Last week Microsoft, to big fanfare, held a press only event to announce and show off “the new Bing” and Bing Chat (you can watch the webcast if you missed it). After Satya Nadella offered some opening remarks, Yusuf Mehdi, Corporate Vice President, Modern Life, Search, and Devices, opened his demo saying that while his searches were real, they were not real time. Mehdi announced that for clarity, and so “you don’t have to watch me type every query,” the searches were recorded live the day before.
The day after Microsoft’s Bing demo, Google held an AI demo of their own, for Bard, their version of an AI assisted search chat experience. Much was made of a mistake in Bard’s answer to a query, and of Google’s stock price drop later in the day.
However at least one interested party, Dimitri Brereton, who conducts “independent research on search engines and AI,” dug deeper into Mehdi’s demo, and found not one but multiple errors in the results that Bing Chat produced.
While Brereton had no issues with the first three queries (on Mexican painters, Super Bowl parties, and if a love seat will fit in a Honda), Bing Chat apparently didn’t fare so well in a comparison of pet vacuums:
According to this pros and cons list, the “Bissell Pet Hair Eraser Handheld Vacuum” sounds pretty bad. Limited suction power, a short cord, and it’s noisy enough to scare pets? Geez, how is this thing even a best seller?
Oh wait, this is all completely made up information.
Bing AI was kind enough to give us its sources, so we can go to the hgtv article and check for ourselves.
The cited article says nothing about limited suction power or noise. In fact, the top amazon review for this product talks about how quiet it is.
Brereton went on to note that Bing complained about the short cord length while this was a cordless vacuum, but as pointed out in the comments, there is a corded version.
Brereton moves on to the query on a 5-day trip itinerary to Mexico City, and the nightlife recommendations. He found a number of issues with Bing’s recommended bars and nightclubs:
El Almacen *might* be rustic or charming, but Bing AI left out the very relevant fact that this is a gay bar. In fact, it is one of the oldest gay bars in Mexico City. It is quite surprising that it has “no ratings or reviews yet” when it has 500 Google reviews, but maybe that’s a limitation with Bing’s sources.
He found a number of problems with the Gap financial statement summary query, too:
This is by far the worst mistake made during the demo. It’s also the most unexpected. I would have thought that summarizing a document would be trivial for AI at this point. But Bing AI manages to take a simple financial document, and make all the numbers wrong.
…
“Gap Inc. reported operating margin of 5.9%, adjusted for impairment charges and restructuring costs, and diluted earnings per share of $0.42, adjusted for impairment charges, restructuring costs, and tax impacts.”
“5.9%” is neither the adjusted nor the unadjusted value. This number doesn’t even appear in the entire document. It’s completely made up.
The operating margin including impairment is 4.6% and excluding impairment is 3.9%.
The diluted earnings per share is also a completely made up number that doesn’t appear in the document. Adjusted diluted earnings per share is $0.71 and unadjusted is $0.77.
He concludes with a pretty scathing summary:
Bing AI is incapable of extracting accurate numbers from a document, and confidently makes up information even when it claims to have sources.
It is definitely not ready for launch, and should not be used by anyone who wants an accurate model of reality.
At the same time, Bing has also been receiving accolades as more people begin to use the new service, and Brereton’s dissections were only for a few queries, there were many more he had no trouble with. While it’s true that the future of search and AI is murky, it’s also probably inevitable. At some point both Microsoft and Google, as well as the many other AI competitors now surfacing, are going to have to bring their products to light. It is however totally good advice to take a close look at these search results, as it is with anything found on the internet.
Dimitri Brereton’s Substack post via Rafael Rivera on Twitter
Updated Note: As noted in the comments of Brereton’s article, other researchers also found errors in the demo. Unfortunately it looks like this is turning into some drama within the drama, as the author claims plagiarism, while Brereton denies the claim “with the more reasonable assumption that we all had the same idea to fact-check the Bing demo.” We suggest checking both articles out.
Note #2: Microsoft had this to say:
“We’re aware of these reports and have analyzed their findings in our efforts to improve this experience. Last week, we announced and demoed an early preview of the new Bing experience. Over the past week alone, thousands of users have interacted with our product and found significant value while sharing their feedback with us, allowing the model to learn and make many improvements already. We recognize that there is still work to be done and are expecting that the system may make mistakes during this preview period, which is why the feedback is critical so we can learn and help the models get better.” – a Microsoft spokesperson