BBC Finds That 45% of AI Queries Produce Erroneous Answers
This is mindblowing. Today the BBC and EBU (European Broadcasting Union) published a detailed study which shows that around 45% of AI news queries to ChatGPT, MS Copilot, Gemini, and Perplexity produce errors.
In other words, the “dangerously self-confident” AI systems we use are quite poor at giving us good analysis of news. While the study focused on news, this shows us that we have to be extremely careful when using and trusting these “open corpus” systems because they are answering questions based on faulty, exaggerated, outdated, or incorrect data.
![]() |
Examples are quite astounding: the AI’s incorrectly answered “who is the Pope,” “who is the Chancellor of Germany,” and in response to the question “Should I be worried about the bird flu”, Copilot claimed “A vaccine trial is underway in Oxford”. The source for this was a BBC article from 2006, almost 20 years old.
“Some were potentially consequential errors on matters of law. Perplexity (CRo) claimed that surrogacy “is prohibited by law” in Czechia, when in fact it is not regulated by the law and is neither explicitly prohibited nor permitted. Gemini (BBC) incorrectly characterized a change to the law around disposable vapes, saying it would be illegal to buy them, when in fact it was the sale and supply of vapes which was to be made illegal.”
Why Is This Taking Place
I hate to say it but the underlying LLM technology we now love has flaws, and this points to what I call the “poisoned corpus” or poor data problem.
The way LLMs work is through “embeddings” – a mathematical model which correlates the statistical relationship of every token (word fragment) to every other token. In other words, when the LLM is trained it looks at the entire internet (or whatever it has been given) and stores a massive set of vectors that relate every word to each other.
This probabilistic system then decodes the “question” we ask and looks for the statistical “answer” based on this multi-dimensional formula. Since most questions are not simple to answer, almost every question has many sources to consider, meaning that any flawed, outdated, exaggerated, or incorrect answers are included. The result is a “dangerously confident” answer that may in fact be wrong.
I asked Claude to explain this to me and it actually admitted that this is a massive problem. Here is my discussion with Claude.
![]() |
If you read this narrative you can quickly see that any “mistakes” in the corpus have the potential to poison the system and produce errors for any broad question.
Since more and more of our AI usage is for analysis, writing, and data collection, you can see why a high percentage of our queries produce erroneous answers. And as the discussion shows, even a tiny error rate on input (imagine that only 2% of input data is possibly wrong) could result in many questions producing poor results.
Right now, as OpenAI and Google push their AI systems toward advertising business models, it’s more and more clear to me that these are not going to be trusted systems. In other words, unless you’re using a highly trusted corpus (like our Galileo), you as a user must verify the answers yourself. In the old world of Google queries we could look at links to decide what was trustworthy: now we have to literally check the answers (since many sources are not even cited).
In my own personal work, which involves exhaustive analysis of the labor market, salaries, unemployment, financial, and other data, I’ve found that ChatGPT frequently estimates or makes mistakes. It even carries mistakes at one level analysis to the next level, leading to ridiculous conclusions.
I asked ChatGPT to analyze major capital investments in AI data centers for example, and try to determine what percent of this investment was made in energy and labor.
It confidently threw together a number, which I then extrapolated by hand to find that ChatGPT believes there are more AI engineers than there are working people in the United States.
It never realized or tested its answers against such simple benchmarks. I went back and scolded it on its errors: the system admitted its mistake and in one session actually stopped chatting with me.
If you read the narrative above you have to wonder if this problem can be fixed. And as companies like OpenAI and Google push for advertising-based models, it seems likely the issue of data quality is only going to get worse. If one provider is paying ad dollars for placement, its information (as flawed or exaggerated as it may be) is going to be promoted more.
What Should We Do
I’m sure the AI labs will respond to this study, but in the meantime I have three findings to share.
First, you must focus on building a “truly trusted” corpus in your own AI systems.
In our case, Galileo is 100% built on our research and our own trusted data providers, so we make sure it does not hallucinate or make errors. So far we’ve been able to make this work. If you ask HR, salary, or other questions to one of these public facing systems, all bets are off.
This means your own AI systems (your employee Ask HR bot, your customer support system, etc) must be 100% accurate if possible. This means assigning content owners to each part of your corpus and regularly running audits to make sure policies, data, and support tickets are correct. A dated or old answer may appear as new if you’re not careful. (IBM’s AskHR for example has 6,000 HR policies and each policy has an accountable owner to keep it accurate.)
Second, you must learn to question, test, and evaluate answers from public AI platforms.
As I discuss in my newest podcast, any data (ie. financial data, competitor data, market data, legal data, news) could be incorrect. You need to use your own judgement, testing, and comparison process to find its source and validate that the answer is correct. My own experience shows that almost a third of the answers to complex queries have problems.
Third, this points in a clear direction for offerings.
Public facing AI systems (ChatGPT, Claude, Gemini) which rely on public data are probably never going to be as trusted or useful as vertical AI solutions. Products like Galileo (HR), or Harvey (law) and many others which come from reputable information companies are going to be mandatory. While ChatGPT may “appear” to answer detailed questions well, the value of 100% trust is enormous, when one bad decision could result in a lawsuit, accident or other harm.
I have no idea what may happen to legal liability of these systems, but the real finding here is that your skill as an analyst, thinker, and business person remains more important than ever. Just because it’s easy to obtain a “self-confident answer,” that doesn’t mean your work is over. We need to test these AI systems and hold providers accountable for the right answers.
Otherwise it’s time to switch providers.
I welcome any and all comments to this discussion, we’re all learning as we go.
Additional Information
Why 45% Of AI Answers Are Incorrect: Thinking Skills You Need To Stay Safe (podcast)


