AI’s Genius? 🤯 Truth or Hallucination? 🤔

AI

🎧English flagFrench flagGerman flagSpanish flag

Summary

A recent analysis by The New York Times, utilizing the SimpleQA test with over 4,000 questions, revealed Gemini-powered AI Overviews achieved 90 percent accuracy. Initial tests with Gemini 2.5 showed 85 percent, rising to 91 percent after the Gemini 3 update. However, the analysis highlighted that one in ten answers were incorrect, as exemplified by misidentifying the date of Bob Marley’s former home museum. Google spokesperson Ned Adriance expressed reservations about the testing methodology, noting discrepancies regarding Yo Yo Ma’s induction into the classical music hall of fame. Google’s internal benchmarks, measuring factuality between 60 and 80 percent, were conducted without web search. Ultimately, the study’s limitations and potential for AI hallucination underscore the continued need for users to independently verify AI-generated responses.

INSIGHTS


THE RISE OF AI OVERVIEWS: A SCANTY ACCURACY
AI Overviews, Google’s Gemini-powered search robot, has faced criticism for its accuracy since its 2024 launch. Despite improvements, the system still struggles, providing the correct answer only 90% of the time, leaving 10% of responses inaccurate—a staggering number of “lies” circulating globally. This issue highlights the challenges of relying on AI for information retrieval.

THE OUMI ANALYSIS: A RIGOROUS TEST
A new analysis conducted by The New York Times, in collaboration with the startup Oumi, aimed to quantify the accuracy of AI Overviews. Oumi utilized the SimpleQA evaluation, a benchmark consisting of over 4,000 questions with verifiable answers, to assess generative models like Gemini. Initially run with Gemini 2.5, the benchmark revealed an 85% accuracy rate. A subsequent rerun following the Gemini 3 update showed an impressive 91% accuracy. This analysis provides a concrete metric for evaluating AI Overviews' performance.

GEMINI 3 AND BEYOND: ACCURACY INCREASES, BUT…
The shift to the Gemini 3 update resulted in a notable increase in accuracy for AI Overviews, reaching 91% on the SimpleQA benchmark. However, this improvement doesn’t eliminate the problem. Extrapolating the miss rate to all Google searches suggests tens of millions of incorrect answers generated daily. The core issue remains that while the underlying models improve, the system’s reliance on faster, less precise models for typical search queries contributes to the ongoing inaccuracies.

CASE STUDIES: WHEN AI OVERVIEWS FAILS
Several examples illustrate the shortcomings of AI Overviews. When asked about the date Bob Marley’s former home became a museum, the AI confidently provided three pages, two of which contained no date information, and the third offered a contradictory year. Similarly, when prompted about Yo Yo Ma’s induction into the Classical Music Hall of Fame, the AI asserted the non-existence of the institution despite citing its website. These instances demonstrate the AI’s propensity to fabricate information or select incorrect details.

GOOGLE’S RESPONSE AND THE COMPLEXITIES OF AI EVALUATION
Google’s response to the Times’ report highlights the challenges in evaluating AI models. Spokesperson Ned Adriance criticized SimpleQA, arguing that it contains incorrect information and doesn’t reflect actual user searches. Google utilizes a similar test, SimpleQA Verified, employing a smaller, more vetted set of questions. Furthermore, the evaluation process itself is often subjective, with companies employing different methodologies to showcase their models’ capabilities. The non-deterministic nature of generative AI adds another layer of complexity, as models can provide correct answers one moment and miss them entirely upon re-querying.

HALLUCINATIONS AND MULTI-MODEL APPROACHES
The assessment process is further complicated by the tendency of AI models, including those used by Oumi, to “hallucinate” – generate false information. Recognizing this, Google employs multiple AI models for different queries. While Gemini 3.1 Pro offers the best accuracy, its speed and cost constraints necessitate the use of faster Gemini Flash models for immediate search results. This multi-model approach contributes to the variability in accuracy observed.

FACTUAL GROUNDING AND THE ROLE OF BLUE LINKS
Despite the increased accuracy of Gemini models, the fundamental issue persists: AI Overviews relies on external data, primarily the vast knowledge base of the internet. Grounding an AI with this wealth of information makes it more accurate than the model itself. However, the truth often resides within the “blue links” – the sources Google provides – and AI Overviews encourages users to accept its summaries without verifying those sources. This reliance on AI summaries over independent verification raises concerns about the potential for misinformation.

DISCLAIMERS AND THE ACCEPTANCE OF ERROR
The Times’ analysis revealed discrepancies between the AI’s output and the information readily available. This underscores the inherent challenges in assessing AI factuality. Google acknowledges these mistakes, prominently displaying the disclaimer: “AI can make mistakes, so double-check responses.” This constant reminder reflects the current limitations of AI-powered search and the need for critical evaluation of its outputs.

This article is AI-synthesized from public sources and may not reflect original reporting.