The inevitable hallucinations of large language models: taxonomy, benchmarks, and mitigation for reliable AI

LLMNewsPerformance

Why LLMs hallucinate and how to curb them

Hallucination in large language models is theoretically inevitable, demanding a shift from eradication to robust detection, mitigation, and human oversight. This report formalizes the inevitability, maps a comprehensive taxonomy (intrinsic vs extrinsic; factuality vs faithfulness), and links causes to data, model, prompt, and human factors. It consolidates benchmarks, metrics, and hybrid mitigation strategies—such as RAG, tool use, guardrails, and uncertainty displays—to guide safe deployment in high-stakes domains.

Points clés

  • The report by Manuel Cossio (Universitat de Barcelona) formalizes hallucination as an inevitable property of computable LLMs, with theorems showing models will hallucinate on infinitely many inputs and cannot self-eliminate the behavior.
  • Core taxonomy separates intrinsic vs extrinsic hallucinations and factuality vs faithfulness errors, clarifying whether outputs contradict the input, reality, or both.
  • Prevalence indicators highlight distinct risks: logical inconsistencies account for 19% of cases, temporal disorientation for 12%, and ethical violations for 6% in examined analyses.
  • Causes span data (quality, bias, outdated knowledge), model factors (auto-regressive generation, decoding randomness, overconfidence, weak reasoning), and prompt factors (adversarial content, confirmatory tendencies).
  • Human factors amplify risks: users overtrust fluent outputs (automation and confirmation biases), with studies reporting convincing but incorrect answers rated highly in blind evaluations.
  • Benchmarks and metrics include TruthfulQA, HalluLens, FActScore, Q2/QuestEval, and domain tools like MedHallu, Med-HALT, CodeHaluEval, and HALLUCINOGEN to capture task-specific hallucinations.
  • Quantitative trends from Epoch AI show accuracy improving roughly 12 percentage points per 10× training compute on GPQA Diamond and about 17 percentage points per 10× on MATH Level 5, correlating with reduced hallucinations.
  • Performance gaps inform risk: proprietary models have led open models by up to 20 points on GPQA Diamond and 29 points on MATH Level 5, though DeepSeek-R1 narrowed the math gap to 2 points, enabling broader community progress.
  • Mitigation emphasizes hybrid systems: Toolformer-style tool use, retrieval-augmented generation (RAG), fine-tuning with synthetic/adversarially filtered data, guardrails, factual filters, logic validators, and calibrated uncertainty displays.
  • Monitoring resources include Artificial Analysis (intelligence, cost, latency), the Vectara Hallucination Leaderboard (summarization hallucination rates), Epoch AI (capability trends), and UC Berkeley’s LM Arena (blind preference testing reflecting real-world reliability).

À retenir

Want fewer AI tall tales? Start with retrieval grounding, add tool calls for math and code, and slap on guardrails like your model’s wearing a seatbelt. Sprinkle in uncertainty labels, cite sources, and, radical idea, ask a human before you ship to production. If that sounds like overkill, remember: the models are confidently wrong by design—so let’s at least make them wrong politely and with receipts.

Sources

Quiz sur le document: 10 questions

Loading