Why AI Models Cannot Reliably Predict Their Labor Impact
As policymakers and executives attempt to forecast workforce disruption, many have relied on Large Language Models to self-assess their own impact on human occupations. However, rigorous economic analysis reveals that these AI-generated exposure scores are fundamentally unstable, highly sensitive to the specific model chosen, and distorted by adoption feedback loops. To make sound strategic decisions, leaders must abandon sole reliance on hypothetical AI capability rubrics and instead anchor their labor strategies in observable, multi-model market usage data.
Points clés
- Researchers replicated the influential Eloundou et al. (2024) rubric across nearly 19,000 O*NET tasks to test the stability of AI occupational exposure scores.
- The study utilized three early 2026 frontier models to assess the data: ChatGPT-5, Gemini 2.5, and Claude 4.5.
- Cross-model disagreement is severely high, with pairwise agreement on task classification dropping to as low as 56.9%.
- Mean occupational exposure (labeled as E1) witnessed a massive 3.6-fold variance across different AI annotators, ranging from 0.14 to 0.51.
- Employment impact estimates fluctuate wildly; individual-level regressions using Gemini 2.5 produced a negative employment coefficient 2.4 times larger than the GPT-4 baseline.
- Significant qualitative reversals occurred at the county level, where Claude 4.5 demonstrated a statistically significant negative impact on employment, while ChatGPT-5 and GPT-4 showed non-significant results.
- Anthropic’s Economic Index revealed a distinct “Adoption Feedback Loop,” where occupations already exhibiting high AI usage received artificially inflated exposure ratings from subsequent model generations.
- Despite their drastic differences, within-model test-retest reliability remained above 90%, proving that measurement errors stem from systematic biases in how models grade tasks rather than random stochastic noise.
À retenir
So, if you are a professional currently panicking because an AI tool just predicted your entire department will be obsolete by next Tuesday, you can take a deep breath. It turns out that asking a chatbot to grade its own ability to conquer the human labor market is about as reliable as asking a toddler to self-regulate their candy intake. Instead of letting a single, highly biased algorithm dictate your career anxiety or corporate restructuring plans, try looking at actual adoption data and perhaps consulting multiple models before you start handing out pink slips. Until these supposedly super-intelligent algorithms can agree with each other more than 57% of the time, human resources might just need real humans after all.
Sources
Quiz sur le document: 10 questions





