Google quantifies AI’s serving footprint and cuts it 44x in a year

ChineGoogleNews

A comprehensive standard for sustainable AI at scale

Google unveils a full-stack, production-grade method to measure the energy, carbon, and water impacts of AI inference, anchored in first-party telemetry from Gemini Apps. The median text prompt consumes 0.24 Wh, emits 0.03 gCO2e (market-based), and uses 0.26 mL of water—figures far below many public estimates. A year of software, hardware, and operations optimizations drove a 33x energy reduction and a 44x emissions cut per median prompt.

Points clés

  • Google proposes a comprehensive serving boundary that includes active AI accelerators, host CPU & DRAM, provisioned idle capacity, and data center overhead (PUE), while excluding external networking, end-user devices, and training.
  • In May 2025, the median Gemini Apps text prompt used 0.24 Wh: 0.14 Wh from accelerators (58%), 0.06 Wh from CPU & DRAM (25%), 0.02 Wh from idle machines (10%), and 0.02 Wh from overhead (8%).
  • A narrower “existing” approach measured 0.10 Wh/prompt; the comprehensive framework is 2.4x higher and implies a 1.72x scaling factor from accelerator-only energy to end-to-end serving.
  • The median prompt generated 0.03 gCO2e (market-based) and consumed 0.26 mL of water, using fleet averages including PUE 1.09 and WUE Category 2 of 1.15 L/kWh.
  • Google’s market-based grid emissions factor improved from 135 gCO2e/kWh (2023) to 94 gCO2e/kWh (2024), reflecting clean energy procurement and workload siting.
  • From May 2024 to May 2025, per-prompt energy fell 33x and total emissions fell 44x for the median Gemini Apps text prompt.
  • Scope 1+3 embodied emissions per prompt decreased 36x due to fewer machine-hours per prompt; market-based emissions intensity fell 1.4x.
  • Efficiency drivers include MoE and hybrid reasoning in Gemini, speculative decoding, quantization (AQT), distillation (Gemini Flash and Flash‑Lite), optimized batching and idling, and co-designed TPUs (Ironwood ~30x more energy-efficient than the first TPU).
  • The 0.24 Wh median prompt uses less energy than nine seconds of TV and about five water drops (0.26 mL), undercutting numerous public estimates by one to two orders of magnitude.

À retenir

Want a lower-footprint chat with your AI? Favor providers that publish comprehensive metrics (yes, the boring boundary details too). Use efficiency modes (Flash, lightweight models), keep prompts and outputs sensible, and don’t obsess over every token—batching and smarter serving do the heavy lifting. And if someone claims their single prompt drinks a swimming pool, smile politely and ask for their PUE, WUE, and market-based emissions factor—nothing quenches hype like numbers.

Sources

Quiz sur le document: 10 questions

Loading