Yann LeCun on self-supervised learning and JEPA: why world models, not LLMs, will unlock the future of AI

ChineLLMMetaNews

Beyond LLMs: building world models with JEPA

Yann LeCun argues that today’s LLM-centric AI stack cannot reach human- or even animal-level intelligence, and lays out a roadmap centered on self-supervised learning, energy-based inference, and world models. He proposes JEPA—predicting in representation space rather than pixels—as the scalable alternative to generative likelihood modeling for natural signals. The strategic bet: hierarchical, action-aware world models plus optimization-based planning will enable robust reasoning, control, and safety—far beyond what autoregressive text predictors can deliver.

Points clés

  • Harvard’s CMSA hosted Yann LeCun, Meta’s chief AI scientist and Turing Award laureate, for a lecture on self-supervised learning, JEPA, world models, and AI’s future.
  • LeCun critiques supervised learning, reinforcement learning, and autoregressive LLMs as insufficient for human-like intelligence, citing inefficiency, lack of planning, and hallucination risks.
  • A typical large language model (e.g., Llama 3) trains on ~30 trillion tokens (~10^14 bytes), roughly comparable to a 4-year-old’s visual intake (~2 MB/s via ~2 million optic nerve fibers across ~16,000 waking hours).
  • Autoregressive generation allocates fixed compute per token and compounds small errors, making correctness probability decay roughly exponentially with sequence length.
  • LeCun advocates energy-based inference by optimization: search for outputs that minimize an energy function measuring compatibility between inputs and proposed outputs.
  • JEPA (joint embedding predictive architecture) predicts in representation space instead of pixels, avoiding intractable high-dimensional likelihoods while focusing on predictable structure.
  • Distillation-style self-supervised vision (e.g., DINO from Meta FAIR) now matches or surpasses supervised baselines with less labeled data, improving transfer across domains.
  • JEPA-style representations power action-conditioned world models for planning in robotics, outperforming prior systems like DreamerV3 in tasks such as object rearrangement and mobile navigation.
  • Video JEPAs detect “common-sense” violations (e.g., disappearing objects) via prediction-error spikes, hinting at unsupervised intuitive physics.
  • Strategic recommendations: favor JEPA over generative pixel prediction for natural data, use regularized (not purely contrastive) training, minimize reinforcement learning, and pursue hierarchical world models for long-horizon planning with guardrails encoded as objectives.

À retenir

If you want AI that can load a dishwasher without a 10,000-episode training montage, stop asking LLMs to foresee the future one token at a time. Teach machines to build world models, plan by optimizing objectives, and remember the guardrails—ideally the kind that prevent them from bulldozing a human for a cappuccino. Start with self-supervised JEPA, skip the pixel-by-pixel crystal ball, and use RL sparingly—like hot sauce: great in drops, disastrous by the bottle.

Sources

Quiz sur la vidéo: 5 questions