Yann LeCun on self-supervised learning and JEPA: why world models, not LLMs, will unlock the future of AI

Beyond LLMs: building world models with JEPA

Yann LeCun argues that today’s LLM-centric AI stack cannot reach human- or even animal-level intelligence, and lays out a roadmap centered on self-supervised learning, energy-based inference, and world models. He proposes JEPA—predicting in representation space rather than pixels—as the scalable alternative to generative likelihood modeling for natural signals. The strategic bet: hierarchical, action-aware world models plus optimization-based planning will enable robust reasoning, control, and safety—far beyond what autoregressive text predictors can deliver.

Points clés

Harvard’s CMSA hosted Yann LeCun, Meta’s chief AI scientist and Turing Award laureate, for a lecture on self-supervised learning, JEPA, world models, and AI’s future.
LeCun critiques supervised learning, reinforcement learning, and autoregressive LLMs as insufficient for human-like intelligence, citing inefficiency, lack of planning, and hallucination risks.
A typical large language model (e.g., Llama 3) trains on ~30 trillion tokens (~10^14 bytes), roughly comparable to a 4-year-old’s visual intake (~2 MB/s via ~2 million optic nerve fibers across ~16,000 waking hours).
Autoregressive generation allocates fixed compute per token and compounds small errors, making correctness probability decay roughly exponentially with sequence length.
LeCun advocates energy-based inference by optimization: search for outputs that minimize an energy function measuring compatibility between inputs and proposed outputs.
JEPA (joint embedding predictive architecture) predicts in representation space instead of pixels, avoiding intractable high-dimensional likelihoods while focusing on predictable structure.
Distillation-style self-supervised vision (e.g., DINO from Meta FAIR) now matches or surpasses supervised baselines with less labeled data, improving transfer across domains.
JEPA-style representations power action-conditioned world models for planning in robotics, outperforming prior systems like DreamerV3 in tasks such as object rearrangement and mobile navigation.
Video JEPAs detect “common-sense” violations (e.g., disappearing objects) via prediction-error spikes, hinting at unsupervised intuitive physics.
Strategic recommendations: favor JEPA over generative pixel prediction for natural data, use regularized (not purely contrastive) training, minimize reinforcement learning, and pursue hierarchical world models for long-horizon planning with guardrails encoded as objectives.

À retenir

If you want AI that can load a dishwasher without a 10,000-episode training montage, stop asking LLMs to foresee the future one token at a time. Teach machines to build world models, plan by optimizing objectives, and remember the guardrails—ideally the kind that prevent them from bulldozing a human for a cappuccino. Start with self-supervised JEPA, skip the pixel-by-pixel crystal ball, and use RL sparingly—like hot sauce: great in drops, disastrous by the bottle.

Sources