how we accidentally solved robotics by watching 1 million hours of YouTube





the existential crisis we all share

imagine this: you’ve just spent $640 billion training the chonkiest language model known to humanity (lol) and decide to call it “Behemoth”. it can annoy you on whatsapp, try to solve calculus, and argue with you about anything with a sophistication of a philosophy PhD.

but ask it to grab a coffee mug from your kitchen counter? ngmi

turns out scaling LLMs forever still leaves robots as clueless. internet-scale language misses the fundamental physics of stuff actually moving around in 3D space. and no amount of “think step by step” or COT prompting helps to teach your chatterbox where the trash is in the kitchen

but if i told you that the solution was hiding in plain sight? what if the secret sauce wasn’t more tokens, but more… videos?


the “why didn’t we think of this sooner” moment

here’s the thing everyone forgot while we were busy making ai agents book flight tickets: robots need to understand physics, not language.

so enter V-JEPA 2, which basically said “hey, what if we fed a neural network 1 million hours of youtube and taught it to predict what happens next?” except instead of predicting the next word, it predicts the next moment in reality.

this is “deploy a robot in a completely new lab and watch it successfully pick up objects it’s never seen before” level of real.


the beauty under the hood

the core insight: predict in representation space, not pixels

remember when everyone was obsessed with making AI generate pretty pictures? well, V-JEPA 2 said “screw noise” and decided to predict in latent space instead (i know this word is thrown around alot but bear with me)

why? because trying to predict every pixel is like trying to predict every blade of grass in a field when what you really care about is whether the ball is going in the goal.

the magic happens in three parts:

  1. the encoder: a ViT-g with 1 billion parameters that looks at video and goes “ah yes, i understand the essence of this physical situation”

  2. the predictor: a smaller nn that takes masked video tokens and tries to fill in the blanks, like the a sophisticated game of video madlibs

  3. 3D-RoPE: because regular position embeddings are for 2D peasants

the masking strategy

instead of showing the model everything, V-JEPA 2 randomly masks out chunks of video (called “tubelets” – yes, that’s the technical term). the model then has to predict what’s happening in those missing pieces.


data scaling: from “some videos” to “all the videos”

  • before: 2 million videos (cute)
  • after: 22 million videos + 1 million images (now we’re talking)

they basically hoovered up everything: something-something v2, kinetics, howto100m, and a billion youtube videos

model scaling: bigger is better (sometimes)

they scaled from 300M to 1B parameters because apparently size does matter. the ViT-g encoder is basically the endgame of vision transformers.

progressive resolution training: the “boiling frog” approach

here’s the clever bit: instead of immediately training on massive high-res videos (which would require selling a kidney to afford the compute), they started small and gradually cranked up the resolution during training.

(curriculum learning bros keep on winning)

16 frames at 256² → 64 frames at 384²

V-JEPA 2-AC: my favourite bit

having a world model that understands physics is cool, but robots need to understand actionable physics. like “if i move my arm this way, what happens to the world?” and the dynamics behind this action

so they took the pretrained V-JEPA 2, froze it solid, and attached a 300M parameter transformer that learns to predict what happens when you actually do stuff. (a model that can just do stuff, hell yeah)

the training data? just 62 hours of robot videos. not “successful robot videos” or “carefully curated robot videos.” just raw footage of a franka arm doing franka arm things, successes and failures included. really interesting bit, alot of future work to do expirements with good data curation and win/lose ratio.

the magic of energy minimization

when it’s time to actually control a robot, V-JEPA 2-AC plays a game of “hot and cold”:

  1. look at current state
  2. look at goal state
  3. imagine a bunch of possible action sequences
  4. pick the one that gets you closest to the goal
  5. execute first action
  6. repeat until done (or until something breaks)

model predictive control on world model is one of the coolest things this paper has done


zero-shot generalization (aka the money shot)

they took this model, trained entirely on one dataset, and deployed it on franka arms in completely different labs. different lighting, different objects, different everything.

success rates:

  • reach: 100% (because apparently moving to a point in space is trivial when you understand physics)
  • grasp cup: 65% (cups are apparently hard)
  • pick and place: 65-80% (depending on object complexity)

compare this to baseline approaches that basically failed at everything except the most basic reaching tasks.

the speed demon

planning with V-JEPA 2-AC: 16 seconds per action
planning with diffusion models: 4 minutes per action


for robotics folks: the obvious stuff

  • zero-shot generalization: works on novel objects out of the box
  • data efficiency: 62 hours of video vs thousands of hours of careful teleoperation
  • actually deployable: seconds vs minutes for planning

for llm hackers: the plot twist

here’s where it gets spicy. they aligned V-JEPA 2 with an 8B language model and got state-of-the-art results on video question answering.

84.0% on PerceptionTest. 76.9% on TempCompass.

this is a video encoder that was pretrained without any language supervision beating models that were trained on image-text pairs. isnt this like so cool?? also makes us wonder what other dynamics are fed in this world model waiting for us to open up and explore.

the conventional wisdom of “you need language supervision to understand the world” just took a uppercut to the jaw.


limitations (aka the “not everything is sunshine and rainbows” section)

camera pose sensitivity

the model is basically a diva about camera positioning. move the camera 10 degrees and suddenly it thinks left is right and up is down.

in practice, this means you have to manually fiddle with camera positions until you find the sweet spot. very scientific. much engineering.

long-horizon drift

try to plan more than a few steps ahead and the model starts hallucinating. thats tuff.

the language goal problem

right now, you need to show the robot pictures of what you want it to do. want it to “clean the kitchen”? better have a photo of a clean kitchen handy.

future work: teaching it to understand “make me a sandwich” without needing a powerpoint presentation. i am working on this right now if you are interested to help, hmu


wild speculations about the future

we might be looking at a future where world models rival pure-text models for real-world grounding. imagine a robot that understands physics as well as chatgpt understands language.


tl;dr by claude

property          v-jepa 2    diffusion    bc-policies
------------------------------------------------------
understanding      ✨          🤷           🤷
planning speed     🚀          🐌           🐌  
zero-shot magic    ✅          ❌           ❌
data efficiency    📈          📉           😐
can make coffee    probably    uhh         kinda

ps – here is a cool visualization twitter link of PCA done over VJEPA

if you are more interested make sure to check out the paper, the code, or just watch your roomba bump into the same chair leg for the 47th time and contemplate how far we’ve come.



Source link