How Eddy Xu Open-Sourced the Future of Robotics
Eddy Xu manufactured hardware, deployed it across 85 factories, and released 10,000 hours of robotics training data for free. His dataset hit #1 on Hugging Face, beating trillion-dollar companies.
"You can just open-source things!" Clement Delangue, CEO of Hugging Face, posted this week, quote-tweeting an 18-year-old founder who had just done exactly that.
You can just open-source things!
— clem 🤗 (@ClementDelangue) November 12, 2025
If you think you don’t have enough experience, support or ressources to release useful open-source, remind yourself that @eddybuild is 18 and released the currently number one trending dataset on Hugging Face, ahead of Nvidia and Meta… https://t.co/fvyobRSVJr
Eddy Xu, CEO of Build AI, didn't just release a dataset. He released the largest egocentric video dataset in history: 10,000 hours of first-person footage from 2,153 factory workers performing real manufacturing jobs. One billion frames of humans doing skilled manual labor, all open-sourced under Apache 2.0, the most permissive license available.
Why This Actually Matters
Egocentric video (footage from a first-person perspective showing how skilled workers perform tasks) is emerging as the critical training data for embodied AI. If large language models learned from billions of words scraped from the internet, physical AI systems need to learn from billions of frames showing how humans interact with the physical world.
Embodied AI refers to systems integrated into physical agents that perceive, interact, and learn through real-world experience. For humanoid robots operating in human environments (factories, warehouses, homes), this matters because these machines need to connect perception, cognition, and action in ways that handle complex manipulation, navigation, and adaptation to unpredictable conditions.
Previous egocentric datasets existed, but they were limited by design. Ego4D, released by Meta and considered the gold standard, contains 3,670 hours from 923 participants across controlled scenarios: households, workplaces, leisure activities. EPIC-KITCHENS, focused on cooking tasks, has just 55 hours from 32 participants in home kitchens.
Egocentric-10K doesn't just exceed these datasets. It obliterates them. At 10,000 hours, it's nearly three times larger than Ego4D. With 1.08 billion frames versus Ego4D's roughly 400 million and EPIC-KITCHENS' 11.5 million, it represents a genuine order-of-magnitude leap in available training data.
But scale isn't the only difference. Previous datasets were collected in sanitized environments: kitchens, staged workplace scenarios, controlled activities. They're useful for research, but they're fundamentally limited because they don't capture the complexity of real operational environments.
Factory data is different. Factories have dynamic lighting conditions, diverse sensor noise, unexpected obstacles, rare edge cases, and the kind of variability that breaks robots trained only on clean lab data. The dataset shows two hands visible in 76% of frames and active manipulation in 92% of frames (densities that dwarf previous datasets and capture the kind of skilled manual work that robots need to learn).
This is the difference between training a language model on children's books versus the entire internet. One gives you basic competence. The other gives you the edge cases, nuance, and real-world messiness required for actual deployment.
The Hardware Play Nobody Expected
Here's what makes this genuinely impressive: Eddy didn't just collect existing data. He built the hardware to capture it.
Build AI manufactures head-mounted recording devices: monocular cameras with 128° horizontal field of view, recording at 1080p/30fps. They deployed these devices to 2,153 factory workers across 85 facilities and collected footage of skilled manual labor at scale. The complete dataset is 16.4 terabytes, structured in WebDataset format for efficient loading and training.
This isn't a scraping operation or a data labeling service. This is vertically integrated infrastructure for capturing how humans perform physical tasks: manufactured hardware, deployed globally, generating training data for the next generation of robotics models.
While AI labs debate whether to open-source their models, an 18-year-old built a company around the thesis that the path to general robotics requires open data at massive scale.
The Philosophy That Drives This
Eddy's personal site is refreshingly direct about what Build AI believes:
"The deployment of general robots at scale will reshape society. The best people should work on the most important problems. The only acceptable speed is as fast as physically possible. The magnitude of the opportunities ahead require a delusional ambition. The success of a company is measured by its research contribution and net-impact."
Most founders optimize for valuation and exit opportunities. Eddy turned down over $25 million in equity to build Build AI. Others on his team left academia, top research labs, and their own companies to join.
The Apache 2.0 license choice is deliberate. It's the most permissive open-source license: anyone can use this data for any purpose, including commercial applications, without restriction. No strings attached. No strategic moats. Just: here's the data, go build something that matters.
What This Says About the Giants
If an 18-year-old can manufacture hardware, deploy it in 85 factories, collect 10,000 hours of high-quality egocentric video, and release it as the number one trending dataset on Hugging Face, what exactly are companies with trillion-dollar valuations doing?
Meta released Ego4D with significant fanfare and academic partnerships. It remains an important research resource. But it took them years to collect 3,670 hours across controlled scenarios. Eddy collected nearly three times that amount in real production environments in under a year.
Nvidia has the capital to deploy this kind of data collection infrastructure at 10x the scale. Meta has the manufacturing capabilities and global reach to instrument entire supply chains. Both companies have been publicly committed to embodied AI and robotics for years.
Yet neither has released anything approaching this scale of real-world egocentric data. They're getting beaten on execution by a teenager who dropped out of Columbia.
The pattern is familiar. Large organizations move slowly because they can afford to. They have quarterly earnings calls and risk committees and legal reviews. They optimize for avoiding mistakes rather than maximizing velocity.
Eddy's previous track record suggests he understands something about speed that they don't. At 16, he raised $120,000 to build a competition robotics team that won the national championship. He won the DECA business world championship against 200,000 competitors. He built and sold an edtech startup that reached 178,000 users in three months.
Build AI's stated operating principle is "the only acceptable speed is as fast as physically possible."
The Scaling Laws for Physical AI
The AI community spent years debating whether scaling laws applied to language models. Turns out they did: more data, more compute, more parameters led to reliably better performance. GPT-4 exists because OpenAI believed in scaling before it was obvious.
We're at the beginning of discovering whether similar scaling laws apply to physical AI. If they do, datasets like Egocentric-10K become the foundation for the next generation of robotics models. Just as ImageNet catalyzed computer vision research, massive egocentric datasets could catalyze embodied AI.
The difference is that ImageNet was built by academics over years of careful curation. Egocentric-10K was built by an 18-year-old with manufactured hardware and a distribution network in under a year, then given away.
What Happens Next
Build AI's release is a test of whether open-source infrastructure can compete with closed corporate research in physical AI. If the dataset enables breakthrough robotics research (and early signs suggest it will), then the open approach wins. Academic researchers and startups get access to training data that would have required tens of millions in capital to collect independently.
The dataset hit number one on Hugging Face within hours, suggesting the research community was desperate for exactly this kind of resource. That desperation matters. It means the bottleneck for embodied AI research wasn't compute or algorithms. It was data. Real-world data at scale.
Eddy describes Build AI's work as "a research bet that has significant technical risk and low probability of success." Maybe. But betting against someone who manufactured hardware, deployed it globally, and open-sourced a billion frames of training data before his 19th birthday seems like a bad idea.
The era of data scaling in robotics is here. And apparently, you don't need to be a trillion-dollar company to lead it. You just need to move fast and give a shit about research contribution over equity value.
Turns out, you really can just open-source things.