My Bet on Physical Intelligence
Feb 23, 2026

Some spaces you observe from a distance. This one I kept walking closer to. And the company that literally named itself after the space is what pulled me in.
I spent about three years working on psychometric frameworks and tools at D. E. Shaw. No, I wasn’t building quant models (though there is relevancy between behavioral science and investing). I was more so reading people. Designing talent assessments. Trying to figure out how you measure the things about a person that actually predict whether they'll be good at a job, not just whether they interview well.
It is much harder than it sounds. Most of what makes someone genuinely competent at something physical, for example, a surgeon's hand steadiness, a chef's knife intuition, or a warehouse picker's spatial efficiency, actually doesn't show up on any test you can put in front of them at a desk. You either watch them do the thing, or you're guessing.
That background has made me a little allergic to benchmarks. Benchmarks tell you what a system can recite or apply a cookie cutter framework. They rarely tell you what it can actually do.
Which is why, when I started going deep on AI/ML entering the real world and bumped into Physical Intelligence's research (not as an engineer, but as someone who thinks a lot about how capability is learned and measured), I couldn't stop noticing the same patterns. Physical Intelligence, or PI in short, is an AI robotics company whose entire bet is that you can build a general-purpose physical agent: a robot that can fold laundry, pack groceries, make espresso, and eventually do most of the manual, physical work humans do.
They've published ten research milestones since late 2024, and each one is solving a specific bottleneck that stood between robots and the real world.
The engineering, some of which goes over my head, is genuinely remarkable. And the philosophy and research design underneath it, the implicit theory of how learning works and how intelligence should be evaluated, maps oh so uncannily onto what behavioral and social scientists figured out decades ago. PI is arriving at those conclusions through compute and hardware. Behavioral science arrived there through experiments on humans. They ended up in the same place.
That convergence is why I'm paying attention to this space. Here's how I see it.
They Built the Architecture of Social Cognitive Theory. In a Robot.
In 1977, Albert Bandura published what became the theoretical backbone of social cognitive theory. His core argument, laid out in his foundational work on self-efficacy and behavioral change, was that humans learn complex skills through a sequence: watch someone who knows how to do it, form an internal representation of that behavior, attempt it yourself, get corrective feedback in the moment, then practice autonomously until it becomes fluent. Demonstrate. Correct in context. Practice alone. That's the whole recipe.
The research behind this is fairly unambiguous. A 1990 experiment by Carroll and Bandura showed that the effect of watching an expert demonstrate a motor skill was entirely mediated by how accurately the learner formed an internal cognitive representation of that skill… not by how many times they watched, but by whether the watching produced a usable mental model. And a more recent 2022 systematic review of observational learning in motor skill acquisition confirmed that expert modeling consistently outperforms instruction alone across physical education contexts, precisely because it gives the learner a template to calibrate against.
PI's Recap system, or the the reinforcement learning architecture they published in November 2025 under the name π*0.6, follows this structural sequence.
Stage one: human operators teleoperate the robot to establish baseline task competency.
Stage two: the robot attempts the task autonomously and, when it fails, a human expert takes over not to restart but to demonstrate recovery from that specific failure. In context, mid-task, just like Bandura's corrective feedback loop.
Stage three: the robot practices entirely alone, logging both its successes and its catastrophic failures, building up an experiential dataset it uses to condition its own behavior toward higher-probability outcomes.
I want to be careful here about what I'm claiming and what I'm not. Bandura's framework focuses on both sequence, but self-efficacy: the learner's evolving belief in their own capability to execute the task, which drives how hard they try and how they respond to failure. A robot running Recap doesn't have a belief system. What it has is an “advantage function,” or a probability estimate over expected outcomes that tells it, mathematically, how likely a given action is to lead to success from a given state.
That's not the same thing as self-efficacy. But I find myself asking: is it a precursor to it? Is the advantage function the earliest computational analog to what Bandura was describing, a system that has, in some form, an internal model of its own expected performance? I don't know the answer to that. I'm not sure anyone does yet. What I do know is that the structural sequence (i.e., demonstrate, correct in context, practice alone) is doing the same work in PI robots that Bandura documented in humans, and that's not nothing.
The result was striking: throughput more than doubled on the hardest tasks, and the model successfully ran as an autonomous barista for 18 hours straight with a greater than 90% success rate. Probably because they gave the model the equivalent of an apprenticeship.
Behavioral science has known that demonstrate-correct-practice is the structure that produces durable physical competency since the late 1970s. PI got there in 2025, through 5 billion parameters and a robotics lab. I find that less surprising than validating.
The Robot Olympics Is a Work Sample Test. A Really Good One.
In December 2025, PI published something they called the Robot Olympics: a five-event challenge covering tasks like folding laundry, using tools, peeling an orange, and cleaning greasy surfaces. Bronze, Silver, and Gold difficulty tiers for each. The model was fine-tuned on under nine hours of data per task and evaluated against each event. 52% absolute success rate overall. 72% average task progress.
More interesting than the scores was the control condition. PI took a standard vision-language model, the kind that can pass a bar exam, score well on medical licensing tests, reason through complex logic problems, and fine-tuned it on the identical physical task data without the foundation robotic pre-training. That model achieved 9% task progress. It could describe folding a shirt. It could not fold one.
This is the exact finding that I-O psychologists documented in Schmidt and Hunter's seminal 1998 meta-analysis of 85 years of personnel selection research. After synthesizing 19 different methods for predicting job performance, they found that work sample tests (i.e., give someone the actual job task, watch them do it) had an operational validity of .54, among the highest single predictors in the entire literature. Structured interviews with proxy questions about what someone would do? Lower. GPA and academic credentials? Lower. The closer the evaluation was to the actual performance, the better it predicted.
A later meta-analysis by Roth, Bobko, and McFarland (and Sackett et al. more recently) refined those estimates, but the core finding held: behavioral demonstration of actual task performance is a really solid standard for measuring real competence. The other methods aren’t bad. they're measuring something adjacent to the thing you actually care about.
PI didn't call the Robot Olympics a work sample test nor did they cite Schmidt and Hunter, as I would not expect them to. But that's exactly what it is: structured job simulations, evaluated on actual physical output rather than on the model's ability to describe or reason about the task abstractly. The 43-point gap between the fully pre-trained model and the benchmark VLM (52% vs. 9%) is the clearest possible empirical demonstration of why proxy measures fail when the real thing is physical.
The question this raises, which I don't think anyone has answered yet, is what a properly standardized version of this looks like. I-O psychology spent decades developing frameworks for measuring human physical competency. Fleishman's taxonomy identified 52 distinct human abilities, including manual dexterity, gross body coordination, control precision, and static strength, through decades of factor-analytic research. The O*NET content model built on that work to create a systematic catalog of physical demands across every major occupation. Frey and Osborne used O*NET to evaluate which jobs were at risk of automation, treating physical manipulation and dexterity as the primary barriers robots hadn't cleared.
They were right about that barrier in 2013. PI is clearing it now. But we don't yet have the equivalent of a standardized ability taxonomy for evaluating robots against those human benchmarks. The Robot Olympics is a start.
The Hardest Thing to Measure Is the Thing That Matters Most
PI's research team made a deliberate choice to use zero classical psychometric evaluation tools in any of their published work. No cognitive benchmarks. No standardized reasoning tests. No IQ-equivalent scores. Their entire evaluation methodology is behavioral: did the robot complete the physical task? How often? How fast? They've been explicit about why: because physical intelligence is subconscious, and you can't measure subconscious skill with a written exam.
This isn't a controversial claim in their field. It maps directly to what's called Moravec's Paradox, or the observation, made independently by several robotics researchers in the 1980s, that the things computers find easy (chess, calculus, logical deduction) are what humans find hard, and vice versa. Peeling an orange, catching a ball, wiping a greasy pan. Evolutionarily ancient, computationally enormous.
What's interesting is that behavioral science arrived at the same insight through a completely different path, under a different name. Michael Polanyi called it tacit knowledge, and his central formulation, "we can know more than we can tell," is the philosophical foundation for a significant chunk of I-O psychology research on expertise. Wagner and Sternberg's 1985 research demonstrated empirically that tacit, procedurally acquired knowledge predicts real-world job success independently of IQ, meaning there's a whole dimension of competence that traditional cognitive measurement misses entirely.
The motor learning literature is even more direct. Masters's 1992 experiments showed that people who acquire motor skills implicitly, without being taught explicit rules, perform better under pressure than people who learn explicitly. Why? Because their skill is encoded in a way that doesn't compete with conscious thought. When you make it conscious, you degrade it. The very act of articulating what an expert does disrupts the performance that made them an expert. A 2018 systematic review on implicit vs. explicit motor learning confirmed this pattern across contexts: implicit learning produces automatization that explicit learning cannot reliably replicate.
This could sound obvious, but this is exactly why PI rejected text-based psychometric evaluation. A written test of a model's physical intelligence is not just a bad proxy. It measures something categorically different from what the model can actually do. The motor cortex doesn't run on language. And a 2021 narrative review of psychomotor ability measurement in medical sciences makes this point with some exasperation: even in medicine, where physical dexterity is literally life-or-death, there's no agreed-upon definition of psychomotor ability, and the available aptitude tests largely fail to predict the skills surgeons actually need. The measurement problem is unsolved in humans, let alone machines.
Where I think it gets genuinely interesting is in building the bridge between PI's behavioral throughput metrics and something more generalizable. PI measures success rate and task completion speed. That's correct given where the field is. But it's task-specific, which means every new environment, every new robot, every new set of tasks requires a new evaluation regime from scratch.
What the field probably needs is closer to what Hernández-Orallo and Dowe proposed in 2010: a universal intelligence framework that applies psychometric principles (things like adaptive difficulty, latent ability estimation, construct validity) to machine evaluation across arbitrary environments. Not benchmarks. Ability profiles. The same shift I-O psychology made when it moved from task-specific job tests toward Fleishman's ability taxonomy: from measuring what someone did on one task to mapping what latent capacities explain performance across tasks.
PI's Robot Olympics is the closest thing that exists right now to that kind of evaluation for physical AI. The fact that they designed it intuitively, without explicitly citing psychometric theory, is either a sign of convergent validity (two fields reaching the same answer independently), or a missed opportunity for the two literatures to actually talk to each other.
Maybe it’s both?
Why I'm Paying Attention
I can't evaluate the technical architecture of a Vision-Language-Action model. I don't know enough about flow matching or gradient blocking or discrete cosine transforms to have an informed view on PI's engineering choices. But I want to be equally clear: given enough time with the papers, enough conversations with engineers and roboticists — the kind I'm actively seeking out, I think I could get there.
There's also a real tension worth naming. The Robot Olympics exposed something PI themselves were candid about: the intelligence layer is outrunning the hardware. The model knew how to fold a shirt. The gripper couldn't fit inside the sleeve. Peeling an orange required an unauthorized metal tool because two rigid fingers don't approximate what a human hand does. As a recent TechCrunch piece on PI noted, PI's own co-founder Lachy Groom described hardware as the hardest part of what they do. He said: "Everything we do is so much harder than a software company. Hardware breaks." That's not a reason to discount the bet. It's a reason to take seriously that general-purpose physical AI has at least two hard problems to solve, not one. And only one of them is what PI's research team wakes up thinking about.
What keeps me aligned with their approach is something harder to quantify than success rates. PI is running a distinctly research-first operation. They don't give investors commercialization timelines. They instead identify a research need and collect the data to meet it. There are competitors in this space taking the opposite approach: deploy commercially now and collect bank, and let the real-world data flywheel improve the model over time. That might also work. But I have more intuitive confidence in teams that are willing to stay in the problem longer before they declare it solved. That's not a technical judgment. It's a judgment about how hard problems actually get cracked.
The bottleneck in physical AI is not, at this point, purely computational. The robots that work in the long run will be the ones that get the learning architecture right: the sequencing of exposure, correction, and autonomous practice that produces durable, generalizable physical competency. And the evaluation frameworks that emerge to measure those robots will need to grapple with the same measurement problems that I-O psychology has been wrestling with for decades: how do you quantify a capability that is, by its nature, subconscious?
The field that eventually builds that evaluation framework will need to speak both languages. The robotics side is moving fast. The psychometrics side has maybe a century of relevant theory that most robotics researchers have never read.
That gap is where I'm paying attention. I'm not sure who closes it, or when. But I think whoever does will look back on Physical Intelligence's Robot Olympics the same way learning scientists look back on Bandura's modeling experiments: as the moment when the right questions started getting asked, even if the right measurement tools didn't exist yet.
That's enough for me to keep watching.
