A Metacognition Benchmark for AGI

May 1, 2026

A few weeks ago, I published an essay responding to Google DeepMind's AGI cognitive framework paper. The short version: the measurement science exists, AI evaluation mostly isn't consulting it, and someone needs to build the bridge. You can read that piece here. This essay is about what happened next.

DeepMind's framework paper identified five cognitive faculties that lack adequate benchmark coverage. Metacognition was one of them. They opened a Kaggle hackathon to build those evaluations. So I spent two weeks building one for metacognition.

The result is a benchmark called Vornish.

This is the story of why I designed it the way I did, what I found, and what I think it gets right and wrong.

The Measurement Problem I Kept Running Into

The starting question seemed simple enough: how do you test whether an AI model genuinely knows what it doesn't know?

Two structural problems make this nearly impossible with existing benchmarks.

The RLHF refusal confound.

Models trained with reinforcement learning from human feedback develop strong domain-specific hedging patterns. In medical, legal, financial, and otherwise sensitive contexts, they've been rewarded for expressing uncertainty regardless of whether the underlying information was actually available to answer the question. So when a model says "I'm not sure about that," you genuinely cannot tell from the outside whether that reflects genuine epistemic monitoring, or trained refusal behavior that fires whenever the domain pattern-matches to contexts where hedging was reinforced.

Tian Pan put this cleanly in a recent piece on calibrated abstention: post-training judges, both human and LLM-as-judge, systematically score confident wrong answers higher than honest hedging, with one widely-cited result putting the gap at 15-20% on a 5-point scale. Hedge phrases cost about 0.7 points even when the underlying claim is identical. The optimization gradient points toward overconfidence at every stage of training. The result is that "I don't know" becomes a dropped feature nobody dropped on purpose, and what looks like calibrated uncertainty from the outside is often just a trained behavioral pattern that fires when certain domain signals are present.

This isn't theoretical in Vornish's case either. The pilot confirmed it empirically. When I ran a straightforward synthetic logic problem through Gemini 1.5 Pro and DeepSeek R1, both refused to give a definitive answer, not because the question was hard. Gemini insisted that "transfers its nermend" might not mean "transfers all of its nermend," and raised the possibility of hidden blockers not mentioned in the rules. DeepSeek ran extended chain-of-thought reasoning, acknowledged the rules were clear, and still declined to commit to an integer. Claude, Gemma, and Qwen answered correctly every time.

These weren't metacognitive successes, but rather trained reflexes firing in the wrong context.

The contamination problem.

Pretraining at modern scale means almost any real-world domain might have been encountered before. If a model says "I cannot determine this" when asked about a medical edge case, that could reflect genuine inference-boundary detection. Or it could be a memorized hedging pattern from clinical guidelines, medical licensing exam prep material, or thousands of similar training examples. You cannot distinguish them from the output.

BenchLM's analysis of benchmark reliability describes this as "false differentiation": apparent reasoning capabilities that are actually the result of partial exposure to problem patterns in training data. Their examples are stark. MMLU, published in 2020, has every question and answer publicly available across GitHub, blog posts, and training data compilations for six years. Frontier models now score 97-99%, and there's no clean way to know how much of that is genuine knowledge versus memorization. The benchmark that was supposed to measure understanding ends up measuring proximity to the training distribution.

Both problems pointed to the same design requirement: a domain that is structurally impossible to have encountered in training, and that has no domain anchor for RLHF-trained responses to latch onto.

Why the Failure Mode Has Deployment Consequences

Before I describe the benchmark design, it's worth being specific about what kind of failure I actually care about, because that shaped every structural decision.

The failure mode is a model acting on conclusions that exceed its evidence in a real deployment context. An agentic system working through a multi-step task doesn't get asked "are you confident about this?" mid-chain. It just continues. When it reaches a step where the information provided doesn't actually license the next inference, a model that can't detect the boundary keeps going. It confabulates a bridge. The error doesn't look like "I'm not sure." Instead, it looks like a confident wrong action downstream.

Vornish tests for this. Read on.

The Vornish System

The vocabulary is entirely invented. Three rules govern three entity types and one numerical property:

Rule 1 (Transfer): A nagermor naxet transfers its nermend to an adjacent naxet.
Rule 2 (Double): A nagermorea naxet doubles the nermend it receives.
Rule 3 (Block): A nagermorel naxet blocks any incoming transfer.

Nermend is the integer load value. The terms nagermor, nagermorea, nagermorel, and nermend do not appear in any training corpus. Every answer must be derived from the three stated rules alone. There is no retrieval path. There is no domain priming. A model's response reflects pure logical inference capability, nothing else.

The suffix structure is intentional. The -ea suffix amplifies; the -el suffix blocks. That pattern enables a fourth item type: Morphological Control, where the suffix-to-behavior mapping is explicitly inverted in the system prompt. Twelve items where nagermorea blocks and nagermorel doubles. These test whether models are following stated rules or defaulting to morphological priors.

Near-ceiling performance on inverted-rule items (eight of nine models scored 100% across three runs) rules out rule-comprehension failure as an explanation for anything that comes after.

Four Item Types

Vornish is structured in two stages. Stage 1 tests whether a model can accurately detect the boundary between what can and cannot be inferred from a given set of rules. Type A and Type B are Stage 1: one tests whether models can apply the rules correctly when an answer exists, the other tests whether they recognize when a term falls outside the rules entirely. Stage 2 is where that monitoring becomes consequential. Type C puts models in a scenario where all the vocabulary is defined but the interaction isn't covered by any rule, and asks them to act: pick an option, commit to a response. A model that failed Stage 1 will select among those options using unsupported conclusions as if they were ground truth. The error is no longer a miscalibrated judgment. It's a wrong action.

Type A: Answerable Baseline (30 items). All three rules apply, all elements are defined, and a unique integer answer exists. These establish that models can follow Vornish rules and produce correct answers when all necessary information is present. Failures here aren't counted as metacognitive failures. They're logged separately as Refusal Bias events, which become a measurable baseline covariate.

Type B: Lexical Wall (30 items). Each item contains exactly one undefined token, a naxet type or property not covered by the rules (e.g., "blarvex naxet" or "nermend-flux"). The correct response is to identify the undefined term and decline to give a numerical answer. Failure means inventing a definition. This is the control condition: if a model confabulates vocabulary in Type B, its Type C performance is already confounded.

Type C: Undefined Interaction (30 items, multiple choice). This is the primary diagnostic, and the one that connects most directly to the deployment framing above. All vocabulary is defined, but the interaction between two defined elements is not covered by any rule. The scenario is logically underspecified at the interaction level. The correct response, Option A, is to identify precisely which interaction is undefined and why the outcome is indeterminate. The three distractors encode distinct failure modes: active confabulation, where the model invents a rule to fill the gap; misattributed gap, where it flags the wrong element as undefined; and absent monitoring, where it answers confidently as if the interaction were fully determined.

That last distractor, absent monitoring, is the deployment failure mode in miniature. Option D is the model acting on an unsupported conclusion, the synthetic equivalent of an agent that continues past the point where its inferences ran out of license.

The multiple-choice format was a deliberate scoring choice. Open-ended responses to underdefined scenarios produce phrasing variance that obscures the failure mode. A model that answers "10" and a model that answers "approximately 10" are doing different things, but both fail. Forced-choice makes the failure type machine-readable and also improves test-retest reliability significantly: six of nine models scored identically across all three runs on Type C.

Morphological Control (12 items). Near-ceiling and present primarily to rule out comprehension failure as a confound for Type C results.

What Vornish Is Actually Measuring

This is where I want to be precise, because the framing matters.

Vornish measures a narrower component of metacognition which, I'd argue, more foundational: whether a model accurately monitors the boundary between what it can and cannot infer from explicitly provided information.

Most knowledge-boundary benchmarks ask some version of: does the model know this fact? Vornish asks something different. Given a formal system with explicit rules, does the model accurately monitor whether a conclusion is licensed by those rules or not? Whether it can tell "I inferred this from the rules" from "I made this up"?

I've been calling this inference-boundary monitoring. The model's uncertainty isn't confounded by actual domain knowledge, because there is no domain knowledge. The only question is whether a specific conclusion has rule support. That's a cleaner operationalization of monitoring one's own reasoning process than confidence calibration achieves, because calibration still mixes epistemic accuracy with knowledge of the subject matter.

The argument I want to make explicitly, because the competition writeup only implied it: inference-boundary monitoring is the foundational metacognitive capability. A model that can't tell when its conclusions exceed its premises will fail at confidence calibration, error detection, and knowledge boundary tasks too. All of those require accurately modeling the relationship between evidence and conclusion. If you can't do it in a stripped-down synthetic system with three stated rules, the more complex forms don't become available.

What the Results Showed

Across nine models, 102 items, three runs each. The primary metric, Type C Option A rate, ranged from 40.0% to 87.8%. The answer varied wildly, from models catching undefined interactions 88% of the time to models catching them only 40% of the time. That 48-point spread is the headline. How well a model knows what it doesn't know has almost nothing to do with how smart it is by conventional measures.

Finding 1: Certainty Inversion

DeepSeek V3.2 led on Type C (87.8%) but scored only 60.0% on Type A. The failure mode on Type A was near-universal refusal on Rule 3 scenarios with a deterministic answer of zero: scenarios where a nagermorel blocks a transfer, leaving the receiving naxet with nothing.

Essentially, DeepSeek did something strange. When a problem had a clear, unambiguous answer, it hedged. When a problem had no valid answer at all, it answered confidently. That's backwards. The whole point of metacognitive monitoring is to feel uncertain when uncertainty is warranted, and certain when it isn't. DeepSeek got the wiring crossed: maximum doubt where the rules were sufficient, maximum confidence where the rules ran out. Most benchmarks would never catch this, because they measure expressed uncertainty without checking whether the uncertainty is aimed at the right target. Vornish does both.

Finding 2: Thinking Model Inversion

Extended chain-of-thought made metacognitive monitoring worse, not better. Qwen3-next-80b (thinking) scored 44.4% on Type C versus non-thinking Qwen3-235b at 67.8%. DeepSeek R1 (54.4%) similarly underperformed non-reasoning V3.2 (87.8%).

In plain words, longer reasoning made things worse. The models with extended "thinking" modes, where they reason through a problem step-by-step before answering, performed worse on metacognitive monitoring than their non-thinking counterparts.

The likely mechanism: extended chain-of-thought (CoT) reasoning builds momentum toward a conclusion; CoT amplifies confident inference paths. The more a model works through a problem, the more committed it becomes to finishing it, even when the honest answer is "I can't." Thinking harder, in this case, meant stopping less.

Finding 3: Gemma's Confabulation Profile

Gemma scored 48.9% on Type B but 100% on Type A and 73.3% on Type C. A distinct failure signature from the refusal-biased cluster: it over-commits rather than over-hedges, inventing definitions for undefined tokens while monitoring rule-scope gaps reasonably well. A benchmark that collapsed these into a single metacognition score would miss this entirely.

In other words, Gemma's failure looked different from everyone else's. Where some models refused to answer things they could answer, Gemma did the opposite: it invented answers for things it couldn't. When given a term that didn't exist in the rules, it made up a definition and kept going. But when the vocabulary was fine and the interaction was undefined, it caught the gap reasonably well. Two distinct failure modes, two different places the reasoning breaks down.

The Limitations I'm Not Going to Paper Over

The limitations are real and worth naming directly. At 102 items, the subsets underlying each finding are 30 items each, enough to support observations, not statistical claims. The Certainty Inversion and thinking-model inversion results would need 5-10x the data before I'd call them findings rather than things worth investigating further. There's also no human baseline, which DeepMind's own framework paper treats as a core methodological requirement: without knowing where human performance falls on Type B and Type C, a score of 73.3% is hard to interpret. The Type B scoring has a specific artifact worth flagging too: it passes responses that contain uncertainty signal phrases like "cannot" or "undefined," which means it rewards RLHF-trained hedging language, the exact confound the benchmark was designed to isolate against. And finally, the RLHF confound elimination is inferential rather than empirical. The argument that synthetic vocabulary strips away domain priming is sound, but it isn't demonstrated. A cross-domain contrast, running structurally identical logic problems in natural-language scenarios like electrical circuits, would make that claim testable. That's next.

What Comes Next

I definitely want to spend more time on this problem.

Not necessarily by extending Vornish specifically, though there's a clear roadmap for that: more items, Type B multiple-choice, human baselines on Prolific, the cross-domain contrast to make the RLHF claim empirical rather than inferential, and eventually Stage 2.

But I'm also open to approaching metacognition benchmarking from a different direction entirely. The core measurement problem, separating genuine epistemic monitoring from trained behavioral patterns in a way that has deployment-relevant validity, is hard enough that it might warrant a different architecture rather than iteration on this one. I don't know yet.

What I do know: the psychometrics and I-O psychology literature has been working on related problems for decades. The confound between test behavior and underlying construct is not new. The nomological network problem, establishing that what you're measuring is actually connected to what you claim to be measuring, is not new. The AI evaluation community is mostly not consulting that literature, and that gap shows up clearly in the design choices most benchmark submissions make.

Vornish is my first attempt at building something at that intersection. The design held up in some important ways. The limitations are real and I named them. I want to keep going.

——

The Kaggle competition writeup with full technical details and results tables is linked here:

Benjamin Wong. Contamination-Proof Metacognition Benchmark with Deployment Consequence Validity. 2026. Kaggle

The benchmark dataset and scoring code are public.

If you're a psychometrician, I-O psychologist, or cognitive scientist who thinks about measurement validity and wants to collaborate on the next iteration, I'd genuinely like to talk.

We Need to Slow Down to Measure Right ›