We Need to Slow Down to Measure Right

Mar 25, 2026

My response to DeepMind's new framework on cognition to achieve artificial general intelligence (AGI). And, a call for two fields to get in the same room.

On March 16, 2026, Google DeepMind published a paper called "Measuring Progress Toward AGI: A Cognitive Framework". The core proposal: break general intelligence into 10 cognitive areas:

Perception
Generation
Attention
Learning
Memory
Reasoning
Metacognition
Executive functions
Problem solving
Social cognition

Then measure AI systems against human baselines across each one, generating a "cognitive profile" that maps where AI stands relative to people.

It's a good idea. Grounding AI evaluation in actual cognitive science rather than ad hoc benchmarks is overdue. The team behind this paper includes people who genuinely know the territory. Lead author Ryan Burnell, who works at DeepMind, holds a PhD in cognitive science with postdoctoral training in psychology departments and research conducted in the space.. Matthew Botvinick, now at Anthropic as of January 2026, is a neuroscientist. Noah Goodman is a computational cognitive scientist. Undoubtedly, experienced researchers with stacked credentials in adjacent fields.

And yet, reading this paper, I kept noticing the same absence.

John B. Carroll spent the better part of his career doing something strikingly similar to what DeepMind proposes. Starting in the 1980s, Carroll reanalyzed more than 460 cognitive datasets accumulated over seventy years of research, the most comprehensive factor-analytic survey ever assembled. His 1993 book, Human Cognitive Abilities, proposed a hierarchical model of intelligence organized into approximately 10 broad cognitive abilities: fluid reasoning, crystallized intelligence, memory, visual processing, auditory processing, processing speed, and a handful more. I first learned about this framework in my introductory applied human development class (HOD 1250) at Vanderbilt.

Carroll’s work was later synthesized with Raymond Cattell and John Horn's earlier Gf-Gc model into what the field now calls Cattell-Horn-Carroll (CHC) theory, the current gold standard cognitive taxonomy in psychometrics, underpinning five of the seven major intelligence test batteries used today.

Carroll's taxonomy and DeepMind's taxonomy converge on roughly the same structure: approximately 10 broad abilities, organized into basic building blocks and composite functions. One derived empirically from 460+ datasets over a career. One derived by expert review of the cognitive science literature in a single paper. The convergence is remarkable enough that it demands at least an acknowledgment.

But, Carroll does not appear in DeepMind's reference list. CHC theory does not appear. Cattell's 1963 paper introducing fluid versus crystallized intelligence is cited once in passing. However, it wasn’t cited for its central architectural contribution to cognitive taxonomy, but rather as a subcategory reference under problem solving.

I couldn’t help to explore this omission, the main topic of this essay.

• • •

I spend some of my time at the intersection of these worlds. Several years working in psychometrics and talent assessment at a hedge fund, a dual degree in molecular biology and human and organizational development, and enough time in the the literature (in my free time).

What I want to do here is lay out what that century of work actually found, where the DeepMind framework engages it and where it doesn't. And of course, why it matters, right now, that both communities figure out how to get in the same room!

What DeepMind Proposed

I’d like to be clear, first off, that the paper is genuinely an advance. AI evaluation has been running on ad hoc benchmarks assembled for convenience rather than designed around any coherent theory of what intelligence is. I think this is why DeepMind came out with this paper. I may be wrong.

MMLU, HumanEval, ARC are a few benchmarks. But, these benchmarks saturated. Multiple models now score above 90% on tasks that were supposed to be hard. Leaderboards became optimization targets rather than diagnostic tools. The field needed something better, and DeepMind's response is to go upstream: define what the cognitive faculties are, evaluate systems against each separately, compare against human baselines using demographically representative samples.

The proposal to build cognitive profiles rather than a single composite score is particularly important. By proposing radar-chart-style profiles instead of an "AGI score," the paper avoids a trap that psychometrics has spent decades trying to escape. Todd Rose, in The End of Average, describes this as the "jaggedness principle": ability is never one-dimensional, and flattening individual profiles into aggregate scores destroys the information that matters most. In Rose’s book (excerpt here), the Air Force studied 4,063 pilots and found not a single one was average across all ten physical dimensions. Composite scores hide the variance. DeepMind understands this, and the profile-based approach reflects it.

The paper is also honest about its limits. It identifies five of its ten faculties, specifically learning, metacognition, attention, executive functions, social cognition, as lacking adequate benchmark coverage, then launches a public Kaggle hackathon to build them. I love that. An early submission from a participant exposed a specific, structural problem: distinguishing genuine in-context learning from crystallized recall requires procedurally generated synthetic knowledge systems, like invented taxonomies or fabricated rule grammars, because the scale of modern pre-training makes true held-out evaluation nearly impossible otherwise. The community is engaging with real problems.

So: the paper is honest about its limits, staffed with genuine interdisciplinary expertise, and structurally smarter than most AI evaluation work. That matters. My reaction that follows is about what it still doesn't know it doesn't know. Funnily enough, that’s metacognition, one of the faculties in the DeepMind paper.

The Discipline They Didn't Read

Here's the specific pattern.

DeepMind's reference list contains over 100 citations from psychology and neuroscience. Tulving (1972) on episodic memory. Baddeley (1992) on working memory. Miyake et al. (2000) on executive functions. Diamond (2013) on the same. These are neuroscience-adjacent citations — papers explaining the mechanisms of cognition, how particular processes work in the brain.

What's missing is an entirely different tradition within psychology: psychometrics and individual differences.

The science not of how cognition works mechanically, but of how to measure cognitive differences validly. Spearman (1904) is absent. Carroll is absent. Cronbach and Meehl (1955), whose paper on construct validity is one of the most cited documents in all of psychology, is absent. The APA task force report Intelligence: Knowns and Unknowns, the definitive 1996 consensus statement assembled by eleven experts specifically to settle what intelligence testing proves, is absent.

I’m unsure if it’s random or not. Is there a selection bias that runs through how AI has engaged with psychology broadly? The field has imported cognitive and neural science enthusiastically while largely bypassing measurement science and individual differences. François Chollet's 2019 paper on the measure of intelligence (a very important AI document of the last decade) made this explicit: the AI community had been "reinventing the wheel," developing ad hoc benchmarks while psychometrics spent a century developing rigorous measurement theory. Gary Marcus made similar arguments from the cognitive science side. A 2017 target article in Behavioral and Brain Sciences that accumulated over 5,000 citations argued that human-like AI requires cognitive science's understanding of how people learn and reason.

I think the AI field heard the neuroscience argument, but it has not heard the psychometrics argument.

And psychometrics is not peripheral to what DeepMind is trying to do. Psychometrics is the methodological backbone of cognitive measurement. If you want to know whether your taxonomy is measuring what it claims to measure (or whether the faculties you've named correspond to real, distinguishable cognitive processes rather than arbitrary cuts through continuous space) you need psychometrics. Specifically, you need what Cronbach and Meehl (1955) called a nomological network. What the heck is that?

A nomological network is the theoretical scaffolding required to validate a psychological construct. Funnily enough, this is a concept I remember having to wrap my head around when my colleague explained it to me when I first jumped into the world of talent assessments as a psychometrics practitioner. It's an interconnected system of relationships specifying how your construct relates to other constructs, to observable behaviors, and to empirical outcomes. Construct validity (whether a test actually measures the construct it claims to measure ) isn't established by naming a faculty and selecting tasks that seem to assess it. It requires embedding the construct in a web of theoretical predictions (via observable behaviors) that can be falsified. Without that network, you haven't validated "reasoning" as a construct. You've measured performance on reasoning-adjacent tasks and assumed the inference is clean.

Samuel Messick extended this in 1989 in an ETS report, arguing that valid measurement requires attending not just to what a test measures, but also to the social consequences of using it. If a framework certifies a system as possessing "social cognition" and that system is then deployed in sensitive contexts where it produces biased or harmful outputs, the validity of the initial certification is compromised. Evaluation divorced from deployment consequences is psychometrically incomplete. The DeepMind paper explicitly brackets cognitive benchmarking from end-to-end deployment evaluation. That’s a reasonable methodological choice, but one that Messick would flag.

DeepMind's taxonomy names 10 faculties and maps tasks to them. It does not specify the nomological network. That's not a citation oversight. It's the difference between a taxonomy and a validated measurement framework.

The Question the Framework Doesn't Ask

Charles Spearman's 1904 paper documented what has since been called "arguably the most replicated result in all of psychology": performance across seemingly unrelated cognitive tasks correlates positively. Verbal ability correlates with spatial ability correlates with mathematical ability.

This positive manifold typically explains 40 to 50 percent of between-individual variance on cognitive test batteries.

To explain the correlation, Spearman proposed a general factor — g — underlying performance across all cognitive domains. Whether g represents a genuine biological construct (the parieto-frontal neural efficiency pathway, as Deary, Penke, and Johnson argued in Nature Reviews Neuroscience) or a statistical artifact of the measurement structure (as Stephen Jay Gould argued in The Mismeasure of Man) remains genuinely debated.

What's not debated is that the positive manifold is real. The APA task force report confirmed g 's statistical robustness. Nisbett and colleagues' 2012 update in American Psychologist reaffirmed it while attributing group differences in test scores to environmental rather than genetic factors.

The Thurstone-Spearman debate from the 1930s and 40s is instructive. L.L. Thurstone argued for seven independent "primary mental abilities" rather than a unitary g. Subsequent reanalysis found that Thurstone's primary abilities themselves correlated positively, which ended up supporting the existence of a higher-order g factor. The resolution was that both g and group factors exist at different levels of a hierarchy.

This is exactly what Carroll's three-stratum model formalized sixty years later.

DeepMind's framework anticipates that AI systems will have different profiles across its 10 faculties. That's reasonable. AI capabilities are jagged in ways that differ from human variation, and the profile-based approach respects that. But the framework nowhere asks: would scores across the 10 faculties correlate positively in AI systems? Would a general factor emerge from the data? And if it did… what would that mean?

The question is definitely deep. If a general AI factor exists (let’s call it “g-AI”), differences between systems at the faculty level might be secondary to a more fundamental capability dimension. If the faculties are genuinely independent in AI systems in a way they're not in humans, that would itself be a significant finding requiring explanation. Either way, 120+ years of empirical work on this exact question exists and but unsure why it’s not acknowledged.

The history also carries a warning worth naming plainly. Cognitive classification at scale has a track record. The Army Alpha and Beta tests administered to 1.75 million WWI recruits produced data that was co-opted by eugenicists and contributed to the discriminatory national origin quotas of the Immigration Act of 1924. IQ scores formed the basis for forced sterilization decisions in Buck v. Bell (1927). The Bell Curve (1994) used cognitive testing data to make contested claims about racial differences in intelligence, prompting the APA's task force report in direct response. Buck v. Bell, for the record, has never been formally overruled. These episodes aren't arguments against cognitive measurement, but ones for doing it with full awareness of the terrain, awareness the early framework from DeepMind didn’t demonstrate yet only because it’s a starting point.

A 2017 Nature editorial argued explicitly that intelligence research "should not be held back by its past." That this needed to be argued in Nature is itself evidence that the field had been affected by that past. AI researchers picked up the concept of "artificial general intelligence" without any of that political weight, which meant also without the methodological caution the weight generated.

Three Mistakes Already Made

Specific failure modes from psychology that this framework risks repeating.

Benchmark corruption

In 1979, Donald Campbell articulated what's now called Campbell's Law: "The more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor." The education parallel is direct: when standardized test scores become accountability targets, schools teach to the test. Scores rise faster than actual learning. The correlation between the metric and the underlying capability breaks down.

AI benchmarking has produced this phenomenon at scale:

Raji and colleagues (2021) at NeurIPS argued that AI benchmarks claim generality they cannot support.
A comprehensive 2025 interdisciplinary review cataloged the pattern: benchmark saturation, data contamination, leaderboard gaming, Goodhart’s Law operating in plain sight.
OpenAI’s own research on Goodhart’s Law demonstrated that optimizing against proxy reward measures past a certain point causes actual performance on the true objective to decline.
Thomas and Uminsky (2022) in Patterns documented through case studies how metric optimization produces recommendation systems that promote radicalization, essay-grading software that rewards sophisticated nonsense, and algorithmic evaluations that destroy what they were designed to measure.

Campbell’s Law originated in social science in 1979. The social science literature on metric distortion is extensive. AI systems were largely built to optimize single metrics without consulting it. To me, that gap is now visibly costly.

DeepMind's response is to hold out evaluation sets and prevent contamination. That's the right instinct. But Campbell's Law operates even on held-out sets once the framework becomes entrenched: labs optimize architectures for benchmark categories, training paradigms shift toward the 10 faculties as targets.

Psychometrics has spent decades trying to stay ahead of this dynamic in educational testing, with mixed results. Measurement isn’t hopeless. The lesson is that that any fixed taxonomy becomes a target, and ongoing criterion validation (i.e., measuring whether scores still predict real-world outcomes) is required continuously, not once at launch.

II. The validity that took 85 years to get wrong

This one deserves to sit with you for a moment.

Frank Schmidt and John Hunter's 1998 meta-analysis synthesized 85 years of research on personnel selection. Their conclusion: general mental ability (GMA) was the single best predictor of job performance, with a validity coefficient of r = 0.51. This finding became the cornerstone of industrial-organizational psychology (I-O psych) and drove hiring algorithms for over two decades. Cognitive ability testing, operationalized by high validity assessments, was treated as the gold standard for predicting real-world outcomes.

In 2022, Paul Sackett, Charlene Zhang, and colleagues published a correction in the Journal of Applied Psychology. They identified a systematic error in how Schmidt and Hunter had applied statistical corrections for "restriction of range,” the problem that validation studies only include people who were actually hired, artificially compressing the observable variance. When corrected using empirical data rather than assumed rules of thumb, the operational validity of cognitive ability dropped from r = 0.51 to r = 0.31. Structured interviews emerged as stronger predictors. Sackett described it as "the most important paper of my career." Ninety percent of the field's assumptions about cognitive testing and job performance had to be revised.

The people who spent their careers on human cognitive measurement got it wrong by nearly 40% for 85 years, and only caught it through painstaking methodological reanalysis.

If that's the track record for human cognitive testing…with century-long data, entire academic fields devoted to the question, thousands of validation studies…the appropriate stance toward a single-paper cognitive taxonomy for AI systems is not confidence.

To me, it's considered, methodologically grounded humility about what the framework is and isn't yet establishing.

III. The Fluid-Crystallized confound

Cattell's 1963 paper proposed the distinction at the center of modern cognitive assessment: fluid intelligence (Gf) versus crystallized intelligence (Gc).

Fluid intelligence is the ability to reason through novel problems without prior knowledge. Think: adaptation, pattern recognition, inference in situations you've never encountered.
Crystallized intelligence is accumulated knowledge and skills from experience. Think: what you already know, what you can retrieve.

I had to memorize this for my HOD 1250 final: Gf declines with age; Gc is relatively stable. They're meaningfully distinct and require different measurement approaches.

This distinction matters enormously for evaluating large language models. These systems are trained on vast corpora of accumulated human knowledge and text, potentially hundreds of billions of tokens. They build extraordinary reserves of crystallized intelligence: syntax, factual relationships, domain knowledge, stylistic patterns. What remains genuinely unclear is how much they demonstrate fluid intelligence: reasoning through problems that cannot be solved by pattern-matching against training data.

DeepMind specifies that evaluations should use held-out tasks. But the scale of modern pre-training makes traditional task isolation nearly impossible. The same aforementioned DeepMind hackathon participant’s work gets almost at exactly this: a potentially reliable way to test genuine in-context learning rather than crystallized retrieval is to construct entirely synthetic knowledge systems (i.e., invented taxonomies, fabricated rule grammars) that provably cannot exist in any model's training data. Without that methodological rigor, tests of "learning" and "reasoning" in large language models risk measuring crystallized recall while labeling it fluid cognition.

This is the specific failure mode the AI field keeps producing: pointing to exceptional performance on tasks requiring vast accumulated knowledge like bar exams, medical licensing tests, coding benchmarks as evidence of general intelligence, when those tasks do have a strong component of crystallized retrieval.

The CHC literature has been grappling with this exact measurement problem in humans for sixty years. It has partial answers. They haven't been consulted.

Why the Gap Exists

The structural explanation matters because it clarifies what has to change.

Cunningham and Greene's 2024 network analysis of citation patterns in explainable AI found measurable "knowledge silos" and "knowledge gaps,” or limited transfer from foundational disciplines including psychology and cognitive science to contemporary AI research.

Part of the explanation is publication ecology. ML researchers publish at NeurIPS, ICML, ICLR. Psychometricians and I-O psychologists publish in Psychological Assessment, Intelligence, Journal of Applied Psychology. These communities don't read each other's work in any systematic way. The former thinks the latter work in a ‘soft science.’ The latter thinks the former are “too technical.”

There is no institutional home, no dedicated journal, conference, or professional society, where they're expected to be in dialogue. “AI” discussions at the Society of Industrial-Organizational Psychology’s annual conference probably share few overlaps with “psychology” (and related) ones at NeurIPS.

Part of it is methods culture. In psychometrics, a test is a diagnostic tool used to understand a latent, unobservable trait. The entire discipline is oriented around the epistemological challenge of inferring something you can't directly observe from something you can measure. In computer science, a benchmark is an optimization target. The discipline is oriented around improving measurable performance.

These are different orientations, and they produce different intuitions about what it means to "measure" something. Psychometricians ask, "what does this score permit me to infer?" ML researchers ask, "how do we score higher?" (these are overstatements, I know).

And part of it, specifically for general intelligence, is political radioactivity. The concept became charged in psychology in ways that made careful work difficult to do publicly. The Bell Curve controversy, the eugenics history, the decades of contested debates about group differences in IQ scores… all of this made "general intelligence" a topic requiring extraordinary caution to discuss.

I think the AI world got to pick up the concept without any of that weight. Which meant also without any of the methodological caution the weight generated.

Andrea Neubauer noted in Intelligence in 2021 that "human intelligence research in particular and psychology in general has so far contributed very little to the ongoing debates on AI." Gonzalez and colleagues asked it more bluntly in the title of their 2019 paper in Personnel Assessment and Decisions: "Where's the I-O?"

The I-O psychologists and psychometricians most equipped to evaluate cognitive measurement validity have been largely absent from the rooms where AGI measurement gets designed.

Hassabis and colleagues' 2017 Neuron paper on neuroscience-inspired AI is instructive here. It covers attention, episodic memory, working memory, continual. But the framing is neuroscience throughout: biological mechanisms, brain structures, neural processes. What's absent is psychology-level measurement insight: how to validate that your assessment instrument captures the construct you think it does, what the history of attempting this tells you about failure modes. Neuroscience answers "how does memory work?", and adjacently, but separately, psychometrics answers "how do you know if your test of memory is actually measuring memory?" The AI field has absorbed the first question. I don’t think it hasn't fully absorbed the second.

A 2026 paper in Science put the broader version of this gap plainly: "The social and organizational sciences have spent a century studying how team size, composition, hierarchy, role differentiation, conflict norms, institutions, and network structures shape collective performance. Almost none of this research has been brought to bear on AI reasoning."

The People Trying to Build the Bridge

Progress is happening, but structurally fragile.

I came across the following before I started my foray into talent assessment and before ChatGPT came out in 2023. José Hernández-Orallo's The Measure of All Minds (2017) is the most comprehensive attempt to apply psychometric principles, validity, reliability, bias control, to evaluating machines alongside humans and animals. He was consulted on the DeepMind paper and is acknowledged in the text. François Chollet's ARC Prize functions as a de facto institution grounding AI evaluation in developmental psychology's Core Knowledge framework, explicitly designed to measure fluid reasoning rather than crystallized recall. Eric Schulz's group at the Max Planck Institute produced CogBench and Centaur. Centur is a model trained on 60,000+ participants across 160 cognitive experiments, the most rigorous demonstration yet of what full psychology-AI collaboration produces. The BAICS Workshop at ICLR has been convening bridge conversations since 2020. The SIOP white paper on AI-based assessment makes the practical case: AI assessment tools must meet the same psychometric standards of validity, reliability, and fairness as traditional tests.

All of this happens at the individual level: researchers with dual training, collaborations between people who happened to meet. Again, there is no institutional home. No dedicated journal. No professional society. No conference where psychometricians and ML researchers are expected to sit in the same room regularly. (Maybe I should start that.)

The DeepMind framework is now public. The hackathon is running. The taxonomy is being operationalized into benchmarks. The architecture is getting built before the measurement science has fully shaped it.

The specific risks are concrete. Labs will optimize for the 10 faculties as categories without the nomological networks that validate them. The Fluid-Crystallized confound will persist in evaluation design. The g-factor question will go unaddressed until the data forces it. And I think the framework will become entrenched, potentially cited, built upon, treated as canonical, before the psychometric community has had a genuine opportunity to engage with it.

What Both Sides Need to Do

The call to action here runs both directions.

For AI labs and researchers: hire measurement scientists. Not just cognitive neuroscientists. Psychometricians, I-O psychologists, people whose career has been about the gap between what a test scores and what a test claims to measure. Treat the nomological network as a deliverable, not an abstraction. Address the g-factor question empirically: run the 10-faculty evaluations, examine the correlation structure, publish what you find. Map the taxonomy explicitly against CHC theory, maybe acknowledge the convergence, address the divergences, explain your choices. The history of cognitive classification deserves at least a paragraph in the literature review.

For psychology and I-O psychology: engage with AI evaluation as an active research question, not a spectator sport. Publish responses to AI benchmark papers. I think we need to show up at NeurIPS. Write for audiences that aren't reading Journal of Applied Psychology. The people who know the measurement science have real standing to contribute here, and the AI field is — slowly, imperfectly, reaching toward psychology. The outreach deserves a response.

And institutionally: something needs to fill the gap that currently doesn't exist. A dedicated track at a major AI conference where psychometric validity is a first-class evaluation criterion. A joint workshop series with enough structural weight that cross-disciplinary engagement becomes a norm rather than an occasional collaboration. A publication venue that treats measurement science as equally relevant to AI and human cognitive assessment. Something with enough gravity that the two communities develop a shared vocabulary before the frameworks become too entrenched to revise.

What the DeepMind Gets Right

Intellectual honesty requires saying this clearly.

The convergence between Carroll's empirically derived taxonomy and DeepMind's expert-derived taxonomy is not nothing. That two independent approaches from different methodological traditions arrive at roughly the same architecture suggests the architecture might be tracking something real about the structure of cognition. The paper includes metacognition as a first-class faculty, often overlooked in AI evaluation, with meaningful sub-components: confidence calibration, error monitoring, learning strategy selection. It proposes profiles over composites, which is correct. It acknowledges five faculties lack adequate benchmarks and invites public collaboration to build them. Hernández-Orallo was consulted. The team includes real cognitive scientists.

Beaujean and Benson (2019) reanalyzed Carroll's original datasets and found that even Carroll's carefully derived broad abilities add "little-to-no interpretive relevance above and beyond general intelligence" in many cases. The debate about how many cognitive factors exist, and what they mean, is genuinely open even within psychometrics. DeepMind enters that debate rather than ignoring it, even if without full awareness of the 122-year history behind it.

While the gap is tractable, the paper describes itself as a starting point. The window for psychometric rigor to shape the architecture is narrow but not yet closed.

Both communities are sitting on something the other needs. AI has computational scale, architectures for moving fast, and a genuine infrastructure problem that the psychometrics tradition has tools to address. Psychology and psychometrics has more than a century's worth of hard-won lessons about what goes wrong when you measure complex constructs without methodological rigor, lessons derived from studying humans, paid for in failed studies, retracted findings, and in the worst cases, real human harm.

The bridge is half-built. Finishing it isn't optional. We need to slow down to address this, even as the benchmarks are being written and the frameworks are becoming entrenched.

• • •

Selected References

Burnell, R., et al. (2026). Measuring Progress Toward AGI: A Cognitive Framework. Google DeepMind.
Carroll, J. B. (1993). Human Cognitive Abilities: A Survey of Factor-Analytic Studies. Cambridge University Press.
Cattell, R. B. (1963). Theory of fluid and crystallized intelligence: A critical experiment. Journal of Educational Psychology, 54(1), 1–22.
Chollet, F. (2019). On the measure of intelligence. arXiv:1911.01547.
Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52(4), 281–302.
Cunningham, E. & Greene, D. (2024). Knowledge transfer, knowledge gaps, and knowledge silos in citation networks. arXiv:2406.03921.
Deary, I. J., Penke, L., & Johnson, W. (2010). The neuroscience of human intelligence differences. Nature Reviews Neuroscience, 11(3), 201–211.
Eriksson, M., et al. (2025). Can we trust AI benchmarks? An interdisciplinary review. arXiv:2502.06559.
Evans, J., Bratton, B., & Agüera y Arcas, B. (2026). Agentic AI and the next intelligence explosion. Science, 391.
Gonzalez, M. F., et al. (2019). 'Where's the I-O?' Artificial intelligence and machine learning in talent management systems. Personnel Assessment and Decisions, 5(3).
Gould, S. J. (1996). The Mismeasure of Man (revised ed.). W. W. Norton.
Hassabis, D., Kumaran, D., Summerfield, C., & Botvinick, M. (2017). Neuroscience-inspired artificial intelligence. Neuron, 95(2), 245–258.
Hernández-Orallo, J. (2017). The Measure of All Minds. Cambridge University Press.
Lake, B. M., et al. (2017). Building machines that learn and think like people. Behavioral and Brain Sciences, 40, e253.
Marcus, G. (2018). Deep learning: A critical appraisal. arXiv:1801.00631.
McGrew, K. S. (2009). CHC theory and the Human Cognitive Abilities Project. Intelligence, 37(1), 1–10.
Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational Measurement (3rd ed., pp. 13–103). ACE/Macmillan. Available via ERIC.
Nature Editorial. (2017). Intelligence research should not be held back by its past. Nature, 545, 385–386.
Neisser, U., et al. (1996). Intelligence: Knowns and unknowns. American Psychologist, 51(2), 77–101.
Neubauer, A. C. (2021). The future of intelligence research in the coming age of artificial intelligence. Intelligence, 87, 101563.
Nisbett, R. E., et al. (2012). Intelligence: New findings and theoretical developments. American Psychologist, 67(2), 130–159.
OpenAI. (2022). Measuring Goodhart's Law. OpenAI Blog.
Raji, I. D., et al. (2021). AI and the everything in the whole wide world benchmark. NeurIPS Datasets and Benchmarks Track.
Rose, T. (2016). The End of Average. HarperOne.
Sackett, P. R., Zhang, C., Berry, C. M., & Lievens, F. (2022). Revisiting meta-analytic estimates of validity in personnel selection. Journal of Applied Psychology, 107(11), 2040–2068.
Schmidt, F. L., & Hunter, J. E. (1998). The validity and utility of selection methods in personnel psychology. Psychological Bulletin, 124(2), 262–274.
SIOP. (2023). Considerations and recommendations for the validation and use of AI-based assessments. Society for Industrial-Organizational Psychology.
Spearman, C. (1904). "General intelligence," objectively determined and measured. American Journal of Psychology, 15(2), 201–292.
Thomas, R. L., & Uminsky, D. (2022). Reliance on metrics is a fundamental challenge for AI. Patterns, 3(5), 100476.
Ye, J., et al. (2025). Lost in benchmarks? Rethinking LLM benchmarking with Item Response Theory. arXiv:2505.15055.

‹ A Metacognition Benchmark for AGI

We've Been Scared Before ›