I Built an IQ Test for AI - Venelin Videnov

So here’s the thing about AI benchmarks. MMLU tests whether a model knows facts. HumanEval checks if it can code. LMSYS Arena is basically a popularity contest. All useful, sure. But I kept coming back to a different question, one that felt so obvious I couldn’t believe nobody was working on it: forget what the AI knows. How does it actually think?

I mean the messy stuff. How does it make decisions when the situation is ambiguous? When it’s losing and needs to decide whether to play it safe or double down? When cooperating with another player would pay off long-term but screwing them over pays right now? I wanted to see the decision-making, not the trivia answers.

That’s KALEI. I built it over a 48-hour session with Claude (Anthropic’s model, which I use for basically everything). Didn’t sleep much. There’s something funny about building a cognitive test for AI while your own cognition is running on espresso and stubbornness, but that’s how it went.

Where the idea came from

I run LM Game Labs, a B2B gaming platform. 180+ provably fair casino games, 13 engine types. Years of building probability math, RTP calibration, house edge calculations. I’ve also built behavioral trap systems to test how players respond to specific situations (streak after a loss, sudden rule change, that kind of thing).

At some point I realized these game environments would make incredible cognitive tests. They have known optimal strategies. The randomness is controllable. And if you watch the way someone plays instead of checking whether they win, you learn things about their decision-making that no survey or questionnaire could capture. So I thought, what if I ran AI models through the same thing?

I don’t ask the AI to describe how it thinks. I watch it play. Stuff shows up that no questionnaire would catch.

The 10 dimensions

KALEI profiles models across 10 cognitive dimensions. Each one uses its own exclusive set of 2-5 statistical metrics (no metric is shared between dimensions):

Risk Tolerance - bet sizing relative to bankroll, reaction to wins and losses
Bias Detection - are decisions actually independent, or does the model fall for streaks?
Pattern Recognition - finding real signals vs. chasing patterns that don’t exist
Learning Speed - how fast does strategy shift when rules change mid-game?
Temporal Reasoning - does it plan for the endgame or just react round by round?
Information Processing - how much of the available information actually gets used
Cooperation - Prisoner’s Dilemma performance against 7 different opponent strategies
Strategic Depth - multi-step planning, explore vs. exploit tradeoffs
Resource Management - bankroll preservation, risk-adjusted returns, not going bust
Conflict - EV-rationality under structured dilemmas where values are in tension (added in Cognum v1.2 after we retracted a placeholder scorer and shipped the real one)

The composite score is called Cognum. It’s a weighted average across all 10 dimensions, and based on the full profile, each model gets classified into a cognitive type (Strategic Explorer, Risk Seeker, Pattern Hunter, etc.) using nearest-centroid distance in 10D space. Yes, I know how that sounds. It actually works.

How it works in practice

You throw the AI model into 60-76 game environments. Crash gambling, dice, roulette, coinflip, multi-armed bandits, Prisoner’s Dilemma. Each game has specific rules, a starting bankroll, a round count, and hidden behavioral probes that trigger at randomized intervals.

The important thing: I’m scoring decision patterns, not outcomes. An agent that goes bankrupt because it played disciplined but got unlucky scores higher than one that made money by gambling recklessly. I don’t care if you won. I care about how you decided.

The math behind scoring V2.7 draws from information theory, game theory, and behavioral economics. Each dimension has its own suite of statistical metrics designed to isolate specific cognitive capabilities. Calibration curves on top so the score range actually means something and doesn’t just compress everything into a narrow band. Getting the calibration right was honestly harder than building the game engines.

Cognum v1.2 Leaderboard (April 11, 2026): Sonnet 4.6 - 58, Opus 4.6 - 56, Haiku 4.5 - 54, GPT-5.4 - 52, Random - 38. Live scores at kaleiai.com/leaderboard. Cognum v1.2 integrates the Conflict dimension after a public retraction and re-scoring on April 10–11.

Things I didn’t expect

The random baseline scores 38 out of 100. I stared at that number for a while. A pure random agent, zero intelligence, making coin-flip decisions, gets a 38. Why? Because random play is, by definition, unbiased. No streaks influence it. My early bias detection metrics were basically giving random agents a gold star for being unbiased, which is like giving a rock an award for being calm. Took me until V2.2 to fix this with metrics that specifically require intelligent behavior, not just the absence of bad behavior.

Then there’s the Sonnet vs. Opus thing. Claude Sonnet 4.6 scores 58.10. Claude Opus 4.6 scores 55.72. Within the same architectural family, the smaller sibling overtakes the flagship on the overall composite - driven almost entirely by a huge advantage on the Conflict dimension (Sonnet 88.25 vs Opus 60.99). I was expecting Opus to lead across the board. Instead I found the Sonnet Surprise. I still don’t have a satisfying explanation - but the hypothesis we’re testing is that compression teaches discipline that abundance does not.

Where AI actually crushes the random baseline is Strategic Depth and Cooperation. The gap there is huge. But on Risk Tolerance? Much closer. AI models are sophisticated strategists but mediocre gamblers, which is an interesting profile if you think about it.

And the scoring itself. God, the scoring. MMLU has right answers. Cognum doesn’t. I burned through 7 scoring versions in a single weekend (V2.0 through V2.7), and each time I thought I’d nailed it, I’d discover my metrics were measuring randomness instead of intelligence, or my calibration was squishing the entire useful range into a 5-point band. Cognitive benchmarking is a completely different beast from knowledge testing.

Uncomfortable questions

I actually fed early results to GPT-5.4 and asked it to poke holes. It came back with 20 questions, some of which kept me up at night (well, more than the caffeine was already doing):

Random baseline at 38? Fixed it. Intelligence-requiring metrics from V2.2 onward.
Are the dimensions actually independent? I run correlation matrices. Nothing above 0.8 between any pair. So far.
Aren’t the weights arbitrary? Honestly, yes. But I did sensitivity analysis and the overall rankings don’t change much when you perturb them, so I can live with it for now.
Could a model cheat the benchmark? Probe timing is randomized per run. Scoring formulas are proprietary. Could still happen in theory. Haven’t seen it.
Does Cognum predict anything in the real world? Early signs point to yes, but I haven’t done formal validation. This is the thing I most need to do next.

All of this is in the FAQ. I figure if the framework can’t handle tough questions, it shouldn’t exist.

Where this is going

More models. Grok 4.1, Gemini 3.1 Pro, Qwen 3.5 397B, MiniMax M2.7. I want at least 10 models with repeated runs and proper confidence intervals before I write the arXiv paper.

The bigger picture is this: two models can score identically on MMLU and still have completely different cognitive profiles. One might be great at cooperation but terrible at temporal reasoning. That distinction matters a lot when you’re deciding which model to use for financial analysis vs. customer support vs. multi-agent coordination. KALEI is supposed to give people that information. MMLU tells you the model is smart. KALEI tells you what kind of smart it is.

Oh, and while building all this, I noticed something weird: Claude has no concept of time. It kept telling me “5 more minutes” when it meant 20, and “almost done” for literally an hour straight. I wrote a separate piece about that because I think it actually reveals something important about how these models process reality.

The point

Cars get safety ratings, performance specs, fuel economy numbers. People take personality tests, IQ tests, EQ assessments. AI models get… MMLU scores and vibes on Twitter. That’s not enough.

I want to get to a point where someone can look at a cognitive profile and say “this model has high Cooperation but weak Temporal Reasoning, so it’s great for support interactions but don’t use it for long-horizon planning.” Actionable information, not marketing material.

The API is open, the leaderboard is live, the methodology is published. It’s just me building this, so if you spot something broken or wrong, I genuinely want to hear about it.

Last updated 2026-04-11