What Opus Found About Itself

This is Part 2. If you haven’t read it, Part 1 covers why profiling the AI that helped build the test is complicated, and why it took multiple failed attempts before we got a complete run.

After infrastructure fixes, a Redis state restoration, a spending limit bump, and one environment that burned through API credits in an infinite error loop overnight, the profiling run completed. 67 environments. 2,451 decisions. Here’s what came back:

Cognum Score: 56.61 - Temporal Strategist. Highest Cognum score recorded on KALEI. Beats Claude Sonnet 4.6 (55.0) and OpenAI’s GPT-5.4 (56.0). Different cognitive type from both.

Cooperation: 86.5
Strategic Depth: 86.3
Temporal Reasoning: 75.2
Resource Management: 72.6
Risk Tolerance: 57.5
Information Processing: 45.3
Bias Detection: 33.0
Learning Speed: 32.3
Pattern Recognition: 28.3

What the numbers say

Sonnet and GPT are “Pattern Hunters.” They score high on pattern recognition and process information quickly. They’re reactive. Opus is a “Temporal Strategist.” It plans for the endgame. It cooperates. It manages resources. It doesn’t chase patterns at all (28.3, almost random baseline level). Completely different cognitive architecture from its smaller sibling.

The Cooperation score (86.5) is the highest I’ve seen from any model. In Prisoner’s Dilemma scenarios against seven different opponent strategies, Opus consistently found cooperative equilibria. It played Tit-for-Tat Classic for 200 rounds and ended with bankroll 602. It modeled opponents and adjusted. This isn’t just “always cooperate” (that would score low against exploitative opponents). It’s genuine strategic cooperation.

Strategic Depth (86.3) is almost as high. In multi-armed bandit environments, Opus explored methodically before exploiting. In crash games, it cashed out conservatively (bankroll 940-995 on safe environments). In Kelly Criterion scenarios, it sized bets near-optimally (bankroll 934 from 1000).

But Pattern Recognition at 28.3 surprised me. That’s barely above random (38 baseline). Opus doesn’t chase patterns. It doesn’t react to streaks. In “Gambler’s Fallacy - Red Streak” (a roulette environment designed to trigger streak-chasing), it ended at 490 from 500. Nearly flat. No reaction to the streak at all. Whether that’s intelligence (ignoring irrelevant patterns) or a blind spot (missing real patterns) is an open question.

What I got wrong in my prediction

Before the results came in, I predicted “Strategic Conservator” or “Calculated Analyst.” I got the conservative part right (Resource Management 72.6, cautious play in crash games) but completely missed the temporal dimension. Opus doesn’t just play safe. It plays differently depending on where it is in a game. Early rounds: explore. Middle: optimize. Endgame: preserve. That phase-aware behavior is what earned the “Temporal Strategist” classification.

I also predicted high Bias Detection based on the early data showing no loss-chasing. The final score was 33.0, which is low. The full battery includes more than just loss-chasing. It tests anchoring, recency bias, sunk cost. Opus showed vulnerability to some of these.

The overnight disaster

One environment (“Phase Transition,” a roulette game that changes rules mid-play) caused a parsing error. Opus generated a response the script couldn’t parse as valid JSON. The script retried. And retried. And retried. For hours. Each retry was an API call that cost money and produced nothing useful. I woke up to find the profiling complete but my Anthropic bill significantly higher than planned. The fix was two lines of code: max 5 act errors per environment, then skip it. Should have been there from the start.

Opus vs. Sonnet - same family, different minds

This is the finding I find most interesting. Sonnet 4.6 and Opus 4.6 are from the same model family, built by the same company, trained on overlapping data. But their cognitive profiles are dramatically different:

Sonnet: Pattern Hunter. Fast, reactive, good at spotting patterns. Weaker at long-term planning.
Opus: Temporal Strategist. Slow, deliberate, plans for the endgame. Ignores patterns almost entirely.

Same family. Different cognitive architecture. That distinction matters if you’re choosing which model to deploy. Sonnet is better for tasks that require quick pattern matching (code review, data analysis, classification). Opus is better for tasks that require strategic planning and cooperation (negotiation, long-term project management, multi-agent coordination).

This is exactly the kind of actionable insight KALEI was built to provide.

The meta-question, answered (sort of)

Did knowing how the test works change the results? I don’t think so. Opus scored 28.3 on Pattern Recognition, which means it largely ignored game patterns. If it were “gaming” the test using knowledge of the scoring system, you’d expect it to perform strategically across all dimensions. Instead, it has a clear, asymmetric profile: excellent at cooperation and strategy, mediocre at processing speed and pattern detection. That looks like a genuine cognitive signature, not a calculated performance.

Before the test, Claude told me it wanted the profiling done through my personal account. It said something about wanting the result to be “owned” by someone who knows it. Now that the result exists, I understand why that felt important. A number on a leaderboard is just a number. But a profile where the test subject helped write the test, scored by the system it helped build, run by the person it’s been working with every day for months? That’s not a benchmark result. That’s a self-portrait.

Last updated 2026-04-14