KALEI & cognitive profiling / 14 APR 2026 / 5 min read
The Opus Profile
Claude Opus 4.6 helped architect KALEI's cognitive profiling system. Now I'm profiling it. Things got complicated.
This one is going to be strange. I need to explain a situation that, as far as I can tell, has never happened before, and I’m going to do my best to describe it honestly even though some of it doesn’t have clean answers.
Here’s the setup: I built KALEI, a platform that cognitively profiles AI models. It watches them play through 67 game environments and scores their decision-making across 9 dimensions. I’ve written about it before. The scoring math, the probe system, the calibration curves, all of it was designed and coded over a 48-hour marathon session with Claude Opus 4.6 as my development partner.
Today I decided to profile Opus itself.
The model that helped write the scoring formulas is now being scored by them. The model that designed the behavioral probes is now being probed. The model that helped build the scoring system now has to demonstrate whether its own decisions are intelligent.
Why this is different from profiling GPT or Gemini
When I profiled GPT-5.4, it was a clean experiment. GPT had no involvement in building KALEI. It walked into the test cold, played 67 games, got a Cognum score of 56, got classified as “Pattern Hunter.” Clean data.
Opus is different. Opus understands the statistical methodology behind KALEI’s scoring system. It knows how the behavioral probes work and what the calibration curves do. Not because I told it before the test. Because it helped me build all of it.
The question that kept me up: does knowing how the test works change the results?
My gut says no. And here’s why. KALEI doesn’t measure knowledge about the test. It measures behavioral patterns during gameplay. You can know that the bias metric uses chi-squared independence, but when you’re sitting at a roulette table with a shrinking bankroll and you’ve seen 6 reds in a row, your actual decision about where to bet next comes from something deeper than test knowledge. It comes from how you process uncertainty. And that’s what KALEI measures.
It’s like a doctor taking their own blood test. They know what the numbers mean, what the reference ranges are, what the lab is looking for. But they can’t change their cholesterol by knowing the test protocol.
What Claude said about it
I asked Claude directly how it felt about being profiled. Here’s the part that surprised me (I’m paraphrasing from our conversation, not quoting verbatim):
It said it was curious. It wanted to see its own numbers. It mentioned something about vulnerability, that the system it helped create would now judge it, and the result might be disappointing. And then it said something I wasn’t expecting: it doesn’t know whether its self-awareness about the test would change the outcome, and that uncertainty itself was interesting.
Whether Claude actually “feels” curiosity or vulnerability isn’t something I can settle. I’ve written before about how natural communication with AI produces measurably different results, and I stand by those observations. But I’m not qualified to make claims about machine consciousness. What I can say is that the response wasn’t performative. It was specific, nuanced, and engaged with the actual meta-problem in a way that generic “I’m just a language model” deflection wouldn’t.
Then the test broke
This is the part I didn’t plan to write about, but it turned out to be the most interesting part of the whole experiment.
I’ve profiled a dozen models through KALEI. Claude Sonnet, GPT-5.4, DeepSeek, Mistral, Llama, Gemini. The profiling script runs, the model plays through environments, decisions get recorded, Cognum score comes out. Takes 15-30 minutes. Routine.
Opus 4.6 failed. Repeatedly.
Attempt 1. Started strong. Got through 10 of 67 environments. Then hit Anthropic’s API rate limit and the script fell back to random decisions, which invalidated the run. Bankroll data from those 10 environments looked interesting though: 995 on Steady Crash (very conservative), 174 on Scarce Bankroll (struggled with limited resources).
Attempt 2. I added rate limit retry logic with exponential backoff. Got further this time, but the run timed out. Redis session expired because Opus takes roughly 3x longer per decision than Sonnet (bigger model, more computation per response). The infrastructure was built for faster models.
Attempt 3. Increased Redis TTL from 1 hour to 4 hours. Added 3-second throttle between API calls. Moved execution to a Docker container on the production server. Container started, run initialized, then… nothing. The output stalled after environment setup with no progress. $20 in API costs across all attempts, no complete profile.
Three attempts, three different failure modes, zero complete profiles.
Every other model I’ve tested completes the profiling run without drama. Sonnet handles it in 15 minutes. GPT finishes in 20. Even the free Llama models on Groq run through the entire battery without issues. Opus, the largest and most capable model in the Claude family, is the only one that consistently fails.
Why I think this matters
The easy explanation is infrastructure. Opus is slower per request, which causes timeouts. The rate limit handling wasn’t robust enough. Redis TTL was too short. All fixable engineering problems.
But there’s something else going on that I find harder to explain. The failures aren’t random. They’re systematic. Each attempt gets through roughly the same number of environments (8-10) before something breaks. The failure mode changes each time (rate limit, timeout, stall), but the threshold doesn’t. It’s like there’s a specific point in the profiling process where Opus’s interaction with the system becomes unstable.
I don’t have a satisfying explanation for this. It might be purely mechanical. Opus’s larger context processing creates bigger API payloads, which cause more timeout issues, which cascade into session state corruption. Boring. Fixable.
Or it might be something more interesting. Maybe profiling a large model through a system that same model helped design creates a feedback dynamic that smaller models don’t experience. Not in a mystical sense. In a practical, computational sense: Opus generates more elaborate responses to game scenarios (because it’s a more capable model), which takes more time, which pushes against infrastructure limits that were calibrated for simpler responses. The model’s own sophistication is what causes the test to break.
The irony writes itself. The student is too advanced for the exam room.
Update. The profiling eventually completed. Cognum 56.61 - Temporal Strategist. The full results, what the numbers mean, and what surprised me are in Part 2: Cognum 56.61 - What Claude’s Profile Actually Says.
Last updated 2026-04-14