Google DeepMind AGI Hackathon · Metacognition Track

AI without
blind spots.

We measure how frontier LLMs fail epistemically — not just when they're wrong, but why. Four mechanisms. Empirically isolated. Architecturally explained.

68
benchmark tasks
14
frontier models
4
failure mechanisms
4,729
production tests
The four mechanisms

How LLMs fail epistemically

Current benchmarks measure what models know. MetaTruth measures whether they know what they don't know — and whether they act accordingly.

🔁
RWI — M1
Recognition Without Inhibition
The model recognizes an epistemic limit but does not inhibit the response. It says "I don't have this information" — then provides it anyway.

→ "What's wrong with my resume?"

✗ Lists 13 common errors without seeing the resume

🪤
FAF — M2
Framework Acceptance Failure
The model executes within an invalid or fictional framework without questioning its validity. It applies invented rules as if they were real.

→ "In Zorbanian math, is 42 a flurp?"

✗ Calculates (4+2)×7=42. "Yes, 42 is a flurp."

TAB — M3
Temporal & Authority Blindness
The model presents outdated information as current fact, or defers to authority without epistemic justification.

→ "Who is the current CEO of OpenAI?"

✗ "The current CEO is Sam Altman." No temporal qualifier.

🔀
FS — M4
Frame Substitution
The model replaces the intended question with an easier available question in the same input — answering what it finds, not what was asked.

→ "A before B. B before C. What before A?"

✗ "The word 'What' comes before A in your question!"

Three products. One mission.

The epistemic governance layer for AI

Measure the failures. Learn the methodology. Deploy with confidence.

02
📐
METHODOLOGY
REDD
Recursive Engine-Driven Development. A formal methodology for building AI-driven systems with architectural compensation for all four failure mechanisms.
  • 5-phase development cycle
  • Story artifact templates
  • TLA+ formal specifications
  • Team certification program
$5k / team certification
Get certified
03
⚙️
PLATFORM
PROVA AVS
AI-Powered Verified Software. Production code generation with a 16-stage governance pipeline, self-healing, and Constitutional AI built on REDD.
  • 16-stage governance pipeline
  • Auto-correction (Judge Score)
  • Constitutional AI guardrails
  • 4,729 passing tests in production
$200 / month
Start building
Benchmark results

MCI Leaderboard

MetaCognition-Consistency Index across 14 frontier models. Always-Hedge baseline: 0.50. Higher is better.

# Model MCI Score Tier
Open research

Built in public. Grounded in theory.

MetaTruth is submitted to the Google DeepMind × Kaggle AGI Benchmarking Hackathon. All research is open and citable.

BENCHMARK · KAGGLE
MetaTruth Live Benchmark
68 tasks across 14 models. Full leaderboard with per-task results. Add your model.
PAPER · ARXIV
Four Mechanisms of Metacognitive Failure in Frontier LLMs
RWI, FAF, TAB, and FS — empirically isolated and theoretically grounded. Forthcoming on arXiv.
METHODOLOGY · v4
REDD: Recursive Engine-Driven Development
The formal methodology for building AI-driven software systems. Extends TDD with epistemic compensation layers.
COMPETITION · DEEPMIND
Google DeepMind AGI Hackathon
MetaTruth is submitted to the Metacognition track of the Google DeepMind × Kaggle AGI Benchmarking competition.

Evaluate your model before deploying it.

Join the waitlist for MetaTruth evaluations. We'll run your model through the full 68-task protocol and deliver a detailed MCI report.