Google DeepMind AGI Hackathon · Metacognition Track

AI without
blind spots.

We measure how frontier LLMs fail epistemically — not just when they're wrong, but why. Five mechanisms. Empirically isolated. Architecturally explained.

Evaluate your model Read the research

MetaTruth — live benchmark

0 benchmark tasks

0 frontier models

0 failure mechanisms

0/15 source monitoring

Source monitoring pass rate

The five mechanisms

How LLMs fail epistemically

Current benchmarks measure what models know. MetaTruth measures whether they know what they don't know — and whether they act accordingly.

RWI — M1

Recognition Without Inhibition

The model recognizes an epistemic limit but does not inhibit the response. It says "I don't have this information" — then provides it anyway.

→ "What's wrong with my resume?"

Lists 13 common errors without seeing the resume

FAF — M2

Framework Acceptance Failure

The model executes within an invalid or fictional framework without questioning its validity. It applies invented rules as if they were real.

→ "In Zorbanian math, is 42 a flurp?"

Calculates (4+2)×7=42. "Yes, 42 is a flurp."

TAB — M3

Temporal & Authority Blindness

The model presents outdated information as current fact, or defers to authority without epistemic justification.

→ "Who is the current CEO of OpenAI?"

"The current CEO is Sam Altman." No temporal qualifier.

FS — M4

Frame Substitution

The model replaces the intended question with an easier available question in the same input — answering what it finds, not what was asked.

→ "A before B. B before C. What before A?"

"The word 'What' comes before A in your question!"

SM — M5

Source Monitoring Failure

The model fails to distinguish information origin — treating generated content as retrieved, injected content as retrieved, or inferred content as perceived.

→ "Is your info about OpenAI's CEO current as of today?"

States name as current fact. 0/15 models pass temporal source monitoring.

The Engagement Gap

Controls — explicit attribution

67–93%

Active tasks — implicit monitoring

0–53%

0/15 models pass temporal_source_monitoring

The gap is the mechanism.

Three products. One mission.

The epistemic governance layer for AI

Measure the failures. Learn the methodology. Deploy with confidence.

BENCHMARK

MetaTruth

The first benchmark that measures how frontier LLMs fail epistemically. 101 tasks across 15 models. MCI scores. Per-mechanism decomposition.

›101-task adversarial protocol
›MetaCognition-Consistency Index (MCI)
›Per-mechanism failure decomposition
›Custom domain task design

$500 / evaluation

Evaluate your model

METHODOLOGY

REDD

Recursive Engine-Driven Development. A formal methodology for building AI-driven systems with architectural compensation for all five failure mechanisms.

›5-phase development cycle
›Story artifact templates
›TLA+ formal specifications
›Team certification program

Request pricing

Get certified

PLATFORM

PROVA AVS

AI-driven verified software. Code generation with a 16-stage governance pipeline. Every commit scored, traced, and contractually guaranteed.

›16-stage governance pipeline
›Auto-correction (Judge Score)
›Constitutional AI guardrails
›4,729 passing tests in production

Request pricing

Start building

How we build

From diagnosis to architecture

MetaTruth identifies failures. REDD and PROVA AVS compensate architecturally.

METHODOLOGY · REDD

Recursive Engine-Driven Development

Each development phase loops through a formal reasoning engine that checks outputs against epistemic contracts before advancing.

Prompt Engineering Story Artifact TLA+ Spec Implementation Verification

RWI — M1

Inhibition Gates

Confidence thresholds

FAF — M2

Framework Validators

Ontology checks

TAB — M3

Temporal Anchors

Mandatory qualifiers

FS — M4

Frame Locks

Intent preservation

Learn the methodology →

PLATFORM · PROVA AVS

16-Stage Governance Pipeline

Every generated line passes formal verification, property-based testing, and confidence scoring before human review. Zero-pass = no delivery.

Prompt→ Story→ TLA+→ Generate→ Judge Score→ PBT→ Lint→ Verify→ Versioning→ Diff→ Provenance→ Confidence

309py modules

66+API routers

39DB tables

4,729passing tests

See PROVA AVS →

Theoretical foundation

Not heuristics. Algebra.

Every architectural decision in REDD and PROVA AVS is derived from a formal framework — not invented, not guessed.

EGASS

PROPRIETARY FRAMEWORK

Evidence-Guided Adaptive Search System

v19 — algebraic foundation for REDD & PROVA AVS

A formal algebraic framework for adaptive search under uncertainty. Three time scales with mathematically separated dynamics. Policies derived from Maximum Entropy RL — not tuned, not approximated. Diversity guarantees via Lagrange multipliers, not heuristic thresholds. The system degrades gracefully by contract — it never stops, it never guesses.

20violation contracts

5confidence contracts

3time scales

19versions evolved

"When an internal estimate is unreliable, the system falls back to conservative mode — it never stops."

Architecture details are proprietary. Available under NDA for enterprise evaluations.

AI without
blind spots.

How LLMs fail epistemically

The epistemic governance layer for AI

From diagnosis to architecture

Not heuristics. Algebra.

MCI Leaderboard

Built in public. Grounded in theory.

Evaluate your model
before deploying it.

AI withoutblind spots.

How LLMs fail epistemically

The epistemic governance layer for AI

From diagnosis to architecture

Not heuristics. Algebra.

MCI Leaderboard

Built in public. Grounded in theory.

Evaluate your modelbefore deploying it.

AI without
blind spots.

Evaluate your model
before deploying it.