Research/Open Benchmark

CausalBench
The first comprehensive benchmark for causal reasoning in AI

CausalBench tests models on all three layers of Pearl's Causal Hierarchy — with a focus on Layer 2 (intervention) and Layer 3 (counterfactual) — the tasks LLMs cannot solve by design.

Foundation

Pearl's Causal Hierarchy

Three layers of causal reasoning, each strictly more powerful than the last. CausalBench focuses on L2 and L3 — the layers where LLMs fundamentally fail.

L1

Association

P(Y | X)

"What is Y, given that I observe X?"

Passive observation. Pattern matching. This is where all current LLMs operate.

L2

Intervention

P(Y | do(X))

"What happens to Y if I set X to value v?"

Active manipulation. Requires understanding causal mechanisms, not just correlations.

L3

Counterfactual

P(Y_x | X', Y')

"Would Y have been different if X had been different?"

Imagined alternatives. The highest rung — requires full structural causal models.

Each layer is strictly more powerful: L1 ⊂ L2 ⊂ L3. No amount of L1 data can answer L2 questions — a mathematical impossibility, not an engineering limitation.

Results

Abel vs large language models

LLMs score near-random on Layer 2/3 tasks because pattern matching cannot answer intervention questions — this is mathematically proven.

CausalBench v1.0

March 2026

Model accuracy across Pearl's Causal Hierarchy

Intervention

P(Y | do(X)) — causal manipulation

L2
1Abel
84%
2Claude 3.5
26%
3GPT-4o
23%
4Gemini 1.5
20%

LLMs score near-random on intervention tasks. P(Y|do(X)) cannot be computed from observational patterns alone — this is a mathematical impossibility.

Results from CausalBench v1.0 (March 2026). Accuracy is exact-match for L1/L2/L3; path fidelity for cross-domain.

Categories

Six dimensions of causal reasoning

Each category tests a distinct aspect of causal competence, spanning all three layers of Pearl's hierarchy.

L1500 pairs

Causal Direction

Given A and B, which causes which?

L2300 queries

Intervention Effect

If I set X to value v, what happens to Y?

L3200 scenarios

Counterfactual

If X had been different, would Y have changed?

L2-L3150 chains

Cross-Domain Chains

Trace causal effects across economic domains

L2100 snapshots

Regime Detection

Has the causal structure changed recently?

L2250 cases

Confounding

Identify and adjust for spurious correlations

Methodology

How CausalBench is constructed

Data Construction

  • Ground-truth causal graphs derived from Abel's live causal discovery engine
  • Intervention and counterfactual questions generated from structural causal models
  • Cross-domain chains validated by expert review and statistical testing
  • Temporal regime shifts identified from PCMCI structural break detection

Evaluation Criteria

  • Exact-match accuracy for direction and structure identification
  • Distributional distance for intervention effect estimation
  • Counterfactual consistency under alternative histories
  • Cross-domain path fidelity — proportion of correct causal hops

Contribute to CausalBench

CausalBench is open source. Submit new test cases, propose categories, or benchmark your own model.