CausalBench
The first comprehensive benchmark for causal reasoning in AI

CausalBench tests models on all three layers of Pearl's Causal Hierarchy — with a focus on Layer 2 (intervention) and Layer 3 (counterfactual) — the tasks LLMs cannot solve by design.

Foundation

Pearl's Causal Hierarchy

Three layers of causal reasoning, each strictly more powerful than the last. CausalBench focuses on L2 and L3 — the layers where LLMs fundamentally fail.

Association

P(Y | X)

"What is Y, given that I observe X?"

Passive observation. Pattern matching. This is where all current LLMs operate.

Intervention

P(Y | do(X))

"What happens to Y if I set X to value v?"

Active manipulation. Requires understanding causal mechanisms, not just correlations.

Counterfactual

P(Y_x | X', Y')

"Would Y have been different if X had been different?"

Imagined alternatives. The highest rung — requires full structural causal models.

Each layer is strictly more powerful: L1 ⊂ L2 ⊂ L3. No amount of L1 data can answer L2 questions — a mathematical impossibility, not an engineering limitation.

Results

Abel vs large language models

LLMs score near-random on Layer 2/3 tasks because pattern matching cannot answer intervention questions — this is mathematically proven.

CausalBench v1.0

March 2026

Model accuracy across Pearl's Causal Hierarchy

AbelLLMs

Intervention

P(Y | do(X)) — causal manipulation

1Abel

84%

2Claude 3.5

26%

3GPT-4o

23%

4Gemini 1.5

20%

LLMs score near-random on intervention tasks. P(Y|do(X)) cannot be computed from observational patterns alone — this is a mathematical impossibility.

Results from CausalBench v1.0 (March 2026). Accuracy is exact-match for L1/L2/L3; path fidelity for cross-domain.

Six dimensions of causal reasoning

Each category tests a distinct aspect of causal competence, spanning all three layers of Pearl's hierarchy.

L1500 pairs

Causal Direction

“Given A and B, which causes which?”

L2300 queries

Intervention Effect

“If I set X to value v, what happens to Y?”

L3200 scenarios

Counterfactual

“If X had been different, would Y have changed?”

L2-L3150 chains

Cross-Domain Chains

“Trace causal effects across economic domains”

L2100 snapshots

Regime Detection

“Has the causal structure changed recently?”

L2250 cases

Confounding

“Identify and adjust for spurious correlations”

Methodology

How CausalBench is constructed

Data Construction

Ground-truth causal graphs derived from Abel's live causal discovery engine
Intervention and counterfactual questions generated from structural causal models
Cross-domain chains validated by expert review and statistical testing
Temporal regime shifts identified from PCMCI structural break detection

Evaluation Criteria

Exact-match accuracy for direction and structure identification
Distributional distance for intervention effect estimation
Counterfactual consistency under alternative histories
Cross-domain path fidelity — proportion of correct causal hops

Contribute to CausalBench

CausalBench is open source. Submit new test cases, propose categories, or benchmark your own model.

View on GitHub Back to Research

CausalBenchThe first comprehensive benchmark for causal reasoning in AI

Pearl's Causal Hierarchy

Abel vs large language models

CausalBench v1.0

Intervention

Six dimensions of causal reasoning

Causal Direction

Intervention Effect

Counterfactual

Cross-Domain Chains

Regime Detection

Confounding

How CausalBench is constructed

Data Construction

Evaluation Criteria

Contribute to CausalBench

CausalBench
The first comprehensive benchmark for causal reasoning in AI