CausalBench
The first comprehensive benchmark for causal reasoning in AI
CausalBench tests models on all three layers of Pearl's Causal Hierarchy — with a focus on Layer 2 (intervention) and Layer 3 (counterfactual) — the tasks LLMs cannot solve by design.
Foundation
Pearl's Causal Hierarchy
Three layers of causal reasoning, each strictly more powerful than the last. CausalBench focuses on L2 and L3 — the layers where LLMs fundamentally fail.
Association
P(Y | X)
"What is Y, given that I observe X?"
Passive observation. Pattern matching. This is where all current LLMs operate.
Intervention
P(Y | do(X))
"What happens to Y if I set X to value v?"
Active manipulation. Requires understanding causal mechanisms, not just correlations.
Counterfactual
P(Y_x | X', Y')
"Would Y have been different if X had been different?"
Imagined alternatives. The highest rung — requires full structural causal models.
Each layer is strictly more powerful: L1 ⊂ L2 ⊂ L3. No amount of L1 data can answer L2 questions — a mathematical impossibility, not an engineering limitation.
Results
Abel vs large language models
LLMs score near-random on Layer 2/3 tasks because pattern matching cannot answer intervention questions — this is mathematically proven.
CausalBench v1.0
March 2026Model accuracy across Pearl's Causal Hierarchy
Intervention
P(Y | do(X)) — causal manipulation
LLMs score near-random on intervention tasks. P(Y|do(X)) cannot be computed from observational patterns alone — this is a mathematical impossibility.
Results from CausalBench v1.0 (March 2026). Accuracy is exact-match for L1/L2/L3; path fidelity for cross-domain.
Categories
Six dimensions of causal reasoning
Each category tests a distinct aspect of causal competence, spanning all three layers of Pearl's hierarchy.
Causal Direction
“Given A and B, which causes which?”
Intervention Effect
“If I set X to value v, what happens to Y?”
Counterfactual
“If X had been different, would Y have changed?”
Cross-Domain Chains
“Trace causal effects across economic domains”
Regime Detection
“Has the causal structure changed recently?”
Confounding
“Identify and adjust for spurious correlations”
Methodology
How CausalBench is constructed
Data Construction
- Ground-truth causal graphs derived from Abel's live causal discovery engine
- Intervention and counterfactual questions generated from structural causal models
- Cross-domain chains validated by expert review and statistical testing
- Temporal regime shifts identified from PCMCI structural break detection
Evaluation Criteria
- Exact-match accuracy for direction and structure identification
- Distributional distance for intervention effect estimation
- Counterfactual consistency under alternative histories
- Cross-domain path fidelity — proportion of correct causal hops
Contribute to CausalBench
CausalBench is open source. Submit new test cases, propose categories, or benchmark your own model.