
Transformer Is Inherently a Causal Learner
Transformers trained autoregressively naturally encode time-delayed causal structures. Gradient sensitivities of outputs to past inputs can recover the underlying causal graph without explicit causal objectives.











