Research/Interpretability

White-box LLM / VLM
From Black Box to Auditable, Intervention-Ready Reasoning

Identifying and enhancing modularity and sparsity in large language models, so that specialized submodules emerge as explicit, causal components — transforming LLMs from monolithic black boxes into auditable, steerable white-box systems.

Vision

Turning today's opaque LLMs into white-box systems whose internal mechanisms are legible and steerable

Evidence suggests that large models exhibit latent modularity — subnetworks that specialize in perception, mathematics, coding, chain-of-thought, and physical priors. By identifying, shaping, and strengthening these modules with enforced sparsity, we obtain models that are smaller, faster, cheaper, with transparent and controllable decision pathways.

?
Black Box
White Box

Expected Outcomes

What white-box models unlock

Compute Efficiency

Activation sparsity and specialist routing reduce inference cost while maintaining quality.

Bias Isolation & Removal

Measured by group-fairness and counterfactual-fairness tests, enabling targeted debiasing.

Generalization Gains

Improvements on out-of-distribution splits and cross-domain task transfer.

Faithful Transparency

Decision-trace agreement with interventional ground truth, not post-hoc rationalization.

Why this matters

White-box LLMs (and VLMs) make it possible to audit, govern, and tune foundation models for regulated, safety-critical, and creative applications. By isolating causal functions and exposing controllable routes, we deliver models that are more efficient, less biased, easier to trust, and that transfer better across tasks and domains.

Case Study

Long-CoT ability is already inside base models

While recent long-CoT systems (e.g., OpenAI-o1, DeepSeek-R1) rely on expensive RL or SFT, our research reveals that the long-CoT ability is already present but dormant.

Localized Activations

Long-CoT related activations concentrate in the final layers of the model.

Predictable Dynamics

Their dynamics follow predictable patterns — sharp rise followed by logarithmic decay.

Reliable Elicitation

Simple amplification, combined with reflection triggers (such as a wait token), reliably elicits long-form reasoning.