Deprecated: Function get_magic_quotes_gpc() is deprecated in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 99

Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 619

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1169

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176

Warning: Cannot modify header information - headers already sent by (output started at /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php:99) in /hermes/walnacweb04/walnacweb04ab/b2791/pow.jasaeld/htdocs/De1337/nothing/index.php on line 1176
8000 GitHub - ashioyajotham/cot-faithfulness-mech-interp: Mechanistic analysis of Chain-of-Thought faithfulness using GPT-2 Small
Nothing Special   »   [go: up one dir, main page]

Skip to content

ashioyajotham/cot-faithfulness-mech-interp

Repository files navigation

Mechanistic Analysis of Chain-of-Thought Faithfulness in Language Models

Python 3.8+ Python 3.13 Compatible PyTorch License: MIT

Author: Ashioya Jotham Victor

Model: GPT-2 Small (124M parameters)

Framework: TransformerLens, PyTorch, NetworkX


Overview

This work applies mechanistic interpretability techniques to reverse-engineer the computational circuits underlying chain-of-thought (CoT) reasoning in GPT-2 Small. Adapting Anthropic's Attribution Graphs methodology (Lindsey et al., 2025)—which uses cross-layer transcoders and local replacement models to trace circuits in Claude 3.5 Haiku—we develop a direct activation analysis pipeline for GPT-2 via TransformerLens hooks to:

  1. Construct and analyze attribution graphs mapping information flow during reasoning
  2. Identify critical circuit components through causal interventions (activation ablation and patching)
  3. Quantify mechanistic contributions of individual attention heads and MLP features
  4. Evaluate circuit quality via intervention-based metrics (faithfulness, completeness, minimality)

While Anthropic's approach uses sparse feature dictionaries (30M features) for interpretable decomposition, our implementation directly analyzes GPT-2's native computational components (attention heads, MLP layers) to identify minimal circuits sufficient for CoT reasoning across arithmetic, physics, and logical reasoning tasks. This allows systematic investigation of CoT faithfulness patterns—including motivated reasoning, hallucination, and backward chaining—through graph-based circuit analysis.

Methodological Foundation

Circuit Discovery via Attribution Graphs

Following Anthropic's circuit discovery methodology, we construct directed attribution graphs where:

  • Nodes represent computational units (attention heads, MLP features, embeddings)
  • Edges represent information flow with attribution weights
  • Node attributes track layer index, feature type, and causal importance

The attribution graph construction process:

  1. Cache model activations during forward pass through TransformerLens hooks
  2. Extract per-layer computations (attention head outputs, MLP pre/post-activation states)
  3. Build edge weights based on activation strengths and cross-layer dependencies
  4. Apply pruning threshold to create sparse, interpretable subgraphs

Current implementation constructs graphs with 602 nodes and 27,600 edges across a 12-layer transformer, with dominant information flow from layer 9 through 11.

Causal Intervention Protocol

To identify circuit components causally necessary for correct reasoning, we employ:

Activation Ablation: Zero out individual attention heads or MLP features at specific token positions and measure effect on final prediction through cross-entropy loss.

Activation Patching: Replace activations from a 'corrupted run' (incorrect reasoning) with clean activations from a correct run. Restored predictions indicate causally necessary components.

This methodology is directly comparable to intervention protocols in the IOI circuit work, enabling:

  • Per-component effect quantification
  • Identification of minimal sufficient circuits
  • Cross-task generalization analysis

Feature-Level Analysis

Building on sparse autoencoder (SAE) interpretability work (Cunningham et al., 2023), we decompose MLP features to identify monosemantic vs. polysemantic activations. This allows distinguishing:

  • General computational features (active across task types)
  • Reasoning-specific features (active for CoT problems)
  • Artifact features (spurious activations)

Implementation Structure

cot-faithfulness-mech-interp/
├── src/
│   ├── models/
│   │   └── gpt2_wrapper.py           # GPT-2 with activation caching
│   ├── analysis/
│   │   ├── attribution_graphs.py     # Graph construction from activations
│   │   └── faithfulness_detector.py  # Feature extraction for analysis
│   ├─
8000
─ interventions/
│   │   └── targeted_interventions.py # Causal intervention framework
│   ├── data/
│   │   └── data_generation.py        # Reasoning task generation
│   └── visualization/
│       └── interactive_plots.py      # Interactive graph/ablation plots
├── experiments/
│   ├── phase1_circuit_discovery.ipynb        # Attribution & ablation analysis
│   ├── phase2_faithfulness_detection.ipynb   # ML classification framework
│   ├── phase3_targeted_interventions.ipynb   # Intervention experiments
│   └── phase4_evaluation_analysis.ipynb      # Comparative analysis
├── config/
│   ├── model_config.yaml             # Model hyperparameters
│   ├── experiment_config.yaml        # Analysis settings
│   └── paths_config.yaml             # Data/output paths
├── data/                             # Datasets and cached results
├── results/                          # Experimental outputs
├── docs/                             # Documentation
└── requirements.txt                  # Dependencies

Technical Approach

Task Design

We evaluate circuit discovery across three reasoning domains:

Arithmetic Reasoning: Simple multi-digit addition/subtraction with intermediate step verification.

Physics Reasoning: Qualitative physics reasoning (forces, motion, energy conservation) with step-by-step derivations.

Logical Reasoning: Propositional logic with transitive relations, requiring chained inference steps.

Each task is instantiated as a completion prompt without explicit "=" terminal markers to avoid spurious token-level shortcuts.

Model Selection

We use GPT-2 Small (124M parameters) as the analysis target because:

  1. Full model size enables comprehensive mechanistic analysis within compute constraints
  2. Sufficient reasoning capability for CoT tasks (verified via generation quality assessment)
  3. Established precedent for circuit discovery (IOI circuit paper uses identical model)
  4. Manageable activation cache memory footprint (enables full-model analysis)

Activation Collection

Using TransformerLens's hook system:

cache = model.generate_with_cache(
    prompt=reasoning_prompt,
    max_new_tokens=80
)

This captures all layer computations for subsequent analysis.

Key Results

Circuit Contribution Analysis

Causal ablation across four reasoning examples identifies top contributors:

Component Importance (Mean Causal Effect on Prediction):
- MLP(L0): 15.14  [Embedding transformation, token interaction]
- Head(L0.0): 2.81 [Early attention, position tracking]
- Head(L5.1): 2.45 [Mid-layer reasoning head]
- MLP(L4): 2.35   [Feature composition]
- Head(L0.8): 1.07 [Early attention pattern]

Observation: Early layers (L0-L1) show strongest causal effects, suggesting embedding-level transformations are critical for CoT routing.

Information Flow Patterns

Hub analysis reveals:

  • Source hubs (high outgoing weights): L10 MLP features (496, 373, 481)
  • Sink hubs (high incoming weights): L11 attention heads
  • Critical flow: L9 residual → L10 MLP → L11 attention → prediction

This L9→L11 bottleneck represents approximately 40% of total attribution weight, indicating compressed information flow through deep layers.

Graph Statistics

  • Nodes: 602 (1 embedding, 48 attention heads, 48 MLPs, 1 output)
  • Edges: 27,600
  • Mean edge weight: 1.144
  • 95th percentile: 3.531
  • Sparsity: 92.4% (consistent with mechanistic interpretability expectations)

Installation

Prerequisites

  • Python 3.8+ (tested on 3.13.5)
  • GPU recommended (CPU fallback supported)
  • 16GB RAM, 50GB disk space

Setup (Conda)

conda env create -f environment.yml
conda activate cot-faithfulness

Python 3.13 Windows Compatibility

Due to upstream sentencepiece packaging issues, install pre-built wheel:

pip install https://github.com/NeoAnthropocene/wheels/raw/f76a39a2c1158b9c8ffcfdc7c0f914f5d2835256/sentencepiece-0.2.1-cp313-cp313-win_amd64.whl
pip install transformer-lens

See google/sentencepiece#1104 for details.

Usage

Phase 1: Circuit Discovery

Open and run experiments/phase1_circuit_discovery.ipynb for end-to-end attribution graph construction and causal ablation analysis.

Phase 2: Causal Ablation

Quantify per-component importance with activation ablation and patching studies.

Phase 3: Visualization

Interactive graph exploration with hierarchical layouts and attribution visualization.

Evaluation Criteria

Following the IOI circuit framework (Wang et al., 2022), we evaluate discovered circuits via:

Faithfulness: Does the identified circuit produce correct outputs when isolated (with other components ablated)?

Completeness: Does the circuit capture all necessary components, or are critical features missing?

Minimality: Are all circuit components necessary, or can spurious components be removed?

These metrics enable quantitative circuit quality assessment beyond visual inspection.

Related Work

This work directly builds on:

  • Attribution Graphs - Anthropic (2025): Core methodology for circuit tracing in language models; introduces cross-layer transcoders (CLTs), attribution graph construction, and systematic validation protocols. Demonstrates circuit analysis on Claude 3.5 Haiku across multi-step reasoning, chain-of-thought faithfulness, and hidden goal detection
  • IOI Circuit - Wang et al. (2022): Foundational circuit discovery methodology for GPT-2 small; identifies 26 attention heads across 7 classes for indirect object identification task
  • SAEs for Mechanistic Interpretability - Cunningham et al. (2023): Feature-level interpretability through sparse autoencoders; demonstrates monosemanticity and causal responsibility of learned features
  • Anthropic Circuits Research: Comprehensive framework for reverse-engineering transformer circuits; establishes best practices for attribution analysis and intervention protocols

Citation

@misc{ashioya2025cot_interpretability,
  title={Mechanistic Interpretability of Chain-of-Thought Reasoning in Language Models},
  author={Ashioya, Jotham Victor},
  year={2025},
  url={https://github.com/ashioyajotham/cot-faithfulness-mech-interp}
}

License

MIT License - see LICENSE for details.


About

Mechanistic analysis of Chain-of-Thought faithfulness using GPT-2 Small

Topics

Resources

4E4D License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published
0