The Heads Hypothesis

Introduction

A novel way of labelling the attention heads of BERT as described in detail in the following paper: https://arxiv.org/abs/2101.09115 accepted at AAAI'21. There are 4 high-level functional roles: (a) Local, (b) Syntactic, (c) Block and (d) Delimiter.

We test this approach for 4 GLUE tasks, namely: QNLI, QQP, MRPC and SST-2. The end-to-end process consists of the following steps:-

Compute sieve bias score from the attention weights.
Apply Hypothesis Testing on top of the scores to assign functional roles to the heads.

Requirements

Python 3.5+
Tensorflow 1.11 or higher
Numpy, Pickle, Scipy, Pandas,
Visualization libraries: Matplotlib, Seaborn and Plotly.

Setup (from scratch)

After cloning this repo, download the attention weights from here and make a directory in the current folder named pkl_dir and keep all the attention weights files inside it.
Run the respective ipynb files to compute the sieve scores for various functional roles.
A new folder named sieve_scores is created and the scores (for each input sequence and each functional role) are generated and saved in this folder.

Setup (Quickstart)

For quick startup, we provide sample sieve scores (generated for 20 sentences taken from the test sets) here.
Download the scores and keep them inside a directory named sieve_scores in the current folder where this repository is cloned.
Run label_heads.ipynb that generates <task_name>_gems.pkl that can be fed into the plotting code to visualize the role assignments to heads. label_heads.ipynb also contains a quick visualization inline.

Results

We derive several new observations and reinforce few other observations already present in the literature. This is by far the first attempt to bring statistical rigour while analyzing heads behaviour. Few of our key insights are:

Heads can be multi-functional, ie., a single head can perform multiple roles, for example, many (42%-88%) heads are both local and syntactic across all the GLUE tasks.
Most multi-funtional heads are present in the middle layers.
Heads attending to SEP are present in the later layers, while those attending to the CLS token are present in the initial layers.
As part of the task-specific fine-tuning, the later layers change the most with the high attention to the SEP token getting distributed to the other tokens of the sentence.
We scatter the p-values we get in our analysis, since they denote the confidence of our role assignment. Most (around 98%) values are near 0 or 1 indicating that the heads were highly confident when evaluating the null hypothesis.

Citation

@article{pande2021heads,
  title={The heads hypothesis: A unifying statistical approach towards understanding multi-headed attention in BERT},
  author={Pande, Madhura and Budhraja, Aakriti and Nema, Preksha and Kumar, Pratyush and Khapra, Mitesh M},
  journal={arXiv preprint arXiv:2101.09115},
  year={2021}
}

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
images		images
texts		texts
README.md		README.md
block_score.ipynb		block_score.ipynb
delimiter_score.ipynb		delimiter_score.ipynb
delimiter_score_cls.ipynb		delimiter_score_cls.ipynb
delimiter_score_sep.ipynb		delimiter_score_sep.ipynb
label_heads.ipynb		label_heads.ipynb
local_score.ipynb		local_score.ipynb
syntactic_score.ipynb		syntactic_score.ipynb
tokenization.py		tokenization.py
vocab.txt		vocab.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Heads Hypothesis

Introduction

Requirements

Setup (from scratch)

Setup (Quickstart)

Results

Citation

About

Releases

Packages

Contributors 2

Languages

iitmnlp/heads-hypothesis

Folders and files

Latest commit

History

Repository files navigation

The Heads Hypothesis

Introduction

Requirements

Setup (from scratch)

Setup (Quickstart)

Results

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages