Computer Science > Computation and Language

arXiv:2210.03841 (cs)

[Submitted on 7 Oct 2022]

Title:Breaking BERT: Evaluating and Optimizing Sparsified Attention

Authors:Siddhartha Brahma, Polina Zablotskaia, David Mimno

View PDF

Abstract:Transformers allow attention between all pairs of tokens, but there is reason to believe that most of these connections - and their quadratic time and memory - may not be necessary. But which ones? We evaluate the impact of sparsification patterns with a series of ablation experiments. First, we compare masks based on syntax, lexical similarity, and token position to random connections, and measure which patterns reduce performance the least. We find that on three common finetuning tasks even using attention that is at least 78% sparse can have little effect on performance if applied at later transformer layers, but that applying sparsity throughout the network reduces performance significantly. Second, we vary the degree of sparsity for three patterns supported by previous work, and find that connections to neighbouring tokens are the most significant. Finally, we treat sparsity as an optimizable parameter, and present an algorithm to learn degrees of neighboring connections that gives a fine-grained control over the accuracy-sparsity trade-off while approaching the performance of existing methods.

Comments:	Shorter version accepted to SNN2021 workshop
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2210.03841 [cs.CL]
	(or arXiv:2210.03841v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2210.03841

Submission history

From: Polina Zablotskaia [view email]
[v1] Fri, 7 Oct 2022 22:32:27 UTC (4,941 KB)

Computer Science > Computation and Language

Title:Breaking BERT: Evaluating and Optimizing Sparsified Attention

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Breaking BERT: Evaluating and Optimizing Sparsified Attention

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators