Nothing Special   »   [go: up one dir, main page]

2020 Emnlp-Main 398

Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

Analyzing Redundancy in Pretrained Transformer Models

Fahim Dalvi Hassan Sajjad Nadir Durrani Yonatan Belinkov*


{faimaduddin,hsajjad,ndurrani}@hbku.edu.qa
Qatar Computing Research Institute, HBKU Research Complex, Doha 5825, Qatar
*
MIT Computer Science and Artificial Intelligence Laboratory and Harvard
John A. Paulson School of Engineering and Applied Sciences, Cambridge, MA, USA
belinkov@csail.mit.edu

Abstract Sun et al., 2019) or pruned (Voita et al., 2019;


Sajjad et al., 2020), with a minor drop in perfor-
Transformer-based deep NLP models are
mance. Recent research (Mu et al., 2018; Etha-
trained using hundreds of millions of param-
eters, limiting their applicability in computa- yarajh, 2019) analyzed contextualized embeddings
tionally constrained environments. In this pa- in pretrained models and showed that the repre-
per, we study the cause of these limitations sentations learned within these models are highly
by defining a notion of Redundancy, which anisotropic. While these approaches successfully
we categorize into two classes: General Re- exploited over-parameterization and redundancy
dundancy and Task-specific Redundancy. We in pretrained models, the choice of what to prune
dissect two popular pretrained models, BERT is empirically motivated and the work does not
and XLNet, studying how much redundancy
directly explore the redundancy in the network.
they exhibit at a representation-level and at
a more fine-grained neuron-level. Our anal- Identifying and analyzing redundant parts of the
ysis reveals interesting insights, such as: i) network is useful in: i) developing a better under-
85% of the neurons across the network are standing of these models, ii) guiding research on
redundant and ii) at least 92% of them can compact and efficient models, and iii) leading to-
be removed when optimizing towards a down- wards better architectural choices.
stream task. Based on our analysis, we present In this paper, we analyze redundancy in pre-
an efficient feature-based transfer learning pro-
trained models. We classify it into general redun-
cedure, which maintains 97% performance
while using at-most 10% of the original neu- dancy and task-specific redundancy. The former
rons.1 is defined as the redundant information present
in a pretrained model irrespective of any down-
1 Introduction stream task. This redundancy is an artifact of over-
parameterization and other training choices that
Large pretrained models have improved the state-
force various parts of the models to learn simi-
of-the-art in a variety of NLP tasks, with each new
lar information. The latter is motivated by pre-
model introducing deeper and wider architectures
trained models being universal feature extractors.
causing a significant increase in the number of pa-
We hypothesize that several parts of the network
rameters. For example, BERT large (Devlin et al.,
are specifically redundant for a given downstream
2019), NVIDIA’s Megatron model, and Google’s
task.
T5 model (Raffel et al., 2019) were trained using
We study both general and task-specific redun-
340 million, 8.3 billion and 11 billion parameters
dancies at the representation-level and at a more
respectively.
fine-grained neuron-level. Such an analysis allows
An emerging body of work shows that these mod-
us to answer the following questions: i) how redun-
els are over-parameterized and do not require all
dant are the layers within a model? ii) do all the
the representational power lent by the rich archi-
layers add significantly diverse information? iii)
tectural choices during inference. For example,
do the dimensions within a hidden layer represent
these models can be distilled (Sanh et al., 2019;
different facets of knowledge, or are some neurons
1
The code for the experiments in this paper is available largely redundant? iv) how much information in
at https://github.com/fdalvi/analyzing-
redundancy-in-pretrained-transformer- a pretrained model is necessary for specific down-
models stream tasks? and v) can we exploit redundancy to

4908
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 4908–4926,
November 16–20, 2020. 2020
c Association for Computational Linguistics
enable efficiency? reduce the number of layers required in a forward
We introduce several methods to analyze redun- pass to extract the contextualized embeddings. We
dancy in the network. Specifically, for general then filter out general redundant neurons present in
redundancy, we use Center Kernel Alignment (Ko- the contextualized embeddings using Correlation
rnblith et al., 2019) for layer-level analysis, and Clustering. Lastly, we remove task-specific redun-
Correlation Clustering for neuron-level analysis. dant neurons using Linguistic Correlation Analy-
For task-specific redundancy, we use Linear Prob- sis. We show that one can reduce the feature set
ing (Shi et al., 2016a; Belinkov et al., 2017) to iden- to less than 100 neurons for several tasks while
tify redundant layers, and Linguistic Correlation maintaining more than 97% of the performance.
Analysis (Dalvi et al., 2019) to examine neuron- Our procedure achieves a speedup of up to 6.2x in
level redundancy. computation time for sequence labeling tasks.
We conduct our study on two pretrained
language models, BERT (Devlin et al., 2019) and 2 Related Work
XLNet (Yang et al., 2019). While these networks
are similar in the number of parameters, they are A number of studies have analyzed representations
trained using different training objectives, which at layer-level (Conneau et al., 2018; Liu et al., 2019;
accounts for interesting comparative analysis Tenney et al., 2019; Kim et al., 2020; Belinkov
between these models. For task-specific analysis, et al., 2020) and at neuron-level (Bau et al., 2019;
we present our results across a wide suite of Dalvi et al., 2019; Suau et al., 2020; Durrani et al.,
downstream tasks: four core NLP sequence 2020). These studies aim at analyzing either the
labeling tasks and seven sequence classification linguistic knowledge learned in representations and
tasks from the GLUE benchmark (Wang et al., in neurons or the general importance of neurons in
2018). Our analysis yields the following insights: the model. The former is commonly done using
a probing classifier (Shi et al., 2016a; Belinkov
General Redundancy: et al., 2017; Hupkes et al., 2018). Recently, Voita
and Titov (2020); Pimentel et al. (2020) proposed
• Adjacent layers are most redundant in the net- probing methods based on information theoretic
work, with lower layers having greater redun- measures. The general importance of neurons is
dancy with adjacent layers. mainly captured using similarity and correlation-
• Up to 85% of the neurons across the network based methods (Raghu et al., 2017; Chrupała and
are redundant in general, and can be pruned to Alishahi, 2019; Wu et al., 2020). Similar to the
substantially reduce the number of parameters. work on analyzing deep NLP models, we analyze
• Up to 94% of neuron-level redundancy is exhib- pretrained models at representation-level and at
ited within the same or neighbouring layers. neuron-level. Different from them, we analyze
Task-specific Redundancy: various forms of redundancy in these models. We
draw upon various techniques from the literature
• Layers in a network are more redundant w.r.t.
and adapt them to perform a redundancy analysis.
core language tasks such as learning morphology
as compared to sequence-level tasks. While the work on pretrained model compres-
sion (Cao et al., 2020; Shen et al., 2020; Sanh
• At least 92% of the neurons are redundant with
et al., 2019; Turc et al., 2019; Gordon et al., 2020;
respect to a downstream task and can be pruned
Guyon and Elisseeff, 2003) indirectly shows that
without any loss in task-specific performance.
models exhibit redundancy, little has been done
• Comparing models, XLNet is more redundant
to explore the redundancy in the network. Recent
than BERT.
studies (Voita et al., 2019; Michel et al., 2019; Saj-
• Our analysis guides research in model distilla- jad et al., 2020; Fan et al., 2020) dropped atten-
tion and suggests preserving knowledge of lower tion heads and layers in the network with marginal
layers and aggressive pruning of higher-layers. degradation in performance. Their work is lim-
Finally, motivated by our analysis, we present ited in the context of redundancy as none of the
an efficient feature-based transfer learning pro- pruning choices are built upon the amount of re-
cedure that exploits various types of redundancy dundancy present in different parts of the network.
present in the network. We first target layer-level Our work identifies redundancy at various levels of
task-specific redundancy using linear probes and the network and can guide the research in model

4909
compression. 3.2 Models
We present our analysis on two transformer-based
3 Experimental Setup
pretrained models, BERT-base (Devlin et al., 2019)
3.1 Datasets and Tasks and XLNet-base (Yang et al., 2019).4 The for-
mer is a masked language model, while the lat-
To analyze the general redundancy in pre-trained
ter is of an auto-regressive nature. We use the
models, we use the Penn Treebank development
transformers library (Wolf et al., 2019) to fine-
set (Marcus et al., 1993), which consists of roughly
tune these models using default hyperparameters.
44,000 tokens. For task-specific analysis, we use
two broad categories of downstream tasks – Se- Classifier Settings For layer-level probing and
quence Labeling and Sequence Classification tasks. neuron-level ranking, we use a logistic regression
For the sequence labeling tasks, we study core classifier with ElasticNet regularization. We train
linguistic tasks, i) part-of-speech (POS) tagging the classifier for 10 epochs with a learning rate of
using the Penn TreeBank, ii) CCG super tagging 1e−3 , batch size of 128 and a value of 1e−5 for
using CCGBank (Hockenmaier, 2006), iii) seman- both L1 and L2 lambda regularization parameters.
tic tagging (SEM) using Parallel Meaning Bank
data (Abzianidze and Bos, 2017) and iv) syn- 4 Problem Definition
tactic chunking using CoNLL 2000 shared task
Consider a pretrained model M with L layers:
dataset (Sang and Buchholz, 2000).
{l0 , l1 , . . . , lL }, where l0 is an embedding layer
For sequence classification, we study tasks from
and each layer li is of size H. Given a dataset
the GLUE benchmark (Wang et al., 2018), namely
D = {w1 , w2 , ..., wT } consisting of T words, the
i) sentiment analysis (SST-2) (Socher et al., 2013),
contextualized embedding of word wj at layer li is
ii) semantic equivalence classification (MRPC)
zji = li (wj ). A neuron consists of each individual
(Dolan and Brockett, 2005), iii) natural language
unit of zji . For example, BERT-base has L = 13
inference (MNLI) (Williams et al., 2018), iv)
layers, each of size 768 i.e. there are 768 individual
question-answering NLI (QNLI) (Rajpurkar et al.,
neurons in each layer. The total number of neurons
2016), iv) question pair similarity2 (QQP), v) tex-
in the model are 13 × 768 = 9984.
tual entailment (RTE) (Bentivogli et al., 2009), and
We analyze redundancy in M at layer-level li :
vi) semantic textual similarity (Cer et al., 2017).3
how redundant is a layer? and at neuron-level:
Complete statistics for all datasets is provided in
how redundant are the neurons? We target these
Appendix A.1.
two questions in the context of general redundancy
Other Settings The neuron activations for each and task-specific redundancy.
word in our dataset are extracted from the pre-
Notion of redundancy: We broadly define re-
trained model for sequence labeling while the
dundancy to cover a range of observations. For ex-
[CLS] token’s representation (from a fine-tuned
ample, we imply high similarity as a reflection of re-
model) is used for sequence classification. The
dundancy. Similarly, for task-specific neuron-level
fine-tuning step is essential to optimize the [CLS]
redundancy, we hypothesize that some neurons ad-
token for sentence representation. In the case of
ditionally might be irrelevant for the downstream
sub-words, we pick the last sub-word’s represen-
task in hand. There, we consider irrelevancy as part
tation (Durrani et al., 2019; Liu et al., 2019). For
of the redundancy analysis. Succinctly, two neu-
sequence labeling tasks, we use training sets of
rons are considered to be redundant if they serve
150K tokens, and standard development and test
the same purpose from the perspective of feature-
splits. For sequence classification tasks, we set
based transfer learning for a downstream task.
aside 5% of the training data and use it to optimize
all the parameters involved in the process and re- 5 General Redundancy
port results on development sets, since the test sets
are not publicly available. Neural networks are designed to be distributed in
2
nature and are therefore innately redundant. Addi-
http://data.quora.com/First-Quora-
Dataset-Release-Question-Pairs 4
We could not run BERT and XLNet large because of
3
We did not evaluate on CoLA and WNLI because of the computational limitations. See the official BERT readme
irregularities in the data and instability during the fine-tuning describing the issue https://github.com/google-
process: https://gluebenchmark.com/faq. research/bert#out-of-memory-issues

4910
(a) BERT (b) XLNet (a) BERT

Figure 1: Pairwise Similarity between the layers.


Brighter colors indicate higher similarity.

tionally, over-parameterization in pretrained mod-


els with a combination of various training and de-
sign choices causes further redundancy of informa-
tion. In the following, we analyze general redun-
dancy at layer-level and at neuron-level. (b) XLNet

5.1 Layer-level Redundancy Figure 2: General neuron-level redundancy in BERT


and XLNet; comparing the average reduction of neu-
We compute layer-level redundancy by compar- rons for different CC thresholds and the average accu-
ing representations from different layers in a given racy across all downstream tasks. See Appendix A.2
model using linear Center Kernel Alignment (cka for detailed per-task results.
- Kornblith et al. (2019)). cka is invariant to
isotropic similarity and orthogonal transformation.
In other words, the similarity measure itself does the final pair of layers, l11 and l12 , whose similarity
not depend on the various representations having is much lower than other adjacent pairs of layers.
neurons or dimensions with exactly the same distri- We speculate that this is because the final layer is
butions, but rather assigns a high similarity if the highly optimized for the objective at hand, while
two representations behave similarly over all the the lower layers try to encode as much general lin-
neurons. Moreover, cka is known to outperform guistic knowledge as possible. This has also been
other methods such as CCA (Andrew et al., 2013) alluded to by others (Hao et al., 2019; Wu et al.,
and SVCCA (Raghu et al., 2017), in identifying re- 2020).
lationships between different layers across different
architectures. While there are several other meth- 5.2 Neuron-level Redundancy
ods proposed in literature to analyze and compare Assessing redundancy at the layer level may be
representations (Kriegeskorte et al., 2008; Boucha- too coarse grained. Even if a layer is not redun-
court and Baroni, 2018; Chrupała and Alishahi, dant with other layers, a subset of its neurons may
2019; Chrupała, 2019), we do not intend to com- still be redundant. We analyze neuron-level redun-
pare them here and instead use cka to show redun- dancy in a network using correlation clustering –
dancy in the network. The mathematical definition CC (Bansal et al., 2004). We group neurons with
of cka is provided in Appendix A.6 for the reader. highly correlated activation patterns over all of the
We compute pairwise similarity between all L words wj . Specifically, every neuron in the vector
layers in the pretrained model and show the corre- zji from some layer i can be represented as a T
sponding heatmaps in Figure 1. We hypothesize dimensional vector, where each index is the acti-
that a high similarity entails (general) redundancy. vation value zji of that neuron for some word wj ,
Overall the similarity between adjacent layers is where j ranges from 1 to T . We calculate the Pear-
high, indicating that the change of encoded knowl- son product-moment correlation of every neuron
edge from one layer to another takes place in small vector z i with every other neuron. This results in a
incremental steps as we move from a lower layer to N ×N matrix corr, where N is the total number of
a higher layer. An exception to this observation is neurons and corr(x, y) represents the correlation

4911
ct = 0.7, 92% of XLNet neurons can be removed
while maintaining oracle performance. We addi-
tionally visualize a few neurons within a cluster.
The activation patterns are quite similar in their
behavior, though not identical, highlighting the ef-
ficacy of CC in clustering neurons with analogous
behavior. An activation heatmap for several neu-
rons is provided in Appendix A.2.
Higher neuron redundancy within and among
Figure 3: Percentage of clusters which contain neu- neighboring layers We analyze the general
rons from the same layers, adjacent layers, within three makeup of the clusters at ct = 0.3.6 Figure 3
neighboring layers and more than three layers apart. shows the percentage of clusters that contain neu-
rons from the same layer (window size 1), neighbor-
ing layers (window sizes 2 and 3) and from layers
between neurons x and y. The correlation value
further apart. We can see that a vast majority of
ranges from −1 to 1, giving us a relative scale to
clusters (≈ 95%) either contain neurons from the
compare any two neurons. A high absolute correla-
same layer or from adjacent layers. This reflects
tion value between two neurons implies that they
that the main source of redundancy is among the
encode very similar information and therefore are
individual representation units in the same layer or
redundant. We convert corr into a distance matrix
neighboring layers of the network. The finding mo-
cdist by applying cdist(x, y) = 1 − |corr(x, y)|
tivates pruning of models by compressing layers as
and cluster the distance matrix cdist by using ag-
oppose to reducing the overall depth in a distilled
glomerative hierarchical clustering with average
version of a model.
linkage5 to minimize the average distance of all
data points in pairs of clusters. The maximum 6 Task-specific Redundancy
distance between any two points in a cluster is con-
trolled by the hyperparameter ct . It ranges from While pretrained models have a high amount of
0 to 1 where a high value results in large-sized general redundancy as shown in the previous sec-
clusters with a small number of total clusters. tion, they may additionally exhibit redundancies
specific to a downstream task. Studying redun-
Substantial amount of neurons are redundant dancy in relation to a specific task helps us under-
In order to evaluate the effect of clustering in com- stand pretrained models better. It further reflects
bining redundant neurons, we randomly pick a neu- on how much of the network, and which parts of
ron from each cluster and form a reduced set of the network, suffice to perform a task efficiently.
non-redundant neurons. Recall that the cluster-
ing is applied independently on the data without 6.1 Layer-level Redundancy
using any task-specific labels. We then build task- To analyze layer-level task-specific redundancy, we
specific classifiers for each task on the reduced set train linear probing classifiers (Shi et al., 2016b;
and analyze the average accuracy. If the average Belinkov et al., 2017) on each layer li (layer-
accuracy of a reduced set is close to that of the classifier). We consider a classifier’s performance
full set of neurons, we conclude that the reduced as a proxy for the amount of task-specific knowl-
set has filtered out redundant neurons. Figure 2 edge learned by a layer. Linear classifiers are a
shows the effect of clustering on BERT and XLNet popular choice in analyzing deep NLP models due
using different values of ct with respect to aver- to their better interpretability (Qian et al., 2016;
age performance across all tasks. It is remarkable Belinkov et al., 2020). Hewitt and Liang (2019)
to observe that 85% of neurons can be removed have shown linear probes to have higher Selectivity,
without any loss in accuracy (ct = 0.7) in BERT, a property deemed desirable for more interpretable
alluding to a high-level of neuron-level redundancy. probes.
We observe an even higher reduction in XLNet. At We compare each layer-classifier with an oracle-
5
classifier trained over concatenation of all layers
We experimented with other clustering algorithms such
6
as k-means and DBSCAN, and did not see any noticeable The choice of 0.3 avoids aggressive clustering and en-
difference in the resulting clusters. ables the analysis of the most redundant neurons.

4912
at the higher-layers of the model. Lower layers
do not have the sufficient sentence-level context to
perform these tasks well.
XLNet is more redundant than BERT While
XLNet has slightly fewer redundant layers for se-
quence labeling tasks, on average across all down-
stream tasks it shows high layer-level task-specific
redundancy. Having high redundancy for sequence-
level tasks reflects that XLNet learns the higher-
level concepts much earlier in the network and this
information is then passed to all the subsequent
Figure 4: Task-specific layer-wise redundant layers rep- layers. This also showcases that XLNet is a much
resented by the colored blocks. Appendix A.3 presents better candidate for model compression where sev-
fine-grained graphs for a few tasks. eral higher layers can be pruned with marginal loss
in performance, as shown by Sajjad et al. (2020).
of the network. For all individual layers that per- 6.2 Neuron-level Redundancy
form close to oracle (maintaining 99% of the per-
Pretrained models being a universal feature extrac-
formance in our results), we imply that they encode
tor contain redundant information with respect to
sufficient knowledge about the task and are there-
a downstream task. We hypothesize that they may
fore redundant in this context. Note that this does
also contain information that is not necessary for
not necessarily imply that those layers are identical
the underlying task. In task-specific neuron anal-
or that they represent the knowledge in a similar
ysis, we consider both redundant and irrelevant
way – instead they have redundant overall knowl-
neurons as redundancy with respect to a task. Un-
edge specific to the task at hand.
like layers, it is combinatorially intractable to ex-
High redundancy for core linguistic tasks Fig- haustively try all possible neuron permutations that
ure 4 shows the redundant layers that perform can carry out a downstream task. We therefore
within a 1% performance threshold with respect to aim at extracting only one minimal set of neurons
the oracle on each task. We found high layer-level that suffice the purpose, and consider the remaining
redundancy for sequence labeling tasks. There are neurons redundant or irrelevant for the task at hand.
up to 11 redundant layers in BERT and up to 10 Formally, given a task and a set of neurons from
redundant layers in XLNet, across different tasks. a model, we perform feature selection to identify a
This is expected, because the sequence labeling minimal set of neurons that match the oracle perfor-
tasks considered here are core language tasks, and mance. To accomplish this, we use the Linguistic
the information related to them is spread across the Correlation Analysis method (Dalvi et al., 2019) to
network. Comparing models, we found such core ranks neurons with respect to a downstream task,
language information to be distributed amongst referred as FS (feature selector) henceforth. For
fewer layers in XLNet. each downstream task, we concatenate represen-
tations from all layers L and use FS to extract a
Substantially less amount of redundancy for
minimal set of top ranked neurons that maintain the
higher-level tasks The amount of redundancy
oracle performance, within a defined threshold. Or-
is substantially lower for sequence classification
acle is the task-specific classification performance
tasks, with RTE having the least number of redun-
obtained using all the neurons for training. The
dant layers in both models. Especially in BERT,
minimum set allows us to answer how many neu-
we did not find any layer that matched the oracle
rons are redundant and irrelevant to the given task.
performance for RTE. It is interesting to observe
Tables 1 and 2 show the minimum set of top neu-
that all the sequence classification tasks are learned
rons for each task that maintains at least 97% of
at higher layers and none of the lower layers were
the oracle performance.
found to be redundant. These results are intuitive
given that the sequence classification tasks require Complex core language tasks require more neu-
complex linguistic knowledge, such as long range rons CCG and Chunking are relatively complex
contextual dependencies, which are only learned tasks compared to POS and SEM. On average

4913
Task # Neurons Task # Neurons extracted from a pretrained model, and then a classi-
fier is trained on the embeddings towards the down-
POS 290 POS 280
SEM 330 SEM 290 stream NLP task. The bulk of the computational
CCG 330 CCG 690 expense is incurred from the following sources:
Chunk. 750 Chunk. 660
• A full forward pass over the pretrained model to
(a) BERT (b) XLNet
extract the contextualized vector, a costly affair
Table 1: Task-specific neuron-level analysis for se- given the large number of parameters.
quence labeling tasks. • Classifiers with large contextualized vectors are:
a) cumbersome to train, b) inefficient during in-
Task # Neurons Task # Neurons ference, and c) may be sub-optimal when super-
SST-2 30 SST-2 70 vised data is insufficient (Hameed, 2018).
MRPC 190 MRPC 170
MNLI 30 MNLI 90 We propose a three step process to target these two
QNLI 40 QNLI 20 sources of computation bottlenecks:
QQP 10 QQP 20
RTE 320 RTE 400 1. Use the task-specific layer-classifier (Sec-
STS-B 290 STS-B 300 tion 6.1) to select the lowest layer that main-
(a) BERT (b) XLNet
tains oracle performance. Differently from the
analysis, a concatenation of all layers until the
Table 2: Task-specific neuron-level analysis for se- selected layer is used instead of just the individ-
quence classification tasks. ual layers.
2. Given the contextualized embeddings extracted
across both models, these complex tasks require in the previous step, use CC (Section 5.2) to
more neurons than POS and SEM. It is interest- filter-out redundant neurons.
ing to see that the size of minimum neurons set is 3. Apply FS (Section 6.2) to select a minimal set
correlated with the complexity of the task. of neurons that are needed to achieve optimum
Less task-specific redundancy for core linguis- performance on the task.
tic tasks compared to higher-level tasks While The three steps explicitly target task-specific
the minimum set of neurons per task consist of a layer redundancy, general neuron redundancy and
small percentage of total neurons in the network, task-specific neuron redundancy respectively. We
the core linguistic tasks require substantially more refer to Step 1 as LayerSelector (LS) and Step
neurons compared to higher-level tasks (compar- 2 and 3 as CCFS (Correlation clustering + Fea-
ing Tables 1 and 2). It is remarkable that some ture selection) later on. For all experiments, we
sequence-level tasks require as few as only 10 neu- use a performance threshold of 1% for LS and
rons to obtain desired performance. One reason CCFS each. It is worth mentioning that the trade-
for the large difference in the size of minimum set off between loss in accuracy and efficiency can
of neurons could be the nature of tasks, since core be controlled through these thresholds, which can
linguistic tasks are word-level tasks, a much higher be adjusted to serve faster turn-around or better
capacity is required in the pretrained model to store performance.
the knowledge for all of the words. While in the
case of sequence classification tasks, the network 7.1 Results
learns to filter and mold the features to form fewer Table 3 presents the average results on all sequence
“high-level” sentence features. labeling and sequence classification tasks. Detailed
per-task results are provided in Appendix A.5.1. As
7 Efficient Transfer Learning
expected from our analysis, a significant portion
In this section, we build upon the redundancy anal- of the network can be pruned by LS for sequence
ysis presented in the previous sections and propose labeling tasks, using less than 6 layers out of 13
a novel method for efficient feature-based trans- (Embedding + 12 layers) for BERT and less than
fer learning. In a typical feature-based transfer 3 layers for XLNet. Specifically, this reduces the
learning setup, contextualized embeddings are first parameters required for a forward pass for BERT

4914
Sequence Sequence
Classification Labeling
BERT XLNet BERT XLNet
Oracle 93.0% 93.4% 85.5% 84.8%
Neurons 9984
LS 92.3% 93.2% 85.0% 84.5%
Layers 5.3 2.5 11.6 8.1
CCFS 92.0% 92.2% 84.0% 84.0%
Neurons 425 400 90 150
Figure 5: BERT: Runtime of the classifier w.r.t. number
% Reduct. 95.7%↓ 96.0%↓ 99.0%↓ 98.5%↓ of neurons (features). The dots on the line mark the
number of features selected by our method. Note that
Table 3: Average results using LS and CCFS with per- the X-axis is not linear, the lower half of the spectrum
formance thresholds of 1% for each. Oracle is using has been stretched for clarity.
a concatenation of all layers. Layers shows the av-
erage number of selected layers. Neurons are the fi-
nal number of neurons (features) used for classification. that have wider lower layers and narrow higher
% Reduct. shows the percentage reduction in neurons layers. Such a design choice leads to significant
compared to the full network. savings of capacity in higher layers where a few,
rich neurons are sufficient for good performance. In
by 65% for POS and SEM, and 33% for CCG and terms of neuron-based compression methods, these
39% for Chunking. On XLNet, LS led to even findings propose aggressive pruning of higher lay-
larger reduction in parameters; 70% for POS and ers while preserving the lower layers in building
SEM, and 65% for CCG and Chunking. The results smaller and accurate compressed models.
were less pronounced for sequence classification 7.2 Efficiency Analysis
tasks, with LS using 11.6 layers for BERT and 8.1
layers for XLNet on average, out of 13 layers. While the algorithm boosts the theoretical effi-
ciency in terms of the number of parameters re-
Applying CCFS on top of the reduced layers led
duced and the final number of features, it is im-
to another round of significant efficiency improve-
portant to analyze how this translates to real world
ments. The number of neurons needed for the final
performance. Using LS leads to an average speed
classifier reducing to just 5% for sequence labeling
up of 2.8x and 6.2x with BERT and XLNet respec-
tasks and 1.5% for sequence classification tasks.
tively on sequence labeling tasks. On sequence
The final number of neurons is surprising low for
classification tasks, the average speed ups are 1.1x
some tasks compared to the initial 9984, with some
and 1.6x with BERT and XLNet respectively. De-
tasks like QNLI using just 10 neurons.
tailed results are provided in Appendix A.5.2.
More concretely, taking the POS task as an exam-
For the classifier built on the reduced set, we
ple: the pre-trained oracle BERT model has 9984
simulate a test scenario with 100,000 tokens and
features and 110M parameters. LS reduced the
compute the total runtime for 10 iterations of train-
feature set to 2304 (embedding + 2 layers) and the
ing. The numbers were computed on a 6-core 2.8
number of parameters used in the forward pass to
GHz AMD Opteron Processor 4184, and were av-
37M. CCFS further reduced the feature set to 300,
eraged across 3 runs. Figure 5 shows the runtime
maintaining a performance close to oracle BERT’s
of each run (in seconds) against the number of fea-
performance on this task (95.2% vs. 93.9%).
tures selected. The runtime of the classifier reduced
An interesting observation in Table 3 is that the from 50 to 10 seconds in the case of BERT. The 5x
sequence labeling tasks require fewer layers but a speedup can be very useful in a heavy-use scenar-
higher number of features, while sequence classifi- ios where the classifier is queried a large number
cation tasks follow the opposite pattern. As we go times in a short duration.
deeper in the network, the neurons are much more
richer and tuned for the task at hand, and only a few Training time efficiency: Although the focus of
of them are required compared to the much more the current application is to improve inference-time
word-focused neurons in the lower layers. These efficiency, it is nevertheless important to under-
observations suggest pyramid-shaped architectures stand how much computation complexity is added

4915
during training time. Let T be the total number amount of task-specific redundancy; higher layer-
of tokens in our training set, and N be the total level redundancy for core language tasks compared
number of neurons across all layers in a pre-trained to sequence-level tasks. We found that at least
model. The application presented in this section 92% of the neurons are redundant with respect to
consists of 5 steps. a downstream task. Based on our analysis, we
proposed an efficient transfer learning procedure
1. Feature extraction from pre-trained model: that directly targets layer-level and neuron-level
Extraction time scales linearly with the num- redundancy to achieve efficiency in feature-based
ber of tokens T . transfer learning.
While our analysis is helpful in understanding
2. Training a classifier for every layer LS: With a
pretrained models, it suggests interesting research
constant number of neurons N , training time
directions towards building compact models and
per layer scales linearly with the number of
models with better architectural choices. For exam-
tokens T .
ple, a high amount of neuron-level redundancy in
3. Correlation clustering CC: With a constant the same layer suggests that layer-size compression
number of neurons N , running correlation might be more effective in reducing the pretrained
clustering scales linearly with the number of model size while preserving oracle performance.
tokens T . Similarly, our finding that core-linguistic tasks are
learned at lower-layers and require a higher number
4. Feature ranking: This step involves training of neurons, while sequence-level tasks are learned
a classifier with the reduced set of features, at higher-layers and require fewer neurons, sug-
which scales linearly with the number of to- gests pyramid-style architectures that have wide
kens T . Once the classifier is trained, the lower layers and compact higher layers and may
weights of the classifier are used to extract a result in smaller models with performance compet-
feature ranking, with the number of weights itive with large models.
scaling linearly with the number of selection
neurons N . Acknowledgements
This research was carried out in collaboration be-
5. Minimal feature set: Finding the minimal set
tween the HBKU Qatar Computing Research Insti-
of neurons is a brute-force search process,
tute (QCRI) and the MIT Computer Science and
starting with a small number of neurons. For
Artificial Intelligence Laboratory (CSAIL). Y.B.
each set of neurons, a classifier is trained, the
was also supported by the Harvard Mind, Brain,
time for which scales linearly with the total
and Behavior Initiative (MBB).
number of tokens T . As the feature set size
increases, the training time also goes up as
described in Figure 5. References
Appendix A.5.3 provides additional experiments Lasha Abzianidze and Johan Bos. 2017. Towards uni-
versal semantic tagging. In Proceedings of the 12th
and results used to analyze the training time com- International Conference on Computational Seman-
plexity of our application. tics (IWCS 2017) – Short Papers, pages 1–6, Mont-
pellier, France.
8 Conclusion and Future Directions
Galen Andrew, Raman Arora, Jeff A. Bilmes, and
We defined a notion of redundancy and analyzed Karen Livescu. 2013. Deep canonical correlation
analysis. In Proceedings of the 30th International
pre-trained models for general redundancy and Conference on Machine Learning, ICML 2013, At-
task-specific redundancy exhibited at layer-level lanta, GA, USA, 16-21 June 2013, volume 28
and at individual neuron-level. Our analysis on of JMLR Workshop and Conference Proceedings,
general redundancy showed that i) adjacent layers pages 1247–1255. JMLR.org.
are most redundant in the network with an excep- Nikhil Bansal, Avrim Blum, and Shuchi Chawla.
tion of final layers which are close to the objec- 2004. Correlation clustering. Machine Learning,
tive function, and ii) up to 85% and 92% neurons 56(13):89113.
are redundant in BERT and XLNet respectively. D. Anthony Bau, Yonatan Belinkov, Hassan Sajjad,
We further showed that networks exhibit varying Nadir Durrani, Fahim Dalvi, and James Glass. 2019.

4916
Identifying and controlling important neurons in Fahim Dalvi, Nadir Durrani, Hassan Sajjad, Yonatan
neural machine translation. In International Confer- Belinkov, D. Anthony Bau, and James Glass. 2019.
ence on Learning Representations (ICLR). What is one grain of sand in the desert? analyzing
individual neurons in deep nlp models. In Proceed-
Yonatan Belinkov, Nadir Durrani, Fahim Dalvi, Hassan ings of the Thirty-Third AAAI Conference on Artifi-
Sajjad, and James Glass. 2017. What do Neural Ma- cial Intelligence (AAAI, Oral presentation).
chine Translation Models Learn about Morphology?
In Proceedings of the 55th Annual Meeting of the As- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
sociation for Computational Linguistics (ACL), Van- Kristina Toutanova. 2019. BERT: Pre-training of
couver. Association for Computational Linguistics. deep bidirectional transformers for language under-
standing. In Proceedings of the 2019 Conference of
the North American Chapter of the Association for
Yonatan Belinkov, Nadir Durrani, Fahim Dalvi, Has-
Computational Linguistics: Human Language Tech-
san Sajjad, and James Glass. 2020. On the linguistic
nologies, Volume 1 (Long and Short Papers), Min-
representational power of neural machine translation
neapolis, Minnesota. Association for Computational
models. Computational Linguistics, 45(1):1–57.
Linguistics.
Luisa Bentivogli, Ido Dagan, Hoa Trang Dang, Danilo William B. Dolan and Chris Brockett. 2005. Automati-
Giampiccolo, and Bernardo Magnini. 2009. The cally constructing a corpus of sentential paraphrases.
fifth PASCAL recognizing textual entailment chal- In Proceedings of the Third International Workshop
lenge. In Proceedings of the Second Text Analy- on Paraphrasing (IWP2005).
sis Conference, TAC 2009, Gaithersburg, Maryland,
USA, November 16-17, 2009. NIST. Nadir Durrani, Fahim Dalvi, Hassan Sajjad, Yonatan
Belinkov, and Preslav Nakov. 2019. One size does
Diane Bouchacourt and Marco Baroni. 2018. How not fit all: Comparing NMT representations of dif-
agents see things: On visual representations in an ferent granularities. In Proceedings of the 2019
emergent language game. In Proceedings of the Conference of the North American Chapter of the
2018 Conference on Empirical Methods in Natural Association for Computational Linguistics: Human
Language Processing, pages 981–985, Brussels, Bel- Language Technologies, Volume 1 (Long and Short
gium. Association for Computational Linguistics. Papers), pages 1504–1516, Minneapolis, Minnesota.
Association for Computational Linguistics.
Qingqing Cao, Harsh Trivedi, Aruna Balasubramanian, Nadir Durrani, Hassan Sajjad, Fahim Dalvi, and
et al. 2020. Faster and just as accurate: A simple Yonatan Belinkov. 2020. Analyzing individual neu-
decomposition for transformer models. ICLR Open- rons in pretrained language models. In Proceedings
review. of the 2020 Conference on Empirical Methods in
Natural Language Processing (EMNLP-2020), On-
Daniel Cer, Mona Diab, Eneko Agirre, Iñigo Lopez- line. Association for Computational Linguistics.
Gazpio, and Lucia Specia. 2017. SemEval-2017
task 1: Semantic textual similarity multilingual and Kawin Ethayarajh. 2019. How contextual are contex-
crosslingual focused evaluation. In Proceedings tualized word representations? comparing the geom-
of the 11th International Workshop on Semantic etry of BERT, ELMo, and GPT-2 embeddings. In
Evaluation (SemEval-2017), pages 1–14, Vancouver, Proceedings of the 2019 Conference on Empirical
Canada. Association for Computational Linguistics. Methods in Natural Language Processing and the
9th International Joint Conference on Natural Lan-
Grzegorz Chrupała. 2019. Symbolic inductive bias guage Processing (EMNLP-IJCNLP), pages 55–65,
for visually grounded learning of spoken language. Hong Kong, China. Association for Computational
In Proceedings of the 57th Annual Meeting of the Linguistics.
Association for Computational Linguistics, pages
6452–6462, Florence, Italy. Association for Compu- Angela Fan, Edouard Grave, and Armand Joulin. 2020.
tational Linguistics. Reducing transformer depth on demand with struc-
tured dropout. In 8th International Conference on
Grzegorz Chrupała and Afra Alishahi. 2019. Corre- Learning Representations, ICLR 2020, Addis Ababa,
lating neural and symbolic representations of lan- Ethiopia, April 26-30, 2020. OpenReview.net.
guage. In Proceedings of the 57th Annual Meet- Mitchell A. Gordon, Kevin Duh, and Nicholas An-
ing of the Association for Computational Linguis- drews. 2020. Compressing BERT: studying the ef-
tics, pages 2952–2962, Florence, Italy. Association fects of weight pruning on transfer learning. In
for Computational Linguistics. Proceedings of the 5th Workshop on Representation
Learning for NLP, RepL4NLP@ACL 2020, Online,
Alexis Conneau, German Kruszewski, Guillaume Lam- July 9, 2020, pages 143–155. Association for Com-
ple, Loı̈c Barrault, and Marco Baroni. 2018. What putational Linguistics.
you can cram into a single vector: Probing sentence
embeddings for linguistic properties. In Proceed- Isabelle Guyon and André Elisseeff. 2003. An intro-
ings of the 56th Annual Meeting of the Association duction to variable and feature selection. Journal of
for Computational Linguistics (ACL). Machine Learning Research, 3:1157–1182.

4917
Shilan Hameed. 2018. Filter-wrapper combination and Mitchell P. Marcus, Beatrice Santorini, and Mary Ann
embedded feature selection for gene expression data. Marcinkiewicz. 1993. Building a large annotated
International Journal of Advances in Soft Comput- corpus of English: The Penn Treebank. Computa-
ing and its Applications, 10:90–105. tional Linguistics, 19(2):313–330.
Yaru Hao, Li Dong, Furu Wei, and Ke Xu. 2019. Visu- Paul Michel, Omer Levy, and Graham Neubig. 2019.
alizing and understanding the effectiveness of BERT. Are sixteen heads really better than one? In H. Wal-
In Proceedings of the 2019 Conference on Empirical lach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc,
Methods in Natural Language Processing and the E. Fox, and R. Garnett, editors, Advances in Neural
9th International Joint Conference on Natural Lan- Information Processing Systems 32, pages 14014–
guage Processing (EMNLP-IJCNLP), pages 4141– 14024. Curran Associates, Inc.
4150, Hong Kong, China. Association for Computa-
tional Linguistics. Jiaqi Mu, Suma Bhat, and Pramod Viswanath. 2018.
All-but-the-top: Simple and effective postprocessing
John Hewitt and Percy Liang. 2019. Designing and for word representations. In 6th International Con-
interpreting probes with control tasks. In Proceed- ference on Learning Representations, ICLR 2018,
ings of the 2019 Conference on Empirical Methods Vancouver, BC, Canada, April 30 - May 3, 2018,
in Natural Language Processing and the 9th Inter- Conference Track Proceedings. OpenReview.net.
national Joint Conference on Natural Language Pro-
cessing (EMNLP-IJCNLP), pages 2733–2743, Hong Tiago Pimentel, Josef Valvoda, Rowan Hall Maudslay,
Kong, China. Association for Computational Lin- Ran Zmigrod, Adina Williams, and Ryan Cotterell.
guistics. 2020. Information-theoretic probing for linguistic
structure. In Proceedings of the 58th Annual Meet-
Julia Hockenmaier. 2006. Creating a CCGbank and a ing of the Association for Computational Linguistics,
wide-coverage CCG lexicon for German. In Pro- ACL 2020, Online, July 5-10, 2020, pages 4609–
ceedings of the 21st International Conference on 4622. Association for Computational Linguistics.
Computational Linguistics and 44th Annual Meet-
ing of the Association for Computational Linguistics, Peng Qian, Xipeng Qiu, and Xuanjing Huang. 2016.
ACL ’06, pages 505–512, Sydney, Australia. Analyzing Linguistic Knowledge in Sequential
Dieuwke Hupkes, Sara Veldhoen, and Willem Zuidema. Model of Sentence. In Proceedings of the 2016 Con-
2018. Visualisation and ‘diagnostic classifiers’ re- ference on Empirical Methods in Natural Language
veal how recurrent and recursive neural networks Processing, pages 826–835, Austin, Texas. Associa-
process hierarchical structure. Journal of Artificial tion for Computational Linguistics.
Intelligence Research, 61:907–926. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
Taeuk Kim, Jihun Choi, Daniel Edmiston, and Sang- Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
goo Lee. 2020. Are pre-trained language models Wei Li, and Peter J. Liu. 2019. Exploring the limits
aware of phrases? simple but strong baselines for of transfer learning with a unified text-to-text trans-
grammar induction. In 8th International Confer- former. arXiv e-prints.
ence on Learning Representations, ICLR 2020, Ad-
dis Ababa, Ethiopia, April 26-30, 2020. OpenRe- Maithra Raghu, Justin Gilmer, Jason Yosinski, and
view.net. Jascha Sohl-Dickstein. 2017. SVCCA: Singu-
lar Vector Canonical Correlation Analysis for
Simon Kornblith, Mohammad Norouzi, Honglak Lee, Deep Learning Dynamics and Interpretability. In
and Geoffrey Hinton. 2019. Similarity of neural I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach,
network representations revisited. In Proceedings R. Fergus, S. Vishwanathan, and R. Garnett, editors,
of the 36th International Conference on Machine Advances in Neural Information Processing Systems
Learning, volume 97 of Proceedings of Machine 30, pages 6078–6087. Curran Associates, Inc.
Learning Research, pages 3519–3529, Long Beach,
California, USA. PMLR. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and
Percy Liang. 2016. SQuAD: 100,000+ questions for
Nikolaus Kriegeskorte, Marieke Mur, and Peter Ban- machine comprehension of text. In Proceedings of
dettini. 2008. Representational similarity analysis the 2016 Conference on Empirical Methods in Natu-
- connecting the branches of systems neuroscience. ral Language Processing, pages 2383–2392, Austin,
Frontiers in Systems Neuroscience, 2:4. Texas. Association for Computational Linguistics.
Nelson F. Liu, Matt Gardner, Yonatan Belinkov, Hassan Sajjad, Fahim Dalvi, Nadir Durrani, and
Matthew E. Peters, and Noah A. Smith. 2019. Lin- Preslav Nakov. 2020. Poor man’s bert: Smaller and
guistic knowledge and transferability of contextual faster transformer models. ArXiv, abs/2004.03844.
representations. In Proceedings of the 2019 Confer-
ence of the North American Chapter of the Associ- Erik F. Tjong Kim Sang and Sabine Buchholz. 2000.
ation for Computational Linguistics: Human Lan- Introduction to the CoNLL-2000 shared task chunk-
guage Technologies, Volume 1 (Long and Short Pa- ing. In Fourth Conference on Computational Natu-
pers), pages 1073–1094, Minneapolis, Minnesota. ral Language Learning, CoNLL 2000, and the Sec-
Association for Computational Linguistics. ond Learning Language in Logic Workshop, LLL

4918
2000, Held in cooperation with ICGI-2000, Lisbon, Iulia Turc, Ming-Wei Chang, Kenton Lee, and Kristina
Portugal, September 13-14, 2000, pages 127–132. Toutanova. 2019. Well-read students learn better:
ACL. The impact of student initialization on knowledge
distillation. CoRR, abs/1908.08962.
Victor Sanh, Lysandre Debut, Julien Chaumond, and
Thomas Wolf. 2019. Distilbert, a distilled version of Elena Voita, David Talbot, Fedor Moiseev, Rico Sen-
bert: smaller, faster, cheaper and lighter. nrich, and Ivan Titov. 2019. Analyzing multi-head
self-attention: Specialized heads do the heavy lift-
Sheng Shen, Zhen Dong, Jiayu Ye, Linjian Ma, Zhewei ing, the rest can be pruned. In Proceedings of the
Yao, Amir Gholami, Michael W. Mahoney, and 57th Annual Meeting of the Association for Com-
Kurt Keutzer. 2020. Q-BERT: hessian based ul- putational Linguistics, pages 5797–5808, Florence,
tra low precision quantization of BERT. In The Italy. Association for Computational Linguistics.
Thirty-Fourth AAAI Conference on Artificial Intelli-
gence, AAAI 2020, The Thirty-Second Innovative Ap- Elena Voita and Ivan Titov. 2020. Information-
plications of Artificial Intelligence Conference, IAAI theoretic probing with minimum description length.
2020, The Tenth AAAI Symposium on Educational In Proceedings of the 2020 Conference on Empirical
Advances in Artificial Intelligence, EAAI 2020, New Methods in Natural Language Processing. Associa-
York, NY, USA, February 7-12, 2020, pages 8815– tion for Computational Linguistics.
8821. AAAI Press.
Alex Wang, Amanpreet Singh, Julian Michael, Fe-
Xing Shi, Kevin Knight, and Deniz Yuret. 2016a. Why lix Hill, Omer Levy, and Samuel Bowman. 2018.
neural translations are the right length. In Proceed- GLUE: A multi-task benchmark and analysis plat-
ings of the 2016 Conference on Empirical Methods form for natural language understanding. In Pro-
in Natural Language Processing, pages 2278–2282, ceedings of the 2018 EMNLP Workshop Black-
Austin, Texas. Association for Computational Lin- boxNLP: Analyzing and Interpreting Neural Net-
guistics. works for NLP, pages 353–355, Brussels, Belgium.
Association for Computational Linguistics.
Xing Shi, Inkit Padhi, and Kevin Knight. 2016b. Does
string-based neural MT learn source syntax? In Pro- Adina Williams, Nikita Nangia, and Samuel Bowman.
ceedings of the 2016 Conference on Empirical Meth- 2018. A broad-coverage challenge corpus for sen-
ods in Natural Language Processing, EMNLP 2016, tence understanding through inference. In Proceed-
Austin, Texas, USA, November 1-4, 2016, pages ings of the 2018 Conference of the North American
1526–1534. The Association for Computational Lin- Chapter of the Association for Computational Lin-
guistics. guistics: Human Language Technologies, Volume
1 (Long Papers), pages 1112–1122, New Orleans,
Richard Socher, Alex Perelygin, Jean Wu, Jason Louisiana. Association for Computational Linguis-
Chuang, Christopher D. Manning, Andrew Ng, and tics.
Christopher Potts. 2013. Recursive deep models
for semantic compositionality over a sentiment tree- Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
bank. In Proceedings of the 2013 Conference on Chaumond, Clement Delangue, Anthony Moi, Pier-
Empirical Methods in Natural Language Processing, ric Cistac, Tim Rault, Rmi Louf, Morgan Funtow-
pages 1631–1642, Seattle, Washington, USA. Asso- icz, Joe Davison, Sam Shleifer, Patrick von Platen,
ciation for Computational Linguistics. Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,
Teven Le Scao, Sylvain Gugger, Mariama Drame,
Xavier Suau, Luca Zappella, and Nicholas Apos- Quentin Lhoest, and Alexander M. Rush. 2019.
toloff. 2020. Finding experts in transformer models. Huggingface’s transformers: State-of-the-art natural
CoRR, abs/2005.07647. language processing. ArXiv, abs/1910.03771.
Siqi Sun, Yu Cheng, Zhe Gan, and Jingjing Liu. 2019. John M. Wu, Yonatan Belinkov, Hassan Sajjad, Nadir
Patient knowledge distillation for BERT model com- Durrani, Fahim Dalvi, and James R. Glass. 2020.
pression. In Proceedings of the 2019 Conference on Similarity analysis of contextual word representa-
Empirical Methods in Natural Language Processing tion models. In Proceedings of the 58th Annual
and the 9th International Joint Conference on Natu- Meeting of the Association for Computational Lin-
ral Language Processing (EMNLP-IJCNLP), pages guistics, ACL 2020, Online, July 5-10, 2020, pages
4322–4331, Hong Kong, China. Association for 4638–4655. Association for Computational Linguis-
Computational Linguistics. tics.
Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang, Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Car-
Adam Poliak, R. Thomas McCoy, Najoung Kim, bonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019.
Benjamin Van Durme, Samuel R. Bowman, Dipan- XLNet: Generalized autoregressive pretraining for
jan Das, and Ellie Pavlick. 2019. What do you language understanding. In Advances in Neural
learn from context? probing for sentence structure Information Processing Systems 32: Annual Con-
in contextualized word representations. In 7th Inter- ference on Neural Information Processing Systems
national Conference on Learning Representations, 2019, NeurIPS 2019, 8-14 December 2019, Vancou-
ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. ver, BC, Canada, pages 5754–5764.
OpenReview.net.

4919
A Appendices Task Train Dev

A.1 Data SST-2 67349 872


MRPC 3668 408
For Sequence labeling tasks, we use the first MNLI 392702 9815
150,000 tokens for training, and standard devel- QNLI 104743 5463
opment and test data for all of the four tasks QQP 363846 40430
(POS, SEM, CCG super tagging and Chunking). RTE 2490 277
The links to all datasets is provided in the code STS-B 5749 1500
README instructions. The statistics for the
datasets are provided in Table 4. Table 5: Data statistics (number of sequences) on the
official training and development sets used in the ex-
Task Train Dev Test Tags periments. All tasks are binary classification tasks, ex-
cept for STS-B which is a regression task. Recall that
POS 149973 44320 47344 44 the test sets are not publicly available, and hence we
SEM 149986 112537 226426 73 use 10% of the official train as development, and the
Chunking 150000 44346 47372 22 official development set as our test set. Exact split in-
CCG 149990 45396 55353 1272 formation is provided in the code README.

Table 4: Data statistics (number of tokens) on training, A.3 Task-Specific Layer-wise redundancy
development and test sets used in the experiments and
the number of tags to be predicted Tables 7a and 7a provide detailed results used to
produce the illustrations in Figure 4.
For the sequence classification tasks, we study Figures 7, 8 and 9 show the layer-wise task-
tasks from the GLUE benchmark (Wang et al., specific redundancy for individual classes within
2018), namely i) sentiment analysis (SST-2) us- POS, SEM and Chunking respectively. We do not
ing the Stanford sentiment treebank (Socher et al., present these fine-grained plots for CCG (over 1000
2013), ii) semantic equivalence classification using classes) or sequence classification tasks (binary
the Microsoft Research paraphrase corpus (MRPC) classification only).
(Dolan and Brockett, 2005), iii) natural language in- A.4 Task-Specific Neuron-level Redundancy
ference corpus (MNLI) (Williams et al., 2018), iv)
question-answering NLI (QNLI) using the SQUAD Tables 8a and 8b provide the per-task detailed re-
dataset (Rajpurkar et al., 2016), iv) question pair sults along with reduced accuracies after running
similarity using the Quora Question Pairs7 dataset task-specific neuron-level redundancy analysis.
(QQP), v) textual entailment using recognizing A.5 Application: Efficient Feature Selection
textual entailment dataset(RTE) (Bentivogli et al.,
A.5.1 Transfer Learning Detailed Results
2009), and vi) semantic textual similarity using the
STS-B dataset (Cer et al., 2017). The statistics for Tables 9 and 10 show the detailed per-task results
the datasets are provided in Table 5. for our proposed feature selection algorithm.
A.5.2 Pretrained model timing analysis
A.2 General Neuron-level Redundancy
The average runtime per instance was computed
Table 6 presents the detailed results for the illustra- by dividing the total number of seconds taken to
tion in Figures 2a and 2b. As a concrete example, run the forward pass for all batches by the total
6 out of 12 tasks (POS, SEM, CCG, Chunking, number of sentences. All computation was done
SST-2, STS-B) can do away with more than 85% on an NVidia GeForce GTX TITAN X, and the
reduction in the number of neurons (threshold=0.7) numbers are averaged across 3 runs. Figures 10 and
with very little loss in performance. 11 shows the results of various number of layers
Figure 6 visualizes heatmaps of a few neurons (with the selected layer highlighted for each task).
that belong to the same cluster built using CC at
ct = 0.3 as a qualitative example of a cluster. A.5.3 Training time analysis
7
http://data.quora.com/First-Quora- Figures 12, 13 and 14 show the runtimes of the
Dataset-Release-Question-Pairs various steps of the proposed efficient feature se-

4920
Figure 6: Redundant neurons as clustered by correlation clustering on two sentences. The dark red and dark blue
refer to high negative and positive activation values respectively.

lection for transfer learning application. Extraction


of features and correlation clustering both scale
linearly as the number of input tokens increases,
while ranking the various features scales linearly
with the number of total features.

A.6 Center Kernel Alignment


For layer-level redundancy, we compare representa-
tions from various layers using linear Center Kernel
Alignment (cka - Kornblith et al. (2019)). Here,
we briefly present the mathematical definitions be-
hind cka. Let Z denote a column centering trans-
formation. As denoted in the paper, zji represents
the contextualized embedding for some word wj at
some layer li . Let z i represent the contextual em-
beddings over all T words, i.e. it is of size T × N
(where N is the total number of neurons). Given
two layers x and y,

X, Y = Zz x , Zz y

the CKA similarity is

kXT Yk2
cka(z x , z y ) :=
kXT XkkYT Yk

where k · k is the Frobenius norm.

4921
(a) BERT

(b) XLNet

Figure 7: Layer-wise task specific redudancy for POS task. Redundant layers are represented by the colored blocks.

(a) BERT

(b) XLNet

Figure 8: Layer-wise task specific redudancy for SEM task. Redundant layers are represented by the colored
blocks.

4922
Threshold POS SEM CCG Chunking SST-2 MRPC MNLI QNLI QQP RTE STS-B Average
0.0 9984 9984 9984 9984 9984 9984 9984 9984 9984 9984 9984 9984
0.0 95.7% 92.0% 89.8% 94.5% 90.5% 85.8% 81.7% 90.3% 91.2% 70.0% 89.5% 88.3%
0.1 6841 6809 6844 6749 7415 9441 9398 8525 8993 9647 8129 8072
0.1 95.4% 92.3% 90.3% 94.8% 89.8% 86.3% 81.7% 90.2% 91.2% 69.3% 89.7% 88.3%
0.2 4044 4045 4052 4008 6207 8486 8376 7225 7697 8705 6377 6293
0.2 95.9% 92.9% 90.6% 95.0% 90.6% 86.8% 81.7% 90.1% 91.2% 69.0% 89.6% 88.5%
0.3 2556 2566 2570 2573 4994 7328 7049 6131 6413 7157 4949 4935
0.3 96.2% 93.1% 91.3% 95.1% 90.6% 86.0% 81.8% 89.9% 91.1% 67.1% 89.5% 88.3%
0.4 1729 1752 1729 1709 3812 5779 5681 4961 5077 5587 3674 3772
0.4 96.2% 93.3% 91.4% 95.2% 90.4% 86.5% 81.7% 89.4% 91.0% 67.5% 89.3% 88.4%
0.5 1215 1190 1221 1217 2746 4420 4289 3747 3789 4241 2721 2800
0.5 96.4% 93.2% 91.6% 94.9% 90.3% 86.3% 81.6% 89.6% 91.1% 66.4% 89.0% 88.2%
0.6 876 869 873 876 1962 3287 3041 2712 2767 3170 1962 2036
0.6 96.2% 93.3% 91.5% 94.4% 90.0% 85.5% 81.8% 89.7% 91.1% 66.8% 88.8% 88.1%
0.7 792 789 792 795 1404 2258 2025 1867 1907 2315 1419 1488
0.7 96.2% 93.2% 91.6% 94.1% 89.8% 86.3% 81.7% 89.3% 91.1% 69.0% 87.8% 88.2%
0.8 764 758 762 748 982 1367 1239 1191 1226 1531 982 1050
0.8 96.1% 93.2% 91.3% 94.0% 89.2% 85.0% 80.6% 88.3% 90.0% 62.8% 82.6% 86.7%
0.9 443 378 429 357 778 812 798 797 814 854 785 659
0.9 95.6% 91.8% 89.9% 91.0% 56.5% 70.3% 53.2% 80.0% 77.6% 59.2% 32.5% 72.5%

(a) BERT
Threshold POS SEM CCG Chunking SST-2 MRPC MNLI QNLI QQP RTE STS-B Average
0.0 9984 9984 9984 9984 9984 9984 9984 9984 9984 9984 9984 9984
0.0 96.2% 91.8% 90.6% 93.5% 93.2% 86.5% 78.9% 89.1% 87.4% 69.7% 89.0% 87.8%
0.1 9019 9021 9046 8941 7435 9206 7913 8056 5844 9931 9125 9006.75
0.1 96.3% 92.2% 90.7% 93.9% 93.0% 86.5% 80.3% 89.2% 89.7% 71.8% 89.0% 88.4%
0.2 5338 5392 5346 5302 6257 7685 6668 7393 4952 9244 8011 5344.5
0.2 96.2% 92.3% 90.5% 93.9% 93.0% 86.8% 80.4% 89.9% 90.2% 70.4% 88.9% 88.4%
0.3 3646 3651 3660 3606 5206 6241 5988 6613 4482 7635 6407 3640.75
0.3 96.2% 92.5% 91.0% 93.8% 92.9% 86.8% 80.8% 89.8% 90.1% 71.5% 88.7% 88.5%
0.4 2592 2571 2599 2573 4181 4896 5252 5583 3987 5996 4932 2583.75
0.4 96.3% 92.7% 90.8% 93.7% 93.1% 88.0% 81.0% 89.7% 90.1% 70.4% 88.5% 88.6%
0.5 1754 1746 1756 1758 3207 3675 4172 4426 3271 4573 3669 1753.5
0.5 96.5% 92.8% 91.3% 94.4% 93.2% 87.7% 80.8% 89.6% 90.1% 71.8% 88.3% 88.8%
0.6 1090 1085 1091 1072 2355 2549 2905 3248 2370 3346 2666 1084.5
0.6 96.7% 93.0% 91.8% 93.8% 93.1% 88.0% 81.0% 90.4% 90.0% 70.4% 88.4% 88.8%
0.7 833 833 830 824 1663 1735 1883 2224 1627 2348 1859 830
0.7 96.6% 93.0% 91.9% 93.2% 92.0% 88.2% 79.9% 90.1% 89.7% 71.1% 87.7% 88.5%
0.8 773 775 773 762 1127 1108 1189 1399 1091 1469 1232 770.75
0.8 96.5% 92.9% 91.9% 93.0% 92.4% 85.5% 77.3% 89.4% 87.4% 69.3% 84.5% 87.3%
0.9 470 412 471 414 799 790 805 839 791 832 801 441.75
0.9 96.0% 91.5% 91.0% 90.5% 84.4% 75.0% 65.8% 79.7% 88.3% 63.9% 46.6% 79.3%

(b) XLNet

Table 6: Accuracies and number of neurons across all tasks after running correlation clustering. Recall that the
clustering is run without any task specific labels, and the evaluation is done across all tasks to analyze the efficacy
of correlation clustering as a method to remove redundant neurons.

4923
POS SEM CCG Chunking SST-2 MRPC MNLI QNLI QQP RTE STS-B
Oracle 95.2% 92.0% 90.1% 94.6% 90.6% 86.0% 81.7% 90.2% 91.2% 69.3% 89.7%
1% Loss 94.2% 91.1% 89.2% 93.6% 89.7% 85.2% 80.9% 89.3% 90.2% 68.6% 88.8%
Embedding 89.6% 81.5% 70.0% 77.5% 50.9% 68.4% 31.8% 49.5% 63.2% 52.7% 0.0%
Layer 1 93.1% 87.6% 78.9% 82.1% 78.4% 68.9% 42.8% 59.7% 71.4% 52.7% 6.0%
Layer 2 95.3% 91.7% 86.6% 91.0% 80.2% 71.3% 45.0% 61.2% 73.3% 56.0% 10.4%
Layer 3 95.5% 92.3% 88.0% 92.0% 80.6% 69.6% 54.0% 74.4% 77.2% 54.9% 54.5%
Layer 4 96.0% 93.0% 89.6% 94.0% 81.2% 75.5% 61.8% 81.3% 80.1% 55.6% 84.9%
Layer 5 96.0% 93.2% 90.4% 94.0% 82.3% 76.2% 65.9% 82.9% 84.4% 59.6% 85.8%
Layer 6 96.3% 93.4% 91.6% 94.9% 86.2% 77.5% 71.6% 83.2% 85.8% 62.1% 86.4%
Layer 7 96.2% 93.3% 91.9% 95.1% 88.6% 79.4% 74.9% 83.8% 86.9% 62.5% 86.8%
Layer 8 96.0% 93.1% 91.9% 94.8% 90.6% 77.5% 76.4% 84.4% 87.1% 63.5% 87.1%
Layer 9 95.8% 92.9% 91.6% 94.5% 90.5% 83.3% 79.8% 84.8% 87.7% 63.2% 87.0%
Layer 10 95.6% 92.5% 91.2% 94.1% 90.6% 82.6% 80.3% 86.1% 89.0% 64.3% 87.3%
Layer 11 95.4% 92.3% 90.9% 93.9% 90.4% 85.8% 81.7% 89.8% 91.0% 66.4% 88.9%
Layer 12 95.1% 92.0% 90.2% 93.2% 90.1% 87.3% 82.0% 90.4% 91.1% 66.1% 89.7%

(a) BERT
POS SEM CCG Chunking SST-2 MRPC MNLI QNLI QQP RTE STS-B
Oracle 95.9% 92.5% 90.8% 94.2% 92.4% 86.5% 78.9% 88.7% 87.2% 71.1% 88.9%
1% Loss 95.0% 91.5% 89.9% 93.3% 91.5% 85.7% 78.1% 87.8% 86.4% 70.4% 88.0%
Embedding 89.5% 82.6% 70.5% 77.0% 50.9% 68.4% 32.7% 50.5% 63.2% 52.7% 0.6%
Layer 1 96.3% 92.9% 88.7% 90.8% 79.6% 70.6% 44.2% 58.9% 72.0% 47.3% 8.8%
Layer 2 96.7% 93.6% 91.0% 93.4% 81.1% 70.1% 45.1% 58.6% 73.8% 45.8% 11.0%
Layer 3 96.8% 93.5% 91.8% 94.2% 84.7% 71.1% 61.6% 74.2% 82.4% 47.3% 81.1%
Layer 4 96.7% 93.4% 92.1% 94.2% 88.3% 76.0% 63.7% 74.1% 85.0% 53.1% 82.8%
Layer 5 96.6% 93.2% 92.4% 93.9% 88.6% 79.4% 68.4% 81.3% 89.2% 62.1% 84.9%
Layer 6 96.3% 92.6% 92.0% 94.2% 90.1% 83.1% 73.9% 83.3% 89.9% 63.5% 85.9%
Layer 7 96.1% 92.3% 91.9% 94.0% 92.9% 85.3% 79.1% 88.1% 89.9% 67.1% 86.7%
Layer 8 95.8% 91.9% 91.6% 93.5% 93.6% 87.7% 80.7% 90.0% 89.2% 65.0% 87.6%
Layer 9 95.3% 91.6% 91.4% 93.1% 94.2% 87.5% 80.1% 90.3% 88.4% 69.3% 88.2%
Layer 10 94.9% 91.2% 90.8% 92.1% 93.8% 86.5% 80.1% 90.4% 88.9% 71.8% 88.2%
Layer 11 94.6% 90.8% 90.2% 91.1% 94.5% 86.8% 80.1% 90.5% 88.5% 71.8% 88.5%
Layer 12 92.0% 87.4% 86.0% 85.9% 93.8% 86.5% 80.8% 90.6% 89.3% 71.1% 88.5%

(b) XLNet

Table 7: Task specific layer wise results across all tasks. The oracle is trained on all 13 layers combined. Bold
numbers highlight layers for each task that maintain 99% of the Oracle’s performance

Task Oracle #Neurons Reduced Accuracy Task Oracle #Neurons Reduced Accuracy
POS 95.7% 290 94.3% POS 96.1% 280 95.6%
SEM 92.2% 330 90.8% SEM 92.2% 290 91.1%
CCG 89.9% 330 88.7% CCG 90.2% 690 89.8%
Chunking 94.4% 750 93.8% Chunking 94.1% 660 93.0%
Word Average 93.1% 425 91.9% Word Average 93.2% 480 92.4%
SST-2 90.6% 30 88.4% SST-2 92.9% 70 91.3%
MRPC 86.3% 190 85.0% MRPC 85.8% 170 85.0%
MNLI 81.7% 30 81.8% MNLI 79.0% 90 77.9%
QNLI 90.3% 40 89.1% QNLI 88.3% 20 88.5%
QQP 91.2% 10 90.8% QQP 87.4% 20 88.0%
RTE 69.7% 320 68.6% RTE 70.4% 400 71.1%
STS-B 89.6% 290 88.3% STS-B 88.9% 300 86.6%
Sentence Average 85.6% 130 84.6% Sentence Average 84.7% 152 84.1%

(a) BERT (b) XLNet

Table 8: Accuracies after running linguistic correlation analysis and extracting the minimal set of neurons from
all 9984 neurons

4924
POS SEM CCG Chunking
Oracle 95.2% 92.0% 90.1% 94.6%
Neurons 9984

BERT
LS 94.8% 91.2% 89.2% 94.0%
Layers 3 3 8 7
CCFS 93.9% 90.1% 90.2% 93.7%
Neurons 300 400 400 600
% Reduct. 97%↓ 96%↓ 96%↓ 94%↓
Oracle 95.9% 92.5% 90.8% 94.2%
Neurons 9984

XLNet
(a) BERT LS 96.3% 92.9% 90.3% 93.5%
Layers 2 2 3 3
CCFS 95.6% 91.9% 89.5% 91.8%
Neurons 300 400 300 600
% Reduct. 97%↓ 96%↓ 97%↓ 94%↓

Table 9: Results of sequence labeling tasks using


LayerSelector(LS) with performance threshold=
1 and CCFS with performance threshold= 1. Oracle
is using a concatenation of all layers. Layers shows
the number of the selected layer. Neurons are the fi-
nal number of neurons (features) used for classification.
% Reduct. shows the percentage reduction in neurons
(b) XLNet compared to the full network.

Figure 9: Layer-wise task specific redudancy for


Chunking task. Redundant layers are represented by
the colored blocks.

Figure 11: Average runtime per instance computed


Figure 10: Average runtime per instance computed across all sequence classification tasks for XLNet. Se-
across all sequence classification tasks for BERT. Se- quence classification tasks all have a near 2x speed
quence classification tasks all have a near 2x speed up, while most sequence labeling tasks have a 1.08x
up, while most sequence labeling tasks have a 1.08x speedup.
speedup.

4925
SST-2 MRPC MNLI QNLI QQP RTE STS-B
Oracle 90.6% 86.0% 81.7% 90.2% 91.2% 69.3% 89.7%
BERT Neurons 9984
LS 88.2% 86.0% 81.6% 89.9% 90.9% 69.3% 89.1%
Layers 8 12 12 12 12 13 12
CCFS 87.0% 86.3% 81.3% 89.1% 89.9% 65.7% 88.6%
Neurons 30 100 30 10 20 30 400
% Reduction 99.7%↓ 99.0%↓ 99.7%↓ 99.9%↓ 99.8%↓ 99.9%↓ 96.0%↓
Oracle 92.4% 86.5% 78.9% 88.7% 87.2% 71.1% 88.9%
Neurons 9984
XLNet

LS 88.2% 86.0% 79.9% 88.8% 89.3% 71.1% 88.1%


Layers 6 9 8 8 6 11 9
CCFS 87.5% 89.0% 78.4% 88.3% 88.8% 69.0% 87.2%
Neurons 50 100 50 200 100 100 400
% Reduction 99.5%↓ 99.0%↓ 99.5%↓ 98.0%↓ 99.0%↓ 99.0%↓ 96.0%↓

Table 10: Results of sequence classification tasks using LayerSelector(LS) with performance threshold= 1
and CCFS with performance threshold= 1. Oracle is using a concatenation of all layers. Layers shows the number
of the selected layer. Neurons are the final number of neurons (features) used for classification. % Reduct. shows
the percentage reduction in neurons compared to the full network.

Figure 12: Runtime vs number of examples when ex-


tracting contextual embeddings using BERT

Figure 14: Runtime vs number of features when per-


forming feature ranking using the weights of a trained
classifier

Figure 13: Runtime vs number of examples when per-


forming correlation clustering

4926

You might also like