Learning Multiscale Transformer Models for Sequence Generation

Bei Li, Tong Zheng, Yi Jing, Chengbo Jiao, Tong Xiao, Jingbo Zhu

Proceedings of the 39th International Conference on Machine Learning, PMLR 162:13225-13241, 2022.

Abstract

Multiscale feature hierarchies have been witnessed the success in the computer vision area. This further motivates researchers to design multiscale Transformer for natural language processing, mostly based on the self-attention mechanism. For example, restricting the receptive field across heads or extracting local fine-grained features via convolutions. However, most of existing works directly modeled local features but ignored the word-boundary information. This results in redundant and ambiguous attention distributions, which lacks of interpretability. In this work, we define those scales in different linguistic units, including sub-words, words and phrases. We built a multiscale Transformer model by establishing relationships among scales based on word-boundary information and phrase-level prior knowledge. The proposed \textbf{U}niversal \textbf{M}ulti\textbf{S}cale \textbf{T}ransformer, namely \textsc{Umst}, was evaluated on two sequence generation tasks. Notably, it yielded consistent performance gains over the strong baseline on several test sets without sacrificing the efficiency.

Cite this Paper

BibTeX


@InProceedings{pmlr-v162-li22ac,
  title = 	 {Learning Multiscale Transformer Models for Sequence Generation},
  author =       {Li, Bei and Zheng, Tong and Jing, Yi and Jiao, Chengbo and Xiao, Tong and Zhu, Jingbo},
  booktitle = 	 {Proceedings of the 39th International Conference on Machine Learning},
  pages = 	 {13225--13241},
  year = 	 {2022},
  editor = 	 {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan},
  volume = 	 {162},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {17--23 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v162/li22ac/li22ac.pdf},
  url = 	 {https://proceedings.mlr.press/v162/li22ac.html},
  abstract = 	 {Multiscale feature hierarchies have been witnessed the success in the computer vision area. This further motivates researchers to design multiscale Transformer for natural language processing, mostly based on the self-attention mechanism. For example, restricting the receptive field across heads or extracting local fine-grained features via convolutions. However, most of existing works directly modeled local features but ignored the word-boundary information. This results in redundant and ambiguous attention distributions, which lacks of interpretability. In this work, we define those scales in different linguistic units, including sub-words, words and phrases. We built a multiscale Transformer model by establishing relationships among scales based on word-boundary information and phrase-level prior knowledge. The proposed \textbf{U}niversal \textbf{M}ulti\textbf{S}cale \textbf{T}ransformer, namely \textsc{Umst}, was evaluated on two sequence generation tasks. Notably, it yielded consistent performance gains over the strong baseline on several test sets without sacrificing the efficiency.}
}

Endnote

%0 Conference Paper
%T Learning Multiscale Transformer Models for Sequence Generation
%A Bei Li
%A Tong Zheng
%A Yi Jing
%A Chengbo Jiao
%A Tong Xiao
%A Jingbo Zhu
%B Proceedings of the 39th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2022
%E Kamalika Chaudhuri
%E Stefanie Jegelka
%E Le Song
%E Csaba Szepesvari
%E Gang Niu
%E Sivan Sabato	
%F pmlr-v162-li22ac
%I PMLR
%P 13225--13241
%U https://proceedings.mlr.press/v162/li22ac.html
%V 162
%X Multiscale feature hierarchies have been witnessed the success in the computer vision area. This further motivates researchers to design multiscale Transformer for natural language processing, mostly based on the self-attention mechanism. For example, restricting the receptive field across heads or extracting local fine-grained features via convolutions. However, most of existing works directly modeled local features but ignored the word-boundary information. This results in redundant and ambiguous attention distributions, which lacks of interpretability. In this work, we define those scales in different linguistic units, including sub-words, words and phrases. We built a multiscale Transformer model by establishing relationships among scales based on word-boundary information and phrase-level prior knowledge. The proposed \textbf{U}niversal \textbf{M}ulti\textbf{S}cale \textbf{T}ransformer, namely \textsc{Umst}, was evaluated on two sequence generation tasks. Notably, it yielded consistent performance gains over the strong baseline on several test sets without sacrificing the efficiency.

APA


Li, B., Zheng, T., Jing, Y., Jiao, C., Xiao, T. & Zhu, J.. (2022). Learning Multiscale Transformer Models for Sequence Generation. Proceedings of the 39th International Conference on Machine Learning, in Proceedings of Machine Learning Research 162:13225-13241 Available from https://proceedings.mlr.press/v162/li22ac.html.

Related Material

Download PDF