A Neural CRF-based Hierarchical Approach for Linear Text Segmentation

Inderjeet Nair, Aparna Garimella, Balaji Vasan Srinivasan, Natwar Modani, Niyati Chhaya, Srikrishna Karanam, Sumit Shekhar

Abstract

We consider the problem of segmenting unformatted text and transcripts linearly based on their topical structure. While prior approaches explicitly train to predict segment boundaries, our proposed approach solves this task by inferring the hierarchical segmentation structure associated with the input text fragment. Given the lack of a large annotated dataset for this task, we propose a data curation strategy and create a corpus of over 700K Wikipedia articles with their hierarchical structures. We then propose the first supervised approach to generating hierarchical segmentation structures based on these annotations. Our method, in particular, is based on a neural conditional random field (CRF), which explicitly models the statistical dependency between a node and its constituent child nodes. We introduce a new data augmentation scheme as part of our model training strategy, which involves sampling a variety of node aggregations, permutations, and removals, all of which help capture fine-grained and coarse topical shifts in the data and improve model performance. Extensive experiments show that our model outperforms or achieves competitive performance when compared to previous state-of-the-art algorithms in the following settings: rich-resource, cross-domain transferability, few-shot supervision, and segmentation when topic label annotations are provided.

Anthology ID:: 2023.findings-eacl.65
Volume:: Findings of the Association for Computational Linguistics: EACL 2023
Month:: May
Year:: 2023
Address:: Dubrovnik, Croatia
Editors:: Andreas Vlachos, Isabelle Augenstein
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 883–893
Language:
URL:: https://aclanthology.org/2023.findings-eacl.65
DOI:: 10.18653/v1/2023.findings-eacl.65
Bibkey:
Cite (ACL):: Inderjeet Nair, Aparna Garimella, Balaji Vasan Srinivasan, Natwar Modani, Niyati Chhaya, Srikrishna Karanam, and Sumit Shekhar. 2023. A Neural CRF-based Hierarchical Approach for Linear Text Segmentation. In Findings of the Association for Computational Linguistics: EACL 2023, pages 883–893, Dubrovnik, Croatia. Association for Computational Linguistics.
Cite (Informal):: A Neural CRF-based Hierarchical Approach for Linear Text Segmentation (Nair et al., Findings 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.findings-eacl.65.pdf
Video:: https://aclanthology.org/2023.findings-eacl.65.mp4

PDF Cite Search Video