Abstract
Many information access applications have to tackle natural language texts that contain a large proportion of repeated and mostly invariable patterns – called boilerplates –, such as automatic templates, headers, signatures and table formats. These domain-specific standard formulations are usually much longer than traditional collocations or standard noun phrases and typically cover one or more sentences. Such motifs clearly have a non-compositional meaning and an ideal document representation should reflect this phenomenon.
We propose here a method that detects automatically and in an unsupervised way such motifs; and enriches the document representation by including specific features for these motifs. We experimentally show that this document recoding strategy leads to improved classification on different collections.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Baroni, M., Chantree, F., Kilgarriff, A., Sharoff, S.: CleanEval: a competition for cleaning webpages. In: LREC (2008)
Bernstein, Y., Zobel, J.: Accurate discovery of co-derivative documents via duplicate text detection. Inf. Syst. 31(7), 595–609 (2006)
Iliopoulos, C.S., McHugh, J., Peterlongo, P., Pisanti, N., Rytter, W., Sagot, M.: A first approach to finding common motifs with gaps. International Journal of Foundation of Computer Science 16(6), 1145–1155 (2005)
Gallé, M.: Searching for Compact Hierarchical Structures in DNA by means of the Smallest Grammar Problem. Université de Rennes 1 (February 2011)
Gallé, M.: The bag-of-repeats representation of documents. In: SIGIR (2013)
Gusfield, D.: Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press (January 1997)
Kohlschütter, C., Fankhauser, P., Nejdl, W.: Boilerplate detection using shallow text features. In: WSDM, p. 441. ACM Press, New York (2010)
Kohlschütter, C., Nejdl, W.: A Densitometric Approach to Web Page Segmentation Segmentation as a Visual Problem. In: CIKM, pp. 1173–1182 (2008)
Manning, C., Raghavan, P., Schütze, H.: Introduction to Inf Retrieval. Cambridge UP (2009)
Marsan, L., Sagot, M.-F.: Extracting structured motifs using a suffix tree–algorithms and application to promoter consensus identification. Journal of Computational Biology 7(3/4), 345–362 (2000)
Pasternack, J., Roth, D.: Extracting Article Text from the Web with Maximum Subsequence Segmentation. In: WWW, pp. 971–980 (2009)
Pisanti, N., Carvalho, A.M., Marsan, L., Sagot, M.-F.: RISOTTO: Fast extraction of motifs with mismatches. In: Correa, J.R., Hevia, A., Kiwi, M. (eds.) LATIN 2006. LNCS, vol. 3887, pp. 757–768. Springer, Heidelberg (2006)
Zhang, Y., Zaki, M.: Exmotif: efficient structured motif extraction. Algorithms for Molecular Biology 1(1), 21 (2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Gallé, M., Renders, JM. (2014). Boilerplate Detection and Recoding. In: de Rijke, M., et al. Advances in Information Retrieval. ECIR 2014. Lecture Notes in Computer Science, vol 8416. Springer, Cham. https://doi.org/10.1007/978-3-319-06028-6_42
Download citation
DOI: https://doi.org/10.1007/978-3-319-06028-6_42
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-06027-9
Online ISBN: 978-3-319-06028-6
eBook Packages: Computer ScienceComputer Science (R0)