Boilerplate Detection and Recoding

Matthias Gallé²² &
Jean-Michel Renders²²

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8416))

Included in the following conference series:

European Conference on Information Retrieval

2949 Accesses

Abstract

Many information access applications have to tackle natural language texts that contain a large proportion of repeated and mostly invariable patterns – called boilerplates –, such as automatic templates, headers, signatures and table formats. These domain-specific standard formulations are usually much longer than traditional collocations or standard noun phrases and typically cover one or more sentences. Such motifs clearly have a non-compositional meaning and an ideal document representation should reflect this phenomenon.

We propose here a method that detects automatically and in an unsupervised way such motifs; and enriches the document representation by including specific features for these motifs. We experimentally show that this document recoding strategy leads to improved classification on different collections.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Charset Encoding Detection of HTML Documents

URL-Based Web Page Classification: With n-Gram Language Models

A Guide to Dictionary-Based Text Mining

References

Baroni, M., Chantree, F., Kilgarriff, A., Sharoff, S.: CleanEval: a competition for cleaning webpages. In: LREC (2008)
Google Scholar
Bernstein, Y., Zobel, J.: Accurate discovery of co-derivative documents via duplicate text detection. Inf. Syst. 31(7), 595–609 (2006)
Article Google Scholar
Iliopoulos, C.S., McHugh, J., Peterlongo, P., Pisanti, N., Rytter, W., Sagot, M.: A first approach to finding common motifs with gaps. International Journal of Foundation of Computer Science 16(6), 1145–1155 (2005)
Article MathSciNet MATH Google Scholar
Gallé, M.: Searching for Compact Hierarchical Structures in DNA by means of the Smallest Grammar Problem. Université de Rennes 1 (February 2011)
Google Scholar
Gallé, M.: The bag-of-repeats representation of documents. In: SIGIR (2013)
Google Scholar
Gusfield, D.: Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press (January 1997)
Google Scholar
Kohlschütter, C., Fankhauser, P., Nejdl, W.: Boilerplate detection using shallow text features. In: WSDM, p. 441. ACM Press, New York (2010)
Google Scholar
Kohlschütter, C., Nejdl, W.: A Densitometric Approach to Web Page Segmentation Segmentation as a Visual Problem. In: CIKM, pp. 1173–1182 (2008)
Google Scholar
Manning, C., Raghavan, P., Schütze, H.: Introduction to Inf Retrieval. Cambridge UP (2009)
Google Scholar
Marsan, L., Sagot, M.-F.: Extracting structured motifs using a suffix tree–algorithms and application to promoter consensus identification. Journal of Computational Biology 7(3/4), 345–362 (2000)
Article Google Scholar
Pasternack, J., Roth, D.: Extracting Article Text from the Web with Maximum Subsequence Segmentation. In: WWW, pp. 971–980 (2009)
Google Scholar
Pisanti, N., Carvalho, A.M., Marsan, L., Sagot, M.-F.: RISOTTO: Fast extraction of motifs with mismatches. In: Correa, J.R., Hevia, A., Kiwi, M. (eds.) LATIN 2006. LNCS, vol. 3887, pp. 757–768. Springer, Heidelberg (2006)
Chapter Google Scholar
Zhang, Y., Zaki, M.: Exmotif: efficient structured motif extraction. Algorithms for Molecular Biology 1(1), 21 (2006)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Xerox Research Centre Europe, France
Matthias Gallé & Jean-Michel Renders

Authors

Matthias Gallé
View author publications
You can also search for this author in PubMed Google Scholar
Jean-Michel Renders
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Amsterdam, Amsterdam, The Netherlands
Maarten de Rijke & Tom Kenter &
Centrum Wiskunde en Informatica, Amsterdam, The Netherlands and Delft University of Technology, Delft, The Netherlands
Arjen P. de Vries
University of Illinois at Urbana-Champaign, Urbana, IL, USA
ChengXiang Zhai
University of Twente, Twente, The Netheralnds and Erasmus University Rotterdam, Rotterdam, The Netherlands
Franciska de Jong
SalesPredict, Haifa, Israel
Kira Radinsky
Microsoft Research, Cambridge, UK
Katja Hofmann

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gallé, M., Renders, JM. (2014). Boilerplate Detection and Recoding. In: de Rijke, M., et al. Advances in Information Retrieval. ECIR 2014. Lecture Notes in Computer Science, vol 8416. Springer, Cham. https://doi.org/10.1007/978-3-319-06028-6_42

Download citation

DOI: https://doi.org/10.1007/978-3-319-06028-6_42
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-06027-9
Online ISBN: 978-3-319-06028-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Boilerplate Detection and Recoding

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Charset Encoding Detection of HTML Documents

URL-Based Web Page Classification: With n-Gram Language Models

A Guide to Dictionary-Based Text Mining

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Boilerplate Detection and Recoding

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Charset Encoding Detection of HTML Documents

URL-Based Web Page Classification: With n-Gram Language Models

A Guide to Dictionary-Based Text Mining

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation