Nothing Special   »   [go: up one dir, main page]

skip to main content
10.5555/2051073.2051095guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

When was it written? automatically determining publication dates

Published: 17 October 2011 Publication History

Abstract

Automatically determining the publication date of a document is a complex task, since a document may contain only few intratextual hints about its publication date. Yet, it has many important applications. Indeed, the amount of digitized historical documents is constantly increasing, but their publication dates are not always properly identified via OCR acquisition. Accurate knowledge about publication dates is crucial for many applications, e.g. studying the evolution of documents topics over a certain period of time.
In this article, we present a method for automatically determining the publication dates of documents, which was evaluated on a French newspaper corpus in the context of the DEFT 2011 evaluation campaign. Our system is based on a combination of different individual systems, relying both on supervised and unsupervised learning, and uses several external resources, e.g. Wikipedia, Google Books Ngrams, and etymological background knowledge about the French language. Our system detects the correct year of publication in 10% of the cases for 300-word excerpts and in 14% of the cases for 500-word excerpts, which is very promising given the complexity of the task.

References

[1]
Albert, P., Badin, F., Delorme, M., Devos, N., Papazoglou, S., Simard, J.: Décennie d'un article de journal par analyse statistique et lexicale. In: DEFT 2010, TALN (2010).
[2]
Blandine, C., Silberzstein, M.: Dictionnaires électroniques du français. Langue française 87 (1990).
[3]
De Jong, F., Rode, H., Hiemstra, D.: Temporal language models for the disclosure of historical text. In: Humanities, Computers and Cultural Heritage, p. 161 (2005).
[4]
Galibert, O.: Approches et méthodologies pour la réponse automatique à des questions adaptées à un cadre interactif en domaine ouvert. Ph.D. thesis, Université Paris-Sud 11, Orsay, France (2009).
[5]
Grouin, C., Forest, D., Paroubek, P., Zweigenbaum, P.: Présentation et résultats du défi fouille de texte DEFT2011. In: Actes TALN (2011).
[6]
Grouin, C., Forest, D., Sylva, L.D., Paroubek, P., Zweigenbaum, P.: Présentation et résultats du défi fouille de texte DEFT 2010: Oú et quand un article de presse a-t-il été écrit? In: Actes TALN (2010).
[7]
Joachims, T.: Making large-scale SVM learning practical. In: Advances in Kernel Methods - Support Vector Learning. MIT Press, Cambridge (1999).
[8]
Kanhabua, N., Nørvåg, K.: Improving temporal language models for determining time of non-timestamped documents. In: Research and Advanced Technology for Digital Libraries, pp. 358-370 (2008).
[9]
Kanhabua, N., Nørvåg, K.: Using temporal language models for document dating. In: Buntine, W., Grobelnik, M., Mladenic, D., Shawe-Taylor, J. (eds.) ECML PKDD 2009. LNCS, vol. 5782, pp. 738-741. Springer, Heidelberg (2009).
[10]
Michel, J.B., Shen, Y.K., Aiden, A.P., Veres, A., Gray, M.K., The Google Books Team, Pickett, J.P., Hoiberg, D., Clancy, D., Norvig, P., Orwant, J., Pinker, S., Nowak, M.A., Aiden, E.L.: Quantitative Analysis of Culture Using Millions of Digitized Books. Science 331(6014), 176-182 (2011).
[11]
Morik, K., Brockhausen, P., Joachims, T.: Combining statistical learning with a knowledge-based approach - a case study in intensive care monitoring. In: Proceedings of ICML 1999, pp. 268-277. Morgan Kaufmann Publishers Inc., San Francisco (1999).
[12]
Naji, N., Savoy, J., Dolamic, L.: Recherche d'information dans un corpus bruité (OCR). In: CORIA (2011).
[13]
Nørvåg, K.: Supporting temporal text-containment queries in temporal document databases. Data & Knowledge Engineering 49(1), 105-125 (2004).
[14]
Nunberg, G.: Google's Book Search: A Disaster for Scholars. The Chronicle of Higher Education (August 2009) (Online, accessed April 13, 2011).
[15]
Oger, S., Rouvier, M., Camelin, N., Kessler, R., Lefèvre, F., Torres-Moreno, J.: Système du LIA pour la campagne DEFT 2010: datation et localisation d'articles de presse francophones. In: DEFT 2010, TALN (2010).
[16]
Rosset, S., Galibert, O., Bernard, G., Bilinski, E., Adda, G.: The LIMSI participation to the QAst track. In: Working Notes of CLEF 2008 Workshop, Aarhus, Danemark (2008).
[17]
Schmid, H.: Probabilistic part-of-speech tagging using decision trees. In: International Conference on New Methods in Language Processing, pp. 44-49 (1994).
[18]
Vapnik, V.N.: Statistical Learning Theory. John Wiley and Sons, Chichester (1998).

Cited By

View all
  • (2017)Interactive System for Reasoning about Document AgeProceedings of the 2017 ACM on Conference on Information and Knowledge Management10.1145/3132847.3133166(2471-2474)Online publication date: 6-Nov-2017
  • (2016)Digital History Meets WikipediaProceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries10.1145/2910896.2910911(17-26)Online publication date: 19-Jun-2016
  • (2014)Survey of Temporal Information Retrieval and Related ApplicationsACM Computing Surveys10.1145/261908847:2(1-41)Online publication date: 25-Aug-2014
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings
SPIRE'11: Proceedings of the 18th international conference on String processing and information retrieval
October 2011
425 pages
ISBN:9783642245824
  • Editors:
  • Roberto Grossi,
  • Fabrizio Sebastiani,
  • Fabrizio Silvestri

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 17 October 2011

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 28 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2017)Interactive System for Reasoning about Document AgeProceedings of the 2017 ACM on Conference on Information and Knowledge Management10.1145/3132847.3133166(2471-2474)Online publication date: 6-Nov-2017
  • (2016)Digital History Meets WikipediaProceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries10.1145/2910896.2910911(17-26)Online publication date: 19-Jun-2016
  • (2014)Survey of Temporal Information Retrieval and Related ApplicationsACM Computing Surveys10.1145/261908847:2(1-41)Online publication date: 25-Aug-2014
  • (2012)Large scale analysis of changes in english vocabulary over recent timeProceedings of the 21st ACM international conference on Information and knowledge management10.1145/2396761.2398682(2523-2526)Online publication date: 29-Oct-2012

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media