Using wavelet analysis for text categorization in digital libraries: a first experiment with Strathprints

Sándor Darányi¹,
Peter Wittek¹ &
Milena Dobreva²

190 Accesses
3 Citations
Explore all metrics

Abstract

Digital libraries increasingly benefit from research on automated text categorization for improved access. Such research is typically carried out by means of standard test collections. In this article, we present a pilot experiment of replacing such test collections by a set of 6,000 objects from a real-world digital repository, indexed by Library of Congress Subject Headings, and test support vector machines in a supervised learning setting for their ability to reproduce the existing classification. To augment the standard approach, we introduce a combination of two novel elements: using functions for document content representation in Hilbert space, and adding extra semantics from lexical resources to the representation. Results suggest that wavelet-based kernels slightly outperformed traditional kernels on classification reconstruction from abstracts and vice versa from full-text documents, the latter outcome being due to word sense ambiguity. The practical implementation of our methodological framework enhances the analysis and representation of specific knowledge relevant to large-scale digital collections, in this case the thematic coverage of the collections. Representation of specific knowledge about digital collections is one of the basic elements of the persistent archives and the less studied one (compared to representations of digital objects and collections). Our research is an initial step in this direction developing further the methodological approach and demonstrating that text categorization can be applied to analyse the thematic coverage in digital repositories.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Markov Chain Monte Carlo Methods and Evolutionary Algorithms for Automatic Feature Selection from Legal Documents

An Automatic Library Data Classification System Using Layer Structure and Voting Strategy

Extracting Predictive Models from Marked-Up Free-Text Documents at the Royal Botanic Gardens, Kew, London

References

Agirre, E., De Lacalle, O.: Clustering WordNet word senses. In: Proceedings of RANLP-03, 4th international conference on recent advances in natural language processing, pp. 121–130. Borovets, Bulgaria (2003)
Agirre, E., Alfonseca, E., de Lacalle, O.: Approximating hierarchy-based similarity for WordNet nominal synsets using topic signatures. In: Proceedings of GWC-04, 2nd global WordNet conference, pp. 15–22. Brno, Czech Republic (2004)
Avancini H., Lavelli A., Sebastiani F., Zanoli R.: Automatic expansion of domain-specific lexicons by term categorization. ACM Trans. Speech Lang. Process. 3(1), 1–30 (2006)
Article Google Scholar
Basili, R., Cammisa, M., Moschitti, A.: Effective use of WordNet semantics via kernel-based learning. In: Proceedings of CoNLL-05, 9th conference on computational natural language learning, pp. 1–8. Ann Arbor, MI, USA (2005)
Bethard, S., Wetzer, P., Butcher, K., Martin, J., Sumner, T.: Automatically characterizing resource quality for educational digital libraries. In: Proceedings of JCDL-09, 9th joint international conference on digital libraries, pp. 221–230. Austin, TX, USA (2009)
Bloehdorn, S., Basili, R., Cammisa, M., Moschitti, A.: Semantic kernels for text classification based on topological measures of feature similarity. In: Proceedings of ICDM-06, 6th IEEE international conference on data mining. Hong Kong (2006)
Brocks, H., Kranstedt, A., Jäschke, G., Hemmje, M.: Modeling context for digital preservation. In: Nguyen, N., Szczerbicki, E. (eds.) Smart Information and Knowledge Management: Advances, Challenges, and Critical Issues. Springer, Berlin (2009)
Budanitsky A., Hirst G.: Evaluating WordNet-based measures of lexical semantic relatedness. Comput. Linguist. 32(1), 13–47 (2006)
Article MATH Google Scholar
Chang, C.C., Lin, C.J.: LIBSVM: A library for support vector machines. http://www.csie.ntu.edu.tw/~cjlin/libsvm (2001)
Cormen, T., Leiserson, C., Rivest, R.: Introduction to algorithms. MIT Press, Cambridge (2001)
Cristianini N., Shawe-Taylor J., Lodhi H.: Latent semantic kernels. J. Intell. Inf. Syst. 18(2), 127–152 (2002)
Article Google Scholar
Cui, H.: An application for semantic markup of biodiversity documents. In: Proceedings of JCDL-08, 8th ACM/IEEE-CS joint conference on digital libraries, pp. 421–421. Pittsburgh, PA, USA (2008)
Datta R., Joshi D., Li J., Wang J.: Image retrieval: ideas, influences, and trends of the new age. ACM Comput. Surv. 40(2), 1–60 (2008)
Article Google Scholar
Dawson, A., Slevin, A.: Repository case history: University of Strathclyde Strathprints. http://www.rsp.ac.uk/repos/casestudies/pdfs/strathclyde.pdf (2008)
de Carvalho, M., Gonçalves, M., Laender, A., da Silva, A.: Learning to deduplicate. In: Proceedings of JCDL-06, 6th ACM/IEEE-CS joint conference on digital libraries, pp. 41–50. Chapel Hill, NC, USA (2006)
Deerwester S., Dumais S., Furnas G., Landauer T., Harshman R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990)
Article Google Scholar
Efron, M., Elsas, J., Marchionini, G., Zhang, J.: Machine learning for information architecture in a large governmental web site. In: Proceedings of JCDL-04, 4th ACM/IEEE-CS joint conference on digital libraries, pp. 151–159. Tucson, AZ, USA (2004)
Esposito, F., Malerba, D., Semeraro, G., Fanizzi, N., Ferilli, S.: Adding machine learning and knowledge intensive techniques to a digital library service. Int. J. Digit. Libr. 2(1), 3–19 (1998)
Google Scholar
Fellbaum C.: WordNet: An Electronic Lexical Database. MIT Press, Cambridge (1998)
MATH Google Scholar
Frank E., Paynter G.: Predicting library of congress classifications from library of congress subject headings. J. Am. Soc. Inf. Sci. Technol. 55(3), 214–227 (2004)
Article Google Scholar
Fuhr N., Tsakonas G., Aalberg T., Agosti M., Hansen P., Kapidakis S., Klas C., Kovács L., Landoni M., Micsik A.: Evaluation of digital libraries. Int. J. Digit. Libr. 8(1), 21–38 (2007)
Article Google Scholar
Gabrilovich, E., Markovitch, S.: Feature generation for text categorization using world knowledge. In: Proceedings of IJCAI-05, 19th international joint conference on artificial intelligence, vol. 19. Edinburgh, UK (2005)
Hagedorn K., Chapman S., Newman D.: Enhancing search and browse using automated clustering of subject metadata. D-Lib Mag. 13(7/8), 1082–9873 (2007)
Google Scholar
Han, H., Giles, C., Manavoglu, E., Zha, H., Zhang, Z., Fox, E.: Automatic document metadata extraction using support vector machines. In: Proceedings of JCDL-03, 3rd ACM/IEEE-CS joint conference on digital libraries, pp. 37–48. Houston, TX, USA (2003)
Hoenkamp E.: Unitary operators on the document space. J. Am. Soc. Inf. Sci. Technol. 54(4), 314–320 (2003)
Article Google Scholar
Hotho, A., Staab, S., Stumme, G.: WordNet improves text document clustering. In: Proceedings of SIGIR-03, 26th international conference on research and development in information retrieval. Toronto, Canada (2003)
Hu, Y., Li, H., Cao, Y., Meyerzon, D., Zheng, Q.: Automatic extraction of titles from general documents using machine learning. In: Proceedings of JCDL-05, 5th ACM/IEEE-CS joint conference on digital libraries, pp. 145–154. Denver, CO, USA (2005)
ISO 14721: Reference model for an Open Archival Information System (OAIS) fCCSDS 650.0-B-1 Blue book (2003)
Jiang, J., Conrath, D.: Semantic similarity based on corpus statistics and lexical taxonomy. In: Proceedings of ROCLING-97, international conference on research in computational linguistics, pp. 19–33. Taipei, Taiwan (1997)
Joachims, T.: Text categorization with support vector machines: Learning with many relevant features. In: Proceedings of ECML-98, 10th European conference on machine learning, pp. 137–142. Chemnitz, Germany (1998)
Li, T., Ogihara, M., Li, Q.: A comparative study on content-based music genre classification. In: Proceedings of SIGIR-03, 26th international conference on research and development in information retrieval, pp. 282–289. Toronto, ON, Canada (2003)
Lin, D.: Automatic retrieval and clustering of similar words. In: Proceedings of ACL-98, 36th annual meeting of association for computational linguistics, vol. 36, pp. 768–774. Montréal, Québec, Canada (1998)
Lu, X., Mitra, P., Wang, J., Giles, C.: Automatic categorization of figures in scientific documents. In: Proceedings of JCDL-06, 6th ACM/IEEE-CS joint conference on digital libraries, pp. 129–138. Chapel Hill, NC, USA (2006)
Lu, X., Wang, J., Mitra, P., Giles, C.: Deriving knowledge from figures for digital libraries. In: Proceedings of WWW-07, 16th international conference on world wide web, pp. 1229–1230. Banff, AB, Canada (2007)
Lyu, M., Yau, E., Sze, S.: A multilingual, multimodal digital video library system. In: Proceedings of JCDL-02, 2nd ACM/IEEE-CS joint conference on digital libraries, pp. 145–153. Portland, OR, USA (2002)
Manning C., Schütze H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)
MATH Google Scholar
Martins, W., Gonçalves, M., Laender, A., Pappa, G.: Learning to assess the quality of scientific conferences: a case study in computer science. In: Proceedings of JCDL-09, 9th joint international conference on digital libraries, pp. 193–202. Austin, TX, USA (2009)
Mavroeidis, D., Tsatsaronis, G., Vazirgiannis, M., Theobald, M., Weikum, G.: Word sense disambiguation for exploiting hierarchical thesauri in text classification. In: Proceedings of PKDD-05, 9th European conference on the principles of data mining and knowledge discovery, pp. 181–192. Porto, Portugal (2005)
Miller, N., Wong, P., Brewster, M., Foote, H.: TOPIC ISLANDS—a wavelet-based text visualization system. In: Proceedings of InfoVis-98, IEEE symposium on information visualization, pp. 189–196. Research Triangle Park, NC, USA (1998)
Mohammad, S., Hirst, G.: Distributional measures as proxies for semantic relatedness (2005, submitted)
Moore, R., Rajasekar, A., Baru, C., Ludaescher, B., Gupta, A., Marciano, R.: Persistent archives. US Patent 6,963,875 (2005)
Pant, G., Tsioutsiouliklis, K., Johnson, J., Giles, C.: Panorama: extending digital libraries with topical crawlers. In: Proceedings of JCDL-04, 4th ACM/IEEE-CS joint conference on digital libraries, pp. 142–150. Tucson, AZ, USA (2004)
Paynter, G.: Developing practical automatic metadata assignment and evaluation tools for internet resources. In: Proceedings of JCDL-05, 5th ACM/IEEE-CS joint conference on digital libraries, pp. 291–300. Denver, CO, USA (2005)
Purcell G., Rennels G., Shortliffe E.: Development and evaluation of a context-based document representation for searching the medical literature. Int. J. Digit. Libr. 1(3), 288–296 (1997)
Article Google Scholar
Ramsey M., Chen H., Zhu B., Schatz B.: A collection of visual thesauri for browsing large collections of geographic images. J. Am. Soc. Inf. Sci. 50(9), 826–834 (1999)
Article Google Scholar
Resnik, P.: Using information content to evaluate semantic similarity in a taxonomy. In: Proceedings of IJCAI-95, 14th international joint conference on artificial intelligence, vol. 1, pp. 448–453. Montréal, Québec, Canada (1995)
Rodriguez, M., Hidalgo, J.: Using WordNet to complement training information in text categorization. In: Proceedings of RANLP-97, 2nd international conference on recent advances in natural language processing (1997)
Sebastiani F.: Text categorization. In: Zanasi, A. (eds) Text Mining and its Applications, pp. 109–129. WIT Press, Southampton (2005)
Google Scholar
Shawe-Taylor J., Cristianini N.: Kernel Methods for Pattern Analysis. Cambridge University Press, New York (2004)
Book Google Scholar
Siolas, G., d’Alché Buc, F.: Support vector machines based on a semantic kernel for text categorization. In: Proceedings of IJCNN-00, IEEE international joint conference on neural networks. Austin, TX, USA (2000)
Smola A., Schölkopf B., Müller K.: The connection between regularization operators and support vector kernels. Neural Netw. 11(4), 637–649 (1998)
Article Google Scholar
Wang, J.: An extensive study on automated Dewey decimal classification. J. Am. Soc. Inf. Sci. Technol. 60(11), 2269–2286 (2009)
Google Scholar
Wang J., Wiederhold G., Firschein O., Xin Wei S.: Content-based image indexing and searching using Daubechies’ wavelets. Int. J. Digit. Libr. 1(4), 311–328 (1998)
Article Google Scholar
Wetzler, P., Bethard, S., Butcher, K., Martin, J., Sumner, T.: Automatically assessing resource quality for educational digital libraries. In: Proceedings of WICOW-09, 3rd workshop on information credibility on the web, pp. 3–10. Madrid, Spain (2009)
Wilson, B.: A special issue on digital library evolution. D-Lib Mag. 12(3), 56 (2006)
Wittek, P., Darányi, S., Tan, C.: Improving text classification by a sense spectrum approach to term expansion. In: Proceedings of CoNLL-09, 13th conference on computational natural language learning, pp. 183–191. Boulder, CO, USA (2009)
Wong, S., Ziarko, W., Wong, P.: Generalized vector space model in information retrieval. In: Proceedings of SIGIR-85, 8th international conference on research and development in information retrieval, pp. 18–25. Montréal, Québec, Canada (1985)
Xia, Z., Dong, Y., Xing, G.: Support vector machines for collaborative filtering. In: Proceedings of ACMSE-06, 44th annual southeast regional conference, pp. 169–174. Melbourne, FL, USA (2006)
Yang Y.: An evaluation of statistical approaches to text categorization. Inf. Retr. 1(1), 69–90 (1999)
Article Google Scholar
Zhang L., Zhou W., Jiao L.: Wavelet support vector machine. IEEE Trans. Syst. Man Cybern. 34(1), 34–39 (2004)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Swedish School of Library and Information Science, University of Borås, Borås, Sweden
Sándor Darányi & Peter Wittek
Centre for Digital Library Research, University of Strathclyde, Glasgow, UK
Milena Dobreva

Authors

Sándor Darányi
View author publications
You can also search for this author in PubMed Google Scholar
Peter Wittek
View author publications
You can also search for this author in PubMed Google Scholar
Milena Dobreva
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sándor Darányi.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Darányi, S., Wittek, P. & Dobreva, M. Using wavelet analysis for text categorization in digital libraries: a first experiment with Strathprints. Int J Digit Libr 12, 3–12 (2012). https://doi.org/10.1007/s00799-012-0079-y

Download citation

Published: 27 January 2012
Issue Date: July 2012
DOI: https://doi.org/10.1007/s00799-012-0079-y

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Markov Chain Monte Carlo Methods and Evolutionary Algorithms for Automatic Feature Selection from Legal Documents

An Automatic Library Data Classification System Using Layer Structure and Voting Strategy

Extracting Predictive Models from Marked-Up Free-Text Documents at the Royal Botanic Gardens, Kew, London

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Using wavelet analysis for text categorization in digital libraries: a first experiment with Strathprints

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Markov Chain Monte Carlo Methods and Evolutionary Algorithms for Automatic Feature Selection from Legal Documents

An Automatic Library Data Classification System Using Layer Structure and Voting Strategy

Extracting Predictive Models from Marked-Up Free-Text Documents at the Royal Botanic Gardens, Kew, London

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation