Abstract
Digital libraries increasingly benefit from research on automated text categorization for improved access. Such research is typically carried out by means of standard test collections. In this article, we present a pilot experiment of replacing such test collections by a set of 6,000 objects from a real-world digital repository, indexed by Library of Congress Subject Headings, and test support vector machines in a supervised learning setting for their ability to reproduce the existing classification. To augment the standard approach, we introduce a combination of two novel elements: using functions for document content representation in Hilbert space, and adding extra semantics from lexical resources to the representation. Results suggest that wavelet-based kernels slightly outperformed traditional kernels on classification reconstruction from abstracts and vice versa from full-text documents, the latter outcome being due to word sense ambiguity. The practical implementation of our methodological framework enhances the analysis and representation of specific knowledge relevant to large-scale digital collections, in this case the thematic coverage of the collections. Representation of specific knowledge about digital collections is one of the basic elements of the persistent archives and the less studied one (compared to representations of digital objects and collections). Our research is an initial step in this direction developing further the methodological approach and demonstrating that text categorization can be applied to analyse the thematic coverage in digital repositories.
Similar content being viewed by others
References
Agirre, E., De Lacalle, O.: Clustering WordNet word senses. In: Proceedings of RANLP-03, 4th international conference on recent advances in natural language processing, pp. 121–130. Borovets, Bulgaria (2003)
Agirre, E., Alfonseca, E., de Lacalle, O.: Approximating hierarchy-based similarity for WordNet nominal synsets using topic signatures. In: Proceedings of GWC-04, 2nd global WordNet conference, pp. 15–22. Brno, Czech Republic (2004)
Avancini H., Lavelli A., Sebastiani F., Zanoli R.: Automatic expansion of domain-specific lexicons by term categorization. ACM Trans. Speech Lang. Process. 3(1), 1–30 (2006)
Basili, R., Cammisa, M., Moschitti, A.: Effective use of WordNet semantics via kernel-based learning. In: Proceedings of CoNLL-05, 9th conference on computational natural language learning, pp. 1–8. Ann Arbor, MI, USA (2005)
Bethard, S., Wetzer, P., Butcher, K., Martin, J., Sumner, T.: Automatically characterizing resource quality for educational digital libraries. In: Proceedings of JCDL-09, 9th joint international conference on digital libraries, pp. 221–230. Austin, TX, USA (2009)
Bloehdorn, S., Basili, R., Cammisa, M., Moschitti, A.: Semantic kernels for text classification based on topological measures of feature similarity. In: Proceedings of ICDM-06, 6th IEEE international conference on data mining. Hong Kong (2006)
Brocks, H., Kranstedt, A., Jäschke, G., Hemmje, M.: Modeling context for digital preservation. In: Nguyen, N., Szczerbicki, E. (eds.) Smart Information and Knowledge Management: Advances, Challenges, and Critical Issues. Springer, Berlin (2009)
Budanitsky A., Hirst G.: Evaluating WordNet-based measures of lexical semantic relatedness. Comput. Linguist. 32(1), 13–47 (2006)
Chang, C.C., Lin, C.J.: LIBSVM: A library for support vector machines. http://www.csie.ntu.edu.tw/~cjlin/libsvm (2001)
Cormen, T., Leiserson, C., Rivest, R.: Introduction to algorithms. MIT Press, Cambridge (2001)
Cristianini N., Shawe-Taylor J., Lodhi H.: Latent semantic kernels. J. Intell. Inf. Syst. 18(2), 127–152 (2002)
Cui, H.: An application for semantic markup of biodiversity documents. In: Proceedings of JCDL-08, 8th ACM/IEEE-CS joint conference on digital libraries, pp. 421–421. Pittsburgh, PA, USA (2008)
Datta R., Joshi D., Li J., Wang J.: Image retrieval: ideas, influences, and trends of the new age. ACM Comput. Surv. 40(2), 1–60 (2008)
Dawson, A., Slevin, A.: Repository case history: University of Strathclyde Strathprints. http://www.rsp.ac.uk/repos/casestudies/pdfs/strathclyde.pdf (2008)
de Carvalho, M., Gonçalves, M., Laender, A., da Silva, A.: Learning to deduplicate. In: Proceedings of JCDL-06, 6th ACM/IEEE-CS joint conference on digital libraries, pp. 41–50. Chapel Hill, NC, USA (2006)
Deerwester S., Dumais S., Furnas G., Landauer T., Harshman R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990)
Efron, M., Elsas, J., Marchionini, G., Zhang, J.: Machine learning for information architecture in a large governmental web site. In: Proceedings of JCDL-04, 4th ACM/IEEE-CS joint conference on digital libraries, pp. 151–159. Tucson, AZ, USA (2004)
Esposito, F., Malerba, D., Semeraro, G., Fanizzi, N., Ferilli, S.: Adding machine learning and knowledge intensive techniques to a digital library service. Int. J. Digit. Libr. 2(1), 3–19 (1998)
Fellbaum C.: WordNet: An Electronic Lexical Database. MIT Press, Cambridge (1998)
Frank E., Paynter G.: Predicting library of congress classifications from library of congress subject headings. J. Am. Soc. Inf. Sci. Technol. 55(3), 214–227 (2004)
Fuhr N., Tsakonas G., Aalberg T., Agosti M., Hansen P., Kapidakis S., Klas C., Kovács L., Landoni M., Micsik A.: Evaluation of digital libraries. Int. J. Digit. Libr. 8(1), 21–38 (2007)
Gabrilovich, E., Markovitch, S.: Feature generation for text categorization using world knowledge. In: Proceedings of IJCAI-05, 19th international joint conference on artificial intelligence, vol. 19. Edinburgh, UK (2005)
Hagedorn K., Chapman S., Newman D.: Enhancing search and browse using automated clustering of subject metadata. D-Lib Mag. 13(7/8), 1082–9873 (2007)
Han, H., Giles, C., Manavoglu, E., Zha, H., Zhang, Z., Fox, E.: Automatic document metadata extraction using support vector machines. In: Proceedings of JCDL-03, 3rd ACM/IEEE-CS joint conference on digital libraries, pp. 37–48. Houston, TX, USA (2003)
Hoenkamp E.: Unitary operators on the document space. J. Am. Soc. Inf. Sci. Technol. 54(4), 314–320 (2003)
Hotho, A., Staab, S., Stumme, G.: WordNet improves text document clustering. In: Proceedings of SIGIR-03, 26th international conference on research and development in information retrieval. Toronto, Canada (2003)
Hu, Y., Li, H., Cao, Y., Meyerzon, D., Zheng, Q.: Automatic extraction of titles from general documents using machine learning. In: Proceedings of JCDL-05, 5th ACM/IEEE-CS joint conference on digital libraries, pp. 145–154. Denver, CO, USA (2005)
ISO 14721: Reference model for an Open Archival Information System (OAIS) fCCSDS 650.0-B-1 Blue book (2003)
Jiang, J., Conrath, D.: Semantic similarity based on corpus statistics and lexical taxonomy. In: Proceedings of ROCLING-97, international conference on research in computational linguistics, pp. 19–33. Taipei, Taiwan (1997)
Joachims, T.: Text categorization with support vector machines: Learning with many relevant features. In: Proceedings of ECML-98, 10th European conference on machine learning, pp. 137–142. Chemnitz, Germany (1998)
Li, T., Ogihara, M., Li, Q.: A comparative study on content-based music genre classification. In: Proceedings of SIGIR-03, 26th international conference on research and development in information retrieval, pp. 282–289. Toronto, ON, Canada (2003)
Lin, D.: Automatic retrieval and clustering of similar words. In: Proceedings of ACL-98, 36th annual meeting of association for computational linguistics, vol. 36, pp. 768–774. Montréal, Québec, Canada (1998)
Lu, X., Mitra, P., Wang, J., Giles, C.: Automatic categorization of figures in scientific documents. In: Proceedings of JCDL-06, 6th ACM/IEEE-CS joint conference on digital libraries, pp. 129–138. Chapel Hill, NC, USA (2006)
Lu, X., Wang, J., Mitra, P., Giles, C.: Deriving knowledge from figures for digital libraries. In: Proceedings of WWW-07, 16th international conference on world wide web, pp. 1229–1230. Banff, AB, Canada (2007)
Lyu, M., Yau, E., Sze, S.: A multilingual, multimodal digital video library system. In: Proceedings of JCDL-02, 2nd ACM/IEEE-CS joint conference on digital libraries, pp. 145–153. Portland, OR, USA (2002)
Manning C., Schütze H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)
Martins, W., Gonçalves, M., Laender, A., Pappa, G.: Learning to assess the quality of scientific conferences: a case study in computer science. In: Proceedings of JCDL-09, 9th joint international conference on digital libraries, pp. 193–202. Austin, TX, USA (2009)
Mavroeidis, D., Tsatsaronis, G., Vazirgiannis, M., Theobald, M., Weikum, G.: Word sense disambiguation for exploiting hierarchical thesauri in text classification. In: Proceedings of PKDD-05, 9th European conference on the principles of data mining and knowledge discovery, pp. 181–192. Porto, Portugal (2005)
Miller, N., Wong, P., Brewster, M., Foote, H.: TOPIC ISLANDS—a wavelet-based text visualization system. In: Proceedings of InfoVis-98, IEEE symposium on information visualization, pp. 189–196. Research Triangle Park, NC, USA (1998)
Mohammad, S., Hirst, G.: Distributional measures as proxies for semantic relatedness (2005, submitted)
Moore, R., Rajasekar, A., Baru, C., Ludaescher, B., Gupta, A., Marciano, R.: Persistent archives. US Patent 6,963,875 (2005)
Pant, G., Tsioutsiouliklis, K., Johnson, J., Giles, C.: Panorama: extending digital libraries with topical crawlers. In: Proceedings of JCDL-04, 4th ACM/IEEE-CS joint conference on digital libraries, pp. 142–150. Tucson, AZ, USA (2004)
Paynter, G.: Developing practical automatic metadata assignment and evaluation tools for internet resources. In: Proceedings of JCDL-05, 5th ACM/IEEE-CS joint conference on digital libraries, pp. 291–300. Denver, CO, USA (2005)
Purcell G., Rennels G., Shortliffe E.: Development and evaluation of a context-based document representation for searching the medical literature. Int. J. Digit. Libr. 1(3), 288–296 (1997)
Ramsey M., Chen H., Zhu B., Schatz B.: A collection of visual thesauri for browsing large collections of geographic images. J. Am. Soc. Inf. Sci. 50(9), 826–834 (1999)
Resnik, P.: Using information content to evaluate semantic similarity in a taxonomy. In: Proceedings of IJCAI-95, 14th international joint conference on artificial intelligence, vol. 1, pp. 448–453. Montréal, Québec, Canada (1995)
Rodriguez, M., Hidalgo, J.: Using WordNet to complement training information in text categorization. In: Proceedings of RANLP-97, 2nd international conference on recent advances in natural language processing (1997)
Sebastiani F.: Text categorization. In: Zanasi, A. (eds) Text Mining and its Applications, pp. 109–129. WIT Press, Southampton (2005)
Shawe-Taylor J., Cristianini N.: Kernel Methods for Pattern Analysis. Cambridge University Press, New York (2004)
Siolas, G., d’Alché Buc, F.: Support vector machines based on a semantic kernel for text categorization. In: Proceedings of IJCNN-00, IEEE international joint conference on neural networks. Austin, TX, USA (2000)
Smola A., Schölkopf B., Müller K.: The connection between regularization operators and support vector kernels. Neural Netw. 11(4), 637–649 (1998)
Wang, J.: An extensive study on automated Dewey decimal classification. J. Am. Soc. Inf. Sci. Technol. 60(11), 2269–2286 (2009)
Wang J., Wiederhold G., Firschein O., Xin Wei S.: Content-based image indexing and searching using Daubechies’ wavelets. Int. J. Digit. Libr. 1(4), 311–328 (1998)
Wetzler, P., Bethard, S., Butcher, K., Martin, J., Sumner, T.: Automatically assessing resource quality for educational digital libraries. In: Proceedings of WICOW-09, 3rd workshop on information credibility on the web, pp. 3–10. Madrid, Spain (2009)
Wilson, B.: A special issue on digital library evolution. D-Lib Mag. 12(3), 56 (2006)
Wittek, P., Darányi, S., Tan, C.: Improving text classification by a sense spectrum approach to term expansion. In: Proceedings of CoNLL-09, 13th conference on computational natural language learning, pp. 183–191. Boulder, CO, USA (2009)
Wong, S., Ziarko, W., Wong, P.: Generalized vector space model in information retrieval. In: Proceedings of SIGIR-85, 8th international conference on research and development in information retrieval, pp. 18–25. Montréal, Québec, Canada (1985)
Xia, Z., Dong, Y., Xing, G.: Support vector machines for collaborative filtering. In: Proceedings of ACMSE-06, 44th annual southeast regional conference, pp. 169–174. Melbourne, FL, USA (2006)
Yang Y.: An evaluation of statistical approaches to text categorization. Inf. Retr. 1(1), 69–90 (1999)
Zhang L., Zhou W., Jiao L.: Wavelet support vector machine. IEEE Trans. Syst. Man Cybern. 34(1), 34–39 (2004)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Darányi, S., Wittek, P. & Dobreva, M. Using wavelet analysis for text categorization in digital libraries: a first experiment with Strathprints. Int J Digit Libr 12, 3–12 (2012). https://doi.org/10.1007/s00799-012-0079-y
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00799-012-0079-y