Nothing Special   »   [go: up one dir, main page]

Skip to main content
Log in

Using wavelet analysis for text categorization in digital libraries: a first experiment with Strathprints

  • Published:
International Journal on Digital Libraries Aims and scope Submit manuscript

Abstract

Digital libraries increasingly benefit from research on automated text categorization for improved access. Such research is typically carried out by means of standard test collections. In this article, we present a pilot experiment of replacing such test collections by a set of 6,000 objects from a real-world digital repository, indexed by Library of Congress Subject Headings, and test support vector machines in a supervised learning setting for their ability to reproduce the existing classification. To augment the standard approach, we introduce a combination of two novel elements: using functions for document content representation in Hilbert space, and adding extra semantics from lexical resources to the representation. Results suggest that wavelet-based kernels slightly outperformed traditional kernels on classification reconstruction from abstracts and vice versa from full-text documents, the latter outcome being due to word sense ambiguity. The practical implementation of our methodological framework enhances the analysis and representation of specific knowledge relevant to large-scale digital collections, in this case the thematic coverage of the collections. Representation of specific knowledge about digital collections is one of the basic elements of the persistent archives and the less studied one (compared to representations of digital objects and collections). Our research is an initial step in this direction developing further the methodological approach and demonstrating that text categorization can be applied to analyse the thematic coverage in digital repositories.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Agirre, E., De Lacalle, O.: Clustering WordNet word senses. In: Proceedings of RANLP-03, 4th international conference on recent advances in natural language processing, pp. 121–130. Borovets, Bulgaria (2003)

  2. Agirre, E., Alfonseca, E., de Lacalle, O.: Approximating hierarchy-based similarity for WordNet nominal synsets using topic signatures. In: Proceedings of GWC-04, 2nd global WordNet conference, pp. 15–22. Brno, Czech Republic (2004)

  3. Avancini H., Lavelli A., Sebastiani F., Zanoli R.: Automatic expansion of domain-specific lexicons by term categorization. ACM Trans. Speech Lang. Process. 3(1), 1–30 (2006)

    Article  Google Scholar 

  4. Basili, R., Cammisa, M., Moschitti, A.: Effective use of WordNet semantics via kernel-based learning. In: Proceedings of CoNLL-05, 9th conference on computational natural language learning, pp. 1–8. Ann Arbor, MI, USA (2005)

  5. Bethard, S., Wetzer, P., Butcher, K., Martin, J., Sumner, T.: Automatically characterizing resource quality for educational digital libraries. In: Proceedings of JCDL-09, 9th joint international conference on digital libraries, pp. 221–230. Austin, TX, USA (2009)

  6. Bloehdorn, S., Basili, R., Cammisa, M., Moschitti, A.: Semantic kernels for text classification based on topological measures of feature similarity. In: Proceedings of ICDM-06, 6th IEEE international conference on data mining. Hong Kong (2006)

  7. Brocks, H., Kranstedt, A., Jäschke, G., Hemmje, M.: Modeling context for digital preservation. In: Nguyen, N., Szczerbicki, E. (eds.) Smart Information and Knowledge Management: Advances, Challenges, and Critical Issues. Springer, Berlin (2009)

  8. Budanitsky A., Hirst G.: Evaluating WordNet-based measures of lexical semantic relatedness. Comput. Linguist. 32(1), 13–47 (2006)

    Article  MATH  Google Scholar 

  9. Chang, C.C., Lin, C.J.: LIBSVM: A library for support vector machines. http://www.csie.ntu.edu.tw/~cjlin/libsvm (2001)

  10. Cormen, T., Leiserson, C., Rivest, R.: Introduction to algorithms. MIT Press, Cambridge (2001)

  11. Cristianini N., Shawe-Taylor J., Lodhi H.: Latent semantic kernels. J. Intell. Inf. Syst. 18(2), 127–152 (2002)

    Article  Google Scholar 

  12. Cui, H.: An application for semantic markup of biodiversity documents. In: Proceedings of JCDL-08, 8th ACM/IEEE-CS joint conference on digital libraries, pp. 421–421. Pittsburgh, PA, USA (2008)

  13. Datta R., Joshi D., Li J., Wang J.: Image retrieval: ideas, influences, and trends of the new age. ACM Comput. Surv. 40(2), 1–60 (2008)

    Article  Google Scholar 

  14. Dawson, A., Slevin, A.: Repository case history: University of Strathclyde Strathprints. http://www.rsp.ac.uk/repos/casestudies/pdfs/strathclyde.pdf (2008)

  15. de Carvalho, M., Gonçalves, M., Laender, A., da Silva, A.: Learning to deduplicate. In: Proceedings of JCDL-06, 6th ACM/IEEE-CS joint conference on digital libraries, pp. 41–50. Chapel Hill, NC, USA (2006)

  16. Deerwester S., Dumais S., Furnas G., Landauer T., Harshman R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990)

    Article  Google Scholar 

  17. Efron, M., Elsas, J., Marchionini, G., Zhang, J.: Machine learning for information architecture in a large governmental web site. In: Proceedings of JCDL-04, 4th ACM/IEEE-CS joint conference on digital libraries, pp. 151–159. Tucson, AZ, USA (2004)

  18. Esposito, F., Malerba, D., Semeraro, G., Fanizzi, N., Ferilli, S.: Adding machine learning and knowledge intensive techniques to a digital library service. Int. J. Digit. Libr. 2(1), 3–19 (1998)

    Google Scholar 

  19. Fellbaum C.: WordNet: An Electronic Lexical Database. MIT Press, Cambridge (1998)

    MATH  Google Scholar 

  20. Frank E., Paynter G.: Predicting library of congress classifications from library of congress subject headings. J. Am. Soc. Inf. Sci. Technol. 55(3), 214–227 (2004)

    Article  Google Scholar 

  21. Fuhr N., Tsakonas G., Aalberg T., Agosti M., Hansen P., Kapidakis S., Klas C., Kovács L., Landoni M., Micsik A.: Evaluation of digital libraries. Int. J. Digit. Libr. 8(1), 21–38 (2007)

    Article  Google Scholar 

  22. Gabrilovich, E., Markovitch, S.: Feature generation for text categorization using world knowledge. In: Proceedings of IJCAI-05, 19th international joint conference on artificial intelligence, vol. 19. Edinburgh, UK (2005)

  23. Hagedorn K., Chapman S., Newman D.: Enhancing search and browse using automated clustering of subject metadata. D-Lib Mag. 13(7/8), 1082–9873 (2007)

    Google Scholar 

  24. Han, H., Giles, C., Manavoglu, E., Zha, H., Zhang, Z., Fox, E.: Automatic document metadata extraction using support vector machines. In: Proceedings of JCDL-03, 3rd ACM/IEEE-CS joint conference on digital libraries, pp. 37–48. Houston, TX, USA (2003)

  25. Hoenkamp E.: Unitary operators on the document space. J. Am. Soc. Inf. Sci. Technol. 54(4), 314–320 (2003)

    Article  Google Scholar 

  26. Hotho, A., Staab, S., Stumme, G.: WordNet improves text document clustering. In: Proceedings of SIGIR-03, 26th international conference on research and development in information retrieval. Toronto, Canada (2003)

  27. Hu, Y., Li, H., Cao, Y., Meyerzon, D., Zheng, Q.: Automatic extraction of titles from general documents using machine learning. In: Proceedings of JCDL-05, 5th ACM/IEEE-CS joint conference on digital libraries, pp. 145–154. Denver, CO, USA (2005)

  28. ISO 14721: Reference model for an Open Archival Information System (OAIS) fCCSDS 650.0-B-1 Blue book (2003)

  29. Jiang, J., Conrath, D.: Semantic similarity based on corpus statistics and lexical taxonomy. In: Proceedings of ROCLING-97, international conference on research in computational linguistics, pp. 19–33. Taipei, Taiwan (1997)

  30. Joachims, T.: Text categorization with support vector machines: Learning with many relevant features. In: Proceedings of ECML-98, 10th European conference on machine learning, pp. 137–142. Chemnitz, Germany (1998)

  31. Li, T., Ogihara, M., Li, Q.: A comparative study on content-based music genre classification. In: Proceedings of SIGIR-03, 26th international conference on research and development in information retrieval, pp. 282–289. Toronto, ON, Canada (2003)

  32. Lin, D.: Automatic retrieval and clustering of similar words. In: Proceedings of ACL-98, 36th annual meeting of association for computational linguistics, vol. 36, pp. 768–774. Montréal, Québec, Canada (1998)

  33. Lu, X., Mitra, P., Wang, J., Giles, C.: Automatic categorization of figures in scientific documents. In: Proceedings of JCDL-06, 6th ACM/IEEE-CS joint conference on digital libraries, pp. 129–138. Chapel Hill, NC, USA (2006)

  34. Lu, X., Wang, J., Mitra, P., Giles, C.: Deriving knowledge from figures for digital libraries. In: Proceedings of WWW-07, 16th international conference on world wide web, pp. 1229–1230. Banff, AB, Canada (2007)

  35. Lyu, M., Yau, E., Sze, S.: A multilingual, multimodal digital video library system. In: Proceedings of JCDL-02, 2nd ACM/IEEE-CS joint conference on digital libraries, pp. 145–153. Portland, OR, USA (2002)

  36. Manning C., Schütze H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)

    MATH  Google Scholar 

  37. Martins, W., Gonçalves, M., Laender, A., Pappa, G.: Learning to assess the quality of scientific conferences: a case study in computer science. In: Proceedings of JCDL-09, 9th joint international conference on digital libraries, pp. 193–202. Austin, TX, USA (2009)

  38. Mavroeidis, D., Tsatsaronis, G., Vazirgiannis, M., Theobald, M., Weikum, G.: Word sense disambiguation for exploiting hierarchical thesauri in text classification. In: Proceedings of PKDD-05, 9th European conference on the principles of data mining and knowledge discovery, pp. 181–192. Porto, Portugal (2005)

  39. Miller, N., Wong, P., Brewster, M., Foote, H.: TOPIC ISLANDS—a wavelet-based text visualization system. In: Proceedings of InfoVis-98, IEEE symposium on information visualization, pp. 189–196. Research Triangle Park, NC, USA (1998)

  40. Mohammad, S., Hirst, G.: Distributional measures as proxies for semantic relatedness (2005, submitted)

  41. Moore, R., Rajasekar, A., Baru, C., Ludaescher, B., Gupta, A., Marciano, R.: Persistent archives. US Patent 6,963,875 (2005)

  42. Pant, G., Tsioutsiouliklis, K., Johnson, J., Giles, C.: Panorama: extending digital libraries with topical crawlers. In: Proceedings of JCDL-04, 4th ACM/IEEE-CS joint conference on digital libraries, pp. 142–150. Tucson, AZ, USA (2004)

  43. Paynter, G.: Developing practical automatic metadata assignment and evaluation tools for internet resources. In: Proceedings of JCDL-05, 5th ACM/IEEE-CS joint conference on digital libraries, pp. 291–300. Denver, CO, USA (2005)

  44. Purcell G., Rennels G., Shortliffe E.: Development and evaluation of a context-based document representation for searching the medical literature. Int. J. Digit. Libr. 1(3), 288–296 (1997)

    Article  Google Scholar 

  45. Ramsey M., Chen H., Zhu B., Schatz B.: A collection of visual thesauri for browsing large collections of geographic images. J. Am. Soc. Inf. Sci. 50(9), 826–834 (1999)

    Article  Google Scholar 

  46. Resnik, P.: Using information content to evaluate semantic similarity in a taxonomy. In: Proceedings of IJCAI-95, 14th international joint conference on artificial intelligence, vol. 1, pp. 448–453. Montréal, Québec, Canada (1995)

  47. Rodriguez, M., Hidalgo, J.: Using WordNet to complement training information in text categorization. In: Proceedings of RANLP-97, 2nd international conference on recent advances in natural language processing (1997)

  48. Sebastiani F.: Text categorization. In: Zanasi, A. (eds) Text Mining and its Applications, pp. 109–129. WIT Press, Southampton (2005)

    Google Scholar 

  49. Shawe-Taylor J., Cristianini N.: Kernel Methods for Pattern Analysis. Cambridge University Press, New York (2004)

    Book  Google Scholar 

  50. Siolas, G., d’Alché Buc, F.: Support vector machines based on a semantic kernel for text categorization. In: Proceedings of IJCNN-00, IEEE international joint conference on neural networks. Austin, TX, USA (2000)

  51. Smola A., Schölkopf B., Müller K.: The connection between regularization operators and support vector kernels. Neural Netw. 11(4), 637–649 (1998)

    Article  Google Scholar 

  52. Wang, J.: An extensive study on automated Dewey decimal classification. J. Am. Soc. Inf. Sci. Technol. 60(11), 2269–2286 (2009)

    Google Scholar 

  53. Wang J., Wiederhold G., Firschein O., Xin Wei S.: Content-based image indexing and searching using Daubechies’ wavelets. Int. J. Digit. Libr. 1(4), 311–328 (1998)

    Article  Google Scholar 

  54. Wetzler, P., Bethard, S., Butcher, K., Martin, J., Sumner, T.: Automatically assessing resource quality for educational digital libraries. In: Proceedings of WICOW-09, 3rd workshop on information credibility on the web, pp. 3–10. Madrid, Spain (2009)

  55. Wilson, B.: A special issue on digital library evolution. D-Lib Mag. 12(3), 56 (2006)

  56. Wittek, P., Darányi, S., Tan, C.: Improving text classification by a sense spectrum approach to term expansion. In: Proceedings of CoNLL-09, 13th conference on computational natural language learning, pp. 183–191. Boulder, CO, USA (2009)

  57. Wong, S., Ziarko, W., Wong, P.: Generalized vector space model in information retrieval. In: Proceedings of SIGIR-85, 8th international conference on research and development in information retrieval, pp. 18–25. Montréal, Québec, Canada (1985)

  58. Xia, Z., Dong, Y., Xing, G.: Support vector machines for collaborative filtering. In: Proceedings of ACMSE-06, 44th annual southeast regional conference, pp. 169–174. Melbourne, FL, USA (2006)

  59. Yang Y.: An evaluation of statistical approaches to text categorization. Inf. Retr. 1(1), 69–90 (1999)

    Article  Google Scholar 

  60. Zhang L., Zhou W., Jiao L.: Wavelet support vector machine. IEEE Trans. Syst. Man Cybern. 34(1), 34–39 (2004)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sándor Darányi.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Darányi, S., Wittek, P. & Dobreva, M. Using wavelet analysis for text categorization in digital libraries: a first experiment with Strathprints. Int J Digit Libr 12, 3–12 (2012). https://doi.org/10.1007/s00799-012-0079-y

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00799-012-0079-y

Keywords

Navigation