Term Similarity and Weighting Framework for Text Representation

Sadiq Sani¹⁹,
Nirmalie Wiratunga¹⁹,
Stewart Massie¹⁹ &
…
Robert Lothian¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6880))

Included in the following conference series:

International Conference on Case-Based Reasoning

1174 Accesses
4 Citations

Abstract

Expressiveness of natural language is a challenge for text representation since the same idea can be expressed in many different ways. Therefore, terms in a document should not be treated independently of one another since together they help to disambiguate and establish meaning. Term-similarity measures are often used to improve representation by capturing semantic relationships between terms. Another consideration for representation involves the importance of terms. Feature selection techniques address this by using statistical measures to quantify term usefulness for retrieval. In this paper we present a framework that combines term-similarity and weighting for text representation. This allows us to comparatively study the impact of term similarity, term weighting and any synergistic effect that may exist between them. Study of term similarity is based on approaches that exploit term co-occurrences within document and sentence contexts whilst term weighting uses the popular Chi-squared test. Our results on text classification tasks show that the combined effect of similarity and weighting is superior to each technique independently and that this synergistic effect is obtained regardless of co-occurrence context granularity.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Statistical Approach for Term Weighting in Very Short Documents for Text Categorization

On entropy-based term weighting schemes for text categorization

Article 07 July 2021

A New Improved Term Weighting Scheme for Text Categorization

References

Bollegala, D., Matsuo, Y., Ishizuka, M.: Measuring semantic similarity between words using web search engines. In: WWW 2007: Proceedings of the 16th International Conference on World Wide Web, pp. 757–766. ACM, New York (2007)
Google Scholar
Boyd-graber, J., Fellbaum, C., Osherson, D., Schapire, R.: Adding dense, weighted connections to wordnet. In: Proceedings of the Third International WordNet Conference (2006)
Google Scholar
Brank, J., Milic-Frayling, N.: A framework for characterzing feature weighting and selection methods in text classification. Tech. rep., Microsoft Research (January 2005)
Google Scholar
Chakraborti, S., Lothian, R., Wiratunga, N., Orecchioni, A., Watt, S.: Fast case retrieval nets for textual data. In: Roth-Berghofer, T.R., Göker, M.H., Güvenir, H.A. (eds.) ECCBR 2006. LNCS (LNAI), vol. 4106, pp. 400–414. Springer, Heidelberg (2006)
Chapter Google Scholar
Chakraborti, S., Wiratunga, N., Lothian, R., Watt, S.: Acquiring word similarities with higher order association mining. In: Weber, R.O., Richter, M.M. (eds.) ICCBR 2007. LNCS (LNAI), vol. 4626, pp. 61–76. Springer, Heidelberg (2007)
Chapter Google Scholar
Cilibrasi, R.L., Vitanyi, P.M.B.: The google similarity distance. IEEE Trans. on Knowl. and Data Eng. 19, 370–383 (2007)
Article Google Scholar
Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by latent semantic analysis. Journal of the American Society of Information Science 41(6), 391–407 (1990)
Article Google Scholar
Gabrilovich, E., Markovitch, S.: Wikipedia-based semantic interpretation for natural language processing. J. Artif. Int. Res. 34, 443–498 (2009)
MATH Google Scholar
Gracia, J., Mena, E.: Web-based measure of semantic relatedness. In: Bailey, J., Maier, D., Schewe, K.-D., Thalheim, B., Wang, X.S. (eds.) WISE 2008. LNCS, vol. 5175, pp. 136–150. Springer, Heidelberg (2008)
Chapter Google Scholar
Jiang, J., Conrath, D.: Semantic similarity based on corpus statistics and lexical taxonomy. In: Proc. of the Int’l. Conf. on Research in Computational Linguistics, pp. 19–33 (1997)
Google Scholar
Kontostathis, A., Pottenger, W.M.: A framework for understanding latent semantic indexing (lsi) performance. Information Processing and Management 42(1), 56–73 (2006)
Article Google Scholar
Lin, D.: An information-theoretic definition of similarity. In: Proc. of the 15th Int’l. Conf. on Machine Learning, pp. 296–304 (1998)
Google Scholar
Miller, G.A.: Wordnet: A lexical database for english. Communications of the ACM 38, 39–41 (1995)
Article Google Scholar
Resnik, P.: Using information content to evaluate semantic similarity in a taxonomy. In: Proceedings of the 14th International Joint Conference on Artificial Intelligence, vol. 1, pp. 448–453 (1995)
Google Scholar
Sahlgren, M.: An introduction to random indexing. In: Methods and Applications of Semantic Indexing Workshop at the 7th International Conference on Terminology and Knowledge Engineering, TKE 2005 (2005)
Google Scholar
Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM 18, 613–620 (1975)
Article MATH Google Scholar
Schütze, H., Pedersen, J.O.: A cooccurrence-based thesaurus and two applications to information retrieval. Inf. Process. Manage. 33, 307–318 (1997)
Article Google Scholar
Turney, P.D., Pantel, P.: From frequency to meaning: vector space models of semantics. J. Artif. Int. Res. 37, 141–188 (2010)
MathSciNet MATH Google Scholar
Wiratunga, N., Koychev, I., Massie, S.: Feature selection and generalisation for retrieval of textual cases. In: Funk, P., González Calero, P.A. (eds.) ECCBR 2004. LNCS (LNAI), vol. 3155, pp. 806–820. Springer, Heidelberg (2004)
Chapter Google Scholar
Wu, Z., Palmer, M.: Verb semantics and lexical selection. In: Proc. of the 32nd Annual Meeting on Association for Computational Linguistics, pp. 133–138 (1994)
Google Scholar
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proceedings of the Fourteenth International Conference on Machine Learning, ICML 1997, pp. 412–420 (1997)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computing, The Robert Gordon University, Aberdeen, AB25 1HG, Scotland, UK
Sadiq Sani, Nirmalie Wiratunga, Stewart Massie & Robert Lothian

Authors

Sadiq Sani
View author publications
You can also search for this author in PubMed Google Scholar
Nirmalie Wiratunga
View author publications
You can also search for this author in PubMed Google Scholar
Stewart Massie
View author publications
You can also search for this author in PubMed Google Scholar
Robert Lothian
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Ashwin Ram Nirmalie Wiratunga

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sani, S., Wiratunga, N., Massie, S., Lothian, R. (2011). Term Similarity and Weighting Framework for Text Representation. In: Ram, A., Wiratunga, N. (eds) Case-Based Reasoning Research and Development. ICCBR 2011. Lecture Notes in Computer Science(), vol 6880. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23291-6_23

Download citation

DOI: https://doi.org/10.1007/978-3-642-23291-6_23
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-23290-9
Online ISBN: 978-3-642-23291-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Term Similarity and Weighting Framework for Text Representation

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Statistical Approach for Term Weighting in Very Short Documents for Text Categorization

On entropy-based term weighting schemes for text categorization

A New Improved Term Weighting Scheme for Text Categorization

References

Author information

Authors and Affiliations

Editor information

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Term Similarity and Weighting Framework for Text Representation

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Statistical Approach for Term Weighting in Very Short Documents for Text Categorization

On entropy-based term weighting schemes for text categorization

A New Improved Term Weighting Scheme for Text Categorization

References

Author information

Authors and Affiliations

Editor information

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation