Research on Tibetan hot words, sensitive words tracking and public opinion classification

Guixian Xu¹,
Changzhi Wang¹,
Haishen Yao¹ &
…
Qi Qi¹

327 Accesses
6 Citations
Explore all metrics

Abstract

The rapid development of Tibetan information technology provides rich resources for Tibetan information processing technology. The construction of Tibetan corpus is the field of Tibetan information processing of basic work. In this paper, we design the system of Tibetan network data collection and web pages preprocessing. It can timely and efficiently access to web resources, and provide a basis for further analysis of Tibetan data. It can establish the Tibetan related corpus, enrich the Tibetan digital resources. It can also alleviate the status of Tibetan corpus data sparse and lack of resources and bring the convenient condition for Tibetan information processing. The hot words reflect the hot spot of Tibetan people’s attention in a certain period of time. Firstly, the paper proposes the method for reducing the space dimension of Tibetan news text. It can effectively reduce the complexity of subsequent processing. Secondly, term weighting method is proposed based on improved TFIDF for Tibetan text information extraction. It utilizes the idea that the words of different locations are given different weights to extract the hot words. On sensitive words discovery and classification of public opinion, sensitive thesaurus are collected artificially. Through the sensitive thesaurus comparison, the sensitive words are extracted. Classification of public opinion words is based on the proposed classification formula and the public opinion thesaurus. It will classify one Tibetan text to one public opinion class. In this paper, the software is developed to automatically collect Tibetan web pages from the network, preprocess the web pages, extract the text features and hot words, discover the sensitive words and classify the Tibetan text to one public opinion class. The experiment shows that the Tibetan hot words extraction is effective and Tibetan classification results of public opinion are significant.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Research on Hot Spot Mining Technology for Network Public Opinion

News Headline Corpus Construction and High Frequency Word Extraction

Research on Meteorological Public Opinion Combining Domain Keywords and LinearSVC

References

Gao, D.G., Guan, B.: Retrospect on the development of Tibetan information processing technology. J. Tibet Univ. 24(3), 18–27 (2009)
Google Scholar
Li, Y.Q., Sun, L.H.: Hot-word detection for internet public sentiment. J. Chin. Inf. Process. 25(1), 49–53 (2011)
Google Scholar
Gao, D.G., Tashigyal, Zhao, D.C.: Data analyses of large basic Tibetan corpus. J. Northwest Univ. Natl. 34(92), 46–51 (2013)
Google Scholar
Li, P.F., Zhu, Q.M., Qian, P.D.: Construction approach of large-scale corpus based on web. Comp. Eng. 34(7), 41–46 (2008)
Google Scholar
Liu, H.D., Nuo, M.H., Ma, L.L.: Mining Tibetan web text resources and its application. J. Chin. Inf. Process. 29(1), 170–177 (2015)
Google Scholar
Yang, D.Z., Zhao, G., Wang, T.: Application of WebCrawler in information search and data mining. Comput. Eng. Des. 30(24), 5658–5662 (2009)
Google Scholar
Yang, L., Geng, X., Liao, H.: A web sentiment analysis method on fuzzy clustering for mobile social media users. Eurasip J. Wirel. Commun. Netw. 2016(1), 1–13 (2016)
Google Scholar
Wu, Q., Yang, X., Zhao, Z.X.: Web information extraction based on visual characteristics. In: Symposium of the Sixth China Conference on Information Retrieval (2010)
Zhang, R.X., Song, M.Q., Gong, Y.L.: Parsing DOM tree reversely and extracting web main page information. Comput. Sci. 38(4), 213–215 (2011)
Google Scholar
Hu, J.D.: Research on Web News Extraction and Duplicates Elimination. Zhejiang University, Hangzhou (2011)
Google Scholar
Ma, C.Q., Mao, X.G.: Research on near-duplicate detection algorithm shingling and simhash. Comput. Digit. Eng. 39(1), 15–17 (2009)
Google Scholar
Kang, C., Jiang, D., Long, C.: Tibetan word segmentation based on word-position tagging. In: 2013 International Conference on Asian Language Processing (IALP), pp. 239–242. Urumqi (2013)
Jin, Z.: A method of intelligence key words extraction based on improved TF-IDF. J. Intell. 4, 028 (2014)
Google Scholar
Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975)
Google Scholar
Becker, J., Kuropka, D.: Topic-based vector space model. In: Proceedings of the 6th international conference on business information systems, pp. 7–12 (2003)
Aizawa, A.: An information-theoretic perspective of tf-idf measures. Inf. Process. Manage. 39(1), 45–65 (2003)
Google Scholar
Wu, H.C., Luk, R.W.P., Wong, K.F., et al.: Interpreting tf-idf term weights as making relevance decisions. ACM Trans. Inf. Syst. 26(3), 13 (2008)
Google Scholar
Shi, C.Y., Xu, C.J., Yang, X.J.: Study of TFIDF algorithm. J. Comput. Appl. 26, 167–170 (2009)
Google Scholar
Cao, H., Jia, H.: Tibetan text classification based on the feature of position weight. In: 2013 International Conference on Asian Language Processing (IALP), pp. 220–223. Urumqi (2013)
Jiang, T., Yu, H.Z., Zhang, B.: Tibetan text classification using distributed representations of words. In: 2015 International Conference on Asian Language Processing (IALP), pp. 123–126. Suzhou (2015)
Kim, S.B., Han, K.S., Rim, H.C., HyonMyaeng, S.: Some effective techniques for Naive Bayes text classification. IEEE Trans. Knowl. Data Eng. 18(11), 1457–1466 (2006)
Google Scholar
Liu, W., Song, Z.: Design and implementation of an internet public opinion monitoring system. In: 2014 International Conference on security, pattern analysis, and cybernetics (SPAC), pp. 114–118. Wuhan (2014)
Guo, K., Shi, L., Ye, W., Li, X.: A survey of internet public opinion mining. In: 2014 International Conference on progress in informatics and computing (PIC), pp. 173–179 Shanghai (2014)
Li, X., Gao, L.: The design and implementation of an internet public opinion monitoring and analyzing system. In: 2013 International Conference on Service Sciences (ICSS), pp. 176–180. Shenzhen (2013)
Mo, J.W., Zheng, Y., Shou, Z.Y., Zhang, S.L.: Improved Chinese word segmentation method based on dictionary. Comput. Eng. Des. 34(5), 1802–1807 (2013)
Google Scholar
Chen, Y.Z., Li, B.L., Yu, S.W., Lan, C.J.: An automatic Tibetan segmentation scheme based on case-auxiliary words and continuous features. Appl. Linguist. 1, 75–82 (2003)
Google Scholar
Zhu, J., Li, T.R.: Research on Tibetan stop words selection and automatic processing method. J. Chin. Inf. Process. 29(2), 125–132 (2015)
Google Scholar

Download references

Acknowledgements

This work was supported by the Beijing Social Science Foundation (No. 14WYB040), First class university, First class discipline construction funds of Minzu University of China (No.2017MDYL12), the National Natural Science Foundation of China (No. 61309012), the National Key Technology Research and Development Program of the Ministry of Science and Technology of China (No. 2014BAK10B03).

Author information

Authors and Affiliations

Information Engineering College, Minzu University of China, Beijing, 100081, China
Guixian Xu, Changzhi Wang, Haishen Yao & Qi Qi

Authors

Guixian Xu
View author publications
You can also search for this author in PubMed Google Scholar
Changzhi Wang
View author publications
You can also search for this author in PubMed Google Scholar
Haishen Yao
View author publications
You can also search for this author in PubMed Google Scholar
Qi Qi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Guixian Xu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xu, G., Wang, C., Yao, H. et al. Research on Tibetan hot words, sensitive words tracking and public opinion classification. Cluster Comput 22 (Suppl 4), 9977–9990 (2019). https://doi.org/10.1007/s10586-017-1026-x

Download citation

Received: 05 April 2017
Revised: 26 June 2017
Accepted: 28 June 2017
Published: 08 July 2017
Issue Date: July 2019
DOI: https://doi.org/10.1007/s10586-017-1026-x

Research on Tibetan hot words, sensitive words tracking and public opinion classification

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Research on Hot Spot Mining Technology for Network Public Opinion

News Headline Corpus Construction and High Frequency Word Extraction

Research on Meteorological Public Opinion Combining Domain Keywords and LinearSVC

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Research on Tibetan hot words, sensitive words tracking and public opinion classification

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Research on Hot Spot Mining Technology for Network Public Opinion

News Headline Corpus Construction and High Frequency Word Extraction

Research on Meteorological Public Opinion Combining Domain Keywords and LinearSVC

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation