Abstract
Text classification is a task of assigning a set of text documents into predefined classes based on the classifier that learns from training samples; labelled or unlabeled. Binary text classifiers provide a way to separate related documents from a large dataset. However, the existing binary text classifiers are not grounded in reality due to the issue of overfitting. They try to find a clear boundary between relevant and irrelevant objects rather than understand the decision boundary. Normally, the decision boundary cannot be described as a clear boundary because of the numerous uncertainties in text documents. This paper attempts to address this issue by proposing an effective model based on sliding window technique (SW) and Support Vector Machine (SVM) to deal with the uncertain boundary and to improve the effectiveness of binary text classification. This model aims to set the decision boundary by dividing the training documents into three distinct regions (positive, boundary, and negative regions) to ensure the certainty of extracted knowledge to describe relevant information. The model then organizes training samples for the learning task to build a multiple SVMs based classifier. The experimental results using the standard dataset Reuters Corpus Volume 1 (RCV1) and TREC topics for text classification, show that the proposed model significantly outperforms six state-of-the-art baseline models in binary text classification.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Jindal, R., Malhotra, R., Jain, A.: Techniques for text classification: literature review and current trends. Webology 12(2), 1–28 (2015)
Joachims, T.: Transductive inference for text classification using support vector machines. In: ICML 1999, San Francisco, pp. 200–209. ACM (1999)
John, G.H., Langley, P.: Estimating continuous distributions in Bayesian classifiers. In: UAI 1995, Canada, pp. 338–345. ACM (1995)
Aggarwal, C.C., Zhai, C.: A survey of text classification algorithms. In: Aggarwal, C., Zhai, C. (eds.) Mining Text Data, pp. 163–222. Springer, Boston (2012). https://doi.org/10.1007/978-1-4614-3223-4_6
Cover, T., Hart, P.: Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 13(1), 21–27 (1967)
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)
Forman, G.: An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 3, 1289–1305 (2003)
Zhang, L., Li, Y., Bijaksana, M. A.: Decreasing uncertainty for improvement of relevancy prediction. In: Proceeding of the Twelfth Australasian Data Mining Conference, AusDM 2014, Brisbane, pp. 157–162 (2014)
Li, Y., Zhang, L., Yue, X., Yiyu, Y., Raymond, L., Yutong, W.: Enhancing binary classification by modeling uncertain boundary in three-way decisions. IEEE Trans. Knowl. Data Eng. 29(7), 1438–1451 (2017)
Wardaya, P.D.: Support vector machine as a binary classifier for automated object detection in remotely sensed data. In: IOP Conference Series: Earth and Environmental Science, vol. 18, no. 1. IOP Publishing (2014)
Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998). https://doi.org/10.1007/BFb0026683
Burges, C.J.: A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Disc. 2(2), 121–167 (1998)
Shannon, M.: Forensic relative strength scoring: ASCII and entropy scoring. Int. J. Digit. Evid. 2(4), 1–19 (2004)
Lau, R.Y., Bruza, P.D., Song, D.: Towards a belief-revision-based adaptive and context-sensitive information retrieval system. ACM Trans. Inf. Syst. (TOIS) 26(2), 1–38 (2008)
Bekkerman, R., Gavish, M.: High-precision phrase-based document classification on a modern scale. In: KDD 2011, San Diego, pp. 231–239. ACM (2011)
Li, Y., Algarni, A., Zhong, N.: Mining positive and negative patterns for relevance feature discovery. In: KDD 2010, pp. 753–762. ACM, New York (2010)
Fu, Z., Robles-Kelly, A., Zhou, J.: Mixing linear SVMs for nonlinear classification. IEEE Trans. Neural Netw. 21(12), 1963–1975 (2010)
Rodriguez-Lujan, I., Cruz, C.S., Huerta, R.: Hierarchical linear support vector machine. Pattern Recogn. 45(12), 4414–4427 (2012)
Gao, Y., Sun, S.: An empirical evaluation of linear and nonlinear kernels for text classification using support vector machines. In: FSKD 2010, Yantai, pp. 1502–1505. IEEE (2010)
Lan, M., Tan, C.L., Low, H.B.: Proposing a new term weighting scheme for text categorization. In: AAAI 2006, Boston, pp. 763–768. ACM (2006)
Hsu, C.W., Chang, C.C., Lin, C.J.: A practical guide to support vector classification. Technical report, Department of Computer Science, National Taiwan University, Taipei (2003)
Du, L., Song, Q., Jia, X.: Detecting concept drift: an information entropy based method using an adaptive sliding window. Intell. Data Anal. 18(3), 337–364 (2014)
Robertson, S., Zaragoza, H.: The Probabilistic Relevance Framework: BM25 and Beyond. Now Publishers Inc., Breda (2009)
Ko, Y.J., Seo, J.Y.: Issues and empirical results for improving text classification. J. Comput. Sci. Eng. 5(2), 150–160 (2011)
Hall, G.A.: Sliding window measurement for file type identification. Technical report, ManTech Security and Mission Assurance (2006)
Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: Rcv1: a new benchmark collection for text categorization research. J. Mach. Learn. Res. 5, 361–397 (2004)
Joachims, T.: A support vector method for multivariate performance measures. In: ICML 2005, Germany, pp. 377–384. ACM (2005)
Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Francisco (1993)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Albqmi, A.R., Li, Y., Xu, Y. (2018). Enhancing Decision Boundary Setting for Binary Text Classification. In: Mitrovic, T., Xue, B., Li, X. (eds) AI 2018: Advances in Artificial Intelligence. AI 2018. Lecture Notes in Computer Science(), vol 11320. Springer, Cham. https://doi.org/10.1007/978-3-030-03991-2_72
Download citation
DOI: https://doi.org/10.1007/978-3-030-03991-2_72
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-03990-5
Online ISBN: 978-3-030-03991-2
eBook Packages: Computer ScienceComputer Science (R0)