228 International Conference On Engineering Technologies (ICENTE'17)
228 International Conference On Engineering Technologies (ICENTE'17)
228 International Conference On Engineering Technologies (ICENTE'17)
___________________________________________________________________________________________________________
1 1 1
and
1
1
University, Samsun/Turkey,
{durmus.sahin, oguz.kural, erdal.kilic, armagan.karabina}@bil.omu.edu.tr
___________________________________________________________________________________________________________
E-ISBN: 978-605-67535-4-1 Konya, Turkey, December 7-9, 2017
229 International Conference on Engineering Technologies (ICENTE'17)
___________________________________________________________________________________________________________
character, found stemming type all the words and stop words E. Data set
deleted. We prefer Porter Stemmer to find stem of words [18]. Table-1 shows poems and their poetry distribution. Data set
B. Term weighting divided approximately 60% train and 40% test. Because of the
number of poetry is close this data set is balanced.
Any term is very few in some documents but it can be seen very
much in some documents. Therefore, the term weighting must
Table 1: Poems-poetry distribution
reveal this distinction. Term Frequency Inverse Document
Frequency (TF.IDF) was used as term weighting method that is Poems Train Test
frequently used in text classification. TF is the number of Adryan Rotica 284 189
occurrences in a document. IDF is Lamar Cole 241 162
Richard Allen Beevor 227 152
F. Performance Measure
given Equation 1. In Equation 1, is the number of documents
With the performance measurement, the accuracy of belonging
in the whole collection, is the number of documents in the
to the relevant class is evaluated. A sample that is actually
positive category that contain this term, is the number of labeled positively in the dataset, if it is classified as positive in
documents in the negative category that contain this term. The the classification result, it is named True-Positive (TP). A
weight of the term is calculated by multiplying TF by IDF. sample that is actually labeled negative in the dataset, if it is
C. Feature selection classified as negative in the classification result, it is named
The operation of classification algorithms takes a long time. At True-Negative (TN). A sample that is actually labeled negative
the same time, the operation of these algorithms is directly in the dataset, if it is classified as positive in the classification
related to the size of the data. Bag of words model is preferred result, it is named False-Positive (FP). A sample that is actually
in traditional text categorization applications. Document labeled positive in the dataset, if it is classified as negative in
vectors are created using all the words in the training set. For the classification result, it is named False-Negative (FP). These
this reason, documents are represented in terms of tens of states are shown on the Table-2 as confusion matrix.
thousands. Majority of these terms are not considered to affect
important information because they do not affect the Table 2: Confusion matrix
classification success positively. Instead of using all the terms Real
in the bag of words, little word vector is preferred. Little vector
is the best subset of bag of words. By means of feature selection, Predict
memory wastage can be prevented by reducing vector size. C1 C2
Runtime of the text classification process reduce. C1 TP FP
Feature selection methods are generally divided into three C2 FN TN
main groups [19, 20]. These are filtering, wrapper and
embedded methods. In feature selection step, one of the filtering The most used performance measure in text classification is F-
method CHI is used. CHI is score. F-score is the harmonic mean of precision and recall
Equation 3 and
Equation 4 respectively.
___________________________________________________________________________________________________________
E-ISBN: 978-605-67535-4-1 Konya, Turkey, December 7-9, 2017
230 International Conference on Engineering Technologies (ICENTE'17)
___________________________________________________________________________________________________________
[10] M. Meral and B. Diri, "Sentiment analysis on Twitter," 2014 22nd Signal
Processing and Communications Applications Conference (SIU),
Trabzon, pp. 690-693, 2014.
doi: 10.1109/SIU.2014.6830323
[11] classification with word and
document vectors" 2017 25th Signal Processing and Communications
Applications Conference (SIU), Antalya, pp. 1-4, 2017.
doi: 10.1109/SIU.2017.7960145
[12]
lyrics" 2016 24th Signal Processing and Communication Application
Conference (SIU), Zonguldak, pp. 101-104, 2016.
doi: 10.1109/SIU.2016.7495686
[13] C. Akarsu and B. Diri, "Turkish TV rating prediction with Twitter" 2016
24th Signal Processing and Communication Application Conference
(SIU), Zonguldak, Turkey, pp. 345-348, 2016.
doi: 10.1109/SIU.2016.7495748
[14] M. F. Amasyali and T. Yildirim, "Automatic text categorization of news
articles" Proceedings of the IEEE 12th Signal Processing and
Figure 2: Success of classification algorithms Communications Applications Conference, pp. 224-226, 2004.
doi: 10.1109/SIU.2004.1338299
[15]
When 30 features used, the best classifier is NB. However, articles by using Turkish grammatical features" 2012 20th Signal
the worst classifier is KNN. As long as number of features Processing and Communications Applications Conference (SIU), Mugla,
pp. 1-4, 2012.
increase, the success rate of all classifiers excluding KNN go doi: 10.1109/SIU.2012.6204565
up. The best classification rate is 0,7067 F-score that occurs [16]
with SMO in 700 features. Success of RF and C4.5 is similar taken from the online news sites" 2015 23nd Signal Processing and
but they d Communications Applications Conference (SIU), Malatya, pp. 363-366,
2015.
As a result, English poetry were classified according to poets. doi: 10.1109/SIU.2015.7129834
If we use larger data set, more effective text classification [17] https://www.poemhunter.com/ebooks/ Last access: October 2017
results can be obtained. Instead of traditional text classification [18] https://tartarus.org/martin/PorterStemmer/ Last access: October 2017
[19] Hybrid feature selection for text classification
methods, semantic text classification method may be preferred. Turkish Journal of Electrical Engineering and Computer Sciences,
Because, poets use very special sentences while writing poetry. vol. 20, pp. 1296-1311, 2012.
As a future work, we aim to generate different poetry data set [20] Text classification using genetic
algorithm oriented latent semantic features
to classify genre of poetry with text classification. Applications, vol. 41, pp. 5938-5947, 2014.
[21] https://www.cs.waikato.ac.nz/ml/weka/ Last access: October 2017.
REFERENCES
[1] A M. K. Chae, A. Alsadoon, P. W. C. Prasad and S. Sreedharan, "Spam
filtering email classification (SFECM) using gain and graph mining
algorithm" 2017 2nd International Conference on Anti-Cyber Crimes
(ICACC), Abha, pp. 217-222, 2017.
[2] T. Vyas, P. Prajapati and S. Gadhwal, "A survey and evaluation of
supervised machine learning techniques for spam e-mail filtering" 2015
IEEE International Conference on Electrical, Computer and
Communication Technologies (ICECCT), Coimbatore, pp. 1-7, 2015.
[3] S. A. , a genetic
algorithm using tagged-
Applications, 38(4), pp. 3407-3415, 2011.
[4] M. Yasdi and B. Diri, "Author recognition by Abstract Feature
Extraction" 2012 20th Signal Processing and Communications
Applications Conference (SIU), Mugla, pp. 1-4, 2012.
doi: 10.1109/SIU.2012.6204690
[5] ifferent term
weighting schemes" 2013 21st Signal Processing and Communications
Applications Conference (SIU), Haspolat, pp. 1-4, 2013.
doi: 10.1109/SIU.2013.6531190
[6] -
-92, 2010.
[7] and B. Diri, "Sentence selection methods
for text summarization" 2014 22nd Signal Processing and
Communications Applications Conference (SIU), Trabzon, pp. 192-195,
2014.
doi: 10.1109/SIU.2014.6830198
[8]
pp. 37-51,
2005.
[9] C. Parlak, B.
___________________________________________________________________________________________________________
E-ISBN: 978-605-67535-4-1 Konya, Turkey, December 7-9, 2017