Using Frequent Fixed or Variable-Length POS Ngrams or Skip-Grams for Blog Authorship Attribution

Yao Jean Marc Pokou¹⁷,
Philippe Fournier-Viger^17,18 &
Chadia Moghrabi¹⁷

Part of the book series: IFIP Advances in Information and Communication Technology ((IFIPAICT,volume 475))

Included in the following conference series:

IFIP International Conference on Artificial Intelligence Applications and Innovations

2282 Accesses

Abstract

Authorship attribution is the process of identifying the author of an unknown text from a finite set of known candidates. In recent years, it has become increasingly relevant in social networks, blogs, emails and forums where anonymous posts, bullying, and even threats are sometimes perpetrated. State-of-the-art systems for authorship attribution often combine a wide range of features to achieve high accuracy. Although many features have been proposed, it remains an important challenge to find new features and methods that can characterize each author and that can be used on non formal or short writings like blog content or emails. In this paper, we present a novel method for authorship attribution using frequent fixed or variable-length part-of-speech patterns (ngrams or skip-grams) as features to represent each author’s style. This method allows the system to automatically choose its most appropriate features as those sequences being used most frequently. An experimental evaluation on a collection of blog posts shows that the proposed approach is effective at discriminating between blog authors.

You have full access to this open access chapter, Download conference paper PDF

Authorship Attribution using Filtered N-grams as Features

Complete Syntactic N-grams as Style Markers for Authorship Attribution

Empirical Evaluations Using Character and Word N-Grams on Authorship Attribution for Telugu Text

Keywords

1 Introduction

Authorship analysis is the process of examining the characteristics of a piece of work to identify its authors [1]. The forerunners of authorship attribution (AA) are Medenhall who studied the plays of Shakespeare in 1887 [2], and Mosteller and Wallace who studied the disputed authorship of the Federalist Papers in 1964 [3]. Beside identifying the authors of works published anonymously, popular applications of AA techniques include authenticating documents, detecting plagiarism, and assisting forensic investigations.

In recent years, an emerging application of AA has been to analyze online texts, due to the increasing popularity of Internet-based communications, and the need to find solutions to problems such as online bullying, and anonymous threats. However, this application raises novel challenges, as online texts are written in an informal style, and are often short, thus providing less information about their authors. The accuracy achieved with well-established features for AA like function words, syntactic and lexical markers suffers when applied to short-form messages [4], as many of the earliest studies have been tested on long formal text such as books. It is thus a challenge to find new features and methods that are applicable to informal or short texts.

Studies on AA focused on various types of features that are either syntactic or semantic [5]. Patterns based on syntactic features of text are considered reliable because they are unconsciously used by authors. Syntactic features include frequencies of ngrams, character ngrams, and function words [1]. Baayen, van Halteren, and Tweedie rewrote the frequencies rules for AA based on two syntactically annotated samples taken from the Nijmegen corpus [6]. Semantic features take advantage of words’ meanings and their likeness, as exemplified by Clark and Hannon whose classifier quantifies how often an author uses a specific word instead of its synonyms [7]. Moreover, state-of-the-art systems for AA often combine a wide range of features to achieve higher accuracy. For example, the JStylo system offers more than 50 configurable features [8].

Syntactic markers, however, have been less studied because of their language-dependent aspect. This paper studies complex linguistic information carried by part-of-speech (POS) patterns as a novel approach for authorship attribution of informal texts. The hypothesis is that POS patterns more frequently appearing in texts, could accurately characterize authors’ styles.

The contributions of this paper are threefold. First, we define a novel feature for identifying authors based on fixed or variable length POS patterns (ngrams or skip-grams). A signature is defined for each author as the intersection of the top k most frequent POS patterns found in his or her texts, that are less frequent in texts by other authors. In this method, the system automatically chooses its most appropriate features rather than have them predefined in advance. Second, a process is proposed to use these signatures to perform AA. Third, an experimental evaluation using blog posts of 10 authors randomly chosen from the Blog Authorship Corpus [9] shows that the proposed approach is effective at inferring authors of informal texts. Moreover, the experimental study provides answers to the questions of how many patterns are needed, whether POS skip-grams or ngrams perform better, and whether fixed length or variable-length patterns should be used.

The rest of this paper is organized as follows. Section 2 reviews related work. Section 3 introduces the proposed approach. Section 4 presents the experimental evaluation. Finally, Sect. 5 draws the conclusions.

2 Related Work

The use of authorship attribution techniques goes back to the 19th century with the studies of Shakespeare’s work. This section concentrates on those using POS tagging or informal short texts.

Stamatatos et al. exclusively used natural language processing (NLP) tools for their three-level text analysis. Along with the token and analysis levels, the phrase level concentrated on the frequencies of single POS tags. They analyzed a corpus of 300 texts by 10 authors, written in Greek. Their method achieved around 80 % accuracy by excluding some less significant markers [10].

Gamon combined syntactic and lexical features to accurately identify authors [11]. He used features such as the frequencies of ngrams, function words, and POS trigrams from a set of eight POS tags obtained using the NLPWin system. Over 95 % accuracy was obtained. However, its limitation was its evaluation using only three texts, written by three authors [11].

More recently, Sidorov et al. introduced syntactic ngrams (sngrams) as a feature for AA. Syntactic ngrams are obtained by considering the order of elements in syntactic trees generated from a text, rather than by finding n contiguous elements appearing in a text. They compared the use of sngrams with ngrams of words, POS, and characters. The 39 documents by three authors from Project Gutenberg were classified using SVM, J48 and Naïve Bayes implementations provided by the WEKA library. The best results were obtained by SVM with sngrams [12]. The limitations of this work are an evaluation with only three authors, a predetermined length of ngrams, and that all 11,000 ngrams/sngrams were used.

Variations of ngrams were also considered. For example, skip-grams were used by García-Hernández et al. [13]. Skip-grams are ngrams where some words are ignored in sentences with respect to a threshold named the skip step. Ngrams are the specific case of skip-grams where the skip step is equal to 0. The number of skip-grams is very large and their discovery needs complex algorithms [12]. To reduce their number, a cut-off frequency threshold is used [13].

POS ngrams have also been used for problems related to AA. Litvinova et al. used the frequencies of 227 POS bigrams for predicting personality [14].

Unlike previous work using POS patterns (ngrams or skip-grams), the approach presented here considers not only fixed length POS patterns but also variable length POS patterns. Another distinctive characteristic is that it finds only the k most frequent POS patterns in each text (where k is set by the user), rather than using a large set of patterns or using a predetermined cut-off frequency threshold. This allows the proposed approach to use a very small number of patterns to create a signature for each author, unlike many previous works that compute the frequencies of hundreds or thousands of ngrams or skip-grams.

3 The Proposed Approach

The proposed approach takes as input a training corpus \(C_m\) of texts written by m authors. Let \(A = \{a_1,a_2,.....a_m \}\) denote the set of authors. Each author \(a_i\) (\(1 \le i \le m\)) has a set of z texts \(T_i=\{t_1,t_2, \ldots t_z\}\) in the corpus. The proposed approach is applied in three steps.

Preprocessing. The first step is to prepare blogs from the corpus so that they can be used for generating author signatures. All information that does not carry an author’s style is removed, such as images, tables, videos, and links. In addition, each text is stripped of punctuation and is split into sentences using the Rita NLP library (http://www.rednoise.org/rita/). Then, every text is tagged using the Standford NLP Tagger (http://nlp.stanford.edu/software/) as it produced 97.24 % accuracy on the Penn Treebank Wall Street Journal corpus [15]. Since the main focus is analyzing how sentences are constructed by authors rather than the choice of words, words in texts are discarded and only the patterns about parts of speech are maintained. Thus, each text becomes a set of sequences of POS tags. For example, Table 1 shows the transformation of six sentences from an author in the corpus.

Table 1. An example of blog text transformation.

Full size table

Signature Extraction. The second step of the proposed approach is to extract a signature for each author, defined as a set of part-of-speech patterns (POSP) annotated with their respective frequency. Signature extraction has four parameters: the number of POS patterns (ngrams or skip-grams) to be found k, the minimum pattern length n, the maximum length x, and the maximum gap maxgap allowed between POS tags. Frequent patterns of POS tags are extracted from each text. The hypothesis is that each text may contain patterns of POS tags unconsciously left by its author, representing his/her writing style, and could be used to identify that author accurately. For each text t, the k most frequent POS patterns are extracted using a general-purpose sequential pattern mining algorithm [16]. Let POS denote the set of POS tags. Consider a sentence \(w_1, w_2, \ldots w_y\) consisting of y part-of-speech tags, and a parameter maxgap (a positive integer). A n-skip-gram is an ordered list of tags \( w_{i_1},w_{i_2},\ldots w_{i_n}\) where \(i_1, i_2, \ldots i_n\) are integers such that \(0< i_j - i_{j-1} \le maxgap + 1 (1 < j \le n)\). A n-skip-gram respecting the constraint of \(maxgap = 0\) (i.e. no gaps are allowed) is said to be a ngram. For a given text, the frequency of a sequence seq is the number of sequences (sentences) from the text containing seq. Similarly, the relative frequency of a sequence is its frequency divided by the number of sequences in the text. For example, the frequency of the 5-skip-gram \(\langle NNP,VBP,VBP,PRP,RB \rangle \) is 1 in the transformed sentences of Table 1, while the frequency of the 2-skip-gram \(\langle DT,NN \rangle \) is 4 (this pattern appears in the second, third, fifth, and sixth transformed sentences).

In the following, the term part-of-speech pattern of a text t, abbreviated as \((POSPt)_{n,x}^k\), or patterns, is used to refer to the k most frequent POS patterns extracted from a text, annotated with their relative frequency, respecting the maxgap constraint, and having a length no less than n and not greater than x.

Then, the POS patterns of an author \(a_i\) in all his/her texts are found. They are denoted as \((POSPa_i)_{n,x}^k \) and defined formally as the union of the POS patterns found in all of his/her texts: \((POSPa_i)_{n,x}^k = \bigcup _{t \in T_i}{(POSPt)_{n,x}^k}\).

Table 2. Part-of-speech patterns and their relative frequencies.

Full size table

Then, the signature of each author \(a_i\) is extracted. The signature \(s_{a_i}\) of author \(a_i\) is the intersection^{Footnote 1} of the POS patterns of his/her texts \(T_i\). The signature is formally defined as: \((s_{a_i})_{n,x}^k = \bigcap _{t \in T_i} (POSPt)_{n,x}^k\).

For instance, the part-of-speech patterns \((POSP_{1029959})_{1,5}^4\) of the blogger 1029959 from our corpus for \(maxgap = 3 \) are shown in Table 2. In this table, POS patterns are ordered by decreasing frequency. It can be seen that the patterns Noun (NN), and Determiner - Noun (DT-NN) appear in four of the six sentences shown in Table 1. Note that the relative frequency of each pattern is calculated as the relative frequency over all texts containing the pattern.

Moreover, the POS patterns of an author \(a_i\) may contain patterns having unusual frequencies that truly characterize the author’s style, but also patterns representing common sentence structures of the English language. To tell apart these two cases, a set of reference patterns and their frequencies is extracted to be used with each signature for authorship attribution. Extracting this set of reference patterns is done with respect to each author \(a_i\) by computing the union of all parts of speech of the other authors^{Footnote 2}. This set is formally defined as follows. The Common POS patterns of all authors excluding an author \(a_i\) is the union of all the other POSPa, that is: \((CPOSa_{i})_{n,x}^k = \bigcup _{a \in A \wedge a \not = a_i} (POSP a)_{n,x}^k \).

For example, the common POS patterns computed using authors 206953 and 2369365, and excluding author 1029959, are the patterns DT, IN and PRP. The relative frequencies (%) of these patterns for author 206953 are 67.9 %, 69.6 % and 79.1 %, while the relatives frequencies of these patterns for author 2369365 are 64.2 %, 63.0 % and 58.7 %. Note that the relative frequency of each pattern in CPOS is calculated as the relative frequency over all texts containing the pattern.

The revised signature of an author \(a_i\) after removing the common POS patterns of all authors excluding \(a_i\) is defined as: \((s_{ai}^\prime )^k_{n,x} = (s_{ai})^k_{n,x} \setminus (CPOSa_{i})_{n,x}^k\). When the revised signature of each author \(a_1, a_2, ... a_m\) has been extracted, the collection of revised signatures \(s^{\prime ,k}_{n,x}=\{ (s_{a1}^\prime )^k_{n,x}, (s_{a2}^\prime )^k_{n,x}, \ldots , (s_{am}^\prime )^k_{n,x} \}\) is saved.

Authorship Attribution. The third step of the proposed approach is to use the generated signatures to perform authorship attribution, that is to identify the author \(a_u\) of an anonymous text \(t_u\) that was not used for training. The algorithm takes as input an anonymous text \(t_u\), the sets of signatures \(s^{\prime ,k}_{n,x}\) and the parameters n, x and k. The algorithm first extracts the part-of-speech patterns in the unknown text \(t_u\) with their relative frequencies. Then, it compares the patterns found in \(t_u\) and their frequencies with the patterns in the signature of each author using a similarity function. Each author and the corresponding similarity are stored as a tuple in a list. Finally, the algorithm returns this list sorted by decreasing order of similarity. This list represents a ranking of the most likely authors of the anonymous text \(t_u\). Various metrics may be used to define similarity functions. In this work, the Pearson correlation was chosen as it provided better results in initial experiments.

4 Experimental Evaluation

Experiments were conducted to assess the effectiveness of the proposed approach for authorship attribution using either fixed or variable-length ngrams or skip-grams of POS patterns. Our Corpus consists of 609 posts from 10 bloggers, obtained from the Blog Authorship Corpus [9]. Blog posts are written in English and were originally collected from the Blogger website (https://www.blogger.com/). The ten authors were chosen randomly. Note that non-verbal expressions (emoticons, smileys or interjections used in web-blogs or chatrooms such as lol, hihi, and hahaaa were not removed because consistent part-of-speech tags were returned by the tagger for the different blogs. The resulting corpus has a total of 265, 263 words and 19, 938 sentences. Details are presented in Table 3.

Table 3. Corpus statistics.

Full size table

Then, each text was preprocessed (as explained in Sect. 3). Eighty percent (80 %) of each text was used to extract each author’s signature (training), and the remaining 20 % was used to perform the unknown authorship attribution by comparing each text \(t_u\) with each author signature (testing). This produced a ranking of the most likely author to the least likely author, for each text.

4.1 Influence of Parameters n, x, k, and maxgap on Overall Results

Recall that our proposed approach takes four parameters as input, i.e. the minimum and maximum length of part-of-speech patterns n and x, the maximum gap maxgap, and k the number of patterns to be extracted in each text. The influence of these parameters on authorship attribution success was first evaluated. For our experiment, parameter k was set to 50, 100, and 250. For each value of k, the length of the part-of-speech patterns was varied from \(n=1\) to \(x=4\). Moreover, the maxgap parameter was set to 0 (ngrams), and from 1 to 3 (skip-grams). For each combination of parameters, we measured the success rate, defined as the number of correct predictions divided by the number of predictions. Tables 4, 5, 6, and 7 respectively show the results obtained for \(maxgap= 0,1,2,3\), for various values of k, n and x. Furthermore, in these tables, results are also presented by ranks. The row \(R_z\) indicates the number of texts where the author was predicted as one of the z most likely authors, divided by the total number of texts (success rate). For example, \(R_3\) indicates the percentage of texts where the author is among the three most likely authors as predicted by the proposed approach. Since there are 10 authors in the corpus, results are shown for \(R_z\) varied from 1 to 10.

The first observation is that the best overall results are obtained by setting \(n = 2, x = 2\), \(k = 250\) and \(maxgap = 0\). For these parameters, the author of an anonymous text is correctly identified 73.3 % of the time, and 86.6 % as one of the two most likely authors (\(R_2\)).

The second observation is that excellent results can be achieved using few patterns to build signatures (\(k = 250\)). Note that our approach was also tested with other values of k such as 50, but it did not provide better results than \(k = 250\) (results are not shown due to space limitation). This is interesting as it means that signatures can be extracted using a very small number of the k most frequent POS patterns (as low as \(k=100\)) and still characterize well the writing style of authors. This is in contrast with previous works that generally used a large number of patterns to define an author’s signature. For example, Argamon et al. have computed the frequencies of 685 trigrams [17] and Sidorov et al. computed the frequencies of 400/11,000 ngrams/sngrams [12]. By Occam’s Razor, it can be argued that models with less patterns (simpler) may be less prone to overfitting.

Third, it can be observed that good results can also be obtained using POS skip-grams. This is interesting since skip-grams of words or POS have received considerably less attention than ngrams in previous studies. Moreover, to our knowledge no studies had previously compared the results obtained with ngrams and skip-grams for authorship attribution of informal and short texts. Fourth, it is found that using skip-grams of fixed length (bigrams) is better than using patterns of variable length. This provides an answer to the important question of whether fixed length POS patterns or variable-length POS patterns should be used. This question was not studied in previous works.

Table 4. Overall classification results using ngrams (maxgap = 0)

Full size table

Table 5. Overall classification results using skip-grams with maxgap = 1

Full size table

Table 6. Overall classification results using skip-grams with maxgap = 2

Full size table

Table 7. Overall classification results using skip-grams with maxgap = 3

Full size table

4.2 Influence of Parameters n, x and k on Authorship Attribution for Each Author

The previous subsection studied the influence of parameters n, x and k on the ranking of authors for all anonymous texts. This subsection analyzes the results for each author separately. Tables 8a and b respectively show the success rates attributed to each author (rank \(R_1\)) for \(maxgap = 0\), \(k = 100\) and \(k = 250\), when the other parameters are varied.

Table 8. Success rate per author using skip-grams with maxgap = 0 (ngrams)

Full size table

Table 9. Accuracy per author using ngrams (maxgap = 0)

Full size table

It can be observed that for most authors, at least 66.7 % of texts are correctly attributed. For example, for \(n=2\), \(x=2\) and \(k=250\), four authors have all texts correctly identified, four have 66.7 % of their texts correctly classified, and two have 33.3 % of texts correctly classified. Overall, it can be thus found that the proposed approach performs very well.

It can also be found that some authors were harder to classify (author 3420481 and 3454871). After investigation, we found that the reason for the incorrect classification is that both authors have posted a same very long blog post, which represents about 50 % of the length of their respective corpus. This content has a distinct writing style, which suggests that it was not written by any of these authors. Thus, it had a great influence on their signatures, and led to the poor classification of these authors.

In contrast, some authors were very easily identified with high success rate and high accuracy (cf. Table 9 for accuracies). For example, all texts by authors 3182387 and 3298664 were almost always correctly classified for the tested parameter values in these tables. The reason is that these authors have distinctive writing styles in terms of part-of-speech patterns.

Table 10. Summary of best success rate results (Best \(R_1\) Rank)

Full size table

5 Conclusions

A novel approach using fixed or variable-length patterns of part-of-speech skip-grams or n-grams as features was presented for blog authorship attribution. An experimental evaluation using blog posts from 10 blog authors has shown that authors are accurately classified with more than 73.3 % success rate, and an average accuracy of 94.7 %, using a small number of POS patterns (e.g. \(k = 250\)), and that it is unnecessary to create signatures with a large number of patterns. Moreover, it was found that using fixed length POS bigrams provided better results than using POS ngrams or using a larger gap between POS tags, and that using fixed length patterns is preferable to using variable-length patterns. To give the reader a feel of the relative efficacy of this approach on traditional long texts, we present a summary table of results on a corpus of 30 books by ten 19th century authors (c.f. Table 10 for comparative results) [18, 19]. In future work, we plan to develop other features and also evaluate the proposed features on other types of texts.

Notes

1.
A less strict intersection could also be used, requiring occurrences in some or the majority of texts rather than all of them.
2.
A subset of all other authors can also be used if the set of other authors is large.

References

Koppel, M., Schler, J., Argamon, S.: Authorship attribution: what’s easy and what’s hard? (2013). SSRN 2274891
Google Scholar
Mendenhall, T.C.: The characteristic curves of composition. Science 9(214), 237–246 (1887)
Article Google Scholar
Mosteller, F., Wallace, D.: Inference and Disputed Authorship: The Federalist. Addison-Wesley, Reading (1964)
MATH Google Scholar
Grant, T.: Text messaging forensics: Txt 4n6: idiolect free authorship analysis? (2010)
Google Scholar
Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Automatic text categorization in terms of genre and author. Comput. Linguist. 26(4), 471–495 (2000)
Article Google Scholar
Baayen, H., van Halteren, H., Tweedie, F.: Outside the cave of shadows: using syntactic annotation to enhance authorship attribution. Literary Linguist. Comput. 11(3), 121–132 (1996)
Article Google Scholar
Clark, J.H., Hannon, C.J.: A classifier system for author recognition using synonym-based features. In: Gelbukh, A., Kuri Morales, Á.F. (eds.) MICAI 2007. LNCS (LNAI), vol. 4827, pp. 839–849. Springer, Heidelberg (2007)
Chapter Google Scholar
McDonald, A.W.E., Afroz, S., Caliskan, A., Stolerman, A., Greenstadt, R.: Use fewer instances of the letter “i”: toward writing style anonymization. In: Fischer-Hübner, S., Wright, M. (eds.) PETS 2012. LNCS, vol. 7384, pp. 299–318. Springer, Heidelberg (2012)
Chapter Google Scholar
Schler, J., Koppel, M., Argamon, S., Pennebaker, J.W.: Effects of age and gender on blogging. In: AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs, vol. 6, pp. 199–205 (2006)
Google Scholar
Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Computer-based authorship attribution without lexical measures. Comput. Humanit. 35(2), 193–214 (2001)
Article Google Scholar
Gamon, M.: Linguistic correlates of style: authorship classification with deep linguistic analysis features. In: Proceedings of the 20th International Conference on Computational Linguistics, Association for Computational Linguistics, p. 611 (2004)
Google Scholar
Sidorov, G., Velasquez, F., Stamatatos, E., Gelbukh, A., Chanona-Hernández, L.: Syntactic n-grams as machine learning features for natural language processing. Expert Syst. Appl. 41(3), 853–860 (2014)
Article Google Scholar
García-Hernández, R.A., Martínez-Trinidad, J.F., Carrasco-Ochoa, J.A.: Finding maximal sequential patterns in text document collections and single documents. Informatica 34(1), 93–101 (2010)
MATH Google Scholar
Litvinova, T., Seredin, P., Litvinova, O.: Using part-of-speech sequences frequencies in a text to predict author personality: a corpus study. Indian J. Sci. Technol. 8(S9), 93–97 (2015)
Article Google Scholar
Toutanova, K., Klein, D., Manning, C.D., Singer, Y.: Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of 2003 Conference on North American Chapter of the ACL - Human Language Technologies, pp. 173–180 (2003)
Google Scholar
Fournier-Viger, P., Gomariz, A., Gueniche, T., Mwamikazi, E., Thomas, R.: TKS: efficient mining of top-K sequential patterns. In: Motoda, H., Wu, Z., Cao, L., Zaiane, O., Yao, M., Wang, W. (eds.) ADMA 2013, Part I. LNCS, vol. 8346, pp. 109–120. Springer, Heidelberg (2013)
Chapter Google Scholar
Argamon-Engelson, S., Koppel, M., Avneri, G.: Style-based text categorization: what newspaper am i reading. In: Proceedings of AAAI Workshop on Text Categorization, pp. 1–4 (1998)
Google Scholar
Pokou, J.M., Fournier-Viger, P., Moghrabi, C.: Authorship attribution using small sets of frequent part-of-speech skip-grams. In: Proceedings of the 29th International Florida Artificial Intelligence Research Society Conference, pp. 86–91 (2016)
Google Scholar
Pokou, J.M., Fournier-Viger, P., Moghrabi, C.: Authorship attribution using variable-length part-of-speech patterns. In: Proceedings of the 7th International Conference on Agents and Artificial Intelligence, pp. 354–361 (2016)
Google Scholar

Download references

Acknowledgements

This work is financed by a National Science and Engineering Research Council (NSERC) of Canada research grant, and the Faculty of Research and Graduate Studies of the Université de Moncton.

Author information

Authors and Affiliations

Department of Computer Science, Université de Moncton, Moncton, NB, Canada
Yao Jean Marc Pokou, Philippe Fournier-Viger & Chadia Moghrabi
School of Natural Sciences and Humanities, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong, China
Philippe Fournier-Viger

Authors

Yao Jean Marc Pokou
View author publications
You can also search for this author in PubMed Google Scholar
Philippe Fournier-Viger
View author publications
You can also search for this author in PubMed Google Scholar
Chadia Moghrabi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chadia Moghrabi .

Editor information

Editors and Affiliations

Democritus University of Thrace , Thessaloniki, Greece
Lazaros Iliadis
University of Piraeus , Piraeus, Greece
Ilias Maglogiannis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pokou, Y.J.M., Fournier-Viger, P., Moghrabi, C. (2016). Using Frequent Fixed or Variable-Length POS Ngrams or Skip-Grams for Blog Authorship Attribution. In: Iliadis, L., Maglogiannis, I. (eds) Artificial Intelligence Applications and Innovations. AIAI 2016. IFIP Advances in Information and Communication Technology, vol 475. Springer, Cham. https://doi.org/10.1007/978-3-319-44944-9_6

Download citation

DOI: https://doi.org/10.1007/978-3-319-44944-9_6
Published: 02 September 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-44943-2
Online ISBN: 978-3-319-44944-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Using Frequent Fixed or Variable-Length POS Ngrams or Skip-Grams for Blog Authorship Attribution

Abstract

Similar content being viewed by others

Authorship Attribution using Filtered N-grams as Features

Complete Syntactic N-grams as Style Markers for Authorship Attribution

Empirical Evaluations Using Character and Word N-Grams on Authorship Attribution for Telugu Text

Keywords

1 Introduction

2 Related Work

3 The Proposed Approach

4 Experimental Evaluation

4.1 Influence of Parameters n, x, k, and maxgap on Overall Results

4.2 Influence of Parameters n, x and k on Authorship Attribution for Each Author

5 Conclusions

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Using Frequent Fixed or Variable-Length POS Ngrams or Skip-Grams for Blog Authorship Attribution

Abstract

Similar content being viewed by others

Authorship Attribution using Filtered N-grams as Features

Complete Syntactic N-grams as Style Markers for Authorship Attribution

Empirical Evaluations Using Character and Word N-Grams on Authorship Attribution for Telugu Text

Keywords

1 Introduction

2 Related Work

3 The Proposed Approach

4 Experimental Evaluation

4.1 Influence of Parameters n, x, k, and maxgap on Overall Results

4.2 Influence of Parameters n, x and k on Authorship Attribution for Each Author

5 Conclusions

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation