Nothing Special   »   [go: up one dir, main page]

Paper

Comparing intermittency and network measurements of words and their dependence on authorship

, , and

Published 16 December 2011 © IOP Publishing and Deutsche Physikalische Gesellschaft
, , Citation Diego Raphael Amancio et al 2011 New J. Phys. 13 123024 DOI 10.1088/1367-2630/13/12/123024

1367-2630/13/12/123024

Abstract

Many features of texts and languages can now be inferred from statistical analyses using concepts from complex networks and dynamical systems. In this paper, we quantify how topological properties of word co-occurrence networks and intermittency (or burstiness) in word distribution depend on the style of authors. Our database contains 40 books by eight authors who lived in the nineteenth and twentieth centuries, for which the following network measurements were obtained: the clustering coefficient, average shortest path lengths and betweenness. We found that the two factors with stronger dependence on authors were skewness in the distribution of word intermittency and the average shortest paths. Other factors such as betweenness and Zipf's law exponent show only weak dependence on authorship. Also assessed was the contribution from each measurement to authorship recognition using three machine learning methods. The best performance was about 65% accuracy upon combining complex networks and intermittency features with the nearest-neighbor algorithm of automatic authorship. From a detailed analysis of the interdependence of the various metrics, it is concluded that the methods used here are complementary for providing short- and long-scale perspectives on texts, which are useful for applications such as the identification of topical words and information retrieval.

Export citation and abstract BibTeX RIS

1. Introduction

The application of the ideas from statistical physics to text analysis has a long tradition since Shannon's usage of entropy as the central concept in information theory [1]. In recent years, physicists have proposed new approaches based on concepts from complex networks [215] and dynamical systems [1620]. In the former, text is represented as complex networks with words (nodes) being connected (links) using procedures depending on their syntactic or semantic relationships [2]. Several of these networks share topological properties such as the scaling in degree [3, 4] and the small world feature [5, 6]. The co-occurrence networks, where adjacent words are linked to each other, are probably the most popular for applications owing to their ability to capture important syntactic and semantic aspects of texts with a straightforward construction procedure. These networks were employed to evaluate writing quality [7] and machine translations [8, 9], to generate and evaluate summaries [10], to construct spell checkers [11], to recognize patterns in poetry [12] and prose [13, 14] and to study general properties of written language [15]. While co-occurrence networks focus mainly on short scales, an increasingly popular approach addresses longer text scales [1618, 2124]. The usefulness of the latter approach stems from the finding that topical words are unevenly distributed along the text when compared with a random process or function words. This observation can be quantitatively investigated using different analogies and measures familiar to the communities of statistical physics and dynamical systems, including level statistics [22, 24], burstiness [1719], entropy [21] and intermittency measures [20]. The author dependence of the features mentioned above has been noted [14, 17], but very little work has so far been devoted to quantifying the extent of this dependence and to testing its usefulness for the automatic detection of authors.

In the field of authorship recognition (or stylometry), one tries to identify the author of documents whose identity is lacking [25]. Some simple quantitative proposals, such as the use of word length to distinguish between authors, go back to the mid-nineteenth century (see [26] for a historical account). One important recent contribution was by Mosteller and Wallace [27] who showed that the frequency of function words (such as 'any', 'from', 'an', 'there' and 'upon') can be used to characterize the style of authors. This feature is so strong that even letter pair frequencies can provide good distinguishability between authors [26]. Frequent words are also responsible for the success of the approach—proposed and investigated by physicists—that consists in quantifying the similarity between two books based on the distance between their word-frequency rankings [2830]. More recently, new features have been proposed: word length, sentence length; frequency of punctuation marks and contractions; and frequency of graphemes, collocations and words. A summary of these recent results is given in [31].

In this paper, we investigate how the metrics of complex networks and intermittency—familiar to physicists—depend on the style of authors. We start quantifying the metrics for each word (section 2). These are used in the definition of global features for each book that are tested according to their efficiency in algorithms of authorship classification (section 3). Finally, we discuss the importance of each feature (section 4). The primary goal of this paper is not to improve state-of-the-art methods of automatic authorship recognition. Instead, we wish to estimate the authorship dependence of the selected metrics. We evaluate our results using authorship classification tests because they provide a statistical rigorous method for quantifying the importance of different features. Nevertheless, our study reveals interesting insights that are potentially useful in real applications and therefore we include a comparison with more traditional statistical natural language methods (that increase the efficiency from 62.5 to 90.0% correct attributions, see section 4.2).

2. Statistical quantification of the role of words in texts

2.1. Database

Our database contains five novels from each of eight authors who lived between 1809 and 1975, which are available in an online repository (http://www.gutenberg.org/). The list of books is given in section 1 of the supplementary information (SI) available from . To avoid the effects from the length of the texts, each book was limited to their first 18 200 tokens, which correspond to the length of the shortest book. In the remainder of this section, we illustrate our results using the book The Adventures of Sally by P G Wodehouse. The results for all books appear in SI (section 3) and are discussed in section 3 below.

2.1.1. Pre-processing of the text

Before extracting complex networks and intermittency measurements from the texts, some preprocessing steps were applied. Initially, a pre-compiled list of stopwords including articles, prepositions and adverbs was removed from the text (see SI (section 2), available from ). Previous work for recognizing authorship used the frequency of function words, but we decided not to use them in our study because we are interested in the interrelation between words with a pronounced semantic content. This procedure has been employed in many works (see, e.g., [79, 12, 15]) and it is crucial to determine how these techniques depend on the writing style of each author. Next, a lemmatization step was applied to the remaining words using the MXPost part-of-speech Tagger based on Ratnaparki's model [32]. Table 1 exemplifies the application of these pre-processing steps. With this standardization, we grouped together all words referring to the same concept, despite the differences in flexion.

Table 1. Example of the pre-processing steps applied to the texts. An extract (first column) obtained from the book The Adventures of Sally by P G Wodehouse is shown after the removal of the stopwords (second column) and after lemmatization (third column).

Original Without stopwords After lemmatization
What's that? asked Sally. asked Sally ask Sally
Pay my bill for last week, pay bill last week pay bill last week
due this morning. Sally got up morning Sally got morning Sally get
quickly, and flitting down the quickly flitting quickly flit
table, put her arm round her table put arm table put arm
friend's shoulder and whispered friend shoulder whispered friend shoulder whisper
in her ear. ear ear

2.2. Network measurements

Complex networks have been used to characterize different properties of languages [510, 14, 33]. Here we are interested in author-specific characteristics and therefore we adopt a network description based on word co-occurrence [5, 79, 14, 33], where nodes are words and links are established between subsequent words. This procedure is illustrated in figure 1. The network is defined by a set V = {v1, v2, ..., vn} of vertices, and a set E of edges and is represented as a nonsymmetric weighted matrix W. By construction, W is a square matrix of size n, where n is the number of distinct words after the pre-processing step. The elements wij of W indicate the strength that vi is connected to vj (vi → vj), i.e. the number of times word vj appears immediately after word vi. Additionally, we used the nonweighted and undirected network corresponding to W, denoted by the matrix A whose elements aij = 1 if the words represented by the vertices vi and vj appeared as neighbors at least once in the text. Otherwise aij = 0.

Figure 1. Refer to the following caption and surrounding text.

Figure 1. Example of networks: (a) the subgraph obtained for the sentences shown in table 1; and (b) the global network obtained from the first 1000 associations of the same book.

Standard image

We shall use the statistical properties of complex network measurements in the networks W and A. In this section, we discuss the word-specific local measurements, and in section 3.1, we show how to connect them to obtain a global characterization of the network. The number of occurrences Ni of each word i represented by node vi is

Equation (1)

where sin and sout are, respectively, the weight (or strength [34, 35]) of the ingoing and outgoing edges of node vi.4 Therefore, the degree of each node vi is the frequency of appearance of word vi and the degree distribution of W is proportional to the normalized frequency ($f_i=N_i/N_{\mathrm {T}}, N_{\mathrm {T}}=\sum _i N_i$ ) distribution of words, given by Zipf's law [36, 37]. Below we discuss three typical measurements: the clustering coefficient, average shortest path length and betweenness centrality.

2.2.1. Clustering coefficient C

The clustering coefficient (C) measures the probability that the neighbors of a given vertex vi are connected. This measurement has been widely employed in complex networks, e.g. to verify the presence of communities [3840] and to distinguish random networks from other small world networks [5, 41]. Traditionally, the clustering coefficient is defined without considering weights or directions as

Equation (2)

which is equivalent to the fraction of the number of triangles among all possible triads of connected nodes and therefore ranges from 0 to 1. With regard to the interpretation of this measure, Ferrer i Cancho and Solé [5] found that the clustering coefficient of networks representing text was much larger than that expected just by chance (i.e. the value expected for the corresponding random networks).

We singled out the words (and their neighbors) with the highest and lowest values of the clustering coefficient in the book The Adventures of Sally by P G Wodehouse, which are shown in table 2 for the frequency N = 5.5 From the definition, one expects words with the highest C to have neighbors also connected to each other. This is the case for the words 'sand' and 'excitement'. On the other hand, words whose neighbors are not related to each other at all display low values of C (e.g. there is no link between the neighbors of 'full' or between the neighbors of 'high'). Qualitatively, the clustering coefficient quantifies how words are connected to specific contexts. Indeed, the words 'sand' and 'excitement' tend to be more restricted to a specific context, while 'full' tends to appear in a myriad of contexts. Therefore, it seems that the clustering coefficient can be useful in detecting authorship by quantifying the tendency for using semantic-specific or generic words.

Table 2. Words of the book The Adventures of Sally with the highest and lowest clustering coefficients (the average clustering 〈C〉 = 0.085), for words with Ni = 5. The five words with C = 0 were randomly selected among the 18 words with N = 5 and C = 0.

Word Neighbors N C
Shortly twelve, see, say, Sally, news, 5 0.27
  never, heaven, find, enter and Carmylle    
Excitement thing, suppressed, Sally, mince, last, 5 0.25
  can, come, bristle, brief and apart    
Sand watch, want, sit, shuffling, seat, 5 0.18
  here, golden, first, dark and Roville
Nose voice, tip, sort, smut, smooth, 5 0.18
  Sally, oh, glance, tell and come    
Country time, still, somewhere, say, place 5 0.18
  may, happen, great and glorious
Startle shy, seem, mill, little, gratify, 5 0.00
  gather, first, everyday, displeased and considerably    
High recess, mouth, motive, lapse, figure, 5 0.00
  even, disposal, critical, collar and check    
Gold voice, spin, pencil, loan, knob, 5 0.00
  information, high, heavy, frame and buy    
Gift tongue, take, sort, potential, mean, 5 0.00
  few, easily, compensating, blessing and acquire    
Full Tuesday, peal, later, home, happy, 5 0.00
  gratitude, gleaming, glance, color and battle    

2.2.2. Average shortest path length L

A shortest path (or geodesic path) between two nodes is defined as the path whose sum of edge weights is minimum. We start by defining dij as the length of the shortest path between vi and vj (in this case A is employed). Then the average shortest (or geodesic) path length for vi (Li) is the average shortest path to all other (n − 1) nodes of the network:

Equation (3)

which takes low values if vi is close to the other nodes.

The words with the lowest L include the characters 'Sally' (L = 2.35,N = 347) and 'Fillmore' (L = 2.51,N = 138), in addition to high-frequency words, such as 'say' (L = 2.45, N = 349), 'good' (L = 2.46,N = 107) and 'man' (L = 2.50,N = 193). As for the words with the highest L, we found: 'white-clad' (L = 6.33, N = 1), 'affability' (L = 6.31, N = 1), 'whirl' (L = 5.89,N = 1), 'jazz' (L = 5.87,N = 1) and 'war-aims' (L = 5.84, N = 1). Interestingly, all these five words appeared only once in the text, indicating that one of the reasons for a high L could be the low frequency N. However, L is not only a consequence of the frequency N of the words, as low-frequency words can also take low values of L. This is illustrated in table 3, which compares words with the same N but different L. The frequency has a limited influence on L, with a Pearson correlation Corr(L, N) = − 0.36 calculated over all words. Actually, the determining factor is the neighborhood of the word. To understand why this happens, consider the words 'affability' and 'repose'. While the former has as neighbors the words 'jaunty' (N = 1) and 'white-clad' (N = 1), the latter has as neighbors the words 'Sally' (N = 347) and night (N = 20). Therefore, one may infer that L actually quantifies the importance of a word according to its distance to the most frequent words. Since we removed stopwords, the shortest path may be thought of as quantifying the distance from a word to the core-content words of the book.

Table 3. Comparing the average shortest path length L for words with the same frequency N of the book The Adventures of Sally. For a given N, L may vary widely, which shows the dependence of L on the neighborhood connectivity.

Word Ni Li Word Ni Li
Red 5 3.71 Earth 5 2.99
Shudder 4 3.97 Lucky 4 3.00
Maxwell 3 5.55 Funny 3 3.10
Dark 2 5.15 Kiss 2 3.08
Affability 1 6.34 Repose 1 3.11

2.2.3. Betweenness

Betweenness B is a measurement of centrality, with higher values being assigned to the nodes considered as the most relevant in terms of linking different words. In other words, with B one attempts to quantify the frequency of access of each node, assuming that a given target node in the network is reached from a specific source node via shortest paths. Betweenness is defined as follows. Let ηist be the number of distinct shortest paths between the source node vs and the target node vt that pass through the node vi. If gst is the total number of shortest paths between vs and vt, then Bi is given by

Equation (4)

In the context of text analysis, high-frequency words tend to have high B. However, some words may play the role of articulation points by linking concepts related to distinct communities. To illustrate this, we show in table 4 that words with similar N may take very different B. A comparison between the left and right columns suggests that words with high B connect concepts because of their probable appearance in various contexts. Therefore, similarly to the clustering coefficient C, the betweenness centrality B seems to quantify the variety of contexts in which a word can appear. Note, however, that B is based on a global connectivity pattern, in contrast to C.

Table 4. Comparing the betweenness B for words of the book The Adventures of Sally with the same frequency N. For a given N, the betweenness may vary widely, since low-frequency words may have high betweenness as they may appear in different contexts.

Word Ni Bi Word Ni Bi
Say 349 745 634 Sally 347 1 192 881
Know 143 243 357 Fillmore 138 393 955
Tell 65 53 904 Gerald 62 108 528
Allow 20 15 816 Roville 21 32 449
Heaven 10 1 147 Second 10 22 004
Rugger 5 855 Worthy 5 10 503
Fish 4 174 Spectator 4 14 746
Paper-Knife 3 233 Group 3 8 320
Worship 2 44 Sell 2 8 346
Thaw 1 11 Price 1 8 295

2.3. Intermittency measurements

The uneven distribution of words across different documents is an essential feature exploited in statistical natural language processing. For instance, by investigating words appearing overconcentrated in specific documents (when compared with their overall frequency) one can detect keywords, topics and authorship [27, 42]. This is the basic idea of tf–idf (term frequency–inverse document frequency) and related measures that are also at the core of search engines [42]. However, there are numerous situations where the comparison with a general database is not available or is not interesting: for instance, when authorship has to be attributed without previous knowledge of texts written by the potential authors. Here, we approach these problems by taking advantage of the finding that words are unevenly distributed not only across documents but also within them [1618, 2124].

The quantification of the uneven distribution of words has been proposed based on measures commonly used by physicists [16, 24]. Following [21, 22, 24], we use the statistics of recurrence times, a standard quantification of intermittency or burstiness in time series [18, 19]. In texts, time is counted by the number of words and for each word i the recurrence time Tj is defined as the number of words between two successive occurrences of i (the j and j + 1 occurrence) plus one. For instance, the recurrence times for the word 'the' in the previous sentence are T1 = 9 and T2 = 7. A word that appears Ni times in a text of size NT leads to a sequence of NT − 1 inter-occurrence times {T1,T2,...,TNT−1}. In order to incorporate also the time until the first Tf and after the last Tl occurrence of the word, we consider TN = Tf + Tl. In this case $\overline {T} = N_{\mathrm {T}}/N_i$ , where the overline denotes average over the different Tj's. Note that the mean recurrence time $\overline {T_i}$ gives no additional information other than the frequency Ni. The intermittency of the word appears in the variance of Tj's around $\overline {T}$ and can be quantified by $I\equiv \sigma _{\mathrm {T}}/\overline {T}$ , where $\sigma _{\mathrm {T}}=\sqrt {\overline {T^2} - \overline {T}^2}$ . Randomly distributed words in the text have I = 1 (in the limit of large N and small Ni/NT), intermittent words have I > 1 and words appearing in regular intervals have I < 1. We calculate the intermittency measure $I_i=\sigma _{\mathrm {T}}/\overline {T}$ for all words with Ni ⩾ 5 in each of the books (filtered texts) described in section 2.1. The words with Ni < 5 were considered to lack statistics and were disregarded.

In table 5, we compare words with highest $I = \sigma /\overline {T}$ and words with similar frequency. It is clear that the most intermittent words (largest $\sigma _{\mathrm {T}}/\overline {T}$ ) are topical words (e.g. names of characters and locations), regardless of their frequency. Indeed, 15 out of the 16 most intermittent words are directly connected to specific characters. Similar behavior is observed in all books of our database. Intermittency is therefore a good characterization of topical words that in turn plays an important role in the author-specific characteristic of the texts. The relationship between $\sigma _{\mathrm {T}}/\overline {T}$ and the function of the words has been investigated in detail in [1618, 21, 22, 24]. In the next section, we explore the fact that these properties are also author specific [17].

Table 5. In the book The Adventures of Sally by P G Wodehouse, there are NT = 15 173 words (tokens), 3657 different word types and 716 words with Ni ⩾ 5. The five words with highest $\sigma _{\mathrm {T}}/\overline {T}$ are shown in the left part of the table. For comparison, in the right we show for each of these words another word with the closest frequency.

Word Ni $I_i\equiv \sigma _{\mathrm {T}}/\overline {T}$ Word Ni $I\equiv \sigma _{\mathrm {T}}/\overline {T}$
Jules 26 4.31 Turn 26 1.55
Hobson 31 4.09 Here 31 1.35
Ginger 115 3.86 Get 117 1.24
Carmyle 54 3.60 Feel 53 0.87
Bunbury 20 3.59 People 20 1.39

3. Evaluating the author dependence

3.1. From properties of words to properties of books

In the previous section, we introduced five quantities characterizing properties of words in the text: frequency (N), average shortest path length (L), betweenness (B), clustering coefficient (C) and intermittency ($I=\{\sigma _{\mathrm {T}}/\overline {T}\}$ ). The values of these quantities for all words in the books in our database can be found in SI (section 3), available from . We now analyze the global distribution of these measurements for all the words in a given book by plotting the empirical probability density function ρ(X) for the measurements X = {N,L,B,C,I}. Figure 2 shows the results for one book, and similar distributions were obtained for the other books. The shortest path L, clustering C and intermittency I have a well-defined peak and width (akin to a Gaussian distribution), but the frequency N and betweenness B have broad tail distributions (as in power-law distributions ρ(X) ∼ Xα). The tail in N corresponds to the well-known Zipf's law, which also appears in B as expected from the large correlation between B and N (Corr(B,N) = 0.95 in the book of figure 2). With the two different types of behavior, we propose two sets of measurements, one for X = {L,C,I} and another for X = {N,B}.

Figure 2. Refer to the following caption and surrounding text.

Figure 2. Probability density function ρ(X) obtained from the different words of the book The Adventures of Sally by P G Wodehouse. (a) X = L shortest path, (b) X = C clustering coefficient, (c) X = I intermittency, (d) X = N frequency and (e) X = B betweenness. In (d, e), the cumulative distribution $\rho (x \geqslant X)\equiv \int _X^\infty \rho (x){\rm d}x$ is shown, with the density ρ(X) depicted in the inset. The legends indicate the features defined in equations (5)–(7) obtained for these distributions, and Cor(X,N) indicates the Pearson correlation coefficient between X and N calculated over all words.

Standard image

Our goal is to obtain quantities characterizing important features of these distributions to be used as global measurements of the books. The most natural choice is the average value 〈X〉, where $\langle \cdots \rangle \equiv \frac {1}{M} \sum _{i=1}^M \cdots $ corresponds to an average over the M different words. For the network measures L,C and I this corresponds to the average values over nodes, a quantity considered as characteristic of the network [7, 12, 14, 43]. For X = {N,B}, the high frequency words contribute strongly to 〈X〉 due to the long tails. To compensate for this effect, we consider also a modified average defined as 〈X2 ≡ 〈log X〉 for X = {N,B}. For X = {L,C,I} the opposite is true, i.e. 〈X〉 is dominated by the large number of low-frequency words. Accordingly, we introduce a modified average as $\langle X \rangle _2 \propto \sum _i X_i \log N_i$ , i.e. a weighted average with weights proportional to the logarithm of the frequency. The quantities 〈X〉 and 〈X2 are expected to give a good account of 'typical' values of X. However, in section 2 we mentioned that important information is conveyed by words with large X, i.e. in the tails of the distributions shown in figure 2. In order to characterize the fat-tail distributions of X = {N,B}, we used the coefficient αX of a power-law fit to the tails of ρ(X).6 An additional motivation for using αN comes from the suggestion in [45] that it serves as a quantification of the style of texts. The large values of X = {L,C,I} were characterized by calculating the skewness of ρ(X), a measure of the asymmetry of the distribution. In summary, the three features we use for each of the five quantities X = {N,B,L,C,I} are

Equation (5)

Equation (6)

Equation (7)

These features are given in figure 2 for one book (see SI (section 3), available from , for all 40 books). Obviously, the choice of the quantities above is inevitably arbitrary. Our choice was intended to capture features of the distribution, rather than giving a parametric description of the full distribution. In particular, the power-law fit in equation (7) does not intend to fully describe the distributions, as is apparent in figures 2(d) and (e).

3.2. Machine learning methods and evaluation

In order to quantify the ability of the features described above to distinguish between authors, we employ machine learning algorithms which induce classifiers from a training database. The robustness of our results is tested with three widely used algorithms based on different principles. The first is known as C4.5 [46], and generates decision trees based on the information gained by each feature; the second algorithm is the Naive Bayes [47], which is based on the Bayes theorem; and the third and simplest algorithm is the nearest neighbor [48], which classifies an unknown instance according to the nearest neighbor of that instance in a normalized space involving all features. For more details, see SI (section 4), available from .

3.3. Efficiency of the classification

We consider the problem of distinguishing between eight authors, using five books to represent each author's style. More specifically, each book described in section 2.1 was characterized by the set of 15 features discussed in section 3.1 (〈X〉, 〈X2 and γ(X) for X = {N,B,L,C,I}). The authorship assignment was performed using the algorithms in section 3.2 applied to a training dataset independent of the test book using the cross validation methodology (see SI (section 4)). This technique ensures that the training and evaluation sets are different and it is equivalent to assigning the authorship of one book in an experiment where four books of eight authors were used as a training dataset. The final output of the algorithms is the assignment of a specific author to each book tested, and the efficiency is quantified simply as the fraction of successful assignments.

The results are summarized in table 6 and indicate accuracy rates between 42.5 and 50.0% when all 15 features were used. These results were statistically significant by a large amount, confirming that these features successfully capture author-specific characteristics. To further explore the accuracy of different methods, we considered the cases in which only some of the features were included in the algorithms. We tested all 215 = 32 768 combinations of the 15 features and obtained a best result of 65.0% of correct assignments.

Table 6. Accuracy rate achieved for the three machine learning algorithms using all 15 features and the best combination of these features. The accuracy is estimated based on 40 authorship assignments. The p-values correspond to the probability of getting by chance a higher or equal accuracy in one (all features) and in 215 = 32 768 (the best case) trials. The features included in the best cases can be found in SI (tables S1–S4), available from .

    Algorithms  
  Decision tree C4.5 Nearest-neighbor kNN Naive Bayes
All 15 features 50.0%(p=1×10−8) 47.5%(p=6×10−8) 42.5%(p=2×10−6)
The best case 62.5%(p=5×10−9) 65.0%(p=4×10−10) 62.5%(p=5×10−9)

3.4. Relative importance of different features

In evaluating the importance of the different features for the final results, it is essential to identify their mutual dependence. We start from the list of all 215 = 32 768 combinations of features ordered by decreasing accuracy (as shown in SI (tables S1–S4)). We wish to quantify when feature y appears in the top of this list. To this end, we count the fraction of the 214 feature combinations that include y with accuracy higher than or equal to a threshold. The final estimate is then given by the area-under-the-curve of the ROC plot obtained by varying the threshold [49]. This procedure is equivalent to the Mann–Whitney U test [50]. The motivation for using this method is that it evaluates the importance of a specific feature by taking into account how it combines with the other features to improve the accuracy of the prediction. The method depends both on the prediction algorithm and on the other features.

The features were ranked based on the method described above. The results for the three prediction algorithms are given in the first three columns of table 7. The three features appearing as the most prominent are 〈N〉 (average frequency), γ(I) (skewness of intermittency) and 〈N2 (average logarithmic frequency). In order to state the importance of features beyond a specific algorithm, it is important to quantify to what extent the results obtained for the three algorithms (the first three columns) are consistent with each other. To this end, we compute the Spearman's rank correlation and obtain the values 0.29 (p-value =0.145), 0.49 (p-value =0.032) and 0.67 (p-value =0.003) for the pairs C4.5/kNN, C4.5/Bayes and kNN/Bayes, respectively. The p-values are computed under the null hypothesis that the rankings are independent. Altogether, the three p-values indicate that the three rankings are consistent with each other. This is a strong indication that our analysis goes beyond algorithm-specific results and indeed captures the influence of the features.

Table 7. Ranking of features based on the accuracy rate of the classifiers, where 1 in the table means best, 2 means second best and so on. The results for each classifier algorithm (C4.5, kNN and Bayes) are reported using different ranking procedures combined with multiple features (Mann–Whitney U test, columns 1–3, information gain (column 4) and accuracy using only one feature (columns 5–7)). The last column reports the Pearson correlation between each feature and the vocabulary size M (the number of different words) calculated over the 40 books in our database. The features in the table are ordered according to the decreasing geometric mean of the ranks obtained in the three multiple features analysis (this ordering is the same achieved by considering for each feature the likelihood of reaching by chance a ranking as good as the one in each of the three ranking schemes). The areas under the curve in the multiple features analysis ranged between 56 and 69%.

  Multiple features Single feature Correlation
  C4.5 kNN Bayes Info C4.5 kNN Bayes with M
N2 6 1 1 3 2 5 1 −0.90
γ(I) 2 2 2 10 12 9 10 −0.08
N 1 6 3 2 1 2 3 −0.96
L 7 4 6 9 5 3 8 0.85
B 5 8 5 1 3 1 2 0.98
I2 10 3 10 15 15 12 12 −0.34
L2 8 7 8 8 5 7 5 0.85
C 12 11 4 6 5 5 5 −0.87
γ(L) 4 13 11 13 10 9 13 −0.13
γ(B) 3 14 14 11 8 9 9 −0.07
B2 9 9 9 7 9 14 5 0.88
C2 11 10 7 5 4 3 4 −0.87
I 13 5 12 12 13 15 10 −0.29
γ(N) 15 12 13 4 10 8 13 0.81
γ(C) 14 15 15 14 14 12 15 0.07

It is interesting to compare the results with evaluations taking into account each feature separately. This can be done either by considering the accuracy of the prediction using only the specific feature or by comparing the information gained by including the feature [51]. This last method has the advantage of being independent of the prediction algorithm. These results are shown in the last four columns of table 7. Note that some features appearing as very important in multiple features analysis are not informative when taken alone (e.g. the skewness of the intermittency γ(I)). On the other hand, features that are well ranked in the single feature analysis do not always appear among the most important features when multiple features are considered (e.g. the weighted average of the clustering 〈C2). These observations show the nontrivial mutual dependence of the features. To further explore this, we performed a factorial analysis (see SI (section 5), available from ) using the 12 most important features in table 7, with the most important combinations being summarized in table 8. As expected from table 7, in fact γ(I) appears among the best two combinations of features in all three algorithms, which confirms that its effectiveness is correlated with its interdependence on other features.

Table 8. The two most relevant combinations of features, as revealed by a factorial analysis for the C4.5, kNN and Bayes classifier algorithms. As expected from table 7, γ(I) provides good results when used in conjunction with other features, such as 〈N〉, 〈N2, γ(B) and 〈I2.

  C4.5 kNN Bayes
The first combination γ(L), 〈C〉 and 〈C2 γ(I) and γ(B) N〉, 〈N2 and 〈B
The second combination γ(I) and 〈N γ(I) and 〈N2 I2 and γ(I)

4. Discussion and conclusions

4.1. Interpretation of the results

We are now in a position to use the word-specific analysis (section 2) and the distribution (section 3.1) of the quantities X = {N,B,L,C,I} to assess the importance of the different features 〈X〉,〈X2 and γ(X) in table 7:

  • Frequency. This was the most efficient quantity for recognizing authorship with 〈N〉 and 〈N2 among the three most important features. Noting that 〈N〉 is proportional to the inverse number of distinct words M:
    Equation (8)
    one infers that the distinguishing feature between the authors captured by 〈N〉 is the different vocabulary sizes. The modified average 〈N2 = 〈log N〉 also captures this aspect, including the proportion of frequent and infrequent words. On the other hand, the poor performance of γ(N) (=α in ρ(N) ∼ Nα) is a clear signature of the universal, author-independent character of Zipf's law (at a fixed book size [37]).
  • Betweenness. The average betweenness 〈B〉 was useful because of its strong correlation with the vocabulary size of the book M (the last column in table 7). In network terms, this corresponds to a linear relationship between 〈B〉 and network size (M is the number of nodes) and can be understood by noting that the number of terms in the sum of definition of Bi in equation (4) is proportional to M2, so that 〈B〉 is expected to scale linearly with M for a fixed book size. The fact that 〈B2 and γ(B) had a poor performance indicates that the number of words with high betweenness is not a relevant feature to distinguish between authors.
  • Shortest path. This was the network quantity with best performance. 〈L〉 quantifies the typical distance of words to the central hubs of the network (frequent words). The good performance points to a dependence on the style of the authors. The poorer performance of 〈L2 and γ(L) indicates that the style dependence in L is more prominent in the typical values than, respectively, in the frequent and large L words.
  • Clustering. The poor performance of all values related to this quantity suggests that authors have very little freedom in choosing the clustering of words co-occurrence networks. The last position in the ranking of γ(C) in table 7 suggests that the fraction of words used in specific contexts (high C) is author independent. The two averages 〈C2 and 〈C〉 take similar values (as seen in figure 2 and in SI (section 3), available from ; recall the restriction Ni ⩾ 5 used in section 2.2). They perform well only when used alone, possibly because of their correlation to vocabulary size M.
  • Intermittency. Apart from the frequency, intermittency was the most important quantity in table 7 with the skewness of the distribution γ(I) playing a prominent role. In view of the results in section 2.3, γ(I) may be interpreted as the fraction of all words that are topical or 'keyword like'. The poor performance of 〈I〉 is not surprising since I is normalized by frequency ($I\equiv \sigma _{\mathrm {T}}/\overline {T}$ ) and therefore 〈I〉 ≈ 1 is expected. Indeed, from all five quantities the I features have shown altogether the smallest absolute value of correlation with vocabulary size (table 7), explaining why even γ(I) has a poor relative performance when used alone. Finally, 〈I2 performs better than 〈I〉, suggesting that frequent words are more relevant.

4.2. Comparison with other prediction methods

Even if the main goal of this paper is to evaluate the importance of different factors, it is also useful to compare the accuracy of our results with other methods of authorship attribution. Uzuner and Katz  [52] used a database of books similar to ours, produced by eight authors. They used five sets of features, including simple statistics and more sophisticated syntactic analysis (table 3 of [52]). Our best results (accuracy of 65%) is comparable to their second best case obtained using 'syntactic elements of expression' (62%), being significantly worse only than their best result, achieved using function words (87%). In an extensive review, Grieve  [31] reported accuracies obtained with a set of 34 features varying between 33% and 87% for the case of five authors, and between 18 and 80% for ten authors (table 9 of [31]). Our best results are above the median of their results achieved by using different features. Their best results again are based on the relative frequency of function words. These results are in accordance with the long tradition started by Mosteller and Wallace [27] to use the frequency of function words to distinguish between authors.

In order to confirm this in our database, we implemented a series of prediction schemes using the frequency Ni of frequent (mostly function) words. In contrast with the approach described in this paper that used average and scaling properties as features, now the frequencies of specific words are used directly as input features of the prediction algorithms. We used only the kNN algorithm because the other algorithms did not provide good results when too many features were included. When the list of 70 stopwords from table 2.5 of [27] was used, we obtained an accuracy of 62.5%, i.e. comparable to our best results. Following the work [31], we considered two other lists of words: all 1978 words that appear in at least one book of each of the authors, leading to an accuracy of 90%; and all 209 words that appear at least once in every book in our database, leading to an accuracy of 82.5%. We recall that in order to concentrate on analysis that focuses on words with pronounced semantic content instead of function words, we have deliberately excluded a list of stopwords that contributed 80% of the cases listed in [27]. Therefore, our best combination of features compares well to other methods which demand more sophisticated syntactic analysis of the text. Measurements of complex network and intermittency are indeed able to capture many of the author-dependent characteristics.

In order to illustrate how measures analyzed in this paper can be complementary to traditional methods, we have performed a very simple experiment using as features the frequency and intermittency of the set of words composed of the five most frequent words in each book. The accuracy in classifying the books only by frequency was 72.5% and only by intermittency was 37.5%. Although this last accuracy rate is not impressive, it is statistically significant (p = 2.2 × 10−4) and shows that the intermittency values of specific words across distinct authors are different. The accuracy is increased to 80% when both features are included. It remains to be shown in future works how our results can improve state-of-the-art methods of authorship attribution.

4.3. Summary of conclusions

We have shown that the styles of different authors leave fingerprints in very general statistical measures of texts based on the network of co-occurrence of words and on intermittency or burstiness of words. The statistically significant scores obtained in authorship attribution unequivocally show that the style dependence of these features can be used in practice. Regarding the prominence of the different features, we note that both the results and ranking of features may depend on the database, selected features and attribution algorithms. Accordingly, as emphasized in [31], different algorithms and features have to be tested in a given corpus before any real application of authorship attribution. However, the robustness of our results using three radically different attribution algorithms strongly suggests that the different features have importance that go beyond specific algorithms. Two features should be highlighted: (i) the skewness of the distribution of intermittent words γ(I), which is based on the long-scale distribution of words and detects the extent to which topical words (keywords, large $I\equiv \sigma _{\mathrm {T}}/\overline {T}$ ) were used in the book; and (ii) the mean shortest path of the word co-occurrence network 〈L〉, which is based on the short-range connectivity of words and detects the typical distance of words to all other words. The different nature of these two quantities suggests a complementary role in capturing both short- and long-scale properties of the text, as well as typical and exceptional words.

Our focus in this paper was on the evaluation of the different features, rather than on maximizing the efficiency of the authorship attribution algorithms. This is apparent when comparing the best accuracy rates we achieved using our approach (62.5%) and using previous proposals (90.0%), as discussed in section 4.2. A further limitation of approaches based on intermittency and networks is that they require large pieces of text. While the root of the success of previous methods is the observation that function words are a powerful tool for detecting the style of authors [27, 31, 52], in the complex network and intermittency approaches used in this paper the focus is on content words. In this sense the results we achieve can be thought of as being complementary to the analysis using function words. More specifically, our results suggest that using γ(I) and 〈L〉 can improve authorship recognition techniques when used in combination with the many different features currently employed [31]. Finally, the successful application of these measurements to characterize the style of authors suggests that the quantities discussed here can be further explored in other linguistic tasks, an approach that has been limited to a few works (see, e.g., [14, 53]).

Acknowledgments

We thank Stefan Siegert for valuable suggestions and assistance with the statistical methods and for a careful reading of the manuscript. This work was supported by FAPESP and CNPq (Brazil). DRA acknowledges support from the MPIPKS during his one-month visit to Dresden (Germany).

Footnotes

  • The triple equality in equation (1) is not valid for the first and for the last word in the text. The correct expression for the first word is $N_i = k_{\rm out}(i) = \sum _{j} w_{ij}$ and that for the last word is $N_i = k_{\rm in}(i) = \sum _{i} w_{ij}$ .

  • The words with Ni < 5 were considered to lack statistics and were disregarded in all the analyses involving the clustering C.

  • The fitting was performed to the cumulative distribution with logarithmic binning size, as suggested in [44]. A cut-off X > 3 × 103 was used for X = B (see figure 2(e)); no cut-off was used for X = N.

Please wait… references are loading.
10.1088/1367-2630/13/12/123024