Figures
Abstract
We propose a power-law growth and decay model for posting data to social networking services before and after social events. We model the time series structure of deviations from the power-law growth and decay with a conditional Poisson autoregressive (AR) model. Online postings related to social events are described by five parameters in the power-law growth and decay model, each of which characterizes different aspects of interest in the event. We assess the validity of parameter estimates in terms of confidence intervals, and compare various submodels based on likelihoods and information criteria.
Citation: Fujiyama T, Matsui C, Takemura A (2016) A Power-Law Growth and Decay Model with Autocorrelation for Posting Data to Social Networking Services. PLoS ONE 11(8): e0160592. https://doi.org/10.1371/journal.pone.0160592
Editor: Eduardo G. Altmann, Max-Planck-Institut fur Physik komplexer Systeme, GERMANY
Received: December 16, 2015; Accepted: July 21, 2016; Published: August 9, 2016
Copyright: © 2016 Fujiyama et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: Data are available from NTTCom Online Marketing Solutions Corporation (http://www.nttcoms.com/contact/) for researchers who meet the criteria for access to confidential data.
Funding: This work was supported by JSPS Grant-in-Aid No. 15K20939 and JSPS Grant-in-Aid No. 25220001.
Competing interests: The authors have declared that no competing interests exist.
Introduction
With the increasing use of social networking services (SNS), including blogs, Facebook, and Twitter, it is becoming increasingly important to extract information from SNS to optimize the use of resources for these events. Many models have been proposed to grasp people’s reactions to social events from various viewpoints, for instance, of complex networks [1] and time-series analyses [2, 3]. However, it is still difficult to qualitatively verify the models.
To overcome this problem, we use statistical methods in analyzing time-series data of social events. By assuming the power-law growth and decay for the number of postings about social events on SNS, we aim to gain insights into patterns of human interest regarding these events. We classify the patterns of human interest with parameters α: how “rapidly” people become interested/lose interest before/after an event, β: how “long” people remain interested in an event, and γ: how “much” attention is paid by people. The parameters, α and β, can be specified differently for data before and after the event. The last parameter γ only concerns the peak of the time series, and therefore, we consider a total of five parameters. The length of the time series of the data for social events is usually short, covering several weeks before and after the event. The time series are typically non-stationary because there is a sharp peak in the number of postings on the date of the event. Hence, we cannot use standard time series models, such as ARMA models [4].
Social events are usually scheduled in advance, and thus we call them predictable events as the time of occurrence is known in advance. On the other hand, there exist “unpredictable events” such as earthquakes, whose effect on the number of postings on SNS has been studied in [2, 5–7]. In the case of unpredictable events, we only obtain the data after the event. Whereas, for predictable events such as social events, we have the data before and after the event. Another type of event is a “deadline event” such as registration data collected within a deadline [8, 9]. The time series data for a deadline event often show similar behavior to unpredictable events. Since we deal with time series before and after the event separately using different parameters, our model consists of the growth part and the decay part, each of which is used to describe deadline and unpredictable events, respectively. For this reason, our method can be used for any of predictable, unpredictable, and deadline events.
The power-law distribution model is widely used, for instance, in time-series of daily views on YouTube [1] and rumor diffusion on Twitter [2]. The model was applied to blog posting about social events [3]. This model allows the prediction of the number of postings and the time to return to the normal activity level. Here we modify the model by introducing the conditional Poisson autoregressive (AR) model for deviations of data from the power-law model. The conditional Poisson AR model is often used in fields of social science such as economics, political science, and epidemiology [10, 11]. The theoretical aspects of the conditional Poisson AR model are found in [12–14]. Thus, we propose the model consisting of the power-law growth and decay model and the conditional Poisson AR model. The former describes the expected number of postings, while the latter describes deviations from the power-law growth and decay model. We give interpretations of the parameters contained in the model. We show that we are able to obtain the necessary information from the parameter values. The advantage of employing a statistical model is that we can assess the validity of parameter estimates in terms of confidence intervals and we can compare various submodels based on likelihoods and information criteria.
The organization of this paper is as follows. After describing the data, we first introduce the five-parameter power-law growth and decay model with independent Poisson distributions. Then the model is extended to the one combined with the conditional Poisson AR model by introducing autocorrelation. We then analyze the data using these models. This paper ends up with concluding remarks.
Materials and Methods
Data description
The data analyzed in this study is provided by NTTCom Online Marketing Solutions Corporation through the BuzzFinder service (Those who want to access the data we used can request it through the web page http://www.nttcoms.com/contact/). We have used two millions of blog postings collected over six months, from January 2014 to June 2014. From more than ten popular Japanese blog-servers, including goo-blog, ameblo, yahoo blog, and livedoor blog, blog postings are counted, if they contain a keyword we specified.
In Fig 1(a) we show a typical symmetric pattern for the number of postings. The Tokyo Marathon 2014 was held on Sunday, February 23, 2014. It was highly anticipated and it ended without anything unexpected happening. In this and similar events, the number of postings shows a symmetric pattern with a sharp peak on the day of the event. Both sides of the peak seem to exhibit negative power behavior in the time difference from the actual date of the event.
(a) shows the typical symmetric pattern, while (b) shows the strong asymmetric pattern.
An asymmetric pattern is observed when there is a surprising element to the event. In Fig 1(b) we show the number of postings on Noriaki Kasai regarding February 16, 2014, when he won a silver medal in the ski jump at the 2014 winter Olympics. From the data we see that people did not anticipate the medal before the event.
In view of the above characteristics of the number of postings, we propose that SNS posting data presents an interesting challenge for statistical analysis. Since the data is in a time series, we propose models that can account for autocorrelations. As we show in the Results section, our proposed model fits the data very well.
A power-law growth and decay model for the mean number of postings
We propose a model without autocorrelation and an autoregressive model. We then compare these models by Akaike’s Information Criterion (AIC) in the Results section.
The power-law growth and decay model without autocorrelation.
Let t0 denote the date of the event and let yt denote the number of posts on day t about the event. We model the expected value of yt by the following power-law growth and decay function: (1) γ is a common parameter for t < t0 and t > t0. The power-law growth and decay model was proposed for the symmetric case in [3] without the parameter α. They used the least-squares method, while we use the maximum likelihood method for parameter estimation. They do not consider fitting the data close to the peak, which we do by introducing the parameter α. Note that the model (1) consists of the growth part (t < t0) and the decay part (t > t0), each of which models deadline events and unpredictable events, respectively.
The interpretation of the parameters is as follows.
- αa/b steepness of the curve just before/after the event. How rapidly an event gains/loses interest.
- βa/b longer growth/decay pattern. How long people get/remain interested in an event.
- γ impact of the event (peak level, the maximum number of postings). The total attention paid to an event.
When αa = αb and βa = βb, the model (1) shows the symmetric pattern. For the symmetric case, we assume that yt, t = tL, tL + 1, …, tU − 1, tU, tL ≤ t0 ≤ tU, are independent Poisson random variables with mean μt(α, β, γ) in Eq (1). We call this the power-law growth and decay model without autocorrelation. We denote the probability function for the Poisson distribution with mean μ as (2) Then the likelihood function is (3)
The solid line is the observed data and the dotted line is obtained from the power-law growth and decay model.
Maximization of the log-likelihood function is numerically straightforward. We fitted a Poisson distribution to the Tokyo Marathon data in Fig 2. We chose the estimation period as one week before (tL = t0 − 7) and after (tU = t0 + 7) the event for the reason discussed later in the Results section. The parameter estimates are , and . Our model seems to fit the data well, but there is a slight asymmetry in this data, which is not captured by the symmetric model in Eq (1).
Conditional Poisson regression model for autocorrelations.
In the power-law growth and decay model without autocorrelation we assumed that the number of postings yt are independent. We generalize this model to allow autocorrelations by using a conditional Poisson regression model. As we saw, the estimate for the parameter γ in Eq (1) is very close to yt0. Hence, in this section we replace γ by yt0 and consider the conditional likelihood given yt0. This is the initial value of our autoregressive scheme and we model the number of postings after the event yt, t > t0. Conditional likelihoods are much simpler than unconditional likelihoods and are often used in statistical time series analyses [15].
Considering the data yt, t < t0, before the date of the event, we propose to use the model given in Eq (4) by reversing the time axis. This is similar to the standard AR(1) process xt = ρxt − 1 + ϵt, where the reciprocal of the autoregressive coefficient ρ is used. However this model for the data preceding the event is somewhat unsatisfactory, in particular for the purpose of predicting yt0 before the date of the event. We discuss this point again in the Discussion section.
We let t0 = 0 for simplicity and replace γ by y0 in Eq (1) and take as the conditional expected value of yt given y0 Then E(yt|y0) is written recursively as We propose the following AR(2) model for yt, t ≥ 2: (4) Note that y1 is given in an AR(1) form. The parameter w linearly connects the AR(1) model and the complete AR(2) model, which is solely determined from the information two days ago. Therefore, the containing ratio of the AR(1) model in Eq (2) is given by this parameter.
Then the conditional likelihood function for α, β, w, given y0, for the data y1, …, yT is (5) When we estimate w in L(α, β, w), we restrict w ∈ [0, 1], although unrestricted MLE of w may result in w > 1.
Note that the model without autocorrelation in Eq (3) and the AR(2) model in Eq (5) are separate models. In the usual AR(1) model for continuous observations xt = ρxt − 1 + ϵt, the model without autocorrelation is a special case of ρ = 0. In order to interpolate between the model without autocorrelation Eq (3) and the AR(2) model (5), we also propose the following unified model with additional parameters u, v ∈ [0, 1] representing the weights of the two models: (6) This unified model reduces to the model without autocorrelation for u = v = 1, while it results in the AR(2) model for u = v = 0. We introduced the parameters u and v separately for increased flexibility.
Results
Analysis of Japanese social networking data
We apply our models to SNS data in Japan. The data is summarized in Table 1. In Table 1, “Date” is the date of the event in the format month/day in 2014. “ID” is our identifier for the events used in later tables. “Searchword” is the word we used in the BuzzFinder service to search for the postings related to the events. “Remarks” are the explanations of the events. The searchwords we have used are related to national holidays, major sports events, and cultural events held in the first half of 2014, in order to collect enough amount of data for statistical analysis.
Parameter estimation for the power-law growth and decay model without autocorrelation.
For the model without autocorrelation, we chose the estimation period as one week before (tL = t0 − 7) and after (tU = t0 + 7) the event. In Tables 2 and 3 we show parameter estimation of the model without autocorrelation fitted to the SNS data collected over 4, 7, 14, and 21 days before and after the events. As the estimation period becomes longer, the parameter α tends to be larger, while β tends to be smaller by converging to 1. The parameter estimation over a long estimation period seems to be affected by a small number of postings far from the issue of an event. On the other hand, the parameter values estimated for a short estimation period seem to be strongly affected by the peak. For the best reflection of the behavior around the peak, we have chosen the estimation period as one week before and after the event.
In Table 4 we show parameter estimation of the model without autocorrelation fitted to the SNS data collected over the estimation period one week before and after the events. Because the distribution of postings about events often showed asymmetry, we estimated the before-event parameters αb, βb and the peak level γ for one week before the event, and then estimated the after-event parameters αa, βa separately with the same γ as the before-event parameters. We also computed 95% confidence intervals based on the Fisher information matrix (S1) and the asymptotic normal approximation of the sampling distribution of parameter estimates.
In Fig 3 we show the data for postings about Valentine’s day, around February 14, 2014. The graph looks almost symmetric, but the estimated before-event parameters and after-event parameters are different. In Fig 3 the slope just before the date of the event is steeper than after the event and the number of postings decreases to zero faster after the event than before the event. The estimated parameters in Table 5 reflect this behavior.
The parameter estimates for the power-law growth and decay model without autocorrelation fitted to postings related to a number of different events are given in Table 4. The standard errors of the estimates are shown in parentheses. We see that the parameter estimates are generally reliable, except for the data labeled by ID: Kasai and ID: W-cup, since the standard errors are relatively small compared to the estimates. Some events show strongly asymmetric patterns, which are reflected in the large differences between the before-event and after-event parameter estimates.
The disagreement of the model with Kasai’s data lies in the nature of the data. Kasai’s data shows steep growth for t < t0 (Fig 1(b)), which implies that the event belongs to the unpredictable type. For this reason, the growth part of the model does not fit the data, although the decay part fits the data well. On the other hand, the model disagrees with W-cup’s data because of another reason. According to the collected data of W-cup, there are several peaks in a short interval. It is then understood that the poor fit of out model to W-cup’s data is due to the existence of another peak before the parameter β converges. This implies the limitation of adaptation of our model, that is, the power-law growth and decay model only fits single-peak data.
Based on the power-law growth and decay model without autocorrelation, we considered predicting the after-event parameters based on the data before the event. However, this was difficult because of the asymmetry of many events. To explain this phenomenon, we performed a multiple regression analysis, where the before-event parameters αb, βb, γ are explanatory variables and the after-event parameters αa, βa are objective variables, but we did not find a significant correlation.
Parameter estimation for the AR(2) model.
For the AR(2) model, we chose the estimation period as two week after (tU = t0 + 14) the event. In Tables 6 and 7 we show the parameter estimation of the AR(2) model fitted to the SNS data as 4, 7, 14, and 21 days before and after the events. As the estimation period becomes longer, the estimation of the parameters α, β, and s becomes stable. We have chosen the estimation period as two weeks after the event, which seems to be long enough for the parameter estimation to be stable from Tables 6 and 7.
In Table 8 we show the fit of the AR(2) model to our data. In Table 8 “log-lik.” stands for the log-likelihood for the estimated model. Fig 4 shows the fit of the AR(2) model for “Children’s Day (Japan) 2014” and for “Yuko Oshima (Japanese actress)” as representative examples. The parameter s is estimated as s = 1 for “Children’s Day” (Fig 4(a)), it is estimated as s = 0.87 for “Yuko Oshima” (Fig 4(b)). The parameter s reflects the longevity of interest in the event. The parameter s tends to be close to 1 for events with faster decay rates, but tends to be less than one for events with long-lasting interest. This is reasonable, because 1 − s represents the effect of two days before and s = 1 means that the autocorrelation is fully explained only by the number of postings one day before. We compared the AICs for the AR(2) model with s = 1 and s ≠ 1. For many data sets, the AIC was smaller when s was estimated to be less than 1.
Parameter estimation for the unified model.
In Table 9, we apply the unified model (6) to the data. In Table 10 we compare the unified model and some relevant AR(1) models (w = 1) based on the AIC. The leftmost column shows the AICs of the unified AR(2) model and the second leftmost column gives the AICs of the unified AR(1) model. The third and fourth column show the AICs of the unified AR(1) model at the extreme values u = 1, 0. For some cases the unified AR(1) model provides the smallest AIC, even at the extreme values. This suggests that the most general model considered here (the unified AR(2) model) is over-parameterized for some events and the maximum likelihood estimation is not very stable for these cases.
Discussion and Conclusion
In this paper, we proposed a power-law growth and decay model combined with a conditional Poisson AR model. The conditional Poisson AR model was introduced to model deviations from the power-law growth and decay model. The power-law growth and decay model contained five parameters, which determined how rapidly interest in an event grew, how long people remained interested in an event, and how much attention was paid. The first two contribute four parameters, since the parameters for before and after an event can be modeled separately. Also we compared the models based on AIC systematically in Tables 8, 9 and 10.
In spite of good fits of our model, a number of issues remain. Since the lengths of the datasets considered are fairly short, the unified model in Eq (6) with five parameters is probably over-parameterized. This was reflected in the AIC values which showed that models with fewer parameters provided a better fit to the data.
Although we assumed a single peak at t = t0 in the data, some events, such as the Olympic games, may admit more peaks in postings due to their long duration. The number of postings during an event with a longer duration usually reveals a more complicated pattern. The patterns at the beginning and the end of the event seem to be similar to those for single-day events, whereas the pattern around the day of the event is noticeably different. It is not clear how to generalize our model for events with a longer duration.
Furthermore, in order to predict patterns in social networking data, we need to know the peak level γ for the number of postings and the after-event parameters αa and βa from the shape of the before-event pattern in advance. The estimation of these parameters is difficult for our data, which suggests that the before- and after-event parameters are independent. For this reason, we estimated the before- and after-event parameters separately, although this is somewhat unsatisfactory for our purposes. Our conditional Poisson AR model is not suitable for predictive purposes. In addition, the prediction of unusual patterns in the post-event data is difficult for certain types of events. To improve the predictive capability of the model, we could include characteristics of the event in the model. For instance, for events set on fixed dates, such as national holidays, we can analyze the inter-annual stability of patterns.
Supporting Information
S1 File. Fisher information matrix for the model without autocorrelation.
We presented the Fisher information matrix for the model without autocorrelation, which is needed for the construction of confidence intervals for parameter estimates in Fig 3 and Table 4.
https://doi.org/10.1371/journal.pone.0160592.s001
(PDF)
Acknowledgments
The authors appreciate NTTCom Online Marketing Solutions Corporation for providing the data through the BuzzFinder service. They are also grateful to the editor and the reviewers for helpful comments to improve the manuscript. The second author is supported by Grant-in-Aid for Young Scientists (B) No. 15K20939. The third author is supported by Grant-in-Aid for Scientific Research No. 25220001.
Author Contributions
- Conceived and designed the experiments: TF AT.
- Performed the experiments: TF.
- Analyzed the data: CM AT.
- Contributed reagents/materials/analysis tools: TF.
- Wrote the paper: TF CM AT.
References
- 1. Crane R, Sornette D. Robust dynamic classes revealed by measuring the response function of a social system. Proceedings of the National Academy of Sciences. 2008;105(41):15649–15653.
- 2. Takayasu M, Sato K, Sano Y, Yamada K, Miura W, Takayasu H. Rumor Diffusion and Convergence during the 3.11 Earthquake: A Twitter Case Study. PLoS One. 2015;10:e0121443. pmid:25831122
- 3. Sano Y, Yamada K, Watanabe H, Takayasu H, Takayasu M. Empirical analysis of collective human behavior for extraordinary events in the blogosphere. Physical Review E. 2013;87:012805.
- 4.
Box GEP, Jenkins GM, Reinsel GC, Ljung GM. Time Series Analysis: Forecasting and Control. 5th ed. Wiley Series in Probability and Statistics. John Wiley & Sons, Inc., Hoboken, NJ; 2015.
- 5. Earle P. Earthquake Twitter. Nature Geoscience. 2010;3:221–222.
- 6.
Sakaki T, Okazaki M, Matsuo Y. Earthquake shakes Twitter users: real-time event detection by social sensors. In: Proceedings of the 19th international conference on World wide web. ACM New York; 2010. p. 851–860.
- 7. Crooks A, Croitoru A, Stefanidis A, Radzikowski J. ♯Earthquake: Twitter as a Distributed Sensor System. Transactions in GIS. 2013;17:124–147.
- 8. Alfi V, Gabrielli A, Pietronero L. How people react to a deadline: time distribution of conference registrations and fee payments. Central European Journal of Physics. 2009;7:483–489.
- 9. Fenner T, Levene M, Loizou G. A bi-logistic growth model for conference registration with an early bird deadline. Central European Journal of Physics. 2013;11:904–909.
- 10. Brandt PT, Williams JT. A linear Poisson autoregressive model: the Poisson AR(p) Model. Political Analysis. 2001;9:164–184.
- 11. Freeland RK, McCabe BPM. Analysis of low count time series data by Poisson autoregression. Journal of Time Series Analysis. 2004;25:701–722.
- 12.
Fokianos K. In: Handbook of Statistics. vol. 30. North Holland; 2012. p. 315–347.
- 13. Fokianos K, Rahbek A, Tjøstheim D. Poisson autoregression. Journal of the American Statistical Association. 2009;104:1430–1439.
- 14. Fokianos K. Some recent progress in count time series. Statistics. 2011;45:49–58.
- 15.
Tsay RS. Multivariate Time Series Analysis: with R and financial applications. Wiley Series in Probability and Statistics. John Wiley & Sons, Inc., Hoboken, NJ; 2013.