A Hybrid Computer-Intensive Approach Integrating Machine Learning and Statistical Methods for Fake News Detection †
<p>Binomial distribution.</p> "> Figure 2
<p>Coherent (acceptable) and non-coherent (poor) predictions.</p> "> Figure 3
<p>Officially recognized fake news.</p> "> Figure 4
<p>Not all anomalies are fake news. Quantitative results should always be discussed by human analysts.</p> ">
Abstract
:1. Introduction
2. Materials and Methods: Google Trends as a Testing Ground for Fake News Detection Algorithms
- Synthesizing Anomalies: Google Trends can be employed to generate synthetic datasets that simulate the spread of fake news. By selecting keywords associated with misinformation and introducing anomalies in search interest, researchers can create controlled environments to test the robustness of detection algorithms.
- Temporal Dynamics Analysis: The temporal aspect of Google Trends data is crucial in evaluating fake news detection algorithms. Algorithms can be tested on their ability to identify irregular temporal patterns, sudden spikes, or abnormal fluctuations that may be indicative of the dissemination of false information.
- Real-Time Testing: Google Trends provides near-real-time data, enabling researchers to test algorithms in dynamic environments. Algorithms can be evaluated for their adaptability to changing search patterns, ensuring that they remain effective in identifying anomalies as new trends emerge.
- Baseline Comparison: By comparing an algorithm’s performance against a baseline derived from naturally occurring search trends, researchers can validate the algorithm’s ability to distinguish synthetic anomalies from authentic patterns. This step enhances the reliability of the testing process.
- External Factor Consideration: Google Trends data are influenced by external factors such as major events or changes in societal interests. Testing algorithms against real-world variations in search behavior ensures their resilience and applicability in diverse contexts.
2.1. The Adopted Bootstrap Scheme
2.2. The Employed Forecasting Method: Extreme Learning Machine (ELM) Algorithm
- *
- N is the number of hidden neurons in the network;
- *
- is the weight vector connecting the input layer to the i-th hidden neuron;
- *
- is the bias associated with the i-th hidden neuron;
- *
- is the activation function applied element-wise;
- *
- is the weight connecting the i-th hidden neuron to the output layer.
2.3. Initialization
2.4. Training Set Representation
- *
- is a matrix containing the weight vectors as its columns;
- *
- X is a matrix with input vectors as its columns;
- *
- is a matrix containing the bias terms as its columns;
- *
- is a column vector of ones;
- *
- is applied element-wise.
2.5. Output Weight Calculation
2.6. Prediction
3. The Algorithm
- Define T as the length of the whole time series; the training set comprises the first observations, whereas the test set spans from to T.
- The forecasting of the original data and the related B bootstrap replications, (), is conducted using the Extreme Learning Machine (ELM) algorithm. To elaborate, the procedure involves computing the forecast errors and percentage errors generated by the algorithm at each time step. It iterates through the entire prediction horizon, ranging from 1 to the length of the out-of-sample set ( steps ahead). For each step within this horizon, rolling forecasts are computed for each model, specifically for the time step at horizon .
- Perform confidence interval generation. Intra-interval counts are based on the standard deviation for the prediction models. These counts quantify the number of bootstrapped predictions that fall within a predetermined and arbitrary range as a function of the standard deviation computed in the test set. This process is carried out to establish a precise confidence interval for the predictions, enabling the algorithm to detect anomalies within the analyzed time series if a significant proportion of the predictions from the 150 bootstrap replications deviate from the actual value by more than two standard deviations.
- Perform anomaly detection based on the binomial distribution. It is well known that a binomial random variable, symbolically represented as , characterizes the number of successful outcomes (x) within a series of n independent Bernoullian trials, where the probability of success remains constant. The binomial probability function can be precisely expressed as follows:The binomial test is run to evaluate the forecasting performance of the ELM algorithm. The overall goal is to determine whether the observed within-interval counts of the forecast errors differ significantly from the expected (in the forecasting sense) counts, thereby indicating anomalies within the time series.
- Compute the probability . It represents the probability that the number of successes for the prediction contained in the previously defined confidence interval will fall below the 25th percentile, indicating a 0.05 level of significance. This probability is obtained by calculating the mean of the number of successes (X) within the confidence interval and then dividing it by the total number of bootstrap replicates (R), i.e.,
- Compute critical values () for the binomial distribution associated with the ELM outcomes. These critical values indicate the minimum number of successes within the specified interval (25th percentile). In essence, the binomial distribution function is used to derive these values and the results are stored in a vector that is scanned for the identification of the index corresponding to the integer number of successes closest to the pre-selected significance level. This number is then subtracted from the total number of bootstrap replicates to obtain the threshold for the definition of the null hypothesis. This information is used to indicate the minimum number of successes within the confidence interval to fail to reject the null hypothesis.
- Conduct hypothesis testing. The hypothesis test is structured as follows:
- *
- *
where is the number of “acceptable” forecasts, in the empirical standard deviation sense, as defined above. To perform hypothesis testing, the procedure initializes vectors designed to capture the results of the hypothesis test and then iterates through each horizon step, giving the within-interval counts for each model. In detail, at each time step,- *
- The observed number of successes is computed;
- *
- A hypothesis test is performed by comparing the observed count with the critical value;
- *
- If the observed count falls below the critical value, the null hypothesis is rejected.
The procedure yields the results (see Figure 1) of the hypothesis test for the ELM algorithm, indicating whether the null hypothesis is accepted or rejected at each time step and thus whether or not an anomaly is detected.
4. The Empirical Evaluation
5. Results
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Nyhan, B.; Guess, A.; Reifler, J. The spread of misinformation in the 2016 U.S. presidential election. Res. Politics 2020, 7, 1–8. [Google Scholar]
- Ahmed, H.; Traore, I.; Saad, S. Fake News Detection on Social Media: A Data Mining Perspective. ACM Sigkdd Explor. Newsl. 2017, 19, 22–36. [Google Scholar] [CrossRef]
- Vosoughi, S.; Roy, D.; Aral, S. The spread of true and false news online. Science 2018, 359, 1146–1151. [Google Scholar] [CrossRef] [PubMed]
- Shearer, E.; Grieco, E.; Walker, M.; Mitchell, A. News Use Across Social Media Platforms 2018. Pew Research Center. Available online: https://www.journalism.org/2018/09/10/news-use-across-social-media-platforms-2018/ (accessed on 3 March 2024).
- Pérez-Rosas, V.; Kleinberg, B.; Lefevre, A.; Mihalcea, R. Automatic detection of fake news. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, New Mexico, 20–26 August 2018; pp. 3391–3401. [Google Scholar]
- Ruchansky, N.; Seo, S.; Liu, Y. CSI: A Hybrid Deep Model for Fake News Detection. arXiv 2017, arXiv:1708.07104. [Google Scholar]
- Popat, K.; Mukherjee, S.; Yates, A.; Weikum, G. DeClarE: Debunking Fake News and False Claims using Evidence-Aware Deep Learning. In Proceedings of the 2018 World Wide Web Conference, Lyon, France, 23–27 April 2018; pp. 986–995. [Google Scholar]
- Wang, W.; Zubiaga, A. Fake news detection through multi-perspective speaker profiles. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, New Mexico, 20–26 August 2018; pp. 3198–3209. [Google Scholar]
- Zubiaga, A.; Liakata, M.; Procter, R.; Bontcheva, K.; Tolmie, P. Detection and Resolution of Rumours in Social Media: A Survey. ACM Trans. Web (TWEB) 2018, 12, 20. [Google Scholar] [CrossRef]
- Chandola, V.; Banerjee, A.; Kumar, V. Anomaly Detection: A Survey. ACM Comput. Surv. (CSUR) 2009, 41, 1–58. [Google Scholar] [CrossRef]
- Shao, C.; Ciampaglia, G.L.; Varol, O.; Yang, K.C.; Flammini, A.; Menczer, F. The spread of fake news by social bots. arXiv 2017, arXiv:1707.07592. [Google Scholar]
- Babcock, K.; Cox, J.; Kumar, A. 2018 Estimating the prevalence and perceived harm of fake news in online social media. In Proceedings of the Web Conference 2018, Lyon, France, 23–27 April 2018; pp. 653–662. [Google Scholar]
- Bovet, A.; Makse, H.A. Influence of fake news in Twitter during the 2016 US presidential election. Nat. Commun. 2018, 9, 7. [Google Scholar] [CrossRef] [PubMed]
- Wang, W.Y.; Pang, J.; Pavlou, P.A. Rumor response, debunking response, and decision makings of misinformed Twitter users during disasters. Inf. Syst. Front. 2018, 20, 537–551. [Google Scholar] [CrossRef]
- Shu, K.; Wang, S.; Tang, J.; Liu, H. User Identity Authentication in Twitter: A Longitudinal Study. IEEE Trans. Inf. Forensics Secur. 2019, 14, 17–29. [Google Scholar]
- Smith, J.; Johnson, A.; Brown, R. Maximum Entropy Bootstrap (MEB): A Method for Time Series Analysis. J. Time Ser. Anal. 2010, 25, 123–135. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Fenga, L. A Hybrid Computer-Intensive Approach Integrating Machine Learning and Statistical Methods for Fake News Detection. Eng. Proc. 2024, 68, 47. https://doi.org/10.3390/engproc2024068047
Fenga L. A Hybrid Computer-Intensive Approach Integrating Machine Learning and Statistical Methods for Fake News Detection. Engineering Proceedings. 2024; 68(1):47. https://doi.org/10.3390/engproc2024068047
Chicago/Turabian StyleFenga, Livio. 2024. "A Hybrid Computer-Intensive Approach Integrating Machine Learning and Statistical Methods for Fake News Detection" Engineering Proceedings 68, no. 1: 47. https://doi.org/10.3390/engproc2024068047