Abstract
Social data analytics have become a vital asset for organizations and governments. For example, over the last few years, governments started to extract knowledge and derive insights from vastly growing open data to personalize the advertisements in elections, improve government services, predict intelligence activities, as well as to improve national security and public health. A key challenge in analyzing social data is to transform the raw data generated by social actors into curated data, i.e., contextualized data and knowledge that is maintained and made available for use by end-users and applications. To address this challenge, we present the notion of knowledge lake, i.e., a contextualized Data Lake, to provide the foundation for big data analytics by automatically curating the raw social data and to prepare them for deriving insights. We present a social data curation foundry, namely DataSynapse, to enable analysts engage with social data to uncover hidden patterns and generate insight. In DataSynapse, we present a scalable algorithm to transform social items (e.g., a Tweet in Twitter) into semantic items, i.e., contextualized and curated items. This algorithm offers customizable feature extraction to harness desired features from diverse data sources. To link contextualized information items to the domain knowledge, we present a scalable technique which leverages cross document coreference resolution assisting analysts to derive targeted insights. DataSynapse is offered as an extensible and scalable microservice-based architecture that are publicly available on GitHub supporting networks such as Twitter, Facebook, GooglePlus and LinkedIn. We adopt a typical scenario for analyzing urban social issues from Twitter as it relates to the government budget, to highlight how DataSynapse significantly improves the quality of extracted knowledge compared to the classical curation pipeline (in the absence of feature extraction, enrichment and domain-linking contextualization).
Similar content being viewed by others
Notes
A social actor, is a conscious, thinking, individual who has an account a social network such as Facebook (facebook.com) and has the capacity to shape their world in a variety of ways by reflecting on their situation and the choices available to them on social networks.
The notion of a Data Lake has been coined to convey the concept of a centralized repository containing limitless amounts of raw (or minimally curated) data stored in various data islands. The rationale behind a Data Lake is to store raw data and let the data analyst decide how to cook/curate them later.
A Hypernym is a word with a broad meaning constituting a category into which words with more specific meanings fall; a superordinate. For example, colour is a hypernym of red.
A Hyponym is a word of more specific meaning than a general or superordinate term applicable to it. For example, spoon is a hyponym of cutlery.
MapReduce (hadoop.apache.org/) is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster.
References
Aggarwal, C.C.: An Introduction to Social Network Data Analytics, pp. 1–15. Springer, Berlin (2011)
Anderson, M.R., Antenucci, D., Bittorf, V., Burgess, M., Cafarella, M.J., Kumar, A., Niu, F. et al.: Brainwash: a data system for feature engineering. In: CIDR (2013)
Beheshti, S.-M.-R., Nezhad, H.R.M., Benatallah, B.: Temporal provenance model (TPM): model and query language. CoRR, abs/1211.5009 (2012)
Beheshti, S.-M.-R. et al.: Galaxy: a platform for explorative analysis of open data sources. In: Proceedings of the 19th International Conference on Extending Database Technology, (EDBT), pp. 640–643 (2016). https://dblp.org/rec/bibtex/conf/edbt/BeheshtiBM16
Beheshti, S.-M.-R., Benatallah, B., Motahari-Nezhad, H.R.: Scalable graph-based OLAP analytics over process execution data. Distrib. Parallel Databases 34(3), 379–423 (2016)
Beheshti, S.-M.-R., Benatallah, B., Sakr, S., Grigori, D., Motahari-Nezhad, H.R., Barukh, M.C., Gater, A., Ryu, S.H.: Process Analytics—Concepts and Techniques for Querying and Analyzing Process Data. Springer, Berlin (2016)
Arocena, P.C., Glavic, B., Mecca, G., Miller, R.J., Papotti, P., Santoro, D.: Benchmarking data curation systems. IEEE Data Eng. Bull. 39(2), 47–62 (2016)
Beheshti, S.-M.-R., Tabebordbar, A., Benatallah, B., Nouri, R.: On automating basic data curation tasks. In: Proceedings of the 26th International Conference on World Wide Web Companion, Perth, Australia, April 3–7, 2017, pp. 165–169 (2017)
Beheshti, S.-M.-R., Benatallah, B., Venugopal, S., Ryu, S.H., Motahari-Nezhad, H.R., Wang, Wei: A systematic review and comparative analysis of cross-document coreference resolution methods and tools. Computing 99(4), 313–349 (2017)
Beheshti, A., Benatallah, B., Nouri, R., Chhieng, Van M., Xiong, H., Zhao, X.: Coredb: a data lake service. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, CIKM 2017, Singapore, November 06–10, 2017, pp. 2451–2454 (2017)
Beheshti, A., Benatallah, B., Nouri, R., Tabebordbar, A.: Corekg: a knowledge lake service. PVLDB 11(12), 1942–1945 (2018). https://dblp.org/rec/bibtex/journals/pvldb/BeheshtiBNT18
Beheshti, A., Schiliro, F., Ghodratnama, S., Amouzgar, F., Benatallah, B., Yang, J., Sheng, Q.Z., Casati, F., Motahari-Nezhad, H.R.: iprocess: Enabling iot platforms in data-driven knowledge-intensive processes. In: Business Process Management Forum - BPM Forum 2018 (2018)
Beheshti, A., Vaghani, K., Benatallah, B., Tabebordbar, A.: Crowdcorrect: A curation pipeline for social data cleansing and curation. In: Information Systems in the Big Data Era—CAiSE Forum 2018, Tallinn, Estonia, June 11–15, 2018, Proceedings, pp. 24–38 (2018)
Chai, X., Deshpande, O., Garera, N., Gattani, A., Lam, W., Lamba, D.S., Liu, L., Tiwari, M., Tourn, M., Vacheri, Z., Prasad, S.T.S., Subramaniam, S., Harinarayan, V., Rajaraman, A., Ardalan, A., Das, S., Suganthan, G.C.P., Doan, A.: Social media analytics: the kosmix story. IEEE Data Eng. Bull. 36(3), 4–12 (2013)
Chen, H., Chiang, R.H.L., Storey, V.C.: Business intelligence and analytics: from big data to big impact. MIS Q. 36(4), 1165–1188 (2012)
Chiticariu, L., Krishnamurthy, R., Li, Y., Raghavan, S., Reiss, F., Vaithyanathan, S.: Systemt: an algebraic approach to declarative information extraction. In: ACL 2010, Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, July 11–16, 2010, Uppsala, pp. 128–137 (2010)
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM. 51(1), 107 (2008)
Deshpande, M., Ray, D., Dixit, S., Agasti, A.: Shareinsights: an unified approach to full-stack data processing. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia, May 31–June 4, 2015, pp. 1925–1940 (2015)
Doan, A., Domingos, P.M., Halevy, A.Y.: Reconciling schemas of disparate data sources: a machine-learning approach. In: Proceedings of the 2001 ACM SIGMOD international conference on Management of data, Santa Barbara, CA, USA, May 21–24, 2001, pp. 509–520 (2001)
Ferrucci, D.A.: Introduction to ’this is watson’. IBM J. Res. Dev. 56(3.4), 4:1–4:11 (2012)
Freitas, A., Curry, E.: Big data curation. In: Cavanillas, J.M., (ed.), New Horizons for a Data-Driven Economy, pp. 87–118. Springer, Berlin (2016)
Terrizzano, I. et al.: Data wrangling: the challenging journey from the wild to the lake. In: CIDR (2015)
Kim, N.W., Jung, J., Ko, E.-Y., Han, S., Lee, C.W., Kim, J., Kim, J.: Budgetmap: engaging taxpayers in the issue-driven classification of a government budget. In: Proceedings of the 19th ACM Conference on Computer-Supported Cooperative Work & Social Computing, CSCW 2016, San Francisco, CA, USA, February 27–March 2, 2016, pp. 1026–1037 (2016)
Lee, K., Agrawal, A., Choudhary, A.: Real-time disease surveillance using twitter data: demonstration on flu and cancer. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’13, pages 1474–1477, New York, NY, USA (2013). ACM
Lohr, S.: The age of big data. New York Times, 11 (2012)
Nakov, P., Ritter, A., Rosenthal, S., Sebastiani, F., Stoyanov, V.: Semeval-2016 task 4: sentiment analysis in twitter. In: Proceedings of the 10th International Workshop on Semantic Evaluation, SemEval@NAACL-HLT 2016, San Diego, CA, USA, June 16–17, 2016, pp. 1–18 (2016)
Pandey, N., Natarajan, S.: How social media can contribute during disaster events? case study of chennai floods 2015. In: 2016 International Conference on Advances in Computing, Communications and Informatics, ICACCI 2016, Jaipur, India, September 21–24, 2016, pp. 1352–1356 (2016)
Paul Suganthan, G.C., Sun, C., Krishna Gayatri, K., Zhang, H., Yang, F., Rampalli, N., Prasad, S., Arcaute, E., Krishnan, G., Deep, R., Raghavendra, V., Doan, A.: Why big data industrial systems need rules and what we can do about it. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia, May 31–June 4, 2015, pp. 265–276 (2015)
Pu, X., Jin, R., Wu, G., Han, D., Xue, G.-R.: Topic modeling in semantic space with keywords. In: Proceedings of the 24th ACM International Conference on Information and Knowledge Management, CIKM 2015, Melbourne, VIC, Australia, October 19–23, 2015, pp. 1141–1150 (2015)
Ritter, A., Clark, S., Mausam, E., Oren: named entity recognition in tweets: an experimental study. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, EMNLP 2011, 27–31 July 2011, John McIntyre Conference Centre, Edinburgh, UK, A meeting of SIGDAT, a Special Interest Group of the ACL, pp. 1524–1534 (2011)
Ruder, T.D., Hatch, G.M., Ampanozi, G., Thali, M.J., Fischer, N.: Suicide announcement on facebook. Crisis (2011)
Russom, P., et al.: Big data analytics. TDWI best practices report, fourth quarter 19, 40 (2011)
Sellam, T., Müller, E., Kersten, M.L.: Semi-automated exploration of data warehouses. In: Proceedings of the 24th ACM International Conference on Information and Knowledge Management, CIKM 2015, Melbourne, VIC, Australia, October 19–23, 2015, pp. 1321–1330 (2015)
Stonebraker, M. et al.: Data curation at scale: the data tamer system. In: CIDR (2013)
Fabian, M.: Suchanek and Gerhard Weikum. Knowledge bases in the age of big data analytics. Proc. VLDB Endow. 7(13), 1713–1714 (2014)
Tabebordbar, A., Beheshti, A.: Adaptive rule monitoring system. In: 40th International Conference on Software Engineering (ICSE), International Workshop on Software Engineering for Cognitive Services (SE4COG) (2018)
Tene, O., Polonetsky, J.: Big data for all: Privacy and user control in the age of analytics. N. J. Tech. Intell. Prop. 11, xxvii (2012)
Troncy, R.: Linking entities for enriching and structuring social media content. In: WWW, pp. 597–597 (2016)
Karlgren, J., Bohman, M., Ekgren, A., Isheden, G., Kullmann, E., Nilsson, D.: Semantic topology. In: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, CIKM 2014, Shanghai, China, November 3–7, 2014, pp. 1939–1942 (2014)
Wang, S., Tang, J., Aggarwal, C.C., Liu, H.: Linked document embedding for classification. In: Proceedings of the 25th ACM International Conference on Information and Knowledge Management, CIKM 2016, Indianapolis, IN, USA, October 24–28, 2016, pp. 115–124 (2016)
Zarras, A.V., Vassiliadis, P., Dinos, I.: Keep calm and wait for the spike! insights on the evolution of amazon services. In: Advanced Information Systems Engineering - 28th International Conference, CAiSE 2016, Ljubljana, Slovenia, June 13-17, 2016. Proceedings, pp. 444–458 (2016)
Acknowledgements
We Acknowledge the data to decisions CRC (D2D CRC) and the cooperative research centres program for funding this research.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Beheshti, A., Benatallah, B., Tabebordbar, A. et al. DataSynapse: A Social Data Curation Foundry. Distrib Parallel Databases 37, 351–384 (2019). https://doi.org/10.1007/s10619-018-7245-1
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10619-018-7245-1