Performance Comparison of Three Spark-Based Implementations of Parallel Entity Resolution

Xiao Chen¹⁵,
Kirity Rapuru¹⁵,
Gabriel Campero Durand¹⁵,
Eike Schallehn¹⁵ &
…
Gunter Saake¹⁵

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 903))

Included in the following conference series:

International Conference on Database and Expert Systems Applications

618 Accesses

Abstract

During the last decade, several big data processing frameworks have emerged enabling users to analyze large scale data with ease. With the help of those frameworks, people are easier to manage distributed programming, failures and data partitioning issues. Entity Resolution is a typical application that requires big data processing frameworks, since its time complexity increases quadratically with the input data. In recent years Apache Spark has become popular as a big data framework providing a flexible programming model that supports in-memory computation. Spark offers three APIs: RDDs, which gives users core low-level data access, and high-level APIs like DataFrame and Dataset, which are part of the Spark SQL library and undergo a process of query optimization. Stemming from their different features, the choice of API can be expected to have an influence on the resulting performance of applications. However, few studies offer experimental measures to characterize the effect of such distinctions. In this paper we evaluate the performance impact of such choices for the specific application of parallel entity resolution under two different scenarios, with the goal to offer practical guidelines for developers.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Exploring Spark-SQL-Based Entity Resolution Using the Persistence Capability

Efficient Processing of Recursive Joins on Large-Scale Datasets in Spark

Elephant Against Goliath: Performance of Big Data Versus High-Performance Computing DBSCAN Clustering Implementations

References

Apache: Apache spark. http://spark.apache.org/. Accessed 10 April 2018
Armbrust, M., et al.: Spark SQL: relational data processing in spark. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1383–1394. ACM (2015)
Google Scholar
Chen, X., Schallehn, E., Saake, G.: Cloud-scale entity resolution: current state and open challenges. Open J. Big Data (OJBD) 4(1), 30–51 (2018)
Google Scholar
Chen, X., Zoun, R., Schallehn, E., Mantha, S., Rapuru, K., Saake, G.: Exploring spark-SQL-based entity resolution using the persistence capability. In: International Conference: Beyond Databases, Architectures and Structures (2018, Forthcoming)
Google Scholar
Christen, P.: Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. DCSA. Springer Science & Business Media, Heidelberg (2012). https://doi.org/10.1007/978-3-642-31164-2
Book Google Scholar
Christen, P., Vatsalan, D.: Flexible and extensible generation and corruption of personal data. In: Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, CIKM 2013, pp. 1165–1168. ACM, New York (2013). https://doi.org/10.1145/2505515.2507815
Cohen, W., Ravikumar, P., Fienberg, S.: A comparison of string metrics for matching names and records. In: KDD Workshop on Data Cleaning and Object Consolidation, vol. 3, pp. 73–78 (2003)
Google Scholar
Hortonworks: Hortonworks data platform. https://hortonworks.com/products/data-platforms/. Accessed 25 June 2018
Karau, H., Warren, R.: High Performance Spark. O’Reilly Media, Sebastopol (2017)
Google Scholar
Mestre, D.G., Pires, C.E.S., Nascimento, D.C., de Queiroz, A.R.M., Santos, V.B., Araujo, T.B.: An efficient spark-based adaptive windowing for entity matching. J. Syst. Softw. 128, 1–10 (2017)
Article Google Scholar
Papadakis, G., Svirsky, J., Gal, A., Palpanas, T.: Comparative analysis of approximate blocking techniques for entity resolution. Proc. VLDB Endow. 9(9), 684–695 (2016). https://doi.org/10.14778/2947618.2947624
Article Google Scholar
Pita, R., Pinto, C., Melo, P., Silva, M., Barreto, M., Rasella, D.: A spark-based workflow for probabilistic record linkage of healthcare data. In: EDBT/ICDT Workshops, pp. 17–26 (2015)
Google Scholar
Tran, K.N., Vatsalan, D., Christen, P.: GeCo: an online personal data generator and corruptor. In: Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, CIKM 2013, pp. 2473–2476. ACM, New York (2013). https://doi.org/10.1145/2505515.2508207
Wang, C., Karimi, S.: Parallel duplicate detection in adverse drug reaction databases with spark. In: EDBT, pp. 551–562 (2016)
Google Scholar

Download references

Acknowledgment

This work was supported by China Scholarship Council [No. 201408080093].

Author information

Authors and Affiliations

Otto-von-Guericke-University of Magdeburg, Universitaetsplatz 2, Magdeburg, Germany
Xiao Chen, Kirity Rapuru, Gabriel Campero Durand, Eike Schallehn & Gunter Saake

Authors

Xiao Chen
View author publications
You can also search for this author in PubMed Google Scholar
Kirity Rapuru
View author publications
You can also search for this author in PubMed Google Scholar
Gabriel Campero Durand
View author publications
You can also search for this author in PubMed Google Scholar
Eike Schallehn
View author publications
You can also search for this author in PubMed Google Scholar
Gunter Saake
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiao Chen .

Editor information

Editors and Affiliations

University of Tunis, Tunis, Tunisia
Mourad Elloumi
MiCS, Media Computer Science, University of Passau, Passau, Bayern, Germany
Michael Granitzer
IRIT, Paul Sabatier University, Toulouse, France
Abdelkader Hameurlain
University of Twente, Enschede, Overijssel, The Netherlands
Christin Seifert
Fak. Medien, Bauhaus Universität Weimar, Weimar, Thüringen, Germany
Benno Stein
Inst. für Softwaretechnik, Vienna University of Technology, Vienna, Austria
A Min Tjoa
FAW, Johannes Kepler University of Linz, Linz, Austria
Roland Wagner

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chen, X., Rapuru, K., Durand, G.C., Schallehn, E., Saake, G. (2018). Performance Comparison of Three Spark-Based Implementations of Parallel Entity Resolution. In: Elloumi, M., et al. Database and Expert Systems Applications. DEXA 2018. Communications in Computer and Information Science, vol 903. Springer, Cham. https://doi.org/10.1007/978-3-319-99133-7_6

Download citation

DOI: https://doi.org/10.1007/978-3-319-99133-7_6
Published: 07 August 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99132-0
Online ISBN: 978-3-319-99133-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics