Abstract
A spreadsheet is one of the most commonly used forms of representation for datasets of similar type. Spreadsheets provide considerable flexibility for data structure organisation. As a result of this flexibility, tables with very complex data structures could be created. In turn, such complexity makes automatic table processing and data extraction a challenging task. Therefore, table preproccessing step is often required in the data extraction pipeline. This paper proposes a heuristic algorithm for the correction of a table header in a spreadsheet. The aim of the proposed algorithm is to transform a machine-readable structure of the table header into its visual representation. The algorithm achieves this aim by iterating through table header cells and merging some of them according to proposed heuristics. The transformed structure, in turn, allows to improve quality of spreadsheet understanding and data extraction further in the pipeline. The proposed algorithm was implemented in the TabbyXL toolset.
This work was supported by the Russian Science Foundation, grant number 18-71-10001.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Calimeri, F., Hamlen, K., Leone, N. (eds.): Practical Aspects of Declarative Languages. 20th International Symposium, PADL 2018, Los Angeles, CA, USA, January 8–9, 2018, Proceedings, 1st edn. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-73305-0
Chen, Z., Dadiomov, S., Wesley, R., Xiao, G., Cory, D., Cafarella, M., Mackinlay, J.: Spreadsheet property detection with rule-assisted active learning. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management - CIKM 2017. ACM Press (2017). https://doi.org/10.1145/3132847.3132882
Dong, H., Liu, S., Fu, Z., Han, S., Zhang, D.: Semantic structure extraction for spreadsheet tables with a multi-task learning architecture. In: Workshop on Document Intelligence (DI 2019) at NeurIPS 2019, December 2019. https://www.microsoft.com/en-us/research/publication/semantic-structure-extraction-for-spreadsheet-tables-with-a-multi-task-learning-architecture/
Doush, I.A., Pontelli, E.: Detecting and recognizing tables in spreadsheets. In: Doermann, D.S., Govindaraju, V., Lopresti, D.P., Natarajan, P. (eds.) The Ninth IAPR International Workshop on Document Analysis Systems, DAS 2010, Boston, Massachusetts, USA, 9–11 June 2010. ACM International Conference Proceeding Series, pp. 471–478. ACM (2010). https://doi.org/10.1145/1815330.1815391
Fang, J., Mitra, P., Tang, Z., Giles, C.L.: Table header detection and classification. In: AAAI (2012)
Gonsior, J., Rehak, J., Thiele, M., Koci, E., Günther, M., Lehner, W.: Active learning for spreadsheet cell classification. In: Workshop Proceedings of the EDBT/ICTDT 2020 Joint Conference, March 2020. https://sea-data.ml/
Guerrero, H.: Excel Data Analysis. Modeling and Simulation, 2nd edn. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-01279-3
Koci, E., et al.: XLIndy. In: Proceedings of the ACM Symposium on Document Engineering 2019 - DocEng 2019. ACM Press (2019). https://doi.org/10.1145/3342558.3345409
Koci, E., Thiele, M., Romero, O., Lehner, W.: Table identification and reconstruction in spreadsheets. In: Dubois, E., Pohl, K. (eds.) CAiSE 2017. LNCS, vol. 10253, pp. 527–541. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-59536-8_33
Paramonov, V., Shigarov, A., Vetrova, V., Mikhailov, A.: Heuristic algorithm for recovering a physical structure of spreadsheet header. In: Borzemski, L., Świątek, J., Wilimowska, Z. (eds.) ISAT 2019. AISC, vol. 1050, pp. 140–149. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-30440-9_14
Raković, L., Sakal, M., Vuković, V.: Improvement of spreadsheet quality through reduction of end-user overconfidence: case study. Periodica Polytech. Soc. Manag. Sci. 27(2), 119–130 (2019). https://doi.org/10.3311/ppso.12392
Ronen, B., Palley, M.A., Lucas, H.C.: Spreadsheet analysis and design. Commun. ACM 32(1), 84–93 (1989). https://doi.org/10.1145/63238.63244
Shigarov, A., Khristyuk, V., Mikhailov, A.: TabbyXL: software platform for rule-based spreadsheet data extraction and transformation. SoftwareX 10, 100270 (2019). https://doi.org/10.1016/j.softx.2019.100270
Shigarov, A.O., Mikhailov, A.A.: Rule-based spreadsheet data transformation from arbitrary to relational tables. Inf. Syst. 71, 123–136 (2017). https://doi.org/10.1016/j.is.2017.08.004
Shigarov, A.O., Paramonov, V.V., Belykh, P.V., Bondarev, A.I.: Rule-based canonicalization of arbitrary tables in spreadsheets. In: Dregvaite, G., Damasevicius, R. (eds.) ICIST 2016. CCIS, vol. 639, pp. 78–91. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46254-7_7
Song, J., Koutra, D., Mani, M., Jagadish, H.V.: GeoFlux. In: Proceedings of the 2018 International Conference on Management of Data - SIGMOD 2018. ACM Press (2018). https://doi.org/10.1145/3183713.3193546
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Paramonov, V., Shigarov, A., Vetrova, V. (2020). Table Header Correction Algorithm Based on Heuristics for Improving Spreadsheet Data Extraction. In: Lopata, A., Butkienė, R., Gudonienė, D., Sukackė, V. (eds) Information and Software Technologies. ICIST 2020. Communications in Computer and Information Science, vol 1283. Springer, Cham. https://doi.org/10.1007/978-3-030-59506-7_13
Download citation
DOI: https://doi.org/10.1007/978-3-030-59506-7_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-59505-0
Online ISBN: 978-3-030-59506-7
eBook Packages: Computer ScienceComputer Science (R0)