Table Header Correction Algorithm Based on Heuristics for Improving Spreadsheet Data Extraction

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1283))

Included in the following conference series:

International Conference on Information and Software Technologies

849 Accesses

Abstract

A spreadsheet is one of the most commonly used forms of representation for datasets of similar type. Spreadsheets provide considerable flexibility for data structure organisation. As a result of this flexibility, tables with very complex data structures could be created. In turn, such complexity makes automatic table processing and data extraction a challenging task. Therefore, table preproccessing step is often required in the data extraction pipeline. This paper proposes a heuristic algorithm for the correction of a table header in a spreadsheet. The aim of the proposed algorithm is to transform a machine-readable structure of the table header into its visual representation. The algorithm achieves this aim by iterating through table header cells and merging some of them according to proposed heuristics. The transformed structure, in turn, allows to improve quality of spreadsheet understanding and data extraction further in the pipeline. The proposed algorithm was implemented in the TabbyXL toolset.

This work was supported by the Russian Science Foundation, grant number 18-71-10001.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Heuristic Algorithm for Recovering a Physical Structure of Spreadsheet Header

Rule Driven Spreadsheet Data Extraction from Statistical Tables: Case Study

TabbyXL: Rule-Based Spreadsheet Data Extraction and Transformation

Notes

References

Calimeri, F., Hamlen, K., Leone, N. (eds.): Practical Aspects of Declarative Languages. 20th International Symposium, PADL 2018, Los Angeles, CA, USA, January 8–9, 2018, Proceedings, 1st edn. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-73305-0
Book Google Scholar
Chen, Z., Dadiomov, S., Wesley, R., Xiao, G., Cory, D., Cafarella, M., Mackinlay, J.: Spreadsheet property detection with rule-assisted active learning. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management - CIKM 2017. ACM Press (2017). https://doi.org/10.1145/3132847.3132882
Dong, H., Liu, S., Fu, Z., Han, S., Zhang, D.: Semantic structure extraction for spreadsheet tables with a multi-task learning architecture. In: Workshop on Document Intelligence (DI 2019) at NeurIPS 2019, December 2019. https://www.microsoft.com/en-us/research/publication/semantic-structure-extraction-for-spreadsheet-tables-with-a-multi-task-learning-architecture/
Doush, I.A., Pontelli, E.: Detecting and recognizing tables in spreadsheets. In: Doermann, D.S., Govindaraju, V., Lopresti, D.P., Natarajan, P. (eds.) The Ninth IAPR International Workshop on Document Analysis Systems, DAS 2010, Boston, Massachusetts, USA, 9–11 June 2010. ACM International Conference Proceeding Series, pp. 471–478. ACM (2010). https://doi.org/10.1145/1815330.1815391
Fang, J., Mitra, P., Tang, Z., Giles, C.L.: Table header detection and classification. In: AAAI (2012)
Google Scholar
Gonsior, J., Rehak, J., Thiele, M., Koci, E., Günther, M., Lehner, W.: Active learning for spreadsheet cell classification. In: Workshop Proceedings of the EDBT/ICTDT 2020 Joint Conference, March 2020. https://sea-data.ml/
Guerrero, H.: Excel Data Analysis. Modeling and Simulation, 2nd edn. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-01279-3
Book Google Scholar
Koci, E., et al.: XLIndy. In: Proceedings of the ACM Symposium on Document Engineering 2019 - DocEng 2019. ACM Press (2019). https://doi.org/10.1145/3342558.3345409
Koci, E., Thiele, M., Romero, O., Lehner, W.: Table identification and reconstruction in spreadsheets. In: Dubois, E., Pohl, K. (eds.) CAiSE 2017. LNCS, vol. 10253, pp. 527–541. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-59536-8_33
Chapter Google Scholar
Paramonov, V., Shigarov, A., Vetrova, V., Mikhailov, A.: Heuristic algorithm for recovering a physical structure of spreadsheet header. In: Borzemski, L., Świątek, J., Wilimowska, Z. (eds.) ISAT 2019. AISC, vol. 1050, pp. 140–149. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-30440-9_14
Chapter Google Scholar
Raković, L., Sakal, M., Vuković, V.: Improvement of spreadsheet quality through reduction of end-user overconfidence: case study. Periodica Polytech. Soc. Manag. Sci. 27(2), 119–130 (2019). https://doi.org/10.3311/ppso.12392
Article Google Scholar
Ronen, B., Palley, M.A., Lucas, H.C.: Spreadsheet analysis and design. Commun. ACM 32(1), 84–93 (1989). https://doi.org/10.1145/63238.63244
Article Google Scholar
Shigarov, A., Khristyuk, V., Mikhailov, A.: TabbyXL: software platform for rule-based spreadsheet data extraction and transformation. SoftwareX 10, 100270 (2019). https://doi.org/10.1016/j.softx.2019.100270
Article Google Scholar
Shigarov, A.O., Mikhailov, A.A.: Rule-based spreadsheet data transformation from arbitrary to relational tables. Inf. Syst. 71, 123–136 (2017). https://doi.org/10.1016/j.is.2017.08.004
Article Google Scholar
Shigarov, A.O., Paramonov, V.V., Belykh, P.V., Bondarev, A.I.: Rule-based canonicalization of arbitrary tables in spreadsheets. In: Dregvaite, G., Damasevicius, R. (eds.) ICIST 2016. CCIS, vol. 639, pp. 78–91. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46254-7_7
Chapter Google Scholar
Song, J., Koutra, D., Mani, M., Jagadish, H.V.: GeoFlux. In: Proceedings of the 2018 International Conference on Management of Data - SIGMOD 2018. ACM Press (2018). https://doi.org/10.1145/3183713.3193546

Download references

Author information

Authors and Affiliations

Matrosov Institute for System Dynamics and Control Theory of Siberian Branch, Russian Academy of Sciences, Irkutsk, Russia
Viacheslav Paramonov & Alexey Shigarov
Institute of Mathematics and Information Technologies, Irkutsk State University, Irkutsk, Russia
Viacheslav Paramonov & Alexey Shigarov
School of Mathematics and Statistics, University of Canterbury, Christchurch, New Zealand
Varvara Vetrova

Authors

Viacheslav Paramonov
View author publications
You can also search for this author in PubMed Google Scholar
Alexey Shigarov
View author publications
You can also search for this author in PubMed Google Scholar
Varvara Vetrova
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Viacheslav Paramonov .

Editor information

Editors and Affiliations

Kaunas University of Technology, Kaunas, Lithuania
Audrius Lopata
Kaunas University of Technology, Kaunas, Lithuania
Rita Butkienė
Kaunas University of Technology, Kaunas, Lithuania
Daina Gudonienė
Kaunas University of Technology, Kaunas, Lithuania
Vilma Sukackė

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Paramonov, V., Shigarov, A., Vetrova, V. (2020). Table Header Correction Algorithm Based on Heuristics for Improving Spreadsheet Data Extraction. In: Lopata, A., Butkienė, R., Gudonienė, D., Sukackė, V. (eds) Information and Software Technologies. ICIST 2020. Communications in Computer and Information Science, vol 1283. Springer, Cham. https://doi.org/10.1007/978-3-030-59506-7_13

Download citation

DOI: https://doi.org/10.1007/978-3-030-59506-7_13
Published: 08 October 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-59505-0
Online ISBN: 978-3-030-59506-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics