Nothing Special   »   [go: up one dir, main page]

Skip to main content

Table Header Correction Algorithm Based on Heuristics for Improving Spreadsheet Data Extraction

  • Conference paper
  • First Online:
Information and Software Technologies (ICIST 2020)

Abstract

A spreadsheet is one of the most commonly used forms of representation for datasets of similar type. Spreadsheets provide considerable flexibility for data structure organisation. As a result of this flexibility, tables with very complex data structures could be created. In turn, such complexity makes automatic table processing and data extraction a challenging task. Therefore, table preproccessing step is often required in the data extraction pipeline. This paper proposes a heuristic algorithm for the correction of a table header in a spreadsheet. The aim of the proposed algorithm is to transform a machine-readable structure of the table header into its visual representation. The algorithm achieves this aim by iterating through table header cells and merging some of them according to proposed heuristics. The transformed structure, in turn, allows to improve quality of spreadsheet understanding and data extraction further in the pipeline. The proposed algorithm was implemented in the TabbyXL toolset.

This work was supported by the Russian Science Foundation, grant number 18-71-10001.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://github.com/tabbydoc/tabbyxl.

  2. 2.

    http://tc11.cvc.uab.es/datasets/Troy_200_1.

  3. 3.

    https://catalog.data.gov/dataset/statistical-abstract-of-the-united-states.

References

  1. Calimeri, F., Hamlen, K., Leone, N. (eds.): Practical Aspects of Declarative Languages. 20th International Symposium, PADL 2018, Los Angeles, CA, USA, January 8–9, 2018, Proceedings, 1st edn. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-73305-0

    Book  Google Scholar 

  2. Chen, Z., Dadiomov, S., Wesley, R., Xiao, G., Cory, D., Cafarella, M., Mackinlay, J.: Spreadsheet property detection with rule-assisted active learning. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management - CIKM 2017. ACM Press (2017). https://doi.org/10.1145/3132847.3132882

  3. Dong, H., Liu, S., Fu, Z., Han, S., Zhang, D.: Semantic structure extraction for spreadsheet tables with a multi-task learning architecture. In: Workshop on Document Intelligence (DI 2019) at NeurIPS 2019, December 2019. https://www.microsoft.com/en-us/research/publication/semantic-structure-extraction-for-spreadsheet-tables-with-a-multi-task-learning-architecture/

  4. Doush, I.A., Pontelli, E.: Detecting and recognizing tables in spreadsheets. In: Doermann, D.S., Govindaraju, V., Lopresti, D.P., Natarajan, P. (eds.) The Ninth IAPR International Workshop on Document Analysis Systems, DAS 2010, Boston, Massachusetts, USA, 9–11 June 2010. ACM International Conference Proceeding Series, pp. 471–478. ACM (2010). https://doi.org/10.1145/1815330.1815391

  5. Fang, J., Mitra, P., Tang, Z., Giles, C.L.: Table header detection and classification. In: AAAI (2012)

    Google Scholar 

  6. Gonsior, J., Rehak, J., Thiele, M., Koci, E., Günther, M., Lehner, W.: Active learning for spreadsheet cell classification. In: Workshop Proceedings of the EDBT/ICTDT 2020 Joint Conference, March 2020. https://sea-data.ml/

  7. Guerrero, H.: Excel Data Analysis. Modeling and Simulation, 2nd edn. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-01279-3

    Book  Google Scholar 

  8. Koci, E., et al.: XLIndy. In: Proceedings of the ACM Symposium on Document Engineering 2019 - DocEng 2019. ACM Press (2019). https://doi.org/10.1145/3342558.3345409

  9. Koci, E., Thiele, M., Romero, O., Lehner, W.: Table identification and reconstruction in spreadsheets. In: Dubois, E., Pohl, K. (eds.) CAiSE 2017. LNCS, vol. 10253, pp. 527–541. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-59536-8_33

    Chapter  Google Scholar 

  10. Paramonov, V., Shigarov, A., Vetrova, V., Mikhailov, A.: Heuristic algorithm for recovering a physical structure of spreadsheet header. In: Borzemski, L., Świątek, J., Wilimowska, Z. (eds.) ISAT 2019. AISC, vol. 1050, pp. 140–149. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-30440-9_14

    Chapter  Google Scholar 

  11. Raković, L., Sakal, M., Vuković, V.: Improvement of spreadsheet quality through reduction of end-user overconfidence: case study. Periodica Polytech. Soc. Manag. Sci. 27(2), 119–130 (2019). https://doi.org/10.3311/ppso.12392

    Article  Google Scholar 

  12. Ronen, B., Palley, M.A., Lucas, H.C.: Spreadsheet analysis and design. Commun. ACM 32(1), 84–93 (1989). https://doi.org/10.1145/63238.63244

    Article  Google Scholar 

  13. Shigarov, A., Khristyuk, V., Mikhailov, A.: TabbyXL: software platform for rule-based spreadsheet data extraction and transformation. SoftwareX 10, 100270 (2019). https://doi.org/10.1016/j.softx.2019.100270

    Article  Google Scholar 

  14. Shigarov, A.O., Mikhailov, A.A.: Rule-based spreadsheet data transformation from arbitrary to relational tables. Inf. Syst. 71, 123–136 (2017). https://doi.org/10.1016/j.is.2017.08.004

    Article  Google Scholar 

  15. Shigarov, A.O., Paramonov, V.V., Belykh, P.V., Bondarev, A.I.: Rule-based canonicalization of arbitrary tables in spreadsheets. In: Dregvaite, G., Damasevicius, R. (eds.) ICIST 2016. CCIS, vol. 639, pp. 78–91. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46254-7_7

    Chapter  Google Scholar 

  16. Song, J., Koutra, D., Mani, M., Jagadish, H.V.: GeoFlux. In: Proceedings of the 2018 International Conference on Management of Data - SIGMOD 2018. ACM Press (2018). https://doi.org/10.1145/3183713.3193546

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Viacheslav Paramonov .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Paramonov, V., Shigarov, A., Vetrova, V. (2020). Table Header Correction Algorithm Based on Heuristics for Improving Spreadsheet Data Extraction. In: Lopata, A., Butkienė, R., Gudonienė, D., Sukackė, V. (eds) Information and Software Technologies. ICIST 2020. Communications in Computer and Information Science, vol 1283. Springer, Cham. https://doi.org/10.1007/978-3-030-59506-7_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-59506-7_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-59505-0

  • Online ISBN: 978-3-030-59506-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics