research-article

Open access

Semantic table structure identification in spreadsheets

Authors:

Dan YeAuthors Info & Claims

ISSTA 2021: Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis

Pages 283 - 295

https://doi.org/10.1145/3460319.3464812

Published: 11 July 2021 Publication History

Abstract

Spreadsheets are widely used in various business tasks, and contain amounts of valuable data. However, spreadsheet tables are usually organized in a semi-structured way, and contain complicated semantic structures, e.g., header types and relations among headers. Lack of documented semantic table structures, existing data analysis and error detection tools can hardly understand spreadsheet tables. Therefore, identifying semantic table structures in spreadsheet tables is of great importance, and can greatly promote various analysis tasks on spreadsheets.

In this paper, we propose Tasi (Table structure identification) to automatically identify semantic table structures in spreadsheets. Based on the contents, styles, and spatial locations in table headers, Tasi adopts a multi-classifier to predict potential header types and relations, and then integrates all header types and relations into consistent semantic table structures. We further propose TasiError, to detect spreadsheet errors based on the identified semantic table structures by Tasi. Our experiments on real-world spreadsheets show that, Tasi can precisely identify semantic table structures in spreadsheets, and TasiError can detect real-world spreadsheet errors with higher precision (75.2%) and recall (82.9%) than existing approaches.

References

[1]

2021. Error detector in Excel 2019. https://www.dummies.com/software/microsoft-office/detecting-and-correcting-errors-in-excel-2019-formulas/

[2]

2021. European Spreadsheet Risks Interest Group. http://www.eusprig.org/horror-stories.htm

[3]

2021. Ideas in Excel. https://support.office.com/en-ie/article/ideas-in-excel-3223aab8-f543-4fda-85ed-76bb0295ffc4

[4]

2021. Power BI | Interactive Data Visualization BI Tools. https://powerbi.microsoft.com

[5]

2021. Rethinking Spreadsheets and Performance Management. https://www.cutimes.com/2013/07/31/rethinking-spreadsheets-and-performance-management/?slreturn=20200726064739

[6]

2021. scikit-learn: Machine Learning in Python. https://scikit-learn.org

[7]

2021. speechocean: A company for AI data provider. http://en.speechocean.com/welcome.html

[8]

Robin Abraham and Martin Erwig. 2007. UCheck: A Spreadsheet Type Checker for End Users. Journal of Visual Languages and Computing, 18, 1 (2007), 71–95.

Digital Library

[9]

Marco D Adelfio and Hanan Samet. 2013. Schema Extraction for Tabular Data on the Web. Proceedings of the VLDB Endowment (VLDB), 6, 6 (2013), 421–432.

Digital Library

[10]

Titus Barik, Kevin Lubick, Justin Smith, John Slankas, and Emerson Murphy-Hill. 2015. FUSE: A Reproducible, Extendable, Internet-Scale Corpus of Spreadsheets. In Proceedings of IEEE/ACM Working Conference on Mining Software Repositories (MSR). 486–489.

[11]

Daniel W. Barowy, Emery D. Berger, and Benjamin Zorn. 2018. ExceLint: Automatically Finding Spreadsheet Formula Errors. In Proceedings of International Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA). 148:1–148:26.

Digital Library

[12]

Leo Breiman. 2001. Random Forests. Machine learning, 45, 1 (2001), 5–32.

[13]

Jonathan P Caulkins, Erica Layne Morrison, and Timothy Weidemann. 2007. Spreadsheet Errors and Decision Making: Evidence from Field Interviews. Journal of Organizational and End User Computing, 19, 3 (2007), 1–23.

[14]

Chris Chambers and Martin Erwig. 2009. Automatic Detection of Dimension Errors in Spreadsheets. Journal of Visual Languages & Computing, 20, 4 (2009), 269–283.

Digital Library

[15]

Zhe Chen and Michael Cafarella. 2014. Integrating Spreadsheet Data via Accurate and Low-Effort Extraction. In Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). 1126–1135.

Digital Library

[16]

Shing-Chi Cheung, Wanjun Chen, Yepang Liu, and Chang Xu. 2016. CUSTODES: Automatic Spreadsheet Cell Clustering and Smell Detection Using Strong and Weak Features. In Proceedings of International Conference on Software Engineering (ICSE). 464–475.

Digital Library

[17]

Haoyu Dong, Shijie Liu, Zhouyu Fu, Shi Han, and Dongmei Zhang. 2019. Semantic Structure Extraction for Spreadsheet Tables with a Multi-task Learning Architecture. In Proceedings of Workshop on Document Intelligence on Neural Information Processing Systems (NeurIPS).

[18]

Haoyu Dong, Shijie Liu, Shi Han, Zhouyu Fu, and Dongmei Zhang. 2019. Tablesense: Spreadsheet Table Detection with Convolutional Neural Networks. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI). 33, 69–76.

Digital Library

[19]

Wensheng Dou, Shing-Chi Cheung, Chushu Gao, Chang Xu, Liang Xu, and Jun Wei. 2016. Detecting Table Clones and Smells in Spreadsheets. In Proceedings of ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE). 787–798.

Digital Library

[20]

Wensheng Dou, Shing-Chi Cheung, and Jun Wei. 2014. Is Spreadsheet Ambiguity Harmful? Detecting and Repairing Spreadsheet Smells due to Ambiguous Computation. In Proceedings of International Conference on Software Engineering (ICSE). 848–858.

Digital Library

[21]

Wensheng Dou, Shi Han, Liang Xu, Dongmei Zhang, and Jun Wei. 2018. Expandable Group Identification in Spreadsheets. In Proceedings of International Conference on Automated Software Engineering (ASE). 498–508.

Digital Library

[22]

Wensheng Dou, Chang Xu, Shing-Chi Cheung, and Jun Wei. 2017. CACheck: Detecting and Repairing Cell Arrays in Spreadsheets. IEEE Transactions on software Engineering (TSE), 43, 3 (2017), 226–251.

Digital Library

[23]

Wensheng Dou, Liang Xu, Shing-Chi Cheung, Chushu Gao, Jun Wei, and Tao Huang. 2016. VEnron: A Versioned Spreadsheet Corpus and Related Evolution Analysis. In Proceedings of International Conference on Software Engineering (ICSE). 162–171.

Digital Library

[24]

Julian Eberius, Katrin Braunschweig, Markus Hentsch, Maik Thiele, Ahmad Ahmadov, and Wolfgang Lehner. 2015. Building the Dresden Web Table Corpus: A Classification Approach. In Proceedings of IEEE/ACM International Symposium on Big Data Computing. 41–50.

[25]

Marc Fisher and Gregg Rothermel. 2005. The EUSES Spreadsheet Corpus: A Shared Resource for Supporting Experimentation with Spreadsheet Dependability Mechanisms. In Proceedings of the Workshop on End-user Software Engineering. 1–5.

Digital Library

[26]

Majid Ghasemi Gol, Jay Pujara, and Pedro Szekely. 2019. Tabular Cell Classification Using Pre-Trained Cell Embeddings. In Proceedings of IEEE International Conference on Data Mining (ICDM). 230–239.

[27]

Felienne Hermans and Emerson Murphy-Hill. 2015. Enron’s Spreadsheets and Related Emails: A Dataset and Analysis. In Proceedings of International Conference on Software Engineering (ICSE). 2, 7–16.

[28]

Felienne Hermans, Martin Pinzger, and Arie Van Deursen. 2010. Automatically Extracting Class Diagrams from Spreadsheets. In Proceedings of European Conference on Object-Oriented Programming (ECOOP). 52–75.

[29]

Felienne Hermans, Martin Pinzger, and Arie van Deursen. 2012. Detecting and Visualizing Inter-Worksheet Smells in Spreadsheets. In Proceedings of International Conference on Software Engineering (ICSE). 441–451.

[30]

Felienne Hermans, Ben Sedee, Martin Pinzger, and Arie van Deursen. 2013. Data Clone Detection and Visualization in Spreadsheets. In Proceedings of International Conference on Software Engineering (ICSE). 292–301.

[31]

Yicheng Huang, Chang Xu, Yanyan Jiang, Huiyan Wang, and Da Li. 2020. WARDER: Towards Effective Spreadsheet Defect Detection by Validity-Based Cell Cluster Refinements. Journal of Systems and Software (JSS), 167 (2020), 1–19.

[32]

Andrew J Ko, Robin Abraham, Laura Beckwith, Alan Blackwell, Margaret Burnett, Martin Erwig, Chris Scaffidi, Joseph Lawrance, Henry Lieberman, and Brad Myers. 2011. The State of the Art in End-user Software Engineering. ACM Computing Surveys (CSUR), 43, 3 (2011), 1–44.

Digital Library

[33]

Elvis Koci, Maik Thiele, Wolfgang Lehner, and Oscar Romero. 2018. Table Recognition in Spreadsheets via a Graph Representation. In Proceedings of IAPR International Workshop on Document Analysis Systems (DAS). 139–144.

[34]

Elvis Koci, Maik Thiele, Óscar Romero Moral, and Wolfgang Lehner. 2016. A Machine Learning Approach for Layout Inference in Spreadsheets. In Proceedings of International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management. 77–88.

Digital Library

[35]

S Lee. 2005. Application of Logistic Regression Model and its Validation for Landslide Susceptibility Mapping using GIS and Remote Sensing Data. International Journal of Remote Sensing, 26, 7 (2005), 1477–1491.

[36]

Ephraim R McLean, Leon A Kappelman, and John P Thompson. 1993. Converging End-user and Corporate Computing. Commun. ACM, 36, 12 (1993), 78–90.

[37]

George Nagy and Sharad Seth. 2016. Table Headers: An Entrance to the Data Mine. In Proceedings of International Conference on Pattern Recognition (ICPR). 4065–4070.

[38]

Ray Panko. 2006. Facing the Problem of Spreadsheet Errors. Decision Line, 37, 5 (2006), 8–10.

[39]

Raymond R Panko. 2008. Spreadsheet Errors: What We Know. What We Think We can Do. arXiv preprint arXiv:0802.3457.

[40]

Stephen G Powell, Kenneth R Baker, and Barry Lawson. 2008. A Critical Review of the Literature on Spreadsheet Errors. Decision Support Systems, 46, 1 (2008), 128–138.

Digital Library

[41]

Kexuan Sun Harsha Rayudu Jay Pujara. 2021. A Hybrid Probabilistic Approach for Table Understanding. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI).

[42]

S Rasoul Safavian and David Landgrebe. 1991. A Survey of Decision Tree Classifier Methodology. IEEE transactions on Systems, Man, and Cybernetics, 21, 3 (1991), 660–674.

[43]

Christopher Scaffidi, Mary Shaw, and Brad Myers. 2005. Estimating the Numbers of End Users and End User Programmers. In Proceedings of IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC). 207–214.

Digital Library

[44]

Rishabh Singh, Benjamin Livshits, and Benjamin Zorn. 2017. Melford: Using Neural Networks to Find Spreadsheet Errors. Tech. Rep.

[45]

Xinxin Wang and Derick Wood. 1993. Tabular Abstraction for Tabular Editing and Formatting. In Proceedings of International Conference for Young Computer Scientists. 17–29.

Digital Library

[46]

Liang Xu, Shuo Wang, Wensheng Dou, Bo Yang, Chushu Gao, Jun Wei, and Tao Huang. 2018. Detecting Faulty Empty Cells in Spreadsheets. In Proceedings of International Conference on Software Analysis, Evolution and Reengineering (SANER). 423–433.

[47]

Richard Zanibbi, Dorothea Blostein, and James R Cordy. 2004. A Survey of Table Recognition. Document Analysis and Recognition, 7, 1 (2004), 1–16.

Digital Library

[48]

Yakun Zhang, Wensheng Dou, Jiaxin Zhu, Liang Xu, Zhiyong Zhou, Jun Wei, Dan Ye, and Bo Yang. 2020. Learning to Detect Table Clones in Spreadsheets. In Proceedings of ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA). 528–540.

Digital Library

Cited By

Huang YYang YShu XChen RWeng DWu Y(2024)Table Illustrator: Puzzle-based interactive authoring of plain tablesProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642415(1-18)Online publication date: 11-May-2024
https://dl.acm.org/doi/10.1145/3613904.3642415
Poon PLau MYu YTang S(2024)Spreadsheet quality assurance: a literature reviewFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-023-2384-618:2Online publication date: 22-Jan-2024
https://dl.acm.org/doi/10.1007/s11704-023-2384-6
Wu XChen HBu CJi SZhang ZSheng V(2023)HUSS: A Heuristic Method for Understanding the Semantic Structure of SpreadsheetsData Intelligence10.1162/dint_a_002015:3(537-559)Online publication date: 1-Aug-2023
https://doi.org/10.1162/dint_a_00201
Show More Cited By

Index Terms

Semantic table structure identification in spreadsheets
1. Applied computing
  1. Computers in other domains
    1. Personal computers and PC applications
      1. Spreadsheets
2. Software and its engineering
  1. Software creation and management
    1. Software verification and validation
      1. Software defect analysis
        Software testing and debugging

Recommendations

Expandable group identification in spreadsheets
ASE '18: Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering

Spreadsheets are widely used in various business tasks. Spreadsheet users may put similar data and computations by repeating a block of cells (a unit) in their spreadsheets. We name the unit and all its expanding ones as an expandable group. All units ...
Learning to detect table clones in spreadsheets
ISSTA 2020: Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis

In order to speed up spreadsheet development productivity, end users can create a spreadsheet table by copying and modifying an existing one. These two tables share the similar computational semantics, and form a table clone. End users may modify the ...
Automatic detection of dimension errors in spreadsheets

We present a reasoning system for inferring dimension information in spreadsheets. This system can be used to check the consistency of spreadsheet formulas and thus is able to detect errors in spreadsheets. Our approach is based on three static analysis ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ISSTA 2021: Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis

July 2021

685 pages

ISBN:9781450384599

DOI:10.1145/3460319

General Chair:
Cristian Cadar
Imperial College London, UK
,
Program Chair:
Xiangyu Zhang
Purdue University, USA

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGSOFT: ACM Special Interest Group on Software Engineering

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 July 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China
Frontier Science Project of Chinese Academy of Sciences

Conference

ISSTA '21

Sponsor:

SIGSOFT

ISSTA '21: 30th ACM SIGSOFT International Symposium on Software Testing and Analysis

July 11 - 17, 2021

Virtual, Denmark

Acceptance Rates

Overall Acceptance Rate 58 of 213 submissions, 27%

Upcoming Conference

ISSTA '25

Sponsor:
sigsoft

34th ACM SIGSOFT International Symposium on Software Testing and Analysis

June 25 - 28, 2025

Trondheim , Norway

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
270
Total Downloads

Downloads (Last 12 months)141
Downloads (Last 6 weeks)17

Reflects downloads up to 14 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Huang YYang YShu XChen RWeng DWu Y(2024)Table Illustrator: Puzzle-based interactive authoring of plain tablesProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642415(1-18)Online publication date: 11-May-2024
https://dl.acm.org/doi/10.1145/3613904.3642415
Poon PLau MYu YTang S(2024)Spreadsheet quality assurance: a literature reviewFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-023-2384-618:2Online publication date: 22-Jan-2024
https://dl.acm.org/doi/10.1007/s11704-023-2384-6
Wu XChen HBu CJi SZhang ZSheng V(2023)HUSS: A Heuristic Method for Understanding the Semantic Structure of SpreadsheetsData Intelligence10.1162/dint_a_002015:3(537-559)Online publication date: 1-Aug-2023
https://doi.org/10.1162/dint_a_00201
Wu XChen HBu CJi SZhang ZSheng V(2022)HUSS: A Heuristic Method for Understanding the Semantic Structure of Spreadsheets2022 IEEE International Conference on Knowledge Graph (ICKG)10.1109/ICKG55886.2022.00049(329-336)Online publication date: Nov-2022
https://doi.org/10.1109/ICKG55886.2022.00049
Tong SShen DKou YNie T(2022)CFCT: The cell function classification method for complex tables2022 IEEE 24th Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys)10.1109/HPCC-DSS-SmartCity-DependSys57074.2022.00326(2206-2213)Online publication date: Dec-2022
https://doi.org/10.1109/HPCC-DSS-SmartCity-DependSys57074.2022.00326
Shigarov A(2022)Table understanding: Problem overviewWIREs Data Mining and Knowledge Discovery10.1002/widm.148213:1Online publication date: 21-Nov-2022
https://doi.org/10.1002/widm.1482

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents