Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3460319.3464812acmconferencesArticle/Chapter ViewAbstractPublication PagesisstaConference Proceedingsconference-collections
research-article
Open access

Semantic table structure identification in spreadsheets

Published: 11 July 2021 Publication History

Abstract

Spreadsheets are widely used in various business tasks, and contain amounts of valuable data. However, spreadsheet tables are usually organized in a semi-structured way, and contain complicated semantic structures, e.g., header types and relations among headers. Lack of documented semantic table structures, existing data analysis and error detection tools can hardly understand spreadsheet tables. Therefore, identifying semantic table structures in spreadsheet tables is of great importance, and can greatly promote various analysis tasks on spreadsheets.
In this paper, we propose Tasi (Table structure identification) to automatically identify semantic table structures in spreadsheets. Based on the contents, styles, and spatial locations in table headers, Tasi adopts a multi-classifier to predict potential header types and relations, and then integrates all header types and relations into consistent semantic table structures. We further propose TasiError, to detect spreadsheet errors based on the identified semantic table structures by Tasi. Our experiments on real-world spreadsheets show that, Tasi can precisely identify semantic table structures in spreadsheets, and TasiError can detect real-world spreadsheet errors with higher precision (75.2%) and recall (82.9%) than existing approaches.

References

[1]
2021. Error detector in Excel 2019. https://www.dummies.com/software/microsoft-office/detecting-and-correcting-errors-in-excel-2019-formulas/
[2]
2021. European Spreadsheet Risks Interest Group. http://www.eusprig.org/horror-stories.htm
[3]
2021. Ideas in Excel. https://support.office.com/en-ie/article/ideas-in-excel-3223aab8-f543-4fda-85ed-76bb0295ffc4
[4]
2021. Power BI | Interactive Data Visualization BI Tools. https://powerbi.microsoft.com
[5]
2021. Rethinking Spreadsheets and Performance Management. https://www.cutimes.com/2013/07/31/rethinking-spreadsheets-and-performance-management/?slreturn=20200726064739
[6]
2021. scikit-learn: Machine Learning in Python. https://scikit-learn.org
[7]
2021. speechocean: A company for AI data provider. http://en.speechocean.com/welcome.html
[8]
Robin Abraham and Martin Erwig. 2007. UCheck: A Spreadsheet Type Checker for End Users. Journal of Visual Languages and Computing, 18, 1 (2007), 71–95.
[9]
Marco D Adelfio and Hanan Samet. 2013. Schema Extraction for Tabular Data on the Web. Proceedings of the VLDB Endowment (VLDB), 6, 6 (2013), 421–432.
[10]
Titus Barik, Kevin Lubick, Justin Smith, John Slankas, and Emerson Murphy-Hill. 2015. FUSE: A Reproducible, Extendable, Internet-Scale Corpus of Spreadsheets. In Proceedings of IEEE/ACM Working Conference on Mining Software Repositories (MSR). 486–489.
[11]
Daniel W. Barowy, Emery D. Berger, and Benjamin Zorn. 2018. ExceLint: Automatically Finding Spreadsheet Formula Errors. In Proceedings of International Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA). 148:1–148:26.
[12]
Leo Breiman. 2001. Random Forests. Machine learning, 45, 1 (2001), 5–32.
[13]
Jonathan P Caulkins, Erica Layne Morrison, and Timothy Weidemann. 2007. Spreadsheet Errors and Decision Making: Evidence from Field Interviews. Journal of Organizational and End User Computing, 19, 3 (2007), 1–23.
[14]
Chris Chambers and Martin Erwig. 2009. Automatic Detection of Dimension Errors in Spreadsheets. Journal of Visual Languages & Computing, 20, 4 (2009), 269–283.
[15]
Zhe Chen and Michael Cafarella. 2014. Integrating Spreadsheet Data via Accurate and Low-Effort Extraction. In Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). 1126–1135.
[16]
Shing-Chi Cheung, Wanjun Chen, Yepang Liu, and Chang Xu. 2016. CUSTODES: Automatic Spreadsheet Cell Clustering and Smell Detection Using Strong and Weak Features. In Proceedings of International Conference on Software Engineering (ICSE). 464–475.
[17]
Haoyu Dong, Shijie Liu, Zhouyu Fu, Shi Han, and Dongmei Zhang. 2019. Semantic Structure Extraction for Spreadsheet Tables with a Multi-task Learning Architecture. In Proceedings of Workshop on Document Intelligence on Neural Information Processing Systems (NeurIPS).
[18]
Haoyu Dong, Shijie Liu, Shi Han, Zhouyu Fu, and Dongmei Zhang. 2019. Tablesense: Spreadsheet Table Detection with Convolutional Neural Networks. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI). 33, 69–76.
[19]
Wensheng Dou, Shing-Chi Cheung, Chushu Gao, Chang Xu, Liang Xu, and Jun Wei. 2016. Detecting Table Clones and Smells in Spreadsheets. In Proceedings of ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE). 787–798.
[20]
Wensheng Dou, Shing-Chi Cheung, and Jun Wei. 2014. Is Spreadsheet Ambiguity Harmful? Detecting and Repairing Spreadsheet Smells due to Ambiguous Computation. In Proceedings of International Conference on Software Engineering (ICSE). 848–858.
[21]
Wensheng Dou, Shi Han, Liang Xu, Dongmei Zhang, and Jun Wei. 2018. Expandable Group Identification in Spreadsheets. In Proceedings of International Conference on Automated Software Engineering (ASE). 498–508.
[22]
Wensheng Dou, Chang Xu, Shing-Chi Cheung, and Jun Wei. 2017. CACheck: Detecting and Repairing Cell Arrays in Spreadsheets. IEEE Transactions on software Engineering (TSE), 43, 3 (2017), 226–251.
[23]
Wensheng Dou, Liang Xu, Shing-Chi Cheung, Chushu Gao, Jun Wei, and Tao Huang. 2016. VEnron: A Versioned Spreadsheet Corpus and Related Evolution Analysis. In Proceedings of International Conference on Software Engineering (ICSE). 162–171.
[24]
Julian Eberius, Katrin Braunschweig, Markus Hentsch, Maik Thiele, Ahmad Ahmadov, and Wolfgang Lehner. 2015. Building the Dresden Web Table Corpus: A Classification Approach. In Proceedings of IEEE/ACM International Symposium on Big Data Computing. 41–50.
[25]
Marc Fisher and Gregg Rothermel. 2005. The EUSES Spreadsheet Corpus: A Shared Resource for Supporting Experimentation with Spreadsheet Dependability Mechanisms. In Proceedings of the Workshop on End-user Software Engineering. 1–5.
[26]
Majid Ghasemi Gol, Jay Pujara, and Pedro Szekely. 2019. Tabular Cell Classification Using Pre-Trained Cell Embeddings. In Proceedings of IEEE International Conference on Data Mining (ICDM). 230–239.
[27]
Felienne Hermans and Emerson Murphy-Hill. 2015. Enron’s Spreadsheets and Related Emails: A Dataset and Analysis. In Proceedings of International Conference on Software Engineering (ICSE). 2, 7–16.
[28]
Felienne Hermans, Martin Pinzger, and Arie Van Deursen. 2010. Automatically Extracting Class Diagrams from Spreadsheets. In Proceedings of European Conference on Object-Oriented Programming (ECOOP). 52–75.
[29]
Felienne Hermans, Martin Pinzger, and Arie van Deursen. 2012. Detecting and Visualizing Inter-Worksheet Smells in Spreadsheets. In Proceedings of International Conference on Software Engineering (ICSE). 441–451.
[30]
Felienne Hermans, Ben Sedee, Martin Pinzger, and Arie van Deursen. 2013. Data Clone Detection and Visualization in Spreadsheets. In Proceedings of International Conference on Software Engineering (ICSE). 292–301.
[31]
Yicheng Huang, Chang Xu, Yanyan Jiang, Huiyan Wang, and Da Li. 2020. WARDER: Towards Effective Spreadsheet Defect Detection by Validity-Based Cell Cluster Refinements. Journal of Systems and Software (JSS), 167 (2020), 1–19.
[32]
Andrew J Ko, Robin Abraham, Laura Beckwith, Alan Blackwell, Margaret Burnett, Martin Erwig, Chris Scaffidi, Joseph Lawrance, Henry Lieberman, and Brad Myers. 2011. The State of the Art in End-user Software Engineering. ACM Computing Surveys (CSUR), 43, 3 (2011), 1–44.
[33]
Elvis Koci, Maik Thiele, Wolfgang Lehner, and Oscar Romero. 2018. Table Recognition in Spreadsheets via a Graph Representation. In Proceedings of IAPR International Workshop on Document Analysis Systems (DAS). 139–144.
[34]
Elvis Koci, Maik Thiele, Óscar Romero Moral, and Wolfgang Lehner. 2016. A Machine Learning Approach for Layout Inference in Spreadsheets. In Proceedings of International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management. 77–88.
[35]
S Lee. 2005. Application of Logistic Regression Model and its Validation for Landslide Susceptibility Mapping using GIS and Remote Sensing Data. International Journal of Remote Sensing, 26, 7 (2005), 1477–1491.
[36]
Ephraim R McLean, Leon A Kappelman, and John P Thompson. 1993. Converging End-user and Corporate Computing. Commun. ACM, 36, 12 (1993), 78–90.
[37]
George Nagy and Sharad Seth. 2016. Table Headers: An Entrance to the Data Mine. In Proceedings of International Conference on Pattern Recognition (ICPR). 4065–4070.
[38]
Ray Panko. 2006. Facing the Problem of Spreadsheet Errors. Decision Line, 37, 5 (2006), 8–10.
[39]
Raymond R Panko. 2008. Spreadsheet Errors: What We Know. What We Think We can Do. arXiv preprint arXiv:0802.3457.
[40]
Stephen G Powell, Kenneth R Baker, and Barry Lawson. 2008. A Critical Review of the Literature on Spreadsheet Errors. Decision Support Systems, 46, 1 (2008), 128–138.
[41]
Kexuan Sun Harsha Rayudu Jay Pujara. 2021. A Hybrid Probabilistic Approach for Table Understanding. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI).
[42]
S Rasoul Safavian and David Landgrebe. 1991. A Survey of Decision Tree Classifier Methodology. IEEE transactions on Systems, Man, and Cybernetics, 21, 3 (1991), 660–674.
[43]
Christopher Scaffidi, Mary Shaw, and Brad Myers. 2005. Estimating the Numbers of End Users and End User Programmers. In Proceedings of IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC). 207–214.
[44]
Rishabh Singh, Benjamin Livshits, and Benjamin Zorn. 2017. Melford: Using Neural Networks to Find Spreadsheet Errors. Tech. Rep.
[45]
Xinxin Wang and Derick Wood. 1993. Tabular Abstraction for Tabular Editing and Formatting. In Proceedings of International Conference for Young Computer Scientists. 17–29.
[46]
Liang Xu, Shuo Wang, Wensheng Dou, Bo Yang, Chushu Gao, Jun Wei, and Tao Huang. 2018. Detecting Faulty Empty Cells in Spreadsheets. In Proceedings of International Conference on Software Analysis, Evolution and Reengineering (SANER). 423–433.
[47]
Richard Zanibbi, Dorothea Blostein, and James R Cordy. 2004. A Survey of Table Recognition. Document Analysis and Recognition, 7, 1 (2004), 1–16.
[48]
Yakun Zhang, Wensheng Dou, Jiaxin Zhu, Liang Xu, Zhiyong Zhou, Jun Wei, Dan Ye, and Bo Yang. 2020. Learning to Detect Table Clones in Spreadsheets. In Proceedings of ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA). 528–540.

Cited By

View all
  • (2024)Table Illustrator: Puzzle-based interactive authoring of plain tablesProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642415(1-18)Online publication date: 11-May-2024
  • (2024)Spreadsheet quality assurance: a literature reviewFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-023-2384-618:2Online publication date: 22-Jan-2024
  • (2023)HUSS: A Heuristic Method for Understanding the Semantic Structure of SpreadsheetsData Intelligence10.1162/dint_a_002015:3(537-559)Online publication date: 1-Aug-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ISSTA 2021: Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis
July 2021
685 pages
ISBN:9781450384599
DOI:10.1145/3460319
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 July 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Spreadsheet
  2. error detection
  3. table structure

Qualifiers

  • Research-article

Funding Sources

Conference

ISSTA '21
Sponsor:

Acceptance Rates

Overall Acceptance Rate 58 of 213 submissions, 27%

Upcoming Conference

ISSTA '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)141
  • Downloads (Last 6 weeks)17
Reflects downloads up to 14 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Table Illustrator: Puzzle-based interactive authoring of plain tablesProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642415(1-18)Online publication date: 11-May-2024
  • (2024)Spreadsheet quality assurance: a literature reviewFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-023-2384-618:2Online publication date: 22-Jan-2024
  • (2023)HUSS: A Heuristic Method for Understanding the Semantic Structure of SpreadsheetsData Intelligence10.1162/dint_a_002015:3(537-559)Online publication date: 1-Aug-2023
  • (2022)HUSS: A Heuristic Method for Understanding the Semantic Structure of Spreadsheets2022 IEEE International Conference on Knowledge Graph (ICKG)10.1109/ICKG55886.2022.00049(329-336)Online publication date: Nov-2022
  • (2022)CFCT: The cell function classification method for complex tables2022 IEEE 24th Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys)10.1109/HPCC-DSS-SmartCity-DependSys57074.2022.00326(2206-2213)Online publication date: Dec-2022
  • (2022)Table understanding: Problem overviewWIREs Data Mining and Knowledge Discovery10.1002/widm.148213:1Online publication date: 21-Nov-2022

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media