Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3238147.3238222acmconferencesArticle/Chapter ViewAbstractPublication PagesaseConference Proceedingsconference-collections
research-article

Expandable group identification in spreadsheets

Published: 03 September 2018 Publication History

Abstract

Spreadsheets are widely used in various business tasks. Spreadsheet users may put similar data and computations by repeating a block of cells (a unit) in their spreadsheets. We name the unit and all its expanding ones as an expandable group. All units in an expandable group share the same or similar formats and semantics. As a data storage and management tool, expandable groups represent the fundamental structure in spreadsheets. However, existing spreadsheet systems do not recognize any expandable groups. Therefore, other spreadsheet analysis tools, e.g., data integration and fault detection, cannot utilize this structure of expandable groups to perform precise analysis. In this paper, we propose ExpCheck to automatically extract expandable groups in spreadsheets. We observe that continuous units that share the similar formats and semantics are likely to be an expandable group. Inspired by this, we inspect the format of each cell and its corresponding semantics, and further classify them into expandable groups according to their similarity. We evaluate ExpCheck on 120 spreadsheets randomly sampled from the EUSES and VEnron corpora. The experimental results show that ExpCheck is effective. ExpCheck successfully detect expandable groups with F1-measure of 73.1%, significantly outperforming the state-of-the-art techniques (F1-measure of 13.3%).

References

[1]
Robin Abraham and Martin Erwig. 2004. Header and Unit Inference for Spreadsheets through Spatial Analyses. In IEEE Symposium on Visual Languages and Human Centric Computing (VL/HCC), 165–172.
[2]
Robin Abraham and Martin Erwig. 2006. Inferring Templates from Spreadsheets. In International Conference on Software Engineering (ICSE), 182–191.
[3]
Robin Abraham and Martin Erwig. 2007. UCheck: A Spreadsheet Type Checker for End Users. Journal of Visual Languages & Computing 18, 1 (2007), 71–95.
[4]
Robin Abraham, Martin Erwig, Steve Kollmansberger, and Ethan Seifert. 2005. Visual Specifications of Correct Spreadsheets. In IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC), 189–196.
[5]
Chris Chambers and Martin Erwig. 2009. Automatic Detection of Dimension Errors in Spreadsheets. Journal of Visual Languages & Computing 20, 4 (2009), 269– 283.
[6]
Zhe Chen and Michael Cafarella. 2014. Integrating Spreadsheet Data via Accurate and Low-effort Extraction. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 1126–1135.
[7]
Zhe Chen, Sasha Dadiomov, Richard Wesley, Gang Xiao, Daniel Cory, Michael Cafarella, and Jock Mackinlay. 2017. Spreadsheet Property Detection With Ruleassisted Active Learning. In ACM on Conference on Information and Knowledge Management (CIKM), 999–1008.
[8]
Shing-Chi Cheung, Wanjun Chen, Yepang Liu, and Chang Xu. 2016. CUSTODES: Automatic Spreadsheet Cell Clustering and Smell Detection Using Strong and Weak Features. In International Conference on Software Engineering (ICSE), 464– 475.
[9]
Jacome Cunha, Martin Erwig, and Joao Saraiva. 2010. Automatically Inferring ClassSheet Models from Spreadsheets. In IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC), 93–100.
[10]
Wensheng Dou, Shing-Chi Cheung, Chushu Gao, Chang Xu, Liang Xu, and Jun Wei. 2016. Detecting Table Clones and Smells in Spreadsheets. In ACM SIGSOFT International Symposium on the Foundations of Software Engineering (FSE), 787– 798.
[11]
Wensheng Dou, Shing-Chi Cheung, and Jun Wei. 2014. Is Spreadsheet Ambiguity Harmful? Detecting and Repairing Spreadsheet Smells Due to Ambiguous Computation. In International Conference on Software Engineering (ICSE), 848–858.
[12]
Wensheng Dou, Chang Xu, Shing-Chi Cheung, and Jun Wei. 2017. CACheck: Detecting and Repairing Cell Arrays in Spreadsheets. IEEE Transactions on Software Engineering (TSE) 43, 3 (2017), 226–251.
[13]
Wensheng Dou, Liang Xu, Shing-Chi Cheung, Chushu Gao, Jun Wei, and Tao Huang. 2016. VEnron: A Versioned Spreadsheet Corpus and Related Evolution Analysis. In International Conference on Software Engineering (ICSE SEIP), 162– 171.
[14]
Gregor Engels and Martin Erwig. 2005. ClassSheets: Automatic Generation of Spreadsheet Applications from Object-oriented Specifications. In IEEE/ACM International Conference on Automated Software Engineering (ASE), 124–133.
[15]
Martin Erwig, Robin Abraham, Steve Kollmansberger, and Irene Cooperstein. 2006. Gencel: A Program Generator for Correct Spreadsheets. Journal of Functional Programming (JFP) 16, 3 (2006), 293–325.
[16]
Marc Fisher and Gregg Rothermel. 2005. The EUSES Spreadsheet Corpus: A Shared Resource for Supporting Experimentation with Spreadsheet Dependability Mechanisms. ACM SIGSOFT Software Engineering Notes 30, 4 (2005), 1–5.
[17]
Felienne Hermans and Emerson Murphy-Hill. 2015. Enron’s Spreadsheets and Related Emails: A Dataset and Analysis. In International Conference on Software Engineering (ICSE SEIP), 7–16.
[18]
Felienne Hermans, Martin Pinzger, and Arie van Deursen. 2010. Automatically Extracting Class Diagrams from Spreadsheets. In European Conference on Object-Oriented Programming (ECOOP), 52–75.
[19]
Felienne Hermans, Martin Pinzger, and Arie van Deursen. 2011. Supporting Professional Spreadsheet Users by Generating Leveled Dataflow Diagrams. In International Conference on Software Engineering (ICSE), 451–460.
[20]
Felienne Hermans, Martin Pinzger, and Arie van Deursen. 2012. Detecting and Visualizing Inter-worksheet Smells in Spreadsheets. In International Conference on Software Engineering (ICSE), 441–451.
[21]
Felienne Hermans, Martin Pinzger, and Arie van Deursen. 2012. Detecting Code Smells in Spreadsheet Formulas. In International Conference on Software Maintenance (ICSM), 409–418.
[22]
Felienne Hermans, Ben Sedee, Martin Pinzger, and Arie van Deursen. 2013. Data Clone Detection and Visualization in Spreadsheets. In Proceedings of the International Conference on Software Engineering (ICSE), 292–301.
[23]
Dietmar Jannach, Thomas Schmitz, Birgit Hofer, and Franz Wotawa. 2014. Avoiding, Finding and Fixing Spreadsheet Errors – A Survey of Automated Approaches for Spreadsheet QA. Journal of Systems and Software (JSS) 94, (2014), 129–150.
[24]
Andrew J. Ko, Robin Abraham, Laura Beckwith, Alan Blackwell, Margaret Burnett, Martin Erwig, Chris Scaffidi, Joseph Lawrance, Henry Lieberman, Brad Myers, Mary Beth Rosson, Gregg Rothermel, Mary Shaw, and Susan Wiedenbeck. 2011. The State of the Art in End-user Software Engineering. ACM Computing Surveys 43, 3 (2011), 21:1–21:44.
[25]
Elvis Koci, Maik Thiele, Oscar Romero, and Wolfgang Lehner. 2016. A Machine Learning Approach for Layout Inference in Spreadsheets. In nternational Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management, 77–88.
[26]
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. In Workshop at International Conference on Learning Representations (ICLR).
[27]
Stephen G. Powell, Kenneth R. Baker, and Barry Lawson. 2008. A Critical Review of the Literature on Spreadsheet Errors. Decision Support Systems 46, 1 (2008), 128–138.
[28]
Christopher Scaffidi, Mary Shaw, and Brad Myers. 2005. Estimating the Numbers of End Users and End User Programmers. In IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC), 207–214.
[29]
Liang Xu, Shuo Wang, Wensheng Dou, Bo Yang, Chushu Gao, Jun Wei, and Tao Huang. 2018. Detecting Faulty Empty Cells in Spreadsheets. In IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), 423–433.
[30]
Rethinking Spreadsheets and Performance Management. https://www.cutimes.com/2013/07/31/rethinking-spreadsheets-and-performance-management/. Accessed: 2018-04-23.
[31]
Power BI | Interactive Data Visualization BI Tools. https://powerbi.microsoft.com. Accessed: 2018-04-27.
[32]
Insights in Excel. https://support.office.com/en-ie/article/insights-in-excel- 3223aab8-f543-4fda-85ed-76bb0295ffc4. Accessed: 2018-04-27.
[33]
Apache POI - the Java API for Microsoft Documents. http://poi.apache.org/. Accessed: 2016-02-13.
[34]
GoogleNews-vectors-negative300.bin.gz. https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=embed_facebook. Accessed: 2018-04-27.
[35]
word2vec project on google code. https://code.google.com/archive/p/word2vec/. Accessed: 2018-04-27.

Cited By

View all
  • (2024)Tracing and Fixing Inconsistencies in Clone-and-Own Tabular Data ModelsProceedings of the 28th ACM International Systems and Software Product Line Conference10.1145/3646548.3672595(191-202)Online publication date: 2-Sep-2024
  • (2024)Large Language Models for Tabular Data: Progresses and Future DirectionsProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3661384(2997-3000)Online publication date: 10-Jul-2024
  • (2024)Table Illustrator: Puzzle-based interactive authoring of plain tablesProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642415(1-18)Online publication date: 11-May-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ASE '18: Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering
September 2018
955 pages
ISBN:9781450359375
DOI:10.1145/3238147
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 September 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Spreadsheet
  2. expandable group

Qualifiers

  • Research-article

Conference

ASE '18
Sponsor:

Acceptance Rates

Overall Acceptance Rate 82 of 337 submissions, 24%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)12
  • Downloads (Last 6 weeks)1
Reflects downloads up to 14 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Tracing and Fixing Inconsistencies in Clone-and-Own Tabular Data ModelsProceedings of the 28th ACM International Systems and Software Product Line Conference10.1145/3646548.3672595(191-202)Online publication date: 2-Sep-2024
  • (2024)Large Language Models for Tabular Data: Progresses and Future DirectionsProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3661384(2997-3000)Online publication date: 10-Jul-2024
  • (2024)Table Illustrator: Puzzle-based interactive authoring of plain tablesProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642415(1-18)Online publication date: 11-May-2024
  • (2024)CoInsight: Visual Storytelling for Hierarchical Tables With Connected InsightsIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2024.338855330:6(3049-3061)Online publication date: Jun-2024
  • (2024)Spreadsheet quality assurance: a literature reviewFrontiers of Computer Science10.1007/s11704-023-2384-618:2Online publication date: 22-Jan-2024
  • (2023)HUSS: A Heuristic Method for Understanding the Semantic Structure of SpreadsheetsData Intelligence10.1162/dint_a_002015:3(537-559)Online publication date: 1-Aug-2023
  • (2023)An Action-based Model to Handle Cloning and Adaptation in Tabular Data ApplicationsProceedings of the 27th ACM International Systems and Software Product Line Conference - Volume A10.1145/3579027.3608991(201-212)Online publication date: 28-Aug-2023
  • (2022)HiTailor: Interactive Transformation and Visualization for Hierarchical Tabular DataIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2022.3209354(1-10)Online publication date: 2022
  • (2022)HUSS: A Heuristic Method for Understanding the Semantic Structure of Spreadsheets2022 IEEE International Conference on Knowledge Graph (ICKG)10.1109/ICKG55886.2022.00049(329-336)Online publication date: Nov-2022
  • (2021)Semantic table structure identification in spreadsheetsProceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3460319.3464812(283-295)Online publication date: 11-Jul-2021
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media