Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2660193.2660207acmconferencesArticle/Chapter ViewAbstractPublication PagessplashConference Proceedingsconference-collections
research-article

CheckCell: data debugging for spreadsheets

Published: 15 October 2014 Publication History

Abstract

Testing and static analysis can help root out bugs in programs, but not in data. This paper introduces data debugging, an approach that combines program analysis and statistical analysis to automatically find potential data errors. Since it is impossible to know a priori whether data are erroneous, data debugging instead locates data that has a disproportionate impact on the computation. Such data is either very important, or wrong. Data debugging is especially useful in the context of data-intensive programming environments that intertwine data with programs in the form of queries or formulas.
We present the first data debugging tool, CheckCell, an add-in for Microsoft Excel. CheckCell identifies cells that have an unusually high impact on the spreadsheet's computations. We show that CheckCell is both analytically and empirically fast and effective. We show that it successfully finds injected typographical errors produced by a generative model trained with data entry from 169,112 Mechanical Turk tasks. CheckCell is more precise and efficient than standard outlier detection techniques. CheckCell also automatically identifies a key flaw in the infamous Reinhart and Rogoff spreadsheet.

References

[1]
Y. Ahmad, T. Antoniu, S. Goldwater, and S. Krishnamurthi. A type system for statically detecting spreadsheet errors. In ASE, pages 174--183. IEEE Computer Society, 2003.
[2]
Y. Ait-Ameur, G. Bel, F. Boniol, S. Pairault, and V. Wiels. Robustness analysis of avionics embedded systems. SIGPLAN Not., 38(7):123--132, June 2003.
[3]
T. Antoniu, P. A. Steckler, S. Krishnamurthi, E. Neuwirth, and M. Felleisen. Validating the unit correctness of spreadsheet programs. In Proceedings of the 26th International Conference on Software Engineering, ICSE '04, pages 439--448, Washington, DC, USA, 2004. IEEE Computer Society.
[4]
Apache Foundation. Welcome to Apache Hadoop. http://hadoop.apache.org/, Nov. 2012.
[5]
M. Ash and R. Pollin. Supplemental Technical Critique of Reinhart and Rogoff, "Growth in a Time of Debt". Research brief, Political Economy Research Institute, University of Massachusetts Amherst, Apr. 2013.
[6]
A. Balmin, T. Papadimitriou, and Y. Papakonstantinou. Hypothetical queries in an OLAP environment. In Proceedings of the 26th International Conference on Very Large Data Bases, VLDB '00, pages 220--231, San Francisco, CA, USA, 2000. Morgan Kaufmann Publishers Inc.
[7]
V. Barnett and T. Lewis. Outliers in statistical data. Wiley Series in Probability and Mathematical Statistics. Applied Probability and Statistics, Chichester: Wiley, 1994, 3rd ed., 1, 1994.
[8]
J. Carver, M. Fisher, II, and G. Rothermel. An empirical evaluation of a testing and debugging methodology for excel. In Proceedings of the 2006 ACM/IEEE international symposium on Empirical software engineering, ISESE '06, pages 278--287, New York, NY, USA, 2006. ACM.
[9]
C. Chambers and M. Erwig. Reasoning about spreadsheets with labels and dimensions. J. Vis. Lang. Comput., 21(5):249--262, Dec. 2010.
[10]
S. Chaudhuri and U. Dayal. An overview of data warehousing and OLAP technology. SIGMOD Rec., 26(1):65--74, Mar. 1997.
[11]
J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107--113, 2008.
[12]
B. Efron. Bootstrap Methods: Another Look at the Jackknife. The Annals of Statistics, 7(1):pp. 1--26, 1979.
[13]
M. Ernst, J. Perkins, P. Guo, S. McCamant, C. Pacheco, M. Tschantz, and C. Xiao. The daikon system for dynamic detection of likely invariants. Science of Computer Programming, 69(1):35--45, 2007.
[14]
M. Erwig. Software engineering for spreadsheets. IEEE Softw., 26(5):25--30, Sept. 2009.
[15]
M. Erwig, R. Abraham, I. Cooperstein, and S. Kollmansberger. Automatic generation and maintenance of correct spreadsheets. In ICSE, ICSE '05, pages 136--145, New York, NY, USA, 2005. ACM.
[16]
M. Fisher and G. Rothermel. The EUSES spreadsheet corpus: a shared resource for supporting experimentation with spreadsheet dependability mechanisms. SIGSOFT Softw. Eng. Notes, July 2005.
[17]
M. Fisher, G. Rothermel, T. Creelan, and M. Burnett. Scaling a dataflow testing methodology to the multiparadigm world of commercial spreadsheets. In 17th International Symposium on Software Reliability Engineering (ISSRE'06), pages 13--22. IEEE, 2006.
[18]
H. Galhardas, D. Florescu, D. Shasha, and E. Simon. Ajax: an extensible data cleaning tool. In Proceedings of the 2000 ACM SIGMOD international conference on Management of data, SIGMOD '00, page 590, New York, NY, USA, 2000. ACM.
[19]
L. Golab, H. Karloff, F. Korn, and D. Srivastava. Data auditor: exploring data quality and semantics using pattern tableaux. Proc. VLDB Endow., 3(1-2):1641--1644, Sept. 2010.
[20]
S. Gulwani. Automating string processing in spreadsheets using input-output examples. In T. Ball and M. Sagiv, editors, POPL, pages 317--330. ACM, 2011.
[21]
D. Hamlet. Continuity in software systems. In Proceedings of the 2002 ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA '02, pages 196--200, New York, NY, USA, 2002. ACM.
[22]
J. Han and M. Kamber. Data mining: concepts and techniques. Morgan Kaufmann, 2006.
[23]
W. R. Harris and S. Gulwani. Spreadsheet table transformations from examples. In M. W. Hall and D. A. Padua, editors, PLDI, pages 317--328. ACM, 2011.
[24]
J. Hellerstein. Quantitative data cleaning for large databases. United Nations Economic Commission for Europe (UNECE), 2008.
[25]
M. A. Hernández and S. J. Stolfo. The merge/purge problem for large databases. In Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, SIGMOD '95, pages 127--138, New York, NY, USA, 1995. ACM.
[26]
T. Herndon, M. Ash, and R. Pollin. Does High Public Debt Consistently Stifle Economic Growth? A Critique of Reinhart and Rogoff. Working Paper Series 322, Political Economy Research Institute, University of Massachusetts Amherst, Apr. 2013.
[27]
B. Hofer, A. Riboira, F. Wotawa, R. Abreu, and E. Getzner. On the empirical evaluation of fault localization techniques for spreadsheets. In Proceedings of the 16th international conference on Fundamental Approaches to Software Engineering, FASE'13, pages 68--82, Berlin, Heidelberg, 2013. Springer- Verlag.
[28]
R. Ihaka and R. Gentleman. R: A language for data analysis and graphics. Journal of computational and graphical statistics, 5(3):299--314, 1996.
[29]
S. Jeffery, G. Alonso, M. Franklin, W. Hong, and J. Widom. A pipelined framework for online cleaning of sensor data streams. In Proceedings of the 22nd International Conference on Data Engineering (ICDE'06), pages 140--142, Apr. 2006.
[30]
A. J. Ko, R. Abraham, L. Beckwith, A. Blackwell, M. Burnett, M. Erwig, C. Scaffidi, J. Lawrance, H. Lieberman, B. Myers, M. B. Rosson, G. Rothermel, M. Shaw, and S. Wiedenbeck. The state of the art in end-user software engineering. ACM Comput. Surv., 43(3):21:1--21:44, Apr. 2011.
[31]
D. Luebbers, U. Grimmer, and M. Jarke. Systematic development of data mining-based data quality tools. In Proceedings of the 29th International Conference on Very Large Data Bases, VLDB '03, pages 548--559. VLDB Endowment, 2003.
[32]
E. Rahm and H. H. Do. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull., 23(4):3--13, 2000.
[33]
V. Raman and J. M. Hellerstein. Potter's wheel: An interactive data cleaning system. In Proceedings of the 27th International Conference on Very Large Data Bases, VLDB '01, pages 381--390, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc.
[34]
O. Raz, P. Koopman, and M. Shaw. Semantic anomaly detection in online data sources. In ICSE, ICSE '02, pages 302--312, New York, NY, USA, 2002. ACM.
[35]
C. M. Reinhart and K. S. Rogoff. Growth in a time of debt. Working Paper 15639, National Bureau of Economic Research, January 2010.
[36]
C. M. Reinhart and K. S. Rogoff. Growth in a time of debt. The American Economic Review, 100(2):573--78, 2010.
[37]
G. Rothermel, M. Burnett, L. Li, C. Dupuis, and A. Sheretov. A methodology for testing spreadsheets. ACM Transactions on Software Engineering and Methodology (TOSEM), 10(1):110--147, 2001.
[38]
G. Rothermel, L. Li, C. DuPuis, and M. Burnett. What you see is what you test: A methodology for testing form-based visual programs. In ICSE 1998, pages 198--207. IEEE, 1998.
[39]
M. Sakal and L. Raković. Errors in building and using electronic tables: Financial consequences and minimisation techiques. International Journal of Strategic Management and Decision Support Systems in Strategic Management, 17(3):29--35, 2012.
[40]
V. Samar and S. Patni. Controlling the information flow in spreadsheets. CoRR, abs/0803.2527, 2008.
[41]
R. Singh and S. Gulwani. Learning semantic string transformations from examples. Proc. VLDB Endow., 5(8):740--751, Apr. 2012.
[42]
H. Xiong, G. Pandey, M. Steinbach, and V. Kumar. Enhancing data analysis with noise removal. IEEE Transactions on Knowledge and Data Engineering, 18(3):304--319, Mar. 2006.
[43]
P. Zhang and W. Su. Statistical inference on recall, precision and average precision under random selection. In FSKD, pages 1348--1352. IEEE, 2012.

Cited By

View all
  • (2024)Spreadsheet quality assurance: a literature reviewFrontiers of Computer Science10.1007/s11704-023-2384-618:2Online publication date: 22-Jan-2024
  • (2024)Quantitative Input Usage Static AnalysisNASA Formal Methods10.1007/978-3-031-60698-4_5(79-98)Online publication date: 4-Jun-2024
  • (2020)Surfacing Visualization MiragesProceedings of the 2020 CHI Conference on Human Factors in Computing Systems10.1145/3313831.3376420(1-16)Online publication date: 21-Apr-2020
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
OOPSLA '14: Proceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages & Applications
October 2014
946 pages
ISBN:9781450325851
DOI:10.1145/2660193
  • cover image ACM SIGPLAN Notices
    ACM SIGPLAN Notices  Volume 49, Issue 10
    OOPSLA '14
    October 2014
    907 pages
    ISSN:0362-1340
    EISSN:1558-1160
    DOI:10.1145/2714064
    • Editor:
    • Andy Gill
    Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 October 2014

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. data-debugging
  2. debugging
  3. errors
  4. inputs
  5. spreadsheets

Qualifiers

  • Research-article

Funding Sources

Conference

SPLASH '14
Sponsor:

Acceptance Rates

OOPSLA '14 Paper Acceptance Rate 52 of 186 submissions, 28%;
Overall Acceptance Rate 268 of 1,244 submissions, 22%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)22
  • Downloads (Last 6 weeks)1
Reflects downloads up to 27 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Spreadsheet quality assurance: a literature reviewFrontiers of Computer Science10.1007/s11704-023-2384-618:2Online publication date: 22-Jan-2024
  • (2024)Quantitative Input Usage Static AnalysisNASA Formal Methods10.1007/978-3-031-60698-4_5(79-98)Online publication date: 4-Jun-2024
  • (2020)Surfacing Visualization MiragesProceedings of the 2020 CHI Conference on Human Factors in Computing Systems10.1145/3313831.3376420(1-16)Online publication date: 21-Apr-2020
  • (2020)VisuaLint: Sketchy In Situ Annotations of Chart Construction ErrorsComputer Graphics Forum10.1111/cgf.1397539:3(219-228)Online publication date: 18-Jul-2020
  • (2018)ExceLint: automatically finding spreadsheet formula errorsProceedings of the ACM on Programming Languages10.1145/32765182:OOPSLA(1-26)Online publication date: 24-Oct-2018
  • (2018)How are spreadsheet templates used in practice: a case study on EnronProceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3236024.3264834(734-738)Online publication date: 26-Oct-2018
  • (2018)Detecting faulty empty cells in spreadsheets2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER)10.1109/SANER.2018.8330229(423-433)Online publication date: Mar-2018
  • (2017)ZenSheet studio: a spreadsheet-inspired environment for reactive computingProceedings Companion of the 2017 ACM SIGPLAN International Conference on Systems, Programming, Languages, and Applications: Software for Humanity10.1145/3135932.3135949(33-35)Online publication date: 22-Oct-2017
  • (2017)CACheckIEEE Transactions on Software Engineering10.1109/TSE.2016.258405943:3(226-251)Online publication date: 1-Mar-2017
  • (2017)SpreadClusterProceedings of the 14th International Conference on Mining Software Repositories10.1109/MSR.2017.28(158-169)Online publication date: 20-May-2017
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media