research-article

Profiler: integrated statistical analysis and visualization for data quality assessment

Authors:

Andreas Paepcke,

Joseph M. Hellerstein,

Jeffrey HeerAuthors Info & Claims

AVI '12: Proceedings of the International Working Conference on Advanced Visual Interfaces

Pages 547 - 554

https://doi.org/10.1145/2254556.2254659

Published: 21 May 2012 Publication History

Abstract

Data quality issues such as missing, erroneous, extreme and duplicate values undermine analysis and are time-consuming to find and fix. Automated methods can help identify anomalies, but determining what constitutes an error is context-dependent and so requires human judgment. While visualization tools can facilitate this process, analysts must often manually construct the necessary views, requiring significant expertise. We present Profiler, a visual analysis tool for assessing quality issues in tabular data. Profiler applies data mining methods to automatically flag problematic data and suggests coordinated summary visualizations for assessing the data in context. The system contributes novel methods for integrated statistical and visual analysis, automatic view suggestion, and scalable visual summaries that support real-time interaction with millions of data points. We present Profiler's architecture --- including modular components for custom data types, anomaly detection routines and summary visualizations --- and describe its application to motion picture, natural disaster and water quality data sets.

References

[1]

P. D. Allison. Missing Data. Sage Publications, 2001.

[2]

M. Bostock, V. Ogievetsky, and J. Heer. D3: Data-driven documents. IEEE TVCG, 17(12):2301--2309, 2011.

Digital Library

[3]

D. B. Carr, R. J. Littlefield, W. L. Nicholson, and J. S. Littlefield. Scatterplot matrix techniques for large N. Journal of the American Statistical Association, 82(398):424--436, 1987.

[4]

V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: A survey. ACM Comput. Surv., 41(3):1--58, July 2009.

Digital Library

[5]

T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, Inc., 2003.

Digital Library

[6]

W. Eckerson. Data quality and the bottom line: Achieving business success through a commitment to high quality data. Technical report, The Data Warehousing Institute, 2002.

[7]

A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. IEEE TKDE, 19(1):1--16, 2007.

Digital Library

[8]

D. Guo. Coordinating computational and visual approaches for interactive feature selection and multivariate clustering. Information Visualization, 2(4):232--246, 2003.

Digital Library

[9]

L. M. Haas, M. A. Hernández, H. Ho, L. Popa, and M. Roth. Clio grows up: from research prototype to industrial tool. In ACM SIGMOD, pages 805--810, 2005.

Digital Library

[10]

J. M. Hellerstein. Quantitative data cleaning for large databases, 2008. White Paper, U. N. Economic Commission for Europe.

[11]

A. Hinneburg, D. Keim, and M. Wawryniuk. HD-Eye: visual mining of high-dimensional data. IEEE CG&A, 19(5):22--31, 1999.

Digital Library

[12]

V. Hodge and J. Austin. A survey of outlier detection methodologies. Artificial Intelligence Rev., 22(2):85--126, 2004.

Digital Library

[13]

Y. Huhtala, J. Kärkkäinen, P. Porkka, and H. Toivonen. Tane: An efficient algorithm for discovering functional and approximate dependencies. The Computer Journal, 42(2):100--111, 1999.

[14]

D. Huynh and S. Mazzocchi. Google Refine. http://code.google.com/p/google-refine/.

[15]

Z. G. Ives, C. A. Knoblock, S. Minton, M. Jacob, P. Pratim, T. R. Tuchinda, J. Luis, A. Maria, and M. C. Gazen. Interactive data integration through smart copy & paste. In Proc. CIDR, 2009.

[16]

S. Kandel, A. Paepcke, J. Hellerstein, and J. Heer. Wrangler: Interactive visual specification of data transformation scripts. In Proc. ACM CHI, pages 3363--3372, 2011.

Digital Library

[17]

H. Kang, L. Getoor, B. Shneiderman, M. Bilgic, and L. Licamele. Interactive entity resolution in relational data: A visual analytic tool and its evaluation. IEEE TVCG, 14(5):999--1014, 2008.

Digital Library

[18]

D. Keim. Information visualization and visual data mining. IEEE TVCG, 8(1):1--8, 2002.

Digital Library

[19]

W. Kim, B.-J. Choi, E.-K. Hong, S.-K. Kim, and D. Lee. A taxonomy of dirty data. Data Mining & Knowl. Discovery, 7(1):81--99, 2003.

Digital Library

[20]

R. Kosara, F. Bendix, and H. Hauser. TimeHistograms for large, time-dependent data. In Proc. VisSym, pages 45--54, 2004.

Digital Library

[21]

J. Lin, J. Wong, J. Nichols, A. Cypher, and T. A. Lau. End-user programming of mashups with Vegemite. In Proc. Intelligent User Interfaces, pages 97--106, 2009.

Digital Library

[22]

C. North and B. Shneiderman. Snap-together visualization: A user interface for coodinating visualizations via relational schemata. In Proc. Advanced Visual Interfaces, pages 128--135, 2000.

Digital Library

[23]

A. Perer and B. Shneiderman. Systematic yet flexible discovery: guiding domain experts through exploratory data analysis. In Proc. Intelligent User interfaces, pages 109--118, 2008.

Digital Library

[24]

E. Rahm and H. H. Do. Data cleaning: Problems and current approaches. IEEE Data Engineering Bulletin, 23, 2000.

[25]

V. Raman and J. M. Hellerstein. Potter's wheel: An interactive data cleaning system. In VLDB, pages 381--390, 2001.

Digital Library

[26]

J. Rissanen. Modeling by shortest data description. Automatica, 14(5):465--471, 1978.

Digital Library

[27]

G. G. Robertson, M. P. Czerwinski, and J. E. Churchill. Visualization of mappings between schemas. In Proc. ACM CHI, pages 431--439, 2005.

Digital Library

[28]

G. E. Rosario, E. A. Rundensteiner, D. C. Brown, M. O. Ward, and S. Huang. Mapping nominal values to numbers for effective visualization. Information Visualization, 3(2):80--95, 2004.

Digital Library

[29]

C. Scaffidi, B. Myers, and M. Shaw. Intelligently creating and recommending reusable reformatting rules. In Proc. Intelligent User Interfaces, pages 297--306, 2009.

Digital Library

[30]

J. Seo and B. Shneiderman. A rank-by-feature framework for interactive exploration of multidimensional data. Information Visualization, 4(2):96--113, 2005.

Digital Library

[31]

C. Stolte, D. Tang, and P. Hanrahan. Polaris: a system for query, analysis, and visualization of multidimensional relational databases. IEEE TVCG, 8(1):52--65, 2002.

Digital Library

[32]

D. F. Swayne, D. T. Lang, A. Buja, and D. Cook. GGobi: evolving from XGobi into an extensible framework for interactive data visualization. Comp. Stat. & Data Analysis, 43(4):423--444, 2003.

Digital Library

[33]

R. S. Tsay. Outliers, level shifts, and variance changes in time series. Journal of Forecasting, 7(1):1--20, 1988.

[34]

R. Tuchinda, P. Szekely, and C. A. Knoblock. Building mashups by example. In Proc. Intelligent User Interfaces, pages 139--148, 2008.

Digital Library

[35]

A. Utwin, M. Theus, and H. Hofmann. Graphics of Large Datasets: Visualizing a Million. Springer, 2006.

Digital Library

[36]

C. Weaver. Building highly-coordinated visualizations in Improvise. In Proc. IEEE InfoVis, pages 159--166, 2004.

Digital Library

[37]

J. Yang, M. O. Ward, E. A. Rundensteiner, and S. Huang. Visual hierarchical dimension reduction for exploration of high dimensional datasets. In Proc. VisSym, pages 19--28, 2003.

Digital Library

Cited By

Lan XLiu Y(2025)“I Came Across a Junk”: Understanding Design Flaws of Data Visualization from the Public's PerspectiveIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2024.345634131:1(393-403)Online publication date: Jan-2025
https://doi.org/10.1109/TVCG.2024.3456341
Lai ELou YYoungmann BCafarella M(2024)LucidScript: Bottom-Up Standardization for Data PreparationProceedings of the VLDB Endowment10.14778/3685800.368586417:12(4317-4320)Online publication date: 1-Aug-2024
https://dl.acm.org/doi/10.14778/3685800.3685864
Huang ZWu EFekete JOmidvar-Tehrani BRong KShraga R(2024)Cocoon: Semantic Table Profiling Using Large Language ModelsProceedings of the 2024 Workshop on Human-In-the-Loop Data Analytics10.1145/3665939.3665957(1-7)Online publication date: 14-Jun-2024
https://dl.acm.org/doi/10.1145/3665939.3665957
Show More Cited By

Index Terms

Profiler: integrated statistical analysis and visualization for data quality assessment
1. Human-centered computing
  1. Human computer interaction (HCI)

Recommendations

Wrangler: interactive visual specification of data transformation scripts
CHI '11: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems

Though data analysis tools continue to improve, analysts still expend an inordinate amount of time and effort manipulating data and assessing data quality issues. Such "data wrangling" regularly involves reformatting data values or layout, correcting ...
Visualizing virus population variability from next generation sequencing data
BIOVIS '11: Proceedings of the 2011 IEEE Symposium on Biological Data Visualization

Advances in genomic sequencing techniques allow for larger scale generation and usage of sequence data. While these techniques afford new types of analysis, they also generate new concerns with regards to data quality and data scale. We present a tool ...
Guest Editors' Introduction: Collaborative Visualization

This article introduces a special issue on collaborative visualization. The articles in this issue present ongoing research, covering topics ranging from prototype systems to the fundamental technical challenges of creating successful collaborative ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

AVI '12: Proceedings of the International Working Conference on Advanced Visual Interfaces

May 2012

846 pages

ISBN:9781450312875

DOI:10.1145/2254556

Editors:
Genny Tortora
Università di Salerno, Italy
,
Stefano Levialdi
Sapienza Università di Roma, Italy
,
Maurizio Tucci
Università di Salerno, Italy

Copyright © 2012 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Consulta Umbria SRL
University of Salerno: University of Salerno

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 May 2012

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Division of Computing and Communication Foundations

Conference

AVI'12

Sponsor:

University of Salerno

AVI'12: International Working Conference on Advanced Visual Interfaces

May 21 - 25, 2012

Capri Island, Italy

Acceptance Rates

Overall Acceptance Rate 128 of 490 submissions, 26%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

148
Total Citations
View Citations
1,685
Total Downloads

Downloads (Last 12 months)176
Downloads (Last 6 weeks)20

Reflects downloads up to 24 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Lan XLiu Y(2025)“I Came Across a Junk”: Understanding Design Flaws of Data Visualization from the Public's PerspectiveIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2024.345634131:1(393-403)Online publication date: Jan-2025
https://doi.org/10.1109/TVCG.2024.3456341
Lai ELou YYoungmann BCafarella M(2024)LucidScript: Bottom-Up Standardization for Data PreparationProceedings of the VLDB Endowment10.14778/3685800.368586417:12(4317-4320)Online publication date: 1-Aug-2024
https://dl.acm.org/doi/10.14778/3685800.3685864
Huang ZWu EFekete JOmidvar-Tehrani BRong KShraga R(2024)Cocoon: Semantic Table Profiling Using Large Language ModelsProceedings of the 2024 Workshop on Human-In-the-Loop Data Analytics10.1145/3665939.3665957(1-7)Online publication date: 14-Jun-2024
https://dl.acm.org/doi/10.1145/3665939.3665957
Ko HJeon HPark GKim DKim NKim JSeo J(2024)Natural Language Dataset Generation Framework for Visualizations Powered by Large Language ModelsProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642943(1-22)Online publication date: 11-May-2024
https://dl.acm.org/doi/10.1145/3613904.3642943
Xiong KXu XFu SWeng DWang YWu Y(2024)JsonCurer: Data Quality Management for JSON Based on an Aggregated SchemaIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2024.338855630:6(3008-3021)Online publication date: Jun-2024
https://doi.org/10.1109/TVCG.2024.3388556
Huang YZhou YChen RPan CShu XWeng DWu Y(2024)Interactive Table Synthesis With Natural LanguageIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2023.332912030:9(6130-6145)Online publication date: Sep-2024
https://doi.org/10.1109/TVCG.2023.3329120
Epperson WGorantla VMoritz DPerer A(2024)Dead or Alive: Continuous Data Profiling for Interactive Data ScienceIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2023.332736730:1(197-207)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TVCG.2023.3327367
Ceneda DCollins CEl-Assady MMiksch STominski CArleo A(2024)A Heuristic Approach for Dual Expert/End-User Evaluation of Guidance in Visual AnalyticsIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2023.332715230:1(997-1007)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TVCG.2023.3327152
Xie LOuyang YChen LWu ZLi Q(2024)Towards Better Modeling With Missing Data: A Contrastive Learning-Based Visual Analytics PerspectiveIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2023.328521030:8(5129-5146)Online publication date: Aug-2024
https://doi.org/10.1109/TVCG.2023.3285210
Jin ZWang XCheng FSun CLiu QQu H(2024)ShortcutLens: A Visual Analytics Approach for Exploring Shortcuts in Natural Language Understanding DatasetIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2023.323638030:7(3594-3608)Online publication date: Jul-2024
https://doi.org/10.1109/TVCG.2023.3236380
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents