Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2254556.2254659acmotherconferencesArticle/Chapter ViewAbstractPublication PagesaviConference Proceedingsconference-collections
research-article

Profiler: integrated statistical analysis and visualization for data quality assessment

Published: 21 May 2012 Publication History

Abstract

Data quality issues such as missing, erroneous, extreme and duplicate values undermine analysis and are time-consuming to find and fix. Automated methods can help identify anomalies, but determining what constitutes an error is context-dependent and so requires human judgment. While visualization tools can facilitate this process, analysts must often manually construct the necessary views, requiring significant expertise. We present Profiler, a visual analysis tool for assessing quality issues in tabular data. Profiler applies data mining methods to automatically flag problematic data and suggests coordinated summary visualizations for assessing the data in context. The system contributes novel methods for integrated statistical and visual analysis, automatic view suggestion, and scalable visual summaries that support real-time interaction with millions of data points. We present Profiler's architecture --- including modular components for custom data types, anomaly detection routines and summary visualizations --- and describe its application to motion picture, natural disaster and water quality data sets.

References

[1]
P. D. Allison. Missing Data. Sage Publications, 2001.
[2]
M. Bostock, V. Ogievetsky, and J. Heer. D3: Data-driven documents. IEEE TVCG, 17(12):2301--2309, 2011.
[3]
D. B. Carr, R. J. Littlefield, W. L. Nicholson, and J. S. Littlefield. Scatterplot matrix techniques for large N. Journal of the American Statistical Association, 82(398):424--436, 1987.
[4]
V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: A survey. ACM Comput. Surv., 41(3):1--58, July 2009.
[5]
T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, Inc., 2003.
[6]
W. Eckerson. Data quality and the bottom line: Achieving business success through a commitment to high quality data. Technical report, The Data Warehousing Institute, 2002.
[7]
A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. IEEE TKDE, 19(1):1--16, 2007.
[8]
D. Guo. Coordinating computational and visual approaches for interactive feature selection and multivariate clustering. Information Visualization, 2(4):232--246, 2003.
[9]
L. M. Haas, M. A. Hernández, H. Ho, L. Popa, and M. Roth. Clio grows up: from research prototype to industrial tool. In ACM SIGMOD, pages 805--810, 2005.
[10]
J. M. Hellerstein. Quantitative data cleaning for large databases, 2008. White Paper, U. N. Economic Commission for Europe.
[11]
A. Hinneburg, D. Keim, and M. Wawryniuk. HD-Eye: visual mining of high-dimensional data. IEEE CG&A, 19(5):22--31, 1999.
[12]
V. Hodge and J. Austin. A survey of outlier detection methodologies. Artificial Intelligence Rev., 22(2):85--126, 2004.
[13]
Y. Huhtala, J. Kärkkäinen, P. Porkka, and H. Toivonen. Tane: An efficient algorithm for discovering functional and approximate dependencies. The Computer Journal, 42(2):100--111, 1999.
[14]
D. Huynh and S. Mazzocchi. Google Refine. http://code.google.com/p/google-refine/.
[15]
Z. G. Ives, C. A. Knoblock, S. Minton, M. Jacob, P. Pratim, T. R. Tuchinda, J. Luis, A. Maria, and M. C. Gazen. Interactive data integration through smart copy & paste. In Proc. CIDR, 2009.
[16]
S. Kandel, A. Paepcke, J. Hellerstein, and J. Heer. Wrangler: Interactive visual specification of data transformation scripts. In Proc. ACM CHI, pages 3363--3372, 2011.
[17]
H. Kang, L. Getoor, B. Shneiderman, M. Bilgic, and L. Licamele. Interactive entity resolution in relational data: A visual analytic tool and its evaluation. IEEE TVCG, 14(5):999--1014, 2008.
[18]
D. Keim. Information visualization and visual data mining. IEEE TVCG, 8(1):1--8, 2002.
[19]
W. Kim, B.-J. Choi, E.-K. Hong, S.-K. Kim, and D. Lee. A taxonomy of dirty data. Data Mining & Knowl. Discovery, 7(1):81--99, 2003.
[20]
R. Kosara, F. Bendix, and H. Hauser. TimeHistograms for large, time-dependent data. In Proc. VisSym, pages 45--54, 2004.
[21]
J. Lin, J. Wong, J. Nichols, A. Cypher, and T. A. Lau. End-user programming of mashups with Vegemite. In Proc. Intelligent User Interfaces, pages 97--106, 2009.
[22]
C. North and B. Shneiderman. Snap-together visualization: A user interface for coodinating visualizations via relational schemata. In Proc. Advanced Visual Interfaces, pages 128--135, 2000.
[23]
A. Perer and B. Shneiderman. Systematic yet flexible discovery: guiding domain experts through exploratory data analysis. In Proc. Intelligent User interfaces, pages 109--118, 2008.
[24]
E. Rahm and H. H. Do. Data cleaning: Problems and current approaches. IEEE Data Engineering Bulletin, 23, 2000.
[25]
V. Raman and J. M. Hellerstein. Potter's wheel: An interactive data cleaning system. In VLDB, pages 381--390, 2001.
[26]
J. Rissanen. Modeling by shortest data description. Automatica, 14(5):465--471, 1978.
[27]
G. G. Robertson, M. P. Czerwinski, and J. E. Churchill. Visualization of mappings between schemas. In Proc. ACM CHI, pages 431--439, 2005.
[28]
G. E. Rosario, E. A. Rundensteiner, D. C. Brown, M. O. Ward, and S. Huang. Mapping nominal values to numbers for effective visualization. Information Visualization, 3(2):80--95, 2004.
[29]
C. Scaffidi, B. Myers, and M. Shaw. Intelligently creating and recommending reusable reformatting rules. In Proc. Intelligent User Interfaces, pages 297--306, 2009.
[30]
J. Seo and B. Shneiderman. A rank-by-feature framework for interactive exploration of multidimensional data. Information Visualization, 4(2):96--113, 2005.
[31]
C. Stolte, D. Tang, and P. Hanrahan. Polaris: a system for query, analysis, and visualization of multidimensional relational databases. IEEE TVCG, 8(1):52--65, 2002.
[32]
D. F. Swayne, D. T. Lang, A. Buja, and D. Cook. GGobi: evolving from XGobi into an extensible framework for interactive data visualization. Comp. Stat. & Data Analysis, 43(4):423--444, 2003.
[33]
R. S. Tsay. Outliers, level shifts, and variance changes in time series. Journal of Forecasting, 7(1):1--20, 1988.
[34]
R. Tuchinda, P. Szekely, and C. A. Knoblock. Building mashups by example. In Proc. Intelligent User Interfaces, pages 139--148, 2008.
[35]
A. Utwin, M. Theus, and H. Hofmann. Graphics of Large Datasets: Visualizing a Million. Springer, 2006.
[36]
C. Weaver. Building highly-coordinated visualizations in Improvise. In Proc. IEEE InfoVis, pages 159--166, 2004.
[37]
J. Yang, M. O. Ward, E. A. Rundensteiner, and S. Huang. Visual hierarchical dimension reduction for exploration of high dimensional datasets. In Proc. VisSym, pages 19--28, 2003.

Cited By

View all
  • (2025)“I Came Across a Junk”: Understanding Design Flaws of Data Visualization from the Public's PerspectiveIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2024.345634131:1(393-403)Online publication date: Jan-2025
  • (2024)LucidScript: Bottom-Up Standardization for Data PreparationProceedings of the VLDB Endowment10.14778/3685800.368586417:12(4317-4320)Online publication date: 1-Aug-2024
  • (2024)Cocoon: Semantic Table Profiling Using Large Language ModelsProceedings of the 2024 Workshop on Human-In-the-Loop Data Analytics10.1145/3665939.3665957(1-7)Online publication date: 14-Jun-2024
  • Show More Cited By

Index Terms

  1. Profiler: integrated statistical analysis and visualization for data quality assessment

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    AVI '12: Proceedings of the International Working Conference on Advanced Visual Interfaces
    May 2012
    846 pages
    ISBN:9781450312875
    DOI:10.1145/2254556
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    • Consulta Umbria SRL
    • University of Salerno: University of Salerno

    In-Cooperation

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 21 May 2012

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. anomaly detection
    2. data analysis
    3. data quality
    4. visualization

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    AVI'12
    Sponsor:
    • University of Salerno

    Acceptance Rates

    Overall Acceptance Rate 128 of 490 submissions, 26%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)176
    • Downloads (Last 6 weeks)20
    Reflects downloads up to 24 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)“I Came Across a Junk”: Understanding Design Flaws of Data Visualization from the Public's PerspectiveIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2024.345634131:1(393-403)Online publication date: Jan-2025
    • (2024)LucidScript: Bottom-Up Standardization for Data PreparationProceedings of the VLDB Endowment10.14778/3685800.368586417:12(4317-4320)Online publication date: 1-Aug-2024
    • (2024)Cocoon: Semantic Table Profiling Using Large Language ModelsProceedings of the 2024 Workshop on Human-In-the-Loop Data Analytics10.1145/3665939.3665957(1-7)Online publication date: 14-Jun-2024
    • (2024)Natural Language Dataset Generation Framework for Visualizations Powered by Large Language ModelsProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642943(1-22)Online publication date: 11-May-2024
    • (2024)JsonCurer: Data Quality Management for JSON Based on an Aggregated SchemaIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2024.338855630:6(3008-3021)Online publication date: Jun-2024
    • (2024)Interactive Table Synthesis With Natural LanguageIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2023.332912030:9(6130-6145)Online publication date: Sep-2024
    • (2024)Dead or Alive: Continuous Data Profiling for Interactive Data ScienceIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2023.332736730:1(197-207)Online publication date: 1-Jan-2024
    • (2024)A Heuristic Approach for Dual Expert/End-User Evaluation of Guidance in Visual AnalyticsIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2023.332715230:1(997-1007)Online publication date: 1-Jan-2024
    • (2024)Towards Better Modeling With Missing Data: A Contrastive Learning-Based Visual Analytics PerspectiveIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2023.328521030:8(5129-5146)Online publication date: Aug-2024
    • (2024)ShortcutLens: A Visual Analytics Approach for Exploring Shortcuts in Natural Language Understanding DatasetIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2023.323638030:7(3594-3608)Online publication date: Jul-2024
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media