Abstract
The problem of identifying deviating patterns in XML repositories has important applications in data cleaning, fraud detection, and stock market analysis. Current methods determine data discrepancies by assessing whether the data conforms to the expected distribution of its immediate neighborhood. This approach may miss interesting deviations involving aggregated information. For example, the average number of transactions of a particular bank account may be exceptionally high as compared to other accounts with similar profiles. Such incongruity could only be revealed through aggregating appropriate data and analyzing the aggregated results in the associated neighborhood. This neighborhood is implicitly encapsulated in the XML structure. In addition, the hierarchical nature of the XML structure reflects the different levels of abstractions in the real world. This work presents a framework that detects incongruities in aggregate information. It utilizes the inherent characteristics of the XML structure to systematically aggregate leaf-level data and propagate the aggregated information up the hierarchy. The aggregated information is analyzed using a novel method by first clustering similar data, then, assuming a statistical distribution and identifying aggregate incongruity within the clusters. Experiments results indicate that the proposed approach is effective in detecting interesting discrepancies in a real world bank data set.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Koh, J.L.Y., Li Lee, M., Hsu, W., Lam, K.-T.: Correlation-based detection of attribute outliers. In: Kotagiri, R., Radha Krishna, P., Mohania, M., Nantajeewarawat, E. (eds.) DASFAA 2007. LNCS, vol. 4443, pp. 164–175. Springer, Heidelberg (2007)
Koh, J.L.Y., Lee, M., Hsu, W., Ang, W.T.: Correlation-based attribute outlier detection in XML. In: Proceedings of the 24th International Conference on Data Engineering, Cancun, Mexico, pp. 1522–1524 (2008)
Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proc. of 2nd International Conference on Knowledge Discovery and Data Mining (KDD 1996), pp. 226–231 (1996)
Aggarwal, C., Yu, S.: An effective and efficient algorithm for high-dimensional outlier detection. The VLDB Journal 14(2), 211–221 (2005)
Knorr, E.M., Ng, R.T.: Finding intensional knowledge of distance-based outliers. In: VLDB 1999: Proceedings of the 25th International Conference on Very Large Data Bases, pp. 211–222. Morgan Kaufmann Publishers Inc., San Francisco (1999)
Teng, C.M.: Polishing blemishes: Issues in data correction. IEEE Intelligent Systems 19(2), 34–39 (2004)
Zhu, X., Wu, X.: Class noise vs. attribute noise: a quantitative study of their impacts. Artif. Intell. Rev. 22(3), 177–210 (2004)
Low, W.L., Tok, W.H., Lee, M.L., Ling, T.W.: Data Cleaning and XML: The DBLP Experience. In: ICDE, p. 269. IEEE Computer Society, Los Alamitos (2002)
Puhlmann, S., Weis, M., Naumann, F.: XML duplicate detection using sorted neighborhoods. In: Ioannidis, Y., Scholl, M.H., Schmidt, J.W., Matthes, F., Hatzopoulos, M., Böhm, K., Kemper, A., Grust, T., Böhm, C. (eds.) EDBT 2006. LNCS, vol. 3896, pp. 773–791. Springer, Heidelberg (2006)
Weis, M., Naumann, F.: Dogmatix tracks down duplicates in XML. In: SIGMOD 2005: Proceedings of the 2005 ACM SIGMOD international conference on Management of data, pp. 431–442. ACM Press, New York (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Hsu, W., Lau, Q.P., Lee, M.L. (2009). Detecting Aggregate Incongruities in XML. In: Zhou, X., Yokota, H., Deng, K., Liu, Q. (eds) Database Systems for Advanced Applications. DASFAA 2009. Lecture Notes in Computer Science, vol 5463. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-00887-0_54
Download citation
DOI: https://doi.org/10.1007/978-3-642-00887-0_54
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-00886-3
Online ISBN: 978-3-642-00887-0
eBook Packages: Computer ScienceComputer Science (R0)