Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/1284420.1284441acmconferencesArticle/Chapter ViewAbstractPublication PagesdocengConference Proceedingsconference-collections
Article

XML version detection

Published: 28 August 2007 Publication History

Abstract

The problem of version detection is critical in many important application scenarios, including software clone identification, Web page ranking, plagiarism detection, and peer-to-peer searching. A natural and commonly used approach to version detection relies on analyzing the similarity between files. Most of the techniques proposed so far rely on the use of hard thresholds for similarity measures. However, defining a threshold value is problematic for several reasons: in particular (i) the threshold value is not the same when considering different similarity functions, and (ii) it is not semantically meaningful for the user. To overcome this problem, our work proposes a version detection mechanism for XML documents based on Naïve Bayesian classifiers. Thus, our approach turns the detection problem into a classification problem. In this paper, we present the results of various experiments on synthetic data that show that our approach produces very good results, both in terms of recall and precision measures.

References

[1]
Westfechtel, B., Munch, B. P., and Conradi, R. A Layered Architecture for Uniform Version Management. IEEE Trans. Software Eng., 27(12):1111--1133, 2001.
[2]
Chien, S-Y., Tsotras, V. J., Zaniolo, C. (2001). XML Document Versioning. SIGMOD Records, Vol. 30 Number 3, Sept.
[3]
Ronnau, S.; Scheffczyk, J. e Borghoff, U.M. Towards XML Version Control of Office Documents. DocEng '05: Proc. of the 2005 ACM symposium on Document engineering, ACM Press, 10--19, 2005.
[4]
Katz, R. e Chang, E. Managing Change in a Computer-Aided Design Database. Proceedings of VLDB Conference, 1987.
[5]
Schleimer, S., Wilkerson, D., Aiken, A. Winnowing: Local Algorithms for Document Fingerprinting. Proc. of the ACM SIGMOD Intl. Conf. on Management of Data, San Diego, CA, p. 76--85, 2003.
[6]
Chen, X., Francia, B., Li, M., McKinnon, B., Seker, A. Shared information and program plagiarism detection. IEEE Transactions on Information Theory, v. 50, n. 7, p-1545--1551, 2004.
[7]
Baeza-Yates, R., Castillo, C. Relating Web Characteristics with Link based Web Page Ranking. Proc. of the 8th Intl. Symposium on String Processing and Information Retrieval, 2001.
[8]
Ducasse, S., Niertrasz, O., Rieger, M. On the effectiveness of clone detection by string matching. Journal of Software Maintenance and Evolution: Research and Practice, v. 18, n. 1, p. 37--58, 2006.
[9]
Navarro, G.: A guided tour to approximate string matching. ACM Computing Surveys 33, 31--88, 2001.
[10]
Guth, G.J.: Surname spellings and computerized record linkage. Historical Methods Newsletter 10, 10--19, 1976.
[11]
Baeza-Yates, R.A. e Ribeiro-Neto, B.A. Modern Information Retrieval. ACM Press/Addison-Wesley, 1999.
[12]
Flesca, S. e Pugliese, A. Fast Detection of XML Structural Similarity. IEEE Transactions on Knowledge and Data Engineering, 17, 160--175, 2005.
[13]
Nierman, A. e Jagadish, H.V. Evaluating Structural Similarity in XML Documents. Proc. of the 5th Intl. Workshop on the Web and Databases (WebDB 2002), 2002.
[14]
Dorneles, C. F. ; Heuser, C. A. ; Lima, A. E. N.; Silva, A. S.; Moura, E. S. . Measuring similarity between collections of values. In: Proc. of the 6th ACM Intl. Workshop on Web Information and Data Management (WIDM), Washington, DC, 2004. p. 56--63.
[15]
Silva, R.; Stasiu, R. K.; Orengo, V. M.; Heuser, C. A. Measuring quality of similarity functions in approximate data matching. Journal of Informetrics, v. 1, p. 4, 2007.
[16]
Stasiu, R. K.; Heuser, C. A.; Silva, R. Estimating Recall and Precision for vague queries in Databases. In: 17th International Conference Advanced Information Systems Engineering (CAISE), Porto, Portugal, 2005. v. 3520. p. 187--200.
[17]
Saccol, D.B., Edelweiss, N., Galante, R.M. Detecting, Managing and Querying Replicas and Versions in a Peer-to-Peer Environment. In: 1st IEEE TCSC Doctoral Symposium, in conjunction with the 7th IEEE Intl. Symposium on Cluster Computing and the Grid, Rio de Janeiro, 2007.
[18]
Cobena, G., Abiteboul, S. and Marian, A. Detecting Changes in XML Documents. Proc. of 18th Intl. Conf. on Data Engineering, 41--52, 2002.
[19]
Wang, Y., DeWitt, D. J., Cai, J. X-Diff: An Effective Change Detection Algorithm for XML Documents. Intl. Conf. on Data Engineering, 519--530, 2003.
[20]
Chawathe, S. S. Comparing Hierarchical Data in External Memory. Proc. of the 25th Intl. Conf. on Very Large Data Bases, MorganallKaufmann Publishers Inc., 90--101, 1999.
[21]
Wan, X. and Yang, J. Using Proportional Transportation Similarity with Learned Element Semantics for XML Document Clustering. WWW '06: Proc. of the 15th Intl. Conf. on World Wide Web, ACM Press, 961--962, 2006.
[22]
Cohen, W.W., Ravikumar, P., Fienberg, S.: A comparison of string distance metrics for name-matching tasks. In: Proc. of IJCAI-03 Workshop on Information Integration on the Web, Acapulco, Mexico, Morgan Kaufmann, 73--78, 2003.
[23]
Schallehn, E., Sattler, K.U., Saake, G.: Efficient similarity-based operations for data integration. Data Knowl. Eng. 48 (2004) 361--387.
[24]
Bilenko, M.; Mooney, R.; Cohen, W.; Ravikumar, P.; Fienberg, S. Adaptive Name Matching in Information Integration. IEEE Intelligent Systems, {S.l.}, v.18, n.5, p.16--23, September/October 2003.
[25]
Sarawagi, S.; Bhamidipaty, A. Interactive deduplication using active learning. In: International Conference on Knowledge Discovery and Data Mining, ACM SIGKDD, 8. New York, NY, USA. Proc. New York:ACM Press, 2002. p.269--278, 2002.
[26]
Langley, P., Iba, W., & Thompson, K. An analysis of Bayesian classifiers. Proc. of the 10th National Conference on Artificial Intelligence (pp. 223--228). San Jose, CA: AAAI Press, 1992.
[27]
Wang. Y., Hodges, J., Tang, B.; Classification of Web Documents Using a Naive Bayes Method. Proc. of the 15th IEEE Intl. Conf. on Tools with Artificial Intelligence. IEEE Computer Society Washington, DC, USA, 2003.
[28]
Pon, R.K., Cárdenas, A.F., Buttler, D., Critchlow, T. iScore: Measuring the Interestingness of Articles in a Limited User Environment. In: IEEE Symposium on Computational Intelligence and Data Mining, Honolulu, HI, 2007.
[29]
Chien, S.-Y., Tsotras, V.J., Zaniolo, C.; Efficient schemes for managing multiversion XML Documents, The VLDB Journal, Dec. 2002.
[30]
Vagena, Z., Moro, M.M., Tsotras, V.J.; Supporting Branched Versions on XML documents. In: 14th International Workshop on Research Issues on Data Engineering, held with 20th Intl. Conf. on Data Engineering (ICDE), Boston, USA, 2004.
[31]
Bertino, E., Guerrini G., Mesiti, M. A Matching Algorithm for Measuring the Structural Similarity between a XML Document and a DTD and its Applications. Information Systems, v. 29, n. 1, Special issue on web data integration, p. 23--46, 2004.

Cited By

View all
  • (2014)Temporal and multi-versioned XML documentsInformation Processing and Management: an International Journal10.1016/j.ipm.2013.08.00350:1(113-131)Online publication date: 1-Jan-2014
  • (2012)Wikipedia Revision Graph Extraction Based on N-Gram CoverWeb-Age Information Management10.1007/978-3-642-33050-6_4(29-38)Online publication date: 2012
  • (2011)Automatic identification of ontology versions using machine learning techniquesProceedings of the 8th extended semantic web conference on The semantic web: research and applications - Volume Part I10.5555/2008892.2008922(352-366)Online publication date: 29-May-2011
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
DocEng '07: Proceedings of the 2007 ACM symposium on Document engineering
August 2007
236 pages
ISBN:9781595937766
DOI:10.1145/1284420
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 August 2007

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. XML
  2. classification
  3. similarity functions
  4. versioning

Qualifiers

  • Article

Conference

DocEng07
Sponsor:
DocEng07: ACM Symposium on Document Engineering
August 28 - 31, 2007
Manitoba, Winnipeg, Canada

Acceptance Rates

Overall Acceptance Rate 194 of 564 submissions, 34%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 26 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2014)Temporal and multi-versioned XML documentsInformation Processing and Management: an International Journal10.1016/j.ipm.2013.08.00350:1(113-131)Online publication date: 1-Jan-2014
  • (2012)Wikipedia Revision Graph Extraction Based on N-Gram CoverWeb-Age Information Management10.1007/978-3-642-33050-6_4(29-38)Online publication date: 2012
  • (2011)Automatic identification of ontology versions using machine learning techniquesProceedings of the 8th extended semantic web conference on The semantic web: research and applications - Volume Part I10.5555/2008892.2008922(352-366)Online publication date: 29-May-2011
  • (2011)Automatic Identification of Ontology Versions Using Machine Learning TechniquesThe Semantic Web: Research and Applications10.1007/978-3-642-21034-1_24(352-366)Online publication date: 2011
  • (2009)WSDL and UDDI extensions for version support in web servicesJournal of Systems and Software10.1016/j.jss.2009.03.00182:8(1326-1343)Online publication date: 1-Aug-2009
  • (2008)Merging changes in XML documents using reliable context fingerprintsProceedings of the eighth ACM symposium on Document engineering10.1145/1410140.1410151(52-61)Online publication date: 16-Sep-2008

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media