Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3395027.3419581acmconferencesArticle/Chapter ViewAbstractPublication PagesdocengConference Proceedingsconference-collections
research-article
Open access

Change Detection on JATS Academic Articles: An XML Diff Comparison Study

Published: 29 September 2020 Publication History

Abstract

XML is currently a well established and widely used document format. It is used as a core data container in collaborative writing suites and other modern information architectures. The extraction and analysis of differences between two XML document versions is an attractive topic, and has already been tackled by several research groups. The goal of this study is to compare 12 existing state-of-the-art and commercial XML diff algorithms by applying them to JATS documents in order to extract and evaluate changes between two versions of the same academic article. Understanding changes between two article versions is important not only regarding data, but also semantics. Change information consumers in our case are editorial teams, and thus they are more generally interested in change semantics than in the exact data changes. The existing algorithms are evaluated on the following aspects: their edit detection suitability for both text and tree changes, execution speed, memory usage and delta file size. The evaluation process is supported by a Python tool available on Github.

References

[1]
Gioele Barabucci. 2013. Introduction to the Universal Delta Model. In Proceedings of the 2013 ACM Symposium on Document Engineering (Florence, Italy) (DocEng '13). Association for Computing Machinery, New York, NY, USA, 47--56. https://doi.org/10.1145/2494266.2494284
[2]
Sudarshan S Chawathe and Hector Garcia-Molina. 1997. Meaningful change detection in structured data. ACM SIGMOD Record 26, 2 (1997), 26--37.
[3]
Sudarshan S Chawathe, Anand Rajaraman, Hector Garcia-Molina, and Jennifer Widom. 1996. Change detection in hierarchically structured information. Acm Sigmod Record 25, 2 (1996), 493--504.
[4]
Yan Chen, Sanjay Madria, and Sourav Bhowmick. 2004. DiffXML: Change Detection in XML Data. In Database Systems for Advanced Applications, YoonJoon Lee, Jianzhong Li, Kyu-Young Whang, and Doheon Lee (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 289--301.
[5]
Paolo Ciancarini, Angelo Di Iorio, Carlo Marchetti, Michele Schirinzi, and Fabio Vitali. 2016. Bridging the gap between tracking and detecting changes in XML. Software: Practice and Experience 46, 2 (2016), 227--250.
[6]
Greégory Cobena. 2003. Change management of semi-structured data on the Web. Ph.D. Dissertation. Institut Polytechnique de Paris.
[7]
Gregory Cobena, Serge Abiteboul, and Amelie Marian. 2002. Detecting changes in XML documents. In Proceedings 18th International Conference on Data Engineering. IEEE, San Jose, CA, USA, 2002, 41--52.
[8]
Grégory Cobéna, Talel Abdessalem, and Yassine Hinnach. 2004. A comparative study of XML diff tools.
[9]
Daniel Hottinger and Franziska Meyer. 2005. XML-diff-algorithmen. (2005).
[10]
James Wayne Hunt and M Douglas MacIlroy. 1976. An algorithm for differential file comparison. Bell Laboratories Murray Hill, USA.
[11]
Falk Langhammer. 2004. Bauen statt modellieren. iX 2 (2004), 100--103. https://shop.heise.de/katalog/bauen-statt-modellieren
[12]
Tancred Lindholm, Jaakko Kangasharju, and Sasu Tarkoma. 2006. Fast and Simple XML Tree Differencing by Sequence Alignment. In Proceedings of the 2006 ACM Symposium on Document Engineering (Amsterdam, The Netherlands) (DocEng '06). Association for Computing Machinery, New York, NY, USA, 75--84. https://doi.org/10.1145/1166160.1166183
[13]
Lorenz Schori. 2020. Delta.js - A JavaScript diff and patch engine for DOM trees. http://znerol.github.io/node-delta/
[14]
Salvatore Manfreda, Matthew F. McCabe, Pauline E. Miller, Richard Lucas, Victor Pajuelo Madrigal, Giorgos Mallinis, Eyal Ben Dor, David Helman, Lyndon Estes, Giuseppe Ciraolo, Jana Müllerová, Flavia Tauro, M. Isabel De Lima, João L. M. P. De Lima, Antonino Maltese, Felix Frances, Kelly Caylor, Marko Kohv, Matthew Perks, Guiomar Ruiz-Pérez, Zhongbo Su, Giulia Vico, and Brigitta Toth. 2018. On the Use of Unmanned Aerial Systems for Environmental Monitoring. Remote Sensing 10, 641 (2018), 1--28. https://doi.org/10.3390/rs10040641
[15]
Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press, USA.
[16]
Webb Miller and Eugene W Myers. 1985. A file comparison program. Software: Practice and Experience 15, 11 (1985), 1025--1040.
[17]
Eugene W Myers. 1986. AnO (ND) difference algorithm and its variations. Algorithmica 1, 1-4 (1986), 251--266.
[18]
Norman Walsh. 2015. DiffMK. https://sourceforge.net/projects/diffmk/
[19]
Alessandreia Oliveira, Leonardo Murta, and Vanessa Braganholo. 2014. Towards Semantic Diff of XML Documents. In Proceedings of the 29th Annual ACM Symposium on Applied Computing (Gyeongju, Republic of Korea) (SAC '14). Association for Computing Machinery, New York, NY, USA, 833--838. https://doi.org/10.1145/2554850.2554893
[20]
Alessandreia Oliveira, Gabriel Tessarolli, Gleiph Ghiotto, Bruno Pinto, Fernando Campello, Matheus Marques, Carlos Oliveira, Igor Rodrigues, Marcos Kalinowski, Uéverton Souza, et al. 2018. An efficient similarity-based approach for comparing XML documents. Information Systems 78 (2018), 40--57.
[21]
Sebastian Rönnau and Uwe M Borghoff. 2009. Versioning XML-based office documents. Multimedia Tools and Applications 43, 3 (2009), 253--274.
[22]
Sebastian Rönnau and Uwe M Borghoff. 2012. XCC: change control of XML documents. Computer Science-Research and Development 27, 2 (2012), 95--111.
[23]
Sebastian Rönnau, Christian Pauli, and Uwe M. Borghoff. 2008. Merging Changes in XML Documents Using Reliable Context Fingerprints. In Proceedings of the Eighth ACM Symposium on Document Engineering (Sao Paulo, Brazil) (DocEng 08). Association for Computing Machinery, New York, NY, USA, 52--61. https://doi.org/10.1145/1410140.1410151
[24]
Sebastian Rönnau, Geraint Philipp, and Uwe M. Borghoff. 2009. Efficient Change Control of XML Documents. In Proceedings of the 9th ACM Symposium on Document Engineering (Munich, Germany) (DocEng '09). Association for Computing Machinery, New York, NY, USA, 3--12. https://doi.org/10.1145/1600193.1600197
[25]
Sebastian Rönnau, Jan Scheffczyk, and Uwe M. Borghoff. 2005. Towards XML Version Control of Office Documents. In Proceedings of the 2005 ACM Symposium on Document Engineering (Bristol, United Kingdom) (DocEng '05). Association for Computing Machinery, New York, NY, USA, 10--19. https://doi.org/10.1145/1096601.1096606
[26]
Stanley M Selkow. 1977. The tree-to-tree editing problem. Information processing letters 6, 6 (1977), 184--186.
[27]
Walter F Tichy. 1984. The string-to-string correction problem with block moves. ACM Transactions on Computer Systems (TOCS) 2, 4 (1984), 309--321.
[28]
W3C. 2016. Extensible Markup Language (XML). https://www.w3.org/XML
[29]
W3C. 2018. HtmlDiff. https://www.w3.org/wiki/HtmlDiff
[30]
W3C. 2019. HtmlDiff. https://docs.deltaxml.com/support/latest/articles-and-papers-9340757.html
[31]
Yuan Wang, David J DeWitt, and J-Y Cai. 2003. X-Diff: An effective change detection algorithm for XML documents. In Proceedings 19th International Conference on Data Engineering (Cat. No. 03CH37405). IEEE, Bangalore, India, 519--530.

Cited By

View all
  • (2022)Semantics to the rescue of document‐based XML diff: A JATS case studySoftware: Practice and Experience10.1002/spe.307452:6(1496-1516)Online publication date: 12-Feb-2022

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
DocEng '20: Proceedings of the ACM Symposium on Document Engineering 2020
September 2020
130 pages
ISBN:9781450380003
DOI:10.1145/3395027
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 September 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. JATS
  2. XML diff
  3. academic publishing
  4. change control
  5. document comparison
  6. semantic diff

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

DocEng '20
Sponsor:
DocEng '20: ACM Symposium on Document Engineering 2020
September 29 - October 1, 2020
CA, Virtual Event, USA

Acceptance Rates

Overall Acceptance Rate 194 of 564 submissions, 34%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)79
  • Downloads (Last 6 weeks)11
Reflects downloads up to 16 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2022)Semantics to the rescue of document‐based XML diff: A JATS case studySoftware: Practice and Experience10.1002/spe.307452:6(1496-1516)Online publication date: 12-Feb-2022

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media