Abstract
This paper presents a two-phase synchronization algorithm—tpsync, which combines content-defined chunking (CDC) with sliding block duplicated data detection methods. tpsync firstly partitions synchronized files into variable-sized chunks in coarse-grained scale with CDC method, locates the unmatched chunks of synchronized files using the edit distance algorithm, and finally generates the fine-grained delta data with fixed-sized sliding block duplicated data detection method. At the first-phase, tpsync can quickly locate the partial changed chunks between two files through similar files’ fingerprint characteristics. On the basis of the first phase’s results, small fixed-sized sliding block duplicated data detection method can produce better fine-grained delta data between the corresponding unmatched data chunks further. Extensive experiments on ASCII, binary and database files demonstrate that tpsync can achieve a higher performance on synchronization time and total transferred data compared to traditional fixed-sized sliding block method—rsync. Compared to rsync, tpsync reduces synchronization time by 12% and bandwidth by 18.9% on average if optimized parameters are applied on both. With signature cached synchronization method adopted, tpsync can yield a better performance.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Ajtai, M., Burns, R., Fagin, R., Long, D., Stockmeyer, L.: Compactly encoding unstructured inputs with differential compression. Journal of the ACM (JACM) 49, 318–367 (2002)
Whitten, A.: Scalable Document Fingerprinting. In: The USENIX Workshop on E-Commerce (1996)
Hunt, J.W., Szymanski, T.G.: A fast algorithm for computing longest common subsequences. Communications of the ACM 20, 350–353 (1977)
Korn, D., Vo, K.: Engineering a differencing and compression data format, pp. 219–228 (2002)
MacDonald, J.: File system support for delta compression. Department of Electrical Engineering and Computer Sciences, University of California at Berkeley, Berkeley, CA, Master thesis (May 2000)
Percival, C.: Naive differences of executable code, Draft Paper, http://www.daemonology.net/bsdiff
Trendafilov, D., Memon, N., Suel, T.: zdelta: An efficient delta compression tool. Department of Computer and Information Science, Polytechnic University Technical Report (2002)
Tridgell, A.: Efficient algorithms for sorting and synchronization. PhD thesis, Australian National University (1999)
Meunier, P., Nystrom, S., Kamara, S., Yost, S., Alexander, K., Noland, D., Crane, J.: ActiveSync, TCP/IP and 802.11 b Wireless Vulnerabilities of WinCE-based PDAs. In: Proceedings of Eleventh IEEE International Workshops on Enabling Technologies: Infrastructure for Collaborative Enterprises, WET ICE 2002, pp. 145–150 (2002)
Whitepapers, P.: Invasion of the data snatchers (1999), http://www.pumatech.com/enterprise/wp-1.html
Palm: Palm developer knowledge base manuals (1999), http://palmos.com/dev/support/docs/palmos/ReferenceTOC.html
Muthitacharoen, A., Chen, B., Mazieres, D.: A low-bandwidth network file system. In: Proceedings of the eighteenth ACM symposium on Operating systems principles, pp. 174–187. ACM, New York (2001)
Teodosiu, D., Bjorner, N., Gurevich, Y., Manasse, M., Porkka, J.: Optimizing file replication over limited bandwidth networks using remote differential compression. Technical report, Microsoft Corporation (2006)
Grune, D.: Concurrent Versions System, a method for independent cooperation. Report IR-114, Vrije University, Amsterdam (1986)
Collins-Sussman, B., Pilato, C., Pilato, C., Fitzpatrick, B.: Version control with subversion. O’Reilly Media, Inc., Sebastopol (2008)
Policroniades, C., Pratt, I.: Alternatives for detecting redundancy in storage systems data. In: Proceedings of the 2004 USENIX Annual Technical Conference, pp. 73–86 (2004)
Jain, N., Dahlin, M., Tewari, R.: Taper: Tiered approach for eliminating redundancy in replica synchronization. In: Proceedings of the 4th Usenix Conference on File and Storage Technologies (FAST 2005) (2005)
Denehy, T., Hsu, W.: Duplicate management for reference data. Research Report RJ10305, IBM (2003)
Kulkarni, P., Douglis, F., LaVoie, J., Tracey, J.: Redundancy elimination within large collections of files. In: The USENIX Annual Technical Conference, General Track, 59–72 (2004)
Quinlan, S., Dorward, S.: Venti: a new approach to archival storage. In: Proceedings of the FAST 2002 Conference on File and Storage Technologies, vol. 4 (2002)
Bindel, D., Chen, Y., Eaton, P., Geels, D., Gummadi, R., Rhea, S., Weatherspoon, H., Weimer, W., Weimer, W., Wells, C., et al.: Oceanstore: An extremely wide-area storage system. In: Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems, Citeseer, pp. 190–201 (2000)
Rabin, M.: Fingerprinting by random polynomials. Technical report, Technical Report TR-15-81, Center for Research in Computing Technology, Harvard University (1981)
Bobbarjung, D., Jagannathan, S., Dubnicki, C.: Improving duplicate elimination in storage systems. ACM Transactions on Storage (TOS) 2, 424–448 (2006)
Levenshteiti, V.: Binary codes capable of correcting deletions, insertions, and reversals. In: Soviet Physics-Doklady, vol. 10 (1966)
Martin Pool, D.B.: librsync, http://librsync.sourceforge.net
Council, T.: TPC BenchmarkTM C Standard Specification (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Sheng, Y., Xu, D., Wang, D. (2010). A Two-Phase Differential Synchronization Algorithm for Remote Files. In: Hsu, CH., Yang, L.T., Park, J.H., Yeo, SS. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2010. Lecture Notes in Computer Science, vol 6081. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-13119-6_6
Download citation
DOI: https://doi.org/10.1007/978-3-642-13119-6_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-13118-9
Online ISBN: 978-3-642-13119-6
eBook Packages: Computer ScienceComputer Science (R0)