Abstract
In this paper we present an approximate algorithm for detecting and filtering data dependencies with a sufficiently large distance between memory references. A sequence of the same operations (typically enclosed in a ‘for’ loop) can be replaced with a single SIMD operation if the distance between memory references is greater than or equal to the number of data processed in the SIMD register. Some loops that could not be vectorized on traditional vector processors, can still be parallelized for short SIMD execution. There are a number of approximate data-dependence tests that have been proposed in the literature but in all of them data dependency will be assumed when actually there is no such a dependence that could restrict parallelization related to the short SIMD execution model. By examining the properties of linear subscript expressions of possibly conflicting data references, our algorithm gives the green light to the parallelization process if some sufficient conditions regarding the dependence distance are met. Our method is based on the Banerjee test and checks the minimum and maximum distances between memory references within the iteration space rather than searching for the existence of an integer solution to the dependence equation. The proposed method extends the accuracy and applicability of the classical Banerjee test.
Similar content being viewed by others
References
Aliaga JI et al (2009) Toward the parallelization of GSL. J Supercomput 48:88–114. doi:10.1007/s11227-008-0207-z
Allen R, Kennedy K (2001) Optimizing compilers for modern architectures: a dependence-based approach. Morgan Kaufmann, San Mateo
Arabnia HR, Oliver MA (1987) A transputer network for the arbitrary rotation of digitized images. Comput J 30(5):425–432
Banerjee U, Eigenman R, Nicolau A, Padua DA (1993) Automatic program parallelization. Proc IEEE 81(2):211–243
Banerjee U (1997) Dependence analysis: a book series on loop transformations for restructuring compilers. Kluwer Academic, Dordrecht
Bhandarkar SM, Arabnia HR (1995) The Hough transform on a reconfigurable multi-ring network. J Parallel Distrib Comput 24(1):107–114
Bhandarkar SM, Arabnia HR, Smith JW (1995) A reconfigurable architecture for image-processing and computer vision. Int J Pattern Recognit Artif Intell 9(2):201–229
Bik AJC, Girkar M, Grey PM, Tian XM (2002) Automatic intra-register vectorization for the Intel (R) architecture. Int J Parallel Program 30(2):65–98
Blume W, Eigenman R, Hoeflinger J, Padua DA, Petersen P, Rauchwerger L, Tu P (1994) Automatic detection of parallelism: a great challenge for high-performance computing. IEEE Parallel Distrib Process Technol 2(3):37–47
Blume W, Eigenman R (1998) Nonlinear and symbolic data dependence testing. IEEE Parallel Distrib Process Technol 9(12):1180–1194
Chuang YJ, Wu JL (2005) An efficient matrix-based 2-D DCT splitter and merger for SIMD instructions. IEICE Trans Inf Syst E88D(7):1569–1577. doi:0.1093/ietisy/e88-d.7.1569
Di Blas A, Jagota A, Hughey R (2005) Optimizing neural networks on SIMD parallel computers. Parallel Comput 31(1):97–115. doi:10.1016/j.parco.2004.11.002
Guo M et al (2009) Communication-free data alignment for arrays with exponential references in parallelizing compilers for scalable parallel systems. J Supercomput. doi:10.1007/s11227-009-0280-y
Hassaballah M et al (2008) A review of SIMD multimedia extensions and their usage in scientific and engineering applications. Comput J 51(6):630–649. doi:10.1093/comjnl/bxm099
Huang TC, Yang CM (2000) Data dependence analysis for array references. J Syst Softw 52:55–65
Kim S et al (2009) Region-based parallelization of irregular reductions on explicitly managed memory hierarchies. J Supercomput. doi:10.1007/s11227-009-0340-3
Krall A, Lelait S (2000) Compilation techniques for multimedia processors. Int J Parallel Program 28(4):347–361
Kurzak J et al (2009) Optimizing matrix multiplication for a short-vector SIMD architecture—CELL processor. Parallel Comput 35(3):138–150. doi:10.1016/j.parco.2008.12.010
Lee WC et al (2008) Algorithm and software optimization of variable block size motion estimation for H.264/AVC on a VLIW-SIMD DSP. J Signal Process Syst Signal Image Video Technol 51(3):289–302. doi:10.1007/s11265-007-0151-9
Lotfi S, Parsa S (2009) Parallel loop generation and scheduling. J Supercomput doi:10.1007/s11227-008-0262-5
Paver NC, Khan MH, Aldrich BC (2006) Optimizing mobile multimedia using SIMD techniques. Multimed Tools Appl 28(2):393–416
Petersen PM, Padua DA (1996) Static and dynamic evaluation of data dependence analysis techniques. IEEE Trans Parallel Distrib Syst 7(11):1121–1132
Pough W (1992) A practical algorithm for exact array dependence analysis. Commun ACM 35(8):102–114
Pough W (1992) Definitions of dependence distance. ACM Lett Program Lang Syst 1(3):261–265
Psarris K (2002) Program analysis techniques for transforming programs for parallel execution. Parallel Comput 28(3):455–469
Psarris K, Klappholz D, Kong X (2004) An experimental evaluation of data dependence analysis techniques. IEEE Trans Parallel Distrib Syst 15(3):196–213
Shahbahrami A et al (2008) Implementing the 2-D wavelet transform on SIMD-enhanced general-purpose processors. IEEE Trans Multimed 10(1):43–51. doi:10.1109/TMM.2007.911195
Sreraman N, Govindarajan R (2000) A vectorizing compiler for multimedia extensions. Int J Parallel Program 28(4):363–400
Wani MA, Arabnia HR (2003) Parallel edge-region-based segmentation algorithm targeted at reconfigurable MultiRing network. J Supercomput 25(1):43–62
Wolfe MJ, Banerjee U (1987) Data dependence and its application to parallel processing. Int J Parallel Program 16(2):137–178
Wolfe MJ, Tseng CW (1992) The power test for data dependence. IEEE Trans Parallel Distrib Syst 3(5):591–601
Xue J, Guo M, Wei D (2008) Improving the parallelism of iterative methods by aggressive loop fusion. J Supercomput 43:147–164. doi:10.1007/s11227-007-0124-6
Yang CT, Lai KC (2009) A directive-based MPI code generator for Linux PC clusters. J Supercomput. doi:10.1007/s11227-008-0258-1
Yang X et al (2009) Matrix-based streamization approach for improving locality and parallelism on FT64 stream processor. J Supercomput 47:171–197. doi:10.1007/s11227-008-0186-0
Yuan FN et al (2006) An interactive 3D visualization system based on PC using Intel SIMD, 3D texturing and thinning techniques. Int J Pattern Recognit Artif Intell 20(3):393–416
Zhou J, Zeng G (2008) A general data dependence analysis for parallelizing compilers. J Supercomput 45:236–252. doi:10.1007/s11227-007-0168-7
Zhou J, Zeng G (2008) A general data dependence analysis for parallelizing compilers. ERRATUM. J Supercomput 45:253. doi:10.1007/s11227-008-0197-x
Zhou JP, Shi C (2008) Efficient SIMD optimization for media processors. J Zhejiang Univ-Sci A 9(4):524–530. doi:10.1631/jzus.A071203
Zima HP, Chapman BM (1990) Supercompilers for parallel and vector computers. Addison-Wesley, Reading
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Bulić, P., Dobravec, T. An approximate method for filtering out data dependencies with a sufficiently large distance between memory references. J Supercomput 56, 226–244 (2011). https://doi.org/10.1007/s11227-009-0364-8
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-009-0364-8