Nothing Special   »   [go: up one dir, main page]

Academia.eduAcademia.edu

Speculative parallelization of multipath radiosity algorithm

2009

Speculative parallelization of multipath radiosity algorithm J. Puiggalı́,1 F. Castro,1 T. Jové,1, M. Sbert1 and J. L. Marzo1 1 Institut d’Informàtica i Aplicacions (IIiA) Universitat de Girona, 17071 Girona, Spain {albert.trias,joan.puiggali,francesc.castro,teodor.jove,mateu.sbert,joseluis.marzo}@udg.edu A. Trias, 1 reflections. The equation for the radiosity B(x) (emitted plus reflected light leaving point x) is: Abstract—In computer graphics, global illumination algorithms take into account not only the light that comes directly from the sources, but also the light interreflections. This kind of algorithms produce very realistic images, but at a high computational cost, especially when dealing with complex environments. Parallel computation has been successfully applied to such algorithms in order to make it possible to compute highly-realistic images in a reasonable time. We introduce here a speculation-based parallel solution for a global illumination algorithm in the context of radiosity, in which we have taken advantage of the hierarchical nature of such an algorithm. Index Terms—global illumination, radiosity, distributed environments, speculation, parallelization, computer clusters. B(x) = (1) where D stands for the set of all the surfaces in the scene, ρ(x) and E(x) respectively stand for the reflectance (fraction of the incoming light that is reflected) and emittance (emission radiosity if x is a light source) at point x, and F (x, y) stands for the form-factor between points x and y, a geometric term which includes the visibility between x and y. In practice, a discrete version of the radiosity equation is considered, corresponding to a discretization of the environment in small polygons called patches: Radiosity techniques [1], [2] aim at estimating the illumination for environments with diffuse reflecting surfaces in order to produce highly-realistic results. The multipath algorithm [3] is a Monte Carlo technique that solves the radiosity problem by using random lines which simulate the exchange of radiant power among the objects in the environment. In the present paper we introduce the parallelization of a variant [4] of the multipath algorithm in which the environment is structured in a hierarchy of subscenes, allowing the algorithm to be run at different levels, which involves a better exploitation of the random lines. We aim at taking advantage of the parallelism at the maximum, in order to accelerate the computations. The hierarchical nature of the algorithm to be parallelized has allowed us to apply a master/slave speculative parallelization architecture for computer clusters. A noticeable parallelization efficiency has been obtained in our experiments. The paper is organized as follows. In section II we present the previous work. Section III describes in detail the contributions presented in this article, namely the parallel strategy we have used. Section IV presents the results obtained in our experiments. Finally, section V summarizes the conclusions of the paper and presents some future lines of research. Bi = Ei + ρi Np X Fij Bj , (2) j=1 where Bi , Ei , and ρi stand respectively for the radiosity, emittance, and reflectance of patch i, Np stands for the number of patches in the scene, and Fij stands for the patch-to-patch form factor from patch i to patch j. B. The multipath algorithm The radiosity multipath algorithm was first described in [3]. It is a member of a family of methods called by different authors global Monte Carlo, global radiosity, or transillumination methods [5], [6], [7] which use random global lines (or directions) to transport energy. Global lines are independent of the surfaces or patches in the scene, contrary to local lines, and can take advantage of all their intersections with the scene. The multipath algorithm can be seen as a random walk whose transition probabilities are given by the form factors. Such a random walk is generated from a uniform density of global lines [8]. Each global line simulates the exchange of energy between several pairs of patches, contributing to several geometric paths (Fig. 1a). Let us briefly review the algorithm. Global lines are sequentially intersected with the scene. For each line, intersections are sorted by distance, resulting in a list of pairs of intersected patches. Each patch keeps two amounts: its accumulated outgoing power, and its unshot outgoing power. The patches in each pair along the intersection list exchange their unshot WORK A. Radiosity Radiosity [1], [2] is a global illumination algorithm that simulates the multiple reflections of light around a scene. The radiosity model assumes that the exiting radiance of a point is direction independent, which is valid to simulate diffuse SPECTS 2009 ρ(x)F (x, y)B(y)dy + E(x), D I. I NTRODUCTION II. P REVIOUS Z 89 Authorized licensed use limited to: UNIVERSITAT DE GIRONA. Downloaded on April 23,2010 at 12:04:01 UTC from IEEE Xplore. Restrictions apply. power times the corresponding reflectances, and sum this amount to their corresponding accumulated power. If a patch is a source, we also keep a third quantity, the emission power per line exiting the source. Thus, if one of the patches of the pair is a source, we must add, at each exchange, its emission power per line to its unshot power. This “emitted per line” power is previously computed by dividing the total emission power of the patch by the forecast number of lines crossing the patch, which is proportional to the area of the patch [8]. B A MAIN SCENE A B SUBSCENES 2 4 3 SINGLE OBJECTS 1 Fig. 2. A 2-level hierarchy of subscenes. Source (a) (b) Fig. 1. Multipath algorithm. (a) A global line (the thick one) simulates two paths, indicated with the continuous stroke and the dashed stroke (b) A path can contribute to the emission of radiant power from several patches. In the figure, path 1-2-3-4 simulates logical paths 1-2-3-4, 2-3-4 and 3-4. C. The multipath algorithm using a hierarchy of subscenes A variant of the multipath algorithm was presented in [4], where a hierarchy of subscenes was used in order to run multipath not only for the whole scene but also for the subscenes, submitting each subscene to its own density of lines (referred to as locally global lines). This arised from the idea of better exploiting the lines by using more lines where they were more necessary. We have to note that such densities of lines were uniform in the context of each subscene but not in the context of the whole scene. Next we elaborate on the main features of this algorithm: 1) Hierarchy of subscenes and adaptive densities of lines: The hierarchy of subscenes is built using a clustering bottomup strategy [9]. Fig. 2 shows a simple example of a 2level hierarchy, although the algorithm allows any number of levels. Note in the example that a subscene can contain other subscenes and/or single objects. 2) Subdivision of the bounding boxes: virtual patches and angular regions: Each face of each bounding box (unless the box at the top level) acts as a virtual wall (VW), which is subdivided in a grid of virtual patches (VP), as seen in Fig. 3 (left). The directional hemisphere over each virtual patch is subdivided in angular regions (AR) (see Fig. 3 (right)). Each AR acts as an accumulator of undistributed incoming and outgoing power. 3) The pre-process and the reuse of lines: Each subscene is submitted into a uniform density [8] of locally global lines. For each line cast in a subscene S, a sorted-by-distance intersection list is computed, considering the intersections of the line against: • (a) Fig. 3. (a) Subdivision of the bounding box virtual wall in a grid of virtual patches. (b) Directional subdivision in angular regions over a virtual patch. • • Virtual walls of the inner subscenes of S (ignoring their interior). Virtual walls of S. This pre-process allows to obtain information about the geometry of the subscenes, estimating the transmittance for each AR. The transmittance TR for the angular region R represents the fraction of power entering subscene S in the directions in R that crosses S finding no obstacle in its way. Transmittances give an idea about the opaqueness of a subscene in each direction, allowing, with a moderate loss of accuracy, to ignore the interior of the subscenes when executing multipath at a higher level. The transmittance TR is estimated as TR ≈ ndR , nR (3) where ndR is the number of lines crossing S in the directions in R and hitting no object, and nR is the total number of lines crossing S in the directions in R. If nR = 0, TR is estimated by interpolation of the values of the neighboring regions. For each line, and after sorting by distance, the identifiers of the intersected patches and angular regions are stored. This Single objects inside S. SPECTS 2009 (b) 90 Authorized licensed use limited to: UNIVERSITAT DE GIRONA. Downloaded on April 23,2010 at 12:04:01 UTC from IEEE Xplore. Restrictions apply. procedure MP(scene S) Compute power per line for each patch in S not included in any subscene of S for each AR in S Compute incoming power per line Set to 0 accumulated outgoing power endfor for each subscene Si of S for each AR in the box of Si Compute outgoing power per line Set to 0 accumulated incoming power per line endfor endfor for each line cast in S Faced patches and/or AR exchange power: * Patches contribute unshot power + power per line * AR of S contribute incoming power per line * AR of S1 ...Sk contribute outgoing power per line * Power leaving S is accumulated as outgoing power in the corresponding AR * Power crossing S1 ..Sk is accumulated as incoming power in the corresponding AR and attenuated by the corresponding transmittance endfor for each subscene Si of S MP(Si ) // RECURSIVE endfor endprocedure information is repeatedly used in the iterative process, avoiding to compute again the intersections. 4) The algorithm: The full algorithm is presented in Fig. 4. Note that recursive function MP (Fig. 5) deals with the exchange of power inside and between each subscene. Note also that a first-shot stage [10] is needed before applying multipath, in order to efficiently distribute the direct illumination. Thus, multipath only estimates the indirect illumination. Generate hierarchy of subscenes Subdivide each VW of the boxes in VP and AR for each subscene S (including the whole scene) Cast global lines and store intersection lists if S is not the whole scene for each AR in S Compute and store its transmittance endfor endif endfor First shot (computing direct illumination) for each iteration (4 or 5 are usually enough) MP(whole scene) // RECURSIVE FUNCTION endfor Fig. 4. The iterative algorithm. D. Master/Slave Speculative Parallelization Architecture for Computer Clusters (MSSPACC) Master/Slave Speculative Parallelization Architecture for Computer Clusters [11], [12], [13] achieves parallelism by using speculation in distributed environments, allowing the parellel execution of a sequential program in a computer cluster. It simulates the behavior of a superscalar system by implementing a form of parallelism called instructionlevel parallelism. These techniques try to break true data dependencies and control dependencies by means of using speculation of future data values and future branch results respectively. Speculation is based on the fact that the program behaviour is usually repetitive and consequently predictable, as demonstrated in studies of branch [14], memory dependencies, and data values [15]. These techniques can be classified into two types: • Software speculations: Compilers carry out the necessary coding. It cannot be applied dynamically [16], [17], [18]. • Hardware speculations: They require duplicated hardware elements, e.g. adding extra registers to store provisional values until they are definitive [15], [19], [20]. The above techniques allow the processor to divide program execution into several thread executions, and increase the program’s degree of parallelism. Moore’s law (processing power doubles each 18 months) and Gilder’s law (bandwidth triples each 12 months) show that the costs of information transmission and synchronization between workstations decrease faster than processing speed. These premises make it SPECTS 2009 Fig. 5. The recursive function MP. S1 ..Sk are the subscenes of S. possible the idea of transporting speculation techniques to a distributed environment composed by cheap workstations. The complete design of Master/Slave Speculative Parallelization Architecture for Computer Clusters system [11], [12], [13] consists of three subsystems: • • The parallelizing subsystem (Fig.6) transforms the original sequential program into the parallel format needed by the execution environment. The program is divided into blocks that can be executed in parallel. Either two or three programs (depending on the type of the original program) are generated as a result of the translation process: a master, a slave, and optionally a master/slave. The master manages the parallelism and the speculation of the system. The slave runs, at each of the nodes, the code of the blocks into which the sequential program has been divided. The master/slave program permits to reduce the master bottleneck by distributing the tasks to others nodes that work as a new master. The execution subsystem (see Fig.7) runs, by applying speculation, the parallelized applications in a computer cluster composed by a monoprocessor architecture using PVM (Parallel Virtual Machine). The execution environ- 91 Authorized licensed use limited to: UNIVERSITAT DE GIRONA. Downloaded on April 23,2010 at 12:04:01 UTC from IEEE Xplore. Restrictions apply. data dependencies which limit parallel execution. We have tried to obtain a code as sequential as possible to get more flexibility defining the blocks. An example of a block where these transformations have been done (recursivity and dynamic memory have been removed) is shown in Fig. 9, while in Fig. 8 we have the orignial methods. Fig. 6. • procedure DividePolygons() for each polygon i polygon[i].divide() endfor endprocedure The parallelizing subsystem ment behaves like a superscalar processor, where the blocks are like the instructions into which the sequential program has been divided, and the nodes where the slave program is run are like the functional units. There exist the following data speculation mechanisms: Data Value Speculation [21]; Last Value Predictor [21]; Stride Predictor [22]; and Context-based Value Predictor [23]. Control dependencies are managed with branch prediction techniques based on a BTB (Branch Target Buffer) with 2-bit history [24]. If there are blocks executed with wrong speculated values or branch missprediction, such blocks are discarded and their execution is restarted from the last stable point. The simulation subsystem (see Fig.7) is developed in order to evaluate the impact of the technological evolution or the effect of using bigger computer clusters. The simulation is able to be run in a single workstation, using the information obtained from the monoprocessor execution (the trace of the program), and the cluster execution (the execution cost of the different blocks). procedure Polygon::divide() if area() > maximum area allowed for j = 0 to 3 addLeaf (division(j)) getLeaf (j).divide() endfor endif endprocedure Fig. 8. Polygons subdivision block Divide Polygons N = amount of polygons to divide i=0 while i < N if polygon[i].area() > maximum area allowed polygon[i].setLeafPointer(N) for j = 0 to 3 polygon[N ] = polygon[i].division(j) N =N +1 endfor endif i=i+1 endwhile endblock Fig. 9. Removing recursivity and dynamic memory from methods in Fig. 8 A. Basic blocks Fig. 7. The execution subsystem needs to divide in basic blocks the algorithm to be executed. In our case, such a division can be seen in Fig. 10, where: • Parallelizable loop means that the iterations can be executed at once using speculation on the induction variable. • Non-parallelizable loop means that there are dependencies that cannot be solved by speculation. Execution and Simulation Subsystem III. O UR CONTRIBUTION To apply the speculation techniques seen in section II-D to the multipath algorithm (section II-C) it has been needed to avoid the use of dynamic memory, because the cluster’s nodes do not use shared memory, and the objects positions depend on the node. We have also avoided recursivity because it produces SPECTS 2009 B. Algorithm modifications In a given iteration (see algorithm in Fig. 4), the dependencies because of the power accumulators in the AR bear that 92 Authorized licensed use limited to: UNIVERSITAT DE GIRONA. Downloaded on April 23,2010 at 12:04:01 UTC from IEEE Xplore. Restrictions apply. is optimal near approximately 20 nodes, tending to decrease from this value on. Fig. 10. Block division for the master node. Fig. 11. Running time vs. number of nodes. Horizontal line shows the running time for the sequential implementation when generating the same number of lines. we can only run at once MP for the boxes of the same level. If power accumulated in the AR was not used until the next iteration, MP could be run at once for all the boxes. Thus, we decide to duplicate the AR accumulators of incoming and outgoing power, keeping the power accumulated in the current iteration and distributing it in the following one. For a given execution of multipath, increasing the performance of parallel execution involves breaking dependencies on the unshot power variables (see section II-B) between the global lines. In this sense, we add a new field for each patch that accumulates the unshot power of a whole multipath iteration in order to be distributed in the following one. One of the problems of such a modification is that the unshot power of the last iteration will be never distributed. Such a problem has been reduced by increasing the number of iterations, which reduces the amount of unresolved power after the last iteration. Since the MSSPAC has a limitation on the amount of embedded loops that can be managed, and it is not possible to run multipath iterations at once, we applied a loop-unrolling in the Multipath block, so that it is possible to run at once all the boxes and use several slaves for each box. We also reduced the hard disk operations by increasing the buffer size. Fig. 12. Speed-up of our parallel implementation regarding to the number of nodes. IV. R ESULTS In our experiments we have employed the scene office, which represents a room with 4 tables, a desk and five chairs. The light source is a lamp stuck on the center of the ceiling. Our experiments have been done in a cluster of Pentium IV with 3 Ghz and 1 Gbyte of RAM. The PCs, which are connected in a LAN, have a shared hard disk, where the global lines have been stored. As shown in Fig. 11, the use of the speculation techniques has reduced the cost with respect to the sequential implementation of the algorithm. Fig. 11 also shows the tendency of the cost to stabilize from a given number of nodes on. The speedup of the parallel implementation regarding to the number of nodes can be seen in Fig. 12, while the relative efficiency, that is, the speed-up divided by the number of nodes, is shown in Fig. 13, where it can be seen that the relative efficiency SPECTS 2009 Fig. 13. Relative efficiency of our parallel implementation regarding to the number of nodes. It is worth to be mentioned that, since speculation allows to obtain a higher degree of parallelization, the speed-up obtained with a parallel implementation without speculation would be lower than the one presented above. On the other hand, we have to remark that a slight bias has been introduced by the speculative parallel implementation, due to the undistributed 93 Authorized licensed use limited to: UNIVERSITAT DE GIRONA. Downloaded on April 23,2010 at 12:04:01 UTC from IEEE Xplore. Restrictions apply. power after the last iteration (see section III-B). Such a bias is visually imperceptible, however, as we can see in the images in Fig. 14 and 15, in which we show some details of the scene office from a radiosity solution obtained with our parallel implementation. Fig. 14. execution of the algorithm. Such a model has allowed us to carry out empirical studies on the parallel implementation, for example on the selection of the size of the blocks (pieces in which we divide the program). Our implementation of the algorithm in the engine of speculation (MSSPACC) has allowed us to extract conclusions, but also new ideas in order to improve the performance by meeting some engine lacks. We have also proposed the implementation of the algorithm in a multiprocessor to compare an environment where the communication between the processors is based on the pass of messages (clusters) to an environment based on shared memory and the PVM (multiprocessor). This comparison will allow us to know the advantages and drawbacks of both models and to explore the possibility of mixing both models according to the problem evolution. ACKNOWLEDGMENTS This project has been funded in part with grants number TIN2007-68066-C04-01, TIN2008-06862-C04-02/TSI, and TEC2006-03883/TCM from the Spanish Government, partially supported by the European Union project N 216746 and Universitat de Girona research grant BR09/10 awarded to Albert Trias. Detail of scene office R EFERENCES Fig. 15. [1] M. Cohen and J. Wallace, Radiosity and Realistic Image Synthesis. Academic Press Professional, 1993. [2] F. Sillion and C. Puech, Radiosity and Global Illumination. Morgan Kaufmann Pubilishers, Inc., 1994. [3] M. Sbert, X. Pueyo, L. Neumann, and W. Purgathofer, “Global multipath monte carlo algorithms for radiosity,” The Visual Computer, pp. 47–61, 1996. [4] F. Castro, M. Sbert, and L. Neumann, “Fast multipath radiosity using hierarchical subscenes,” Computer Graphics Forum, vol.23(1),, pp. 43– 53, 2004. [5] M. Sbert, F. Pérez, and X. Pueyo, “Global monte carlo. a progressive solution,” Rendering Techniques ’95. Springer Wien New York, pp. 231– 239, 1995. [6] L. Neumann, “Monte carlo radiosity,” Computing, 55, pp. 23–42, 1995. [7] L. Szirmay-Kalos, T. Foris, L. Neumann, and B. Csebfalvi, “An analysis of quasi-monte-carlo integration applied to the transillumination radiosity method,” Proceedings of Eurographics 97, 1997. [8] L. Santaló, Integral Geometry and Geometric Probability. Ed. AddisonWesley, New York, 1976. [9] J. Goldsmith and J. Salmon, “Automatic creation of object hierarchies for ray tracing,” IEEE Computer Graphics and Applications, vol. 7, no. 5, pp. 14–20, 1987. [10] F. Castro, R. Martinez, and M. Sbert, “Quasi-monte-carlo and extended first shot improvements to the multipath method,” Proceedings SCCG ’98, (Budmerice, Slovakia), 1998. [11] J. Puiggalı́, T. Jové, S. Salanova, and J. Marzo, “Execution speed up using speculation techniques in computer clusters.” International Mediterranean Modelling Multiconference (IMM 06), pp. 561–568, 2006. [12] ——, “Execution limits of tls execution of sequencial programs on clusters.” International Symposium on Performance Evaluation of Computer and Telecommunication Systems (SPECTS 06), pp. 12–19, 2006. [13] J. Puiggalı́, T. Jové, J. Segovia, and J. Marzo, “Master/slave speculative parallelization architecture for computer clusters.” ACS/IEEE International Conference on Computer Systems and Applications (AICCSA 2007), 2007. [14] J. E. Smith, “A study of branch prediction strategies.” In Proceedings of the 8th Annual Symposium on Computer Architecture., pp. 135–148, 1981. [15] C. B., G. Reinman, and D. Tullsen, “Selective value prediction.” In Proceedings of the 26th Annual International Symposium on Computer Architecture., 1999. Detail of scene office V. C ONCLUSIONS AND FUTURE WORK The main goal of this work has been to improve the performance of the multipath algorithm, a Monte Carlo radiosity method that simulates the illumination of an environment. Taking advantage of a space-division-based hierarchical version of multipath, it has been parallelized using speculation in a distributed Master/Slave Speculative Parallelization Architecture for Computer Clusters [11], [12], [13]. The obtained results show that such a goal has been achieved by applying data and control speculation. These techniques try to use the maximum number of nodes in parallel by increasing the inherent degree of parallelism of the algorithm, and thus obtaining better performance from the processors. We conclude that such a degree of parallelism depends on many factors, such as scene complexity, space division, direct lighting distribution, and number of Monte Carlo samples. One of the first tasks carried out was to obtain a mathematical model that represents the behavior of the SPECTS 2009 94 Authorized licensed use limited to: UNIVERSITAT DE GIRONA. Downloaded on April 23,2010 at 12:04:01 UTC from IEEE Xplore. Restrictions apply. [16] C. Zilles and G. Sohi, “Master/slave speculative parallelization,” Proc. of the 35th Int. Symp. on Microarchitecture., 2002. [17] J. Steffan, C. Colohan, A. Zhain, and T. Mowry, “Improving value communication for thread-level speculation.” Intl. Symp. on HighPerformance Computer Architecture, 2002. [18] E. Larson and T. Austin, “Compiler controlled value prediction using branch predictor based confidence.” In Proceedings of the 33nd Annual ACM/IEEE International Symposium on Microarchitecture., 2000. [19] J. Gonzalez and A. Gonzalez, “The potencial of data value speculation to boosst ilp.” Proceedings of the 12th International Conference of Supercomputing., 1998. [20] Q. Jacobson, S. Bennett, N. Sharma, and J. Smith, “Control flow speculation in multiscalar processors.” In To Appear in Proceedings of the Third International Symposium on High-Performance Computer Architecture., 1997. [21] M. H. Lipasti, W. C. B., and J. P. Shen, “Value locality and data speculation.” in Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems., pp. 138–147, 1996. [22] Y. Sazeides, S. Vassiliadis, and J. Smith, “The performance potential of data dependence speculation and collapsing,” 9th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-29), 1996. [23] Y. Sazeides and J. Smith, “The predictability of data values,” 10th IEEE/ACM International Symposium on Microarchitecture (MICRO), 1997. [24] T. Yeh and Y. Patt, “Two-level adaptive branch prediction.” Proceedings of the 24th ACM/IEEE International Symposium on Microarchitecture, pp. 51–61, 1991. SPECTS 2009 95 Authorized licensed use limited to: UNIVERSITAT DE GIRONA. Downloaded on April 23,2010 at 12:04:01 UTC from IEEE Xplore. Restrictions apply. View publication stats