This paper presents a many-core visual computing architecture code named Larrabee, a new software rendering pipeline, a manycore programming model, and performance analysis for several applications. Larrabee uses multiple in-order x86 CPU cores that are augmented by a wide vector processor unit, as well as some fixed function logic blocks. This provides dramatically higher performance per watt and per unit of area than out-of-order CPUs on highly parallel workloads. It also greatly increases the flexibility and programmability of the architecture as compared to standard GPUs. A coherent on-die 2^nd level cache allows efficient inter-processor communication and high-bandwidth local data access by CPU cores. Task scheduling is performed entirely with software in Larrabee, rather than in fixed function logic. The customizable software graphics rendering pipeline for this architecture uses binning in order to reduce required memory bandwidth, minimize lock contention, and increase opportunities for parallelism relative to standard GPUs. The Larrabee native programming model supports a variety of highly parallel applications that use irregular data structures. Performance analysis on those applications demonstrates Larrabee's potential for a broad range of parallel computation.

Supplementary Material

MOV File (a18-seiler.mov)

Download
22.57 MB

References

[1]

Akenine-Möller, T., Haines, E. 2002. Real-Time Rendering. 2nd Edition. A. K. Peters.

Digital Library

Google Scholar

[2]

Aila, T., Laine, S. 2004. Alias-Free Shadow Maps. In Proceedings of Eurographics Symposium on Rendering 2004, Eurographics Association. 161--166.

Crossref

Google Scholar

[3]

Alpert, D., Avnon, D. 1993. Architecture of the Pentium Microprocessor. IEEE Micro, v.13, n.3, 11--21. May 1993.

Digital Library

Google Scholar

[4]

AMD. 2007. Product description web site: ati.amd.com/products/Radeonhd3800/specs.html.

Google Scholar

[5]

Bader, A., Chhugani, J., Dubey, P., Junkins, S., Morrison T., Ragozin, D., Smelyanskiy. 2008. Game Physics Performance On Larrabee Architecture. Intel whitepaper, available in August, 2008. Web site: techresearch.intel.com.

Google Scholar

[6]

Bavoil, L., Callahan, S., Lefohn, A., Comba, J. Silva, C. 2007. Multi-fragment effects on the GPU using the k-buffer. In Proceedings of the 2007 Symposium on Interactive 3D Graphics and Games (Seattle, Washington, April 30 - May 02, 2007). I3D 2007. ACM, New York, NY, 97--104.

Digital Library

Google Scholar

[7]

Blumofe, R., Joerg, C., Kuszmaul, B., Leiserson, C., Randall, K., Zhou, Y. Aug. 25, 1996. Cilk: An Efficient Multithreaded Runtime System. Journal of Parallel and Distributed Computing, v. 37, i. 1, 55--69.

Digital Library

Google Scholar

[8]

Blythe, D. 2006. The Direct3D 10 System. ACM Transactions on Graphics, 25, 3, 724--734.

Digital Library

Google Scholar

[9]

Bookout, D. July, 2007. Shadow Map Aliasing. Web site: www.gamedev.net/reference/articles/article2376.asp.

Google Scholar

[10]

Buck, I., Foley, T., Horn, D., Sugerman, J., Fatahalian, K., Houston, M., and Hanrahan, P. 2004. Brook for GPUs: stream computing on graphics hardware. ACM Transactions on Graphics, v. 23, n. 3, 777--786.

Digital Library

Google Scholar

[11]

Callahan, S., Ikits, M., Comba, J., Silva, C. 2005. Hardwareassisted visibility sorting for unstructured volume rendering. IEEE Transactions on Visualization and Computer Graphics, 11, 3, 285--295

Digital Library

Google Scholar

[12]

Chandra, R., Menon, R., Dagum, L., Kohr, D, Maydan, D., McDonald, J. 2000. Parallel Programming in OpenMP. Morgan Kaufman.

Digital Library

Google Scholar

[13]

Chen, M., Stoll, G., Igehy, H., Proudfoot, K., Hanrahan P. 1998. Simple models of the impact of overlap in bucket rendering. In Proceedings of the ACM SIGGRAPH/EUROGRAPHICS Workshop on Graphics Hardware (Lisbon, Portugal, August 31 - September 01, 1998). S. N. Spencer, Ed. HWWS '98. ACM, New York, NY, 105--112.

Digital Library

Google Scholar

[14]

Chen, Y., Chhugani, J., Dubey, P., Hughes, C., Kim, D., Kumar, S., Lee, V., Nguyen A., Smelyanskiy, M. 2008. Convergence of Recognition, Mining, and Synthesis Workloads and its Implications. In Procedings of IEEE, v. 96, n. 5, 790--807.

Google Scholar

[15]

Chuvelev, M., Greer, B., Henry, G., Kuznetsov, S., Burylov, I., Sabanin, B. Nov. 2007. Intel Performance Libraries: Multicore ready Software for Numeric Intensive Computation. Intel Technology Journal, v. 11, i. 4, 1--10.

Google Scholar

[16]

Cohen, J., Lin., M., Manocha, D., Ponamgi., D. 1995. I-COLLIDE: An Interactive and Exact Collision Detection System for Large-Scale Environments. In Proceedings of 1995 Symposium on Interactive 3D Graphics. SI3D '95. ACM, New York, NY, 189--196.

Digital Library

Google Scholar

[17]

Eldridge, M. 2001. Designing Graphics Architectures Around Scalability and Communication. PhD thesis, Stanford.

Digital Library

Google Scholar

[18]

Foley, J., Van Dam, A., Feiner, S., Hughes, J. 1996. Computer Graphics: Principles and Practice. Addison Wesley.

Digital Library

Google Scholar

[19]

Fuchs, H., Poulton, J., Eyles, J., Greer, T., Goldfeather, J., Ellsworth, D., Molnar, S., Turk, G., Tebbs, B., Israel, L. 1989. Pixel-planes 5: a heterogeneous multiprocessor graphics system using processor-enhanced memories. In Computer Graphics (Proceedings of ACM SIGGRAPH 89), ACM, 79--88.

Digital Library

Google Scholar

[20]

Ghuloum, A., Smith, T., Wu, G., Zhou, X., Fang, J., Guo, P., So, B., Rajagopalan, M., Chen, Y., Chen, B. November 2007. Future-Proof Data Parallel Algorithms and Software on Intel Multi-Core Architectures. Intel Technology Journal, v. 11, i. 04, 333--348.

Google Scholar

[21]

Gilbert, E., Johnson, D., Keerthi, S. 1988. A fast procedure for computing the distance between complex objects in three-dimensional space. IEEE Journal of Robotics and Automation, 4, 2, 193--203.

Crossref

Google Scholar

[22]

GPGPU. 2007. GPGPU web site: www.gpgpu.org.

Google Scholar

[23]

Greene, N. 1996. Hierarchical polygon tiling with coverage masks, In Proceedings of ACM SIGGRAPH 93, ACM Press/ACM SIGGRAPH, New York, NY, Computer Graphics Proceedings, Annual Conference Series, ACM, 65--64.

Digital Library

Google Scholar

[24]

Grochowski, E., Ronen, R., Shen, J., Wang, H. 2004. Best of Both Latency and Throughput. 2004 IEEE International Conference on Computer Design (ICCD '04), 236--243.

Digital Library

Google Scholar

[25]

Gwennap, L. 1995. Intel's P6 Uses Decoupled Superscalar Design. Microprocessor Report. v. 9, n. 2, Feb. 16, 1995.

Google Scholar

[26]

Hsieh, E., Pentkovski, V., Piazza, T. 2001. ZR: A 3D API Transparent Technology For Chunk Rendering. In Proceedings of the 34th Annual ACM/IEEE International Symposium on Microarchitecture (Austin, Texas, December 01 - 05, 2001). International Symposium on Microarchitecture. IEEE Computer Society, Washington, DC, 284--291.

Digital Library

Google Scholar

[27]

Hughes, C. J., Grzeszczuk, R., Sifakis, E., Kim, D., Kumar, S., Selle, A. P., Chhugani, J., Holliman, M., and Chen, Y. 2007. Physical simulation for animation and visual effects: parallelization and characterization for chip multiprocessors. In Proceedings of the 34th Annual international Symposium on Computer Architecture (San Diego, California, USA, June 09 - 13, 2007). ISCA '07. ACM, New York, NY, 220--231.

Digital Library

Google Scholar

[28]

IEEE Std. 1003.1, 2004 Edition. Standard for Information Technology - Portable Operating System Interface (POSIX) System Interfaces. The Open Group Technical Standard Base Specifications. Issue 6.

Google Scholar

[29]

Jacobsen, T. 2001. Advanced Character Physics. Proc. Game Developers Conference 2001, 1--10.

Google Scholar

[30]

Johnson, G. S., Lee, J., Burns, C. A., Mark, W. R. 2005. The irregular Z-buffer: Hardware acceleration for irregular data structures. ACM Transactions on Graphics. 24, 4, 1462--1482.

Digital Library

Google Scholar

[31]

Kelley, M., Gould, K., Pease, B., Winner, S., Yen, A. 1994. Hardware accelerated rendering of CSG and transparency. In Proceedings of SIGGRAPH 1994, ACM Press/ACM SIGGRAPH, New York, NY, Computer Graphics Proceedings, Annual Conference Series, ACM, 177--184.

Digital Library

Google Scholar

[32]

Kelley, M., Winner, S., Gould, K. 1992. A Scalable Hardware Render Accelerator using a Modified Scanline Algorithm. In Computer Graphics (Proceedings of ACM SIGGRAPH 1992), SIGGRAPH '92. ACM, New York, NY, 241--248.

Digital Library

Google Scholar

[33]

Kessenich, J., Baldwin, D., Rost, R. The OpenGL Shading Language. Version 1.1. Sept. 7, 2006. Web site: www.opengl.org/registry/doc/GLSLangSpec.Full.1.20.8.pdf

Google Scholar

[34]

Khailany, B., Dally, W., Rixner, S., Kapasi, U., Mattson, P., Namkoong, J., Owens, J., Towles, B., Chang, A. 2001. Imagine: Media Processing with Streams. IEEE Micro, 21, 2, 35--46.

Digital Library

Google Scholar

[35]

Kongetira, P., Aingaran, K., Olukotun, K. Mar/Apr 2005. Niagara: A 32-way multithreaded SPARC Processor. IEEE Micro. v. 25, i. 2. 21--29.

Digital Library

Google Scholar

[36]

Lake, A. 2005. Intel Graphics Media Accelerator Series 900 Developer's Guide. Version 2.0. Web site:download.intel.com/ids/gma/Intel_915G_SDG_Feb05.pdf.

Google Scholar

[37]

Lloyd, B., Govindaraju, N., Molnar, S., Manocha, D. 2007. Practical logarithmic rasterization for low-error shadow maps. In Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS Symposium on Graphics Hardware, 17--24.

Digital Library

Google Scholar

[38]

Mark, W., Glanville, S., Akeley, K., Kilgard, M. 2003. Cg: A System for Programming Graphics Hardware in a C-like Language, ACM Transactions on Graphics, v. 22, n. 3, 896--907.

Digital Library

Google Scholar

[39]

Microsoft. 2007. Microsoft Reference for HLSL. Web site: msdn2.microsoft.com/en-us/library/bb509638.aspx.

Google Scholar

[40]

Molnar, S., Cox, M., Ellsworth, D., and Fuchs, H. 1994. A Sorting Classification of Parallel Rendering. IEEE Computer Graphics and Applications, v.14, n. 4, July 1994, 23--32.

Digital Library

Google Scholar

[41]

Molnar, S., Eyles, J., Poulton, J. 1992. Pixelflow: High Speed Rendering Using Image Composition. Computer Graphics (Proceedings of SIGGRAPH 92), v. 26 n. 2, 231--240.

Digital Library

Google Scholar

[42]

Morein, S. 2000. ATI Radeon HyperZ Technology. Presented at Graphics Hardware 2000. Web site:www.graphicshardware.org/previous/www_2000/presentations/ATIHot3D.pdf.

Google Scholar

[43]

Nickolls, J., Buck, I., Garland, M. 2008. Scalable Parallel Programming with CUDA. ACM Queue, 6, 2, 40--53.

Digital Library

Google Scholar

[44]

Nvidia. 2008. Product description web site:www.nvidia.com/object/geforce_family.html.

Google Scholar

[45]

Owens, J., Luebke, D., Govindaraju, N., Harris, M., Kruger, J., Lefohn, A., Purcell, T. 2007. A Survey of General Purpose Computation on Graphics Hardware. Computer Graphics Forum. v.26, n. 1, 80--113.

Google Scholar

[46]

Pham D., Asano, S., Bolliger, M., Day, M., Hofstee, H., Johns., C., Kahle, J., Kameyama, A., Keaty, J., Masubuchi, Y., Riley, M., Shippy, D., Stasiask, D., Suzuodi, M., Wang, M., Warnock, J., Weitzel, S., Wendel, D., Yamazaki, T., Yazawa, K. 2005. The Design and Implementation of a First Generation CELL Processor. IEEE International Solid-State Circuits Conference. 184--186.

Crossref

Google Scholar

[47]

Pharr, M. 2006. Interactive Rendering in the Post-GPU Era. Presented at Graphics Hardware 2006. Web site:www.pharr.org/matt/.

Google Scholar

[48]

Pineda, J. 1988. A Parallel Algorithm for Polygon Rasterization. In Computer Graphics (Proceedings of ACM SIGGRAPH 88), 22, 4, 17--20.

Digital Library

Google Scholar

[49]

Power VR. 2008. Web site:www.imgtec.com/powervr/products/Graphics/index.asp.

Google Scholar

[50]

Pollack, F. 1999. New Microarchitecture Challenges for the Coming Generations of CMOS Process Technologies. Micro32.

Digital Library

Google Scholar

[51]

Reinders, J., 2007. Intel Threading Building Blocks: Outfitting C++ for Multi-core Processor Parallelism. O'Reily Media, Inc.

Digital Library

Google Scholar

[52]

Reshetov A., Soupikov, A., Hurley, J. 2005. Multi-level Ray Tracing Algorithm. ACM Transactions on Graphics, 24, 3, 1176--1185.

Digital Library

Google Scholar

[53]

Rost, R. 2004. The OpenGL Shading Language. Addison Wesley.

Digital Library

Google Scholar

[54]

Shevtsov, M., Soupikov, A., Kapustin, A. 2007. Ray-Triangle Intersection Algorithm for Modern CPU Architectures. In Proceedings of GraphiCon 2007, 33--39.

Google Scholar

[55]

Stevens, A. 2006. ARM Mali 3D Graphics System Solution. Web site:www.arm.com/miscPDFs/16514.pdf.

Google Scholar

[56]

Stoll, G., Eldridge, M., Patterson, D., Webb, A., Berman, S., Levy, R., Caywood, C., Taveira, M., Hunt, S., Hanrahan, P. 2001. Lightning 2: A High Performance Display Subsystem for PC Clusters. In Computer Graphics (Proceedings of ACM SIGGRAPH 2001), ACM, 141--148.

Digital Library

Google Scholar

[57]

Torborg, J., Kajiya, J. 1996. Talisman Commodity Realtime 3D Graphics for the PC. In Proceedings of ACM SIGGRAPH 1996, ACM Press/ACM SIGGRAPH, New York. Computer Graphics Proceedings, Annual Conference Series, ACM, 353--363.

Digital Library

Google Scholar

[58]

Wexler, D., Gritz, L., Enderton, E., Rice, J. 2005. GPU-accelerated high-quality hidden surface removal. In Proceedings of the ACM SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware (Los Angeles, California, July 30 - 31, 2005). HWWS '05, ACM, New York, NY, 7--14.

Digital Library

Google Scholar

Cited By

View all

Radl LSteiner MParger MWeinrauch AKerbl BSteinberger M(2024)StopThePop: Sorted Gaussian Splatting for View-Consistent Real-time RenderingACM Transactions on Graphics10.1145/365818743:4(1-17)Online publication date: 19-Jul-2024
https://dl.acm.org/doi/10.1145/3658187
El-Mesady AFarahat TEl-Shanawany RRomanov ASukhov A(2023)The Novel Generally Described Graphs for Cyclic Orthogonal Double Covers of Some CirculantsLobachevskii Journal of Mathematics10.1134/S199508022307013244:7(2638-2650)Online publication date: 28-Oct-2023
https://doi.org/10.1134/S1995080223070132
Rao ASudarshan T(2023)Performance of Artificial Intelligence Applications on Embedded Parallel Platforms Using Python as the Language ToolSoft Computing Applications10.1007/978-3-031-23636-5_31(409-425)Online publication date: 27-Oct-2023
https://doi.org/10.1007/978-3-031-23636-5_31
Show More Cited By

Index Terms

Larrabee: a many-core x86 architecture for visual computing
1. Computing methodologies
  1. Computer graphics
  2. Parallel computing methodologies

Recommendations

Larrabee: A Many-Core x86 Architecture for Visual Computing

The Larrabee many-core visual computing architecture uses multiple in-order x86 cores augmented by wide vector processor units, together with some fixed-function logic. This increases the architecture's programmability as compared to standard GPUs. The ...
Larrabee: a many-core Intel® architecture for visual computing
CF '09: Proceedings of the 6th ACM conference on Computing frontiers

The ample supply of transistors provided by advancements in process technology, combined with the increased difficultly to exploit single thread performance, moved the industry to populate several cores on a single die. This talk presents Larrabee -- ...
Mapping High-Fidelity Volume Rendering for Medical Imaging to CPU, GPU and Many-Core Architectures

Medical volumetric imaging requires high fidelity, high performance rendering algorithms. We motivate and analyze new volumetric rendering algorithms that are suited to modern parallel processing architectures. First, we describe the three major ...

Reviews

Reviewer: Hector Yee

In the early years of computer graphics, software renderers were very popular on the personal computer. These renderers have been recently supplanted by graphics processing units (GPUs), which first took over fixed-function operations such as triangle setup and rasterization, and eventually grew to encompass the computation of transformation and lighting of geometry completely in hardware. Recent GPU technology enables the user to have limited customization of the shading of pixels and transformation of geometry by means of programmable graphics hardware. However, some kinds of operations, such as creation and manipulation of dynamic data structures (for example, linked lists and other irregular data structures), are still difficult to implement on graphics hardware, and are important for many rendering problems. The Larrabee architecture, described by the authors, attempts to address issues such as this by implementing a multi-core general-purpose processor-based architecture, augmented with several vector units, as an alternative to the classic GPU model. This paper is written in two parts, the first describing the hardware architecture and the second describing an implementation of a software renderer running on top of the architecture. The hardware is described as many in-order central processing units (CPUs), based on the Intel x86 architecture, connected by an interprocessor ring network for communication, with each having its own L2 cache. The hardware has additional fixed-function units that perform tasks such as texture filtering, which is difficult to implement efficiently in software. Almost everything else, such as shading and geometry transformation, is done in software. The Larrabee software renderer follows a sort-middle architecture, where polygons are binned up for rendering and then each block is rendered at once, in order to use the CPU as much as possible while not saturating the bandwidth with too many simultaneous memory requests. The authors show almost linear scale up with the number of CPUs for applications, such as game fluid simulation; applications such as rigid-body simulations do not scale up as well. It is interesting to see real-time rendering software go the full circle from software to hardware and now back to software. I am eager to see the actual hardware in operation in the future. My only disappointment with the paper is the lack of comparison with existing GPUs in terms of performance on state-of-the-art games. The authors do, however, provide a detailed analysis of how each game uses the CPU and bandwidth of the architecture. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Graphics

ACM Transactions on Graphics Volume 27, Issue 3

August 2008

844 pages

ISSN:0730-0301

EISSN:1557-7368

DOI:10.1145/1360612

Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 August 2008

Published in TOG Volume 27, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

434
Total Citations
View Citations
15,615
Total Downloads

Downloads (Last 12 months)127
Downloads (Last 6 weeks)10

Reflects downloads up to 02 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Radl LSteiner MParger MWeinrauch AKerbl BSteinberger M(2024)StopThePop: Sorted Gaussian Splatting for View-Consistent Real-time RenderingACM Transactions on Graphics10.1145/365818743:4(1-17)Online publication date: 19-Jul-2024
https://dl.acm.org/doi/10.1145/3658187
El-Mesady AFarahat TEl-Shanawany RRomanov ASukhov A(2023)The Novel Generally Described Graphs for Cyclic Orthogonal Double Covers of Some CirculantsLobachevskii Journal of Mathematics10.1134/S199508022307013244:7(2638-2650)Online publication date: 28-Oct-2023
https://doi.org/10.1134/S1995080223070132
Rao ASudarshan T(2023)Performance of Artificial Intelligence Applications on Embedded Parallel Platforms Using Python as the Language ToolSoft Computing Applications10.1007/978-3-031-23636-5_31(409-425)Online publication date: 27-Oct-2023
https://doi.org/10.1007/978-3-031-23636-5_31
Weibert AAhiataku SDawuni GAal KMisaki KWulf V(2022)Looking Past the Miracle BoxProceedings of the ACM on Human-Computer Interaction10.1145/35675657:GROUP(1-13)Online publication date: 29-Dec-2022
https://dl.acm.org/doi/10.1145/3567565
Brown BLaurier EVinkhuyzen E(2022)Designing Motion: Lessons for Self-driving and Robotic Motion from Human Traffic InteractionProceedings of the ACM on Human-Computer Interaction10.1145/35675557:GROUP(1-21)Online publication date: 29-Dec-2022
https://dl.acm.org/doi/10.1145/3567555
Hedditch SVyas D(2022)Design Justice in PracticeProceedings of the ACM on Human-Computer Interaction10.1145/35675547:GROUP(1-39)Online publication date: 29-Dec-2022
https://dl.acm.org/doi/10.1145/3567554
Klimiankou YSerafini MXu H(2022)Towards practical multikernel OSes with MySySProceedings of the 13th ACM SIGOPS Asia-Pacific Workshop on Systems10.1145/3546591.3547525(29-37)Online publication date: 23-Aug-2022
https://dl.acm.org/doi/10.1145/3546591.3547525
Singh RBohra MHemrajani PKalla ABhatt DPurohit NDaneshtalab M(2022)Review, Analysis, and Implementation of Path Selection Strategies for 2D NoCsIEEE Access10.1109/ACCESS.2022.322746010(129245-129268)Online publication date: 2022
https://doi.org/10.1109/ACCESS.2022.3227460
Jiang XXia YZhang XMa J(2022)Robust image matching via local graph structure consensusPattern Recognition10.1016/j.patcog.2022.108588126:COnline publication date: 1-Jun-2022
https://dl.acm.org/doi/10.1016/j.patcog.2022.108588
Liu SSong XMa ZGanaa EShen X(2022)MoREPattern Recognition10.1016/j.patcog.2022.108584126:COnline publication date: 1-Jun-2022
https://dl.acm.org/doi/10.1016/j.patcog.2022.108584
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

Larrabee: A Many-Core x86 Architecture for Visual Computing

Larrabee: a many-core Intel® architecture for visual computing

Mapping High-Fidelity Volume Rendering for Medical Imaging to CPU, GPU and Many-Core Architectures

Reviews

Access critical reviews of Computing literature here