Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article
Open access

Versatility of extended subwords and the matrix register file

Published: 29 May 2008 Publication History

Abstract

Extended subwords and the matrix register file (MRF) are two micro architectural techniques that address some of the limitations of existing SIMD architectures. Extended subwords are wider than the data stored in memory. Specifically, for every byte of data stored in memory, there are four extra bits in the media register file. This avoids the need for data-type conversion instructions. The MRF is a register file organization that provides both conventional row-wise, as well as column-wise, access to the register file. In other words, it allows to view the register file as a matrix in which corresponding subwords in different registers corresponds to a column of the matrix. It was introduced to accelerate matrix transposition which is a very common operation in multimedia applications. In this paper, we show that the MRF is very versatile, since it can also be used for other permutations than matrix transposition. Specifically, it is shown how it can be used to provide efficient access to strided data, as is needed in, e.g., color space conversion. Furthermore, it is shown that special-purpose instructions (SPIs), such as the sum-of-absolute differences (SAD) instruction, have limited usefulness when extended subwords and a few general SIMD instructions that we propose are supported, for the following reasons. First, when extended subwords are supported, the SAD instruction provides only a relatively small performance improvement. Second, the SAD instruction processes 8-bit subwords only, which is not sufficient for quarter-pixel resolution nor for cost functions used in image and video retrieval. Results obtained by extending the SimpleScalar toolset show that the proposed techniques provide a speedup of up to 3.00 over the MMX architecture. The results also show that using, at most, 13 extra media registers yields an additional performance improvement ranging from 1.38 to 1.57.

References

[1]
Austin, T., Larson, E., and Ernst, D. 2002. SimpleScalar: An infrastructure for computer system modeling. IEEE Comput. 35, 2, 59--67.
[2]
Baron, M. 2005. Cortex-A8: High speed, low power. Microprocessor Rep. 11, 14, 1--6.
[3]
Bartkowiak, M. 2001. Optimizations of color transformation for real time video decoding. In Proceedings of the EURASIP Conference on Digital Signal Processing for Multimedia Communications and Services.
[4]
Bensaali, F. and Amira, A. 2005. Accelerating colour space conversion on reconfigurable hardware. Image Vision Comput. 23, 935--942.
[5]
Chatterji, S., Narayanan, M., Duell, J., and Oliker, L. 2003. Performance evaluation of two emerging media processors: VIRAM and Imagine. In Proceedings of the 14th IEEE International Symposium on Parallel and Distributed Processing. 229--235.
[6]
Deb, S. 2005. Video Data Management and Information Retrieval. IRM Press, Hershey, Pennsylvania, USA.
[7]
Diefendorff, K., Dubey, P. K., Hochsprung, R., and Scales, H. 2000. AltiVec extension to powerPC accelerates media processing. IEEE Micro 20, 2, 85--95.
[8]
Flachs, B., Asano, S., Dhong, S. H., Hofstee, H. P., Gervais, G., Kim, R., Le, T., Liu, P., Leenstra, J., Michael, J. L. B., Oh, H. J., Mueller, S. M., Takahashi, O., Hatakeyama, A., Watanabe, Y., Yano, N., Brokenshire, D. A., Peyravian, M., Vandung, T., and Iwata, E. 2006. The microarchitecture of the synergistic processor for a cell processor. IEEE J. Solid-State Circuits 41, 63--70.
[9]
Goodacre, J. and Sloss, A. N. 2005. Parallelism and the ARM instruction set architecture. IEEE Comput. 38, 7, 42--50.
[10]
Gschwind, M., Hofstee, H. P., Flachs, B., Hopkins, M., Watanabe, Y., and Yamazaki, T. 2006. Synergistic processing in cell's multicore architecture. IEEE Micro 26, 2, 10--24.
[11]
Gwennap, L. 1996. Digital, MIPS add multimedia extensions. Microprocessor Rep. 10, 15, 24--28.
[12]
Huang, L., Lai, M., Dai, K., Yue, H., and Shen, L. 2007. Hardware support for arithmetic units of processor with multimedia extension. In Proceedings of the IEEE International Conference on Multimedia and Ubiquitous Engineering. 633--637.
[13]
IBM 2007. Synergistic Processor Unit Instruction Set Architecture. IBM. Version 1.2.
[14]
Jennings, M. D. and Conte, T. M. 1998. Subword extensions for video processing on mobile systems. IEEE Concurrency 6, 3, 13--16.
[15]
Juurlink, B., Borodin, D., Meeuws, R. J., Aalbers, G. T., and Leisink, H. 2007. The SimpleScalar Instruction Tool (SSIT) and the SimpleScalar Architecture Tool (SSAT). Available via http://ce.et.tudelft.nl/~shahbahrami
[16]
Kozyrakis, C., Gebis, J., Martin, D., Williams, S., Mavroidis, I., Pope, S., Jones, D., Patterson, D., and Yelick, K. 2000. Vector IRAM: A media-oriented vector processor with embedded DRAM. In Proceedings of the 12th International Conference on Hot Chips.
[17]
Kuhn, P. 1999. Algorithms, Complexity Analysis and VLSI Architectures for MPEG-4 Motion Estimation. Kluwer Academic Publ. Boston, MA.
[18]
Larsen, S. and Amarasinghe, S. 2000. Exploiting superword level parallelism with multimedia instruction sets. In Proceedings of the ACM Conference on Programming Language Design and Implementation. 145--156.
[19]
Lee, A. J. T., Hong, R. W., and Chang, M. F. 2004. An approach to content-based video retrieval. In Proceedings of the IEEE International Conference on Multimedia and Expo. Vol. 1. 273--276.
[20]
Lee, J., Vijaykrishnan, N., Irwin, M. J., and Wolf, W. 2004. An architecture for motion estimation in the transform domain. In Proceedings of the 17th IEEE International Conference on VLSI Design.
[21]
Lee, R. B. and Smith, M. D. 1996. Media processing: A new design target. IEEE Micro 16, 4, 6--9.
[22]
Moreno, J. H., Zyuban, V., Shvadron, U., Neeser, F. D., Derby, J. H., Ware, M. S., Kailas, K., Zaks, A., Geva, A., Ben-David, S., Asaad, S. W., Fox, T. W., Littrell, D., Biberstein, M., Naishlos, D., and Hunter, H. 2003. An innovative low-power high-performance programmable signal processor for digital communications. IBM J. Res. Develop. 47, 2/3, 299--326.
[23]
Motorola Inc. 1998. AltiVec Technology Programming Environments Manual. Motorola Inc. Rev.0.1.
[24]
Naishlos, D., Biberstein, M., David, S. B., and Zaks, A. 2003. Vectorizing for a SIMdD DSP Architecture. In International Conference on Compilers, Architectures and Synthesis for Embedded Systems. 2--11.
[25]
Peleg, A., Wiljie, S., and Weiser, U. 1997. Intel MMX for Multimedia PCs. Commun. ACM 40, 1, 24--38.
[26]
Poynton, C. 1996. A Technical Introduction to Digital Video. Wiley, New York.
[27]
Rabbani, M. and Jones, P. W. 1991. Digital Image Compression Techniques. Bellinghan, Washington.
[28]
Raman, S. K., Pentkovski, V., and Keshava, J. 2000. Implementing streaming SIMD extensions on the Pentium 3 processor. IEEE Micro 20, 4, 47--57.
[29]
Seshan, N. 1998. High VelociTI Processing. IEEE Signal Processing Mag. 15, 2, 86--101.
[30]
Shahbahrami, A., Juurlink, B., Borodin, D., and Vassiliadis, S. 2006a. Avoiding conversion and rearrangement overhead in SIMD architectures. Intern. J. Parallel Programming 34, 3, 237--260.
[31]
Shahbahrami, A., Juurlink, B., and Vassiliadis, S. 2006b. Accelerating color space conversion using extended subwords and the matrix register file. In Proceedings of the 8th IEEE International Symposium on Multimedia. 37--46.
[32]
Shahbahrami, A., Juurlink, B., and Vassiliadis, S. 2006c. Limitations of special-purpose instructions for similarity measurements in media SIMD extensions. In Proceedings of the ACM International Conference on Compilers, Architecture and Synthesis for Embedded Systems. 293--303.
[33]
Shanableh, T. and Ghanbari, M. 2000. Heterogeneous video transcoding to lower spatio-temporal resolutions and different encoding formats. IEEE Trans. Multimedia 2, 2, 101--110.
[34]
Slingerland, N. and Smith, A. J. 2002. Measuring the performance of multimedia instruction sets. IEEE Trans. Comput. 51, 11, 1317--1332.
[35]
Tamhankar, A. and Rao, K. R. 2003. An overview of H.264/MPEG-4 Part 10. In Proceedings of the 4th International Conference on Video and Image Processing and Multimedia Communications. 1--51.
[36]
Texas Instruments 2007. TMS320C64x/C64x+ DSP CPU and Instruction Set Reference Guide. Texas Instruments. Literature Number: SPRU732D.
[37]
Tremblay, M., 0'Connor, J. M., Narayanan, V., and He, L. 1996. VIS speeds new media processing. IEEE Micro 16, 4, 10--20.
[38]
Wang, L., Zhang, Y., and Feng, J. 2005. On the euclidean distance of images. IEEE Trans. Pattern Anal. Machine Intell. 27, 8, 1334--1339.
[39]
Zhang, D. and Lu, G. 2003. Evaluation of similarity measurement for image rretrieval. In Proceedings of the IEEE International Conference on Neural Networks and Signal Processing. Vol. 2. 928--931.

Cited By

View all
  • (2021)Advancing DSP into HPC, AI, and beyond: challenges, mechanisms, and future directionsCCF Transactions on High Performance Computing10.1007/s42514-020-00057-2Online publication date: 31-Mar-2021
  • (2014)A Comparative Study of Color Space Conversion on Homogeneous and Heterogeneous MulticoreApplied Mechanics and Materials10.4028/www.scientific.net/AMM.519-520.724519-520(724-728)Online publication date: Feb-2014
  • (2014)FT-Matrix: A Coordination-Aware Architecture for Signal ProcessingIEEE Micro10.1109/MM.2013.12934:6(64-73)Online publication date: Nov-2014
  • Show More Cited By

Index Terms

  1. Versatility of extended subwords and the matrix register file

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Architecture and Code Optimization
    ACM Transactions on Architecture and Code Optimization  Volume 5, Issue 1
    May 2008
    182 pages
    ISSN:1544-3566
    EISSN:1544-3973
    DOI:10.1145/1369396
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 29 May 2008
    Accepted: 30 November 2007
    Revised: 08 August 2007
    Received: 12 March 2007
    Published in TACO Volume 5, Issue 1

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. SIMD architectures
    2. SIMD programming
    3. multimedia standards

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)64
    • Downloads (Last 6 weeks)19
    Reflects downloads up to 26 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2021)Advancing DSP into HPC, AI, and beyond: challenges, mechanisms, and future directionsCCF Transactions on High Performance Computing10.1007/s42514-020-00057-2Online publication date: 31-Mar-2021
    • (2014)A Comparative Study of Color Space Conversion on Homogeneous and Heterogeneous MulticoreApplied Mechanics and Materials10.4028/www.scientific.net/AMM.519-520.724519-520(724-728)Online publication date: Feb-2014
    • (2014)FT-Matrix: A Coordination-Aware Architecture for Signal ProcessingIEEE Micro10.1109/MM.2013.12934:6(64-73)Online publication date: Nov-2014
    • (2013)Accelerating Color Space Conversion Using CUDA-Enabled Graphic Processing UnitsAdvanced Materials Research10.4028/www.scientific.net/AMR.716.505716(505-509)Online publication date: Jul-2013
    • (2013)Customized MMRF: Efficient Matrix Operations on SIMD ProcessorsApplied Mechanics and Materials10.4028/www.scientific.net/AMM.347-350.1727347-350(1727-1731)Online publication date: Aug-2013
    • (2012)CMRF: a Configurable Matrix Register File for accelerating matrix operations on SIMD processorsIEICE Electronics Express10.1587/elex.9.2839:4(283-289)Online publication date: 2012
    • (2012)Architectural Implications for SIMD Processors in the Wireless Communication DomainProceedings of the 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems10.1109/HPCC.2012.176(1199-1204)Online publication date: 25-Jun-2012
    • (2011)Color-Aware Instructions for Embedded Superscalar ProcessorsJournal of Signal Processing Systems10.1007/s11265-010-0497-264:3(335-350)Online publication date: 1-Sep-2011
    • (2009)SIMD Architectural Enhancements to Improve the Performance of the 2D Discrete Wavelet TransformProceedings of the 2009 12th Euromicro Conference on Digital System Design, Architectures, Methods and Tools10.1109/DSD.2009.189(497-504)Online publication date: 27-Aug-2009
    • (2009)Performance Improvement of Multimedia Kernels by Alleviating Overhead Instructions on SIMD DevicesProceedings of the 8th International Symposium on Advanced Parallel Processing Technologies10.1007/978-3-642-03644-6_31(389-407)Online publication date: 21-Aug-2009
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media