Nothing Special   »   [go: up one dir, main page]

CN115061731B - Shuffling circuit and method, chip and integrated circuit device - Google Patents

Shuffling circuit and method, chip and integrated circuit device Download PDF

Info

Publication number
CN115061731B
CN115061731B CN202210717989.2A CN202210717989A CN115061731B CN 115061731 B CN115061731 B CN 115061731B CN 202210717989 A CN202210717989 A CN 202210717989A CN 115061731 B CN115061731 B CN 115061731B
Authority
CN
China
Prior art keywords
data
result data
operation data
thread
group
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210717989.2A
Other languages
Chinese (zh)
Other versions
CN115061731A (en
Inventor
张春焱
李凯
于冰
张钰勃
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Moore Threads Technology Co Ltd
Original Assignee
Moore Threads Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Moore Threads Technology Co Ltd filed Critical Moore Threads Technology Co Ltd
Priority to CN202210717989.2A priority Critical patent/CN115061731B/en
Publication of CN115061731A publication Critical patent/CN115061731A/en
Application granted granted Critical
Publication of CN115061731B publication Critical patent/CN115061731B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
    • G06F9/3887Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units controlled by a single instruction for multiple data lanes [SIMD]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/80Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
    • G06F15/8007Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors single instruction multiple data [SIMD] multiprocessors
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Complex Calculations (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The present disclosure provides a shuffling circuit comprising a control circuit, an input selector, a shuffler, and an output selector, wherein the control circuit is configured to: dividing m threads into n thread groups according to the maximum number k of threads which can be processed in parallel by the shuffler, generating data corresponding information for each thread group, and sending the data corresponding information to the input selector and the output selector, wherein the data corresponding information limits from which one or more thread groups of operation data the result data of each thread group are respectively obtained, and k, m and n are integers which are larger than or equal to 1. The control circuit may generate data correspondence information based on the SIMD mode, or may generate data correspondence information based on the result data index flag and the operation data index flag. The present disclosure also provides a data shuffling method that may be used for the shuffling circuit, and also relates to a chip including the shuffling circuit and an integrated circuit device including the chip.

Description

Shuffling circuit and method, chip and integrated circuit device
Technical Field
The present disclosure relates to the field of electrical technology, and in particular, to a shuffling circuit and a data shuffling method applicable to a plurality of SIMD modes, and also to a chip including the shuffling circuit and an integrated circuit device including the chip.
Background
Shuffling operations are a data processing method that redistributes multithreaded data, which enables inter-thread data sharing and data reordering, and are therefore widely used in application program interfaces (Application Programming Interface, i.e., APIs) such as DX, CUDA, or VULKAN. Currently, each chip is based on a different SIMD architecture, such as SIMD32 architecture, SIMD64 architecture, or SIMD128 architecture. These different SIMD architectures often need to support multiple APIs for compatibility, and thus need to accommodate multiple SIMD modes. However, the architecture of the shuffling circuit is typically based on current SIMD architecture with a fixed implementation, resulting in unsatisfactory processing results in terms of support for different SIMD modes in different APIs.
Disclosure of Invention
According to a first aspect of the present disclosure, there is provided a shuffling circuit comprising a control circuit, an input selector, a shuffler and an output selector, wherein: the control circuit is configured to: dividing m threads into n thread groups according to the maximum number k of threads which can be processed in parallel by the shuffler, generating data corresponding information, and sending the data corresponding information to the input selector and the output selector, wherein the data corresponding information defines from which one or more thread groups of operation data the result data of each thread group are respectively obtained, and k, m and n are integers which are larger than or equal to 1; the input selector is configured to: selecting one or more corresponding thread groups from the n thread groups according to the received data corresponding information, and sequentially sending the operation data to the shuffler according to a preset sequence of the thread groups; the shuffler is configured to: sequentially receiving operation data of one or more corresponding thread groups from the input selector, performing shuffling operation on the received k operation data of each corresponding thread group, and outputting j shuffling output data, wherein j is an integer and 0.ltoreq.j.ltoreq.k; the output selector is configured to: shuffle output data obtained from the operation data of the one or more corresponding thread groups is sequentially received from the shuffler, and result data for each thread group is generated based on the shuffle output data in accordance with the received data correspondence information.
According to some exemplary embodiments of the disclosure, the control circuit is further configured to: the data correspondence information is generated based on the SIMD mode.
According to some exemplary embodiments of the present disclosure, in the shuffling circuit according to the first aspect of the present disclosure, m has a value of 128, k has a value of 32, n has a value of 4, and the SIMD mode is one of SIMD32 mode, SIMD64 mode and SIMD128 mode.
According to some exemplary embodiments of the disclosure, when the SIMD mode is a SIMD32 mode, the data correspondence information includes: obtaining result data of the first thread group from operation data of the first thread group; obtaining result data of the second thread group from the operation data of the second thread group; obtaining result data of the third thread group from the operation data of the third thread group; and obtaining the result data of the fourth thread group from the operation data of the fourth thread group.
According to some exemplary embodiments of the present disclosure, when the SIMD mode is a SIMD64 mode, the data correspondence information includes: obtaining result data of the first thread group from operation data of the first thread group and the second thread group; obtaining result data of a second thread group from operation data of the first thread group and the second thread group; obtaining result data of a third thread group from operation data of the third thread group and the fourth thread group; and obtaining the result data of the fourth thread group from the operation data of the third thread group and the fourth thread group.
According to some example embodiments of the present disclosure, when the SIMD mode is a SIMD128 mode, the data correspondence information includes: obtaining result data of a first thread group from operation data of the first, second, third and fourth thread groups; obtaining result data of a second thread group from operation data of the first, second, third and fourth thread groups; obtaining result data of a third thread group from operation data of the first, second, third and fourth thread groups; and obtaining the result data of the fourth thread group from the operation data of the first, second, third and fourth thread groups.
According to some exemplary embodiments of the present disclosure, the control circuit includes a result data index flag generator and an operation data index flag generator, and wherein: the result data index flag generator is configured to: generating an n-bit result data index flag according to the validity of the m threads, wherein each bit in the result data index flag corresponds to one group of result data of one thread group in the n thread groups; the operation data index flag generator is configured to: calculating an operation data index corresponding to each result data to generate an n-bit operation data index flag for a set of result data of each of the n thread groups, wherein each bit of the operation data index flag corresponds to a set of operation data of one of the n thread groups; the control circuit is configured to: and generating the data corresponding information based on the result data index mark and the operation data index mark.
According to some exemplary embodiments of the disclosure, the control circuit is further configured to: and determining a group of result data of one thread group corresponding to the bit with the value of 1 in the result data index mark as a valid result data group, and obtaining the result data from a group of operation data of the thread group corresponding to the bit with the value of 1 in the corresponding operation data index mark for each valid result data group.
According to some exemplary embodiments of the present disclosure, in a shuffling circuit whose control circuit includes a result data index flag generator and an operation data index flag generator, m has a value of 128, k has a value of 32, and n has a value of 4.
According to a second aspect of the present disclosure, there is provided a data shuffling method comprising: dividing m threads into n thread groups according to the maximum number k of threads capable of being processed in parallel, wherein each thread group comprises k threads, and k, m and n are integers greater than or equal to 1; generating data correspondence information defining from which one or more thread groups of operation data the result data of each thread group is obtained, respectively; selecting operation data of one or more corresponding thread groups from the n thread groups according to the data corresponding information; performing shuffling operation on k operation data of each corresponding thread group and outputting j shuffling output data, wherein j is an integer and 0.ltoreq.j.ltoreq.k; based on shuffle output data obtained from the operational data of the one or more corresponding thread groups, result data for each thread group is generated from the data correspondence information.
According to some exemplary embodiments of the present disclosure, the generating the data correspondence information includes: the data correspondence information is generated based on the SIMD mode.
According to some exemplary embodiments of the present disclosure, in the data shuffling method according to the second aspect of the present disclosure, m has a value of 128, k has a value of 32, n has a value of 4, and the SIMD mode is one of SIMD32 mode, SIMD64 mode and SIMD128 mode.
According to some example embodiments of the present disclosure, the generating the data correspondence information based on SIMD mode includes: when the SIMD mode is a SIMD32 mode, the data correspondence information includes: obtaining result data of the first thread group from operation data of the first thread group; obtaining result data of the second thread group from the operation data of the second thread group; obtaining result data of the third thread group from the operation data of the third thread group; and obtaining the result data of the fourth thread group from the operation data of the fourth thread group.
According to some example embodiments of the present disclosure, the generating the data correspondence information based on SIMD mode includes: when the SIMD mode is a SIMD64 mode, the data correspondence information includes: obtaining result data of the first thread group from operation data of the first thread group and the second thread group; obtaining result data of a second thread group from operation data of the first thread group and the second thread group; obtaining result data of a third thread group from operation data of the third thread group and the fourth thread group; and obtaining the result data of the fourth thread group from the operation data of the third thread group and the fourth thread group.
According to some example embodiments of the present disclosure, the generating the data correspondence information based on SIMD mode includes: when the SIMD mode is a SIMD128 mode, the data correspondence information includes: obtaining result data of a first thread group from operation data of the first, second, third and fourth thread groups; obtaining result data of a second thread group from operation data of the first, second, third and fourth thread groups; obtaining result data of a third thread group from operation data of the first, second, third and fourth thread groups; and obtaining the result data of the fourth thread group from the operation data of the first, second, third and fourth thread groups.
According to some exemplary embodiments of the present disclosure, the generating the data correspondence information includes: generating an n-bit result data index flag according to the validity of the m threads, wherein each bit in the result data index flag corresponds to one group of result data of one thread group in the n thread groups; calculating an operation data index corresponding to each result data to generate an n-bit operation data index flag for a set of result data of each of the n thread groups, wherein each bit of the operation data index flag corresponds to a set of operation data of one of the n thread groups; and generating the data corresponding information based on the result data index mark and the operation data index mark.
According to some example embodiments of the present disclosure, the generating the data correspondence information based on the result data index flag and the operation data index flag includes: determining a group of result data of a thread group corresponding to a bit with the value of 1 in the result data index mark as a valid result data group; for each valid result data set, result data is obtained from the operation data of the thread group corresponding to the bit of 1 in the corresponding operation data index flag.
According to a third aspect of the present disclosure, there is provided a SIMD architecture-based chip comprising a shuffling circuit provided according to the first aspect of the present disclosure and exemplary embodiments thereof.
According to some example embodiments of the disclosure, the chip is a GPU chip.
According to a fourth aspect of the present disclosure, there is provided an integrated circuit device comprising at least one chip provided according to the third aspect of the present disclosure and exemplary embodiments thereof.
Drawings
Specific embodiments of the present disclosure will be described in detail below with reference to the drawings so that more details, features, and advantages of the present disclosure can be more fully appreciated and understood; in the drawings:
Fig. 1 schematically illustrates a prior art shuffle circuit;
FIG. 2 schematically illustrates, in block diagram form, the structure of a shuffle circuit in accordance with one exemplary embodiment of the present disclosure;
fig. 3 schematically illustrates in block diagram form a structure of a shuffle circuit in accordance with another exemplary embodiment of the present disclosure;
FIGS. 4a, 4b, 4c schematically show the correspondence between sets of operation data and sets of result data of the shuffling circuit shown in FIG. 3 in different SIMD modes;
fig. 5 schematically illustrates in block diagram form a structure of a shuffle circuit in accordance with another exemplary embodiment of the present disclosure;
FIGS. 6a, 6b, 6c, 6d schematically illustrate the operation of the shuffling circuit of FIG. 5 for one SIMD mode;
FIGS. 7a, 7b, 7c, and 7d schematically illustrate the operation of the shuffling circuit shown in FIG. 5 for another SIMD mode;
FIG. 8 schematically illustrates, in flow chart form, a data shuffling method according to an exemplary embodiment of the present disclosure;
FIG. 9 shows, in flow chart form, details of the data shuffling method shown in FIG. 8;
FIG. 10 shows in flowchart form details of the data shuffling method shown in FIG. 9
FIG. 11 schematically illustrates, in block diagram form, the structure of a chip in accordance with one exemplary embodiment of the present disclosure; and
fig. 12 schematically illustrates in block diagram form the structure of an integrated circuit device according to one exemplary embodiment of the present disclosure.
It should be understood that the matters shown in the drawings are merely illustrative and thus are not necessarily drawn to scale. Furthermore, the same or similar features are denoted by the same or similar reference numerals throughout the drawings.
Detailed Description
The following description provides specific details of various exemplary embodiments of the disclosure so that those skilled in the art may fully understand and practice the technical solutions according to the present disclosure.
First, some terms involved in the embodiments of the present disclosure are explained to facilitate understanding by those skilled in the art:
single instruction multiple data (Single Instruction Multiple Data, SIMD): in this disclosure, the term refers to an instruction capable of processing multiple data simultaneously. Thus, SIMD is able to obtain all operation data at once for operation, which makes it particularly suitable for applications where data-intensive operations exist.
Shuffle (Shuffle) operation: in this disclosure, the term refers to an operation of causing data of a plurality of parallel threads to be redistributed according to a predetermined manner, which enables data sharing and data reordering among threads.
Shuffling period: in this disclosure, the term refers to the time required for a shuffler to receive operational data and perform a shuffling operation based on the maximum number of threads it can process in parallel. For example, with the current state of the art, the maximum number of threads that a shuffler can perform parallel processing is 32 in the pursuit of high frequency, and thus, its shuffling period is the time required to receive operation data of 32 threads and perform a shuffling operation. The maximum number of threads that the shuffler can process in parallel may also be 64 or 128 without pursuing a high frequency, and then its shuffling period is the time required to receive the operation data of 64 or 128 threads and to perform the shuffling operation, respectively.
SIMD architecture: the term refers to an architecture that is capable of parallel processing of data for multiple threads based on SIMD approaches. In this disclosure, for a SIMD architecture, the architecture is distinguished by the maximum number of threads it can process in parallel. For example, a SIMD128 architecture refers to its ability to process data for 128 threads in parallel at maximum in a SIMD fashion. Thus, it should be understood that in this disclosure, a shuffling circuit based on SIMD128 architecture means that the shuffling circuit is capable of shuffling data from up to 128 threads.
SIMD mode: in this disclosure, the term refers to the correspondence between result data and operation data during a shuffling operation of data based on a shuffling circuit of a certain SIMD architecture. Specifically, when a shuffling circuit based on a certain SIMD architecture operates in a certain SIMD mode, a set of result data having the number corresponding to that SIMD mode is obtained only from a set of operation data having the same corresponding number.
For example, in the present disclosure, a shuffling circuit based on SIMD128 architecture and employing SIMD32 mode means that the shuffling circuit is capable of acquiring operation data from 128 parallel threads, and there is a correspondence between the result data and the operation data, that is: the 0 th to 31 th result data are obtained only from the 0 th to 31 th operation data, the 32 nd to 63 th result data are obtained only from the 32 nd to 63 th operation data, the 64 th to 95 th result data are obtained only from the 64 th to 95 th operation data, and the 96 th to 127 th result data are obtained only from the 96 th to 127 th operation data. Similarly, a shuffling circuit based on SIMD128 architecture and employing SIMD64 mode means that the shuffling circuit is capable of acquiring operation data from 128 parallel threads, and there is a correspondence between the result data and the operation data as follows: the 0 th to 63 th result data are obtained only from the 0 th to 63 th operation data, and the 64 th to 127 th result data are obtained only from the 64 th to 127 th operation data. Similarly, a shuffling circuit based on SIMD128 architecture and employing SIMD128 mode means that the shuffling circuit is capable of acquiring operation data from 128 parallel threads, and there is a correspondence between the result data and the operation data as follows: the 0 th to 127 th result data are obtained only from the 0 th to 127 th operation data.
In consideration of the maximum number of threads that can be processed in parallel by the shuffler in the shuffle circuit, the correspondence relationship between the result data and the operation data in each SIMD mode needs to take into consideration the case where the operation data and the result data are respectively grouped according to the maximum number of threads that can be processed in parallel by the shuffler. Also taking as an example a shuffle circuit based on SIMD128 architecture and employing SIMD32 mode, if the maximum number of threads that can be processed in parallel by its shuffler is 32, the shuffle circuit divides operation data obtained from 128 parallel threads into 4 groups each including 32 operation data, and correspondingly, its result data is also divided into 4 groups each including 32 result data, and there is a correspondence between the result data and the operation data, that is: the first set of result data is obtained from the first set of operation data only, the second set of result data is obtained from the second set of operation data only, the third set of result data is obtained from the third set of operation data only, and the fourth set of result data is obtained from the fourth set of operation data only. The case of operating in SIMD64 and SIMD128 modes may be similarly considered.
Furthermore, it should be appreciated that for a shuffle circuit based on a certain SIMD architecture, it may actually operate in a plurality of different SIMD modes.
Referring to fig. 1, a prior art shuffle circuit is schematically shown. As shown in fig. 1, the shuffling circuit 10 is a shuffling circuit based on SIMD128 architecture, and thus it is capable of acquiring 128 operation data, i.e., operation data 0 to operation data 127, from 128 parallel threads, and correspondingly, outputting 128 result data, i.e., result data 0 to result data 127. The shuffle circuit 10 includes an input selector 12, a shuffler 13, and an output selector 14. The maximum number of threads that the shuffler 13 can process in parallel is 32, and therefore the input selector 12 divides 128 operation data into 4 operation data groups, that is: a first operational data group 11-1 including operational data 0 through operational data 31, a second operational data group 11-2 including operational data 32 through operational data 63, a third operational data group 11-3 including operational data 64 through operational data 95, and a fourth operational data group 11-4 including operational data 96 through operational data 127. The input selector 12 selects corresponding operation data sets from these operation data sets and sends them in turn to the shuffler 13. The shuffler 13 sequentially receives the corresponding operation data sets, performs a shuffling operation on 32 operation data in each of the received corresponding operation data sets, and outputs j shuffled output data, where j is an integer and 0.ltoreq.j.ltoreq.32. The output selector 14 sequentially receives shuffled output data obtained from each corresponding set of operation data from the shuffler 13 to generate 128 result data, wherein the 128 result data are also divided into 4 result data sets according to: a first result data set 15-1 comprising result data 0 to result data 31, a second result data set 15-2 comprising result data 32 to result data 63, a third result data set 15-3 comprising result data 64 to result data 95, and a fourth result data set 15-4 comprising result data 96 to result data 127.
Thus, to generate result data for one result data set, shuffle circuit 10 needs to traverse all four operation data sets. For example, to generate the result data included in the first result data set 15-1, the shuffling circuit 10 needs to shuffle all four operation data sets, and thus 4 shuffling cycles are required. Similarly, to generate the result data comprised by the second, third and fourth result data sets 15-2, 15-3 and 15-4, the shuffling circuit 10 needs to shuffle the operation data of all four operation data sets for each result data set. It can be seen that 16 shuffling cycles are required by the shuffle circuit 10 in order to generate the result data comprised by the first, second, third and fourth result data sets 15-1, 15-2, 15-3, 15-4. However, in some SIMD modes, there is a specific correspondence between the result data of each result data set and the operation data of the operation data set. For example, in SIMD32 mode, the result data of the first result data group 15-1 would be obtained only from the operation data of the first operation data group 11-1, the result data of the second result data group 15-2 would be obtained only from the operation data of the second operation data group 11-2, the result data of the third result data group 15-3 would be obtained only from the operation data of the third operation data group 11-3, and the result data of the fourth result data group 15-4 would be obtained only from the operation data of the fourth operation data group 11-4. Thus, for SIMD32 mode, the shuffling circuit 10 spends only 4 of the 16 shuffling cycles active in order to generate the result data comprised by the first, second, third and fourth result data sets 15-1, 15-2, 15-3, 15-4, and the other 12 shuffling cycles do not generate result data and are thus inactive. Therefore, the shuffling circuit 10 based on the SIMD128 architecture has a problem that computational resources are wasted and processing efficiency is low when operating in the SIMD32 mode.
Referring to fig. 2, a structure of a shuffling circuit according to an exemplary embodiment of the present disclosure is schematically shown in block diagram form. As shown in fig. 2, the shuffle circuit 100 includes an input selector 120, a shuffler 130, an output selector 140, and a control circuit 160. The shuffling circuit 100 is capable of acquiring m pieces of operation data from m parallel threads, and correspondingly, outputting m pieces of result data, where m is an integer greater than or equal to 1. Thus, the shuffle circuit 100 is a SIMD (m) architecture based shuffle circuit that is capable of operating in different SIMD modes.
The control circuit 160 divides the m threads into n thread groups according to the maximum number k of threads that the shuffler 130 can process in parallel, each thread group including k threads, where k, m, n are integers greater than or equal to 1. Accordingly, the operation data of the m threads is divided into n operation data groups corresponding to n thread groups, that is: the first operation data group 110-1, the second operation data group 110-2, … …, the (n-1) th operation data group 110- (n-1), the nth operation data group 110-n, wherein each operation data group includes k operation data. And, the result data of the m threads is divided into n result data groups corresponding to the n thread groups, that is: the first result data set 150-1, the second result data set 150-2, … …, the (n-1) th result data set 150- (n-1), the nth result data set 150-n, wherein each result data set comprises k result data. Based on the n operation data groups and the n result data groups of the n thread groups, the control circuit 160 may generate data correspondence information and transmit the data correspondence information to the input selector 120 and the output selector 140. The data correspondence information defines from which thread group or thread groups the result data is obtained, respectively. In other words, the data correspondence information defines from which one or more operation data sets of the operation data the result data in each result data set is obtained.
The input selector 120 selects operation data of one or more corresponding thread groups from the n thread groups according to the received data correspondence information, and sequentially transmits the operation data to the shuffler 130 in a predetermined order of the thread groups. It should be understood that in this disclosure, the order of thread groups refers to: in one aspect, the input selector 120 sends the set of operation data corresponding to one thread group to the shuffler 130, waits for the shuffler 130 to process the set of operation data, and then sends the set of operation data corresponding to the next thread group to the shuffler 130; on the other hand, the input selector 120 sequentially transmits all operation data sets corresponding to one result data set to the shuffler 130 in the manner described in the previous aspect, and sequentially transmits all operation data sets corresponding to the next result data set to the shuffler 130 after the shuffler 130 sequentially processes the operation data of the operation data sets in the manner described in the previous aspect.
The shuffler 130 sequentially receives the operation data of the one or more corresponding thread groups from the input selector 120, performs a shuffling operation on the received k operation data of each corresponding thread group, and outputs j shuffled output data, where j is an integer and 0.ltoreq.j.ltoreq.k. The output selector 140 sequentially receives shuffle output data obtained from the operation data of the one or more corresponding thread groups from the shuffler 130, and generates result data for each thread group based on the shuffle output data according to the received data correspondence information. It should be understood that, because the input selector 120 selects the operation data of one or more corresponding thread groups from the n thread groups according to the received data correspondence information and sequentially transmits to the shuffler 130 in a predetermined order of the thread groups, the output selector 140 sequentially receives the shuffle output data from the shuffler 130, and generates a set of result data of the corresponding thread groups from the n thread groups based on the shuffle output data according to the data correspondence relationship. For example, the output selector 140 will sequentially generate the result data for each of the first result data set 150-1, the second result data set 150-2, … …, the (n-1) th result data set 150- (n-1), and the nth result data set 150-n shown in FIG. 2, respectively.
It should be appreciated that because the shuffling circuit 100 shown in fig. 2 generates data correspondence information by the control circuit 160, where the data correspondence information defines from which one or more thread groups of operation data the result data of each thread group is obtained, respectively, the shuffling circuit 100 can ensure that all of the used shuffling periods are used to generate the result data and thus are effective shuffling periods, thereby eliminating waste of computational resources and improving processing efficiency. Thus, the shuffle circuit 100 shown in FIG. 2 can operate efficiently for different SIMD modes, improving compatibility for different SIMD modes.
In some exemplary embodiments of the present disclosure, the control circuit 160 is capable of generating data correspondence information based on the SIMD mode in which the shuffling circuit actually operates. In other exemplary embodiments of the present disclosure, the control circuit 160 may be configured to generate the data correspondence information based on a result data index flag and an operation data index flag, wherein the result data index flag corresponds to each result data group, and the operation data index flag corresponds to each operation data group. These exemplary embodiments of the present disclosure will be respectively described in detail below.
Referring to fig. 3, a structure of a shuffling circuit according to another exemplary embodiment of the present disclosure is schematically shown in block diagram form. As shown in fig. 3, the shuffle circuit 200 includes an input selector 220, a shuffler 230, an output selector 240, and a SIMD mode control circuit 260. The shuffle circuit 200 is capable of acquiring 128 operational data from 128 parallel threads, and thus the shuffle circuit 200 is a SIMD128 architecture based shuffle circuit. The maximum number of threads that shuffler 230 can process in parallel is 32. Accordingly, SIMD mode control circuitry 260 divides 128 threads into 4 thread groups, each thread group comprising 32 threads. In other words, for the shuffling circuit 200, the values of the maximum number of threads k, the number of threads m, and the number of thread groups n that the shuffler described above can process in parallel are respectively: k has a value of 32, m has a value of 128, and n has a value of 4. Thus, the operation data of 128 threads is divided into four operation data groups corresponding to four thread groups, that is: a first operational data set 210-1, a second operational data set 210-2, a third operational data set 210-3, and a fourth operational data set 210-4. Correspondingly, the result data of 128 threads are divided into four result data sets corresponding to four thread groups, namely: a first result data set 250-1, a second result data set 250-2, a third result data set 250-3, and a fourth result data set 250-4. The SIMD mode control circuit 260 is capable of generating data correspondence information based on a SIMD mode, and the SIMD mode is one of a SIMD32 mode, a SIMD64 mode, and a SIMD128 mode.
Referring to fig. 4a, 4b, 4c, and in combination to fig. 3, fig. 4a, 4b, 4c schematically illustrate the correspondence between the operation data set and the result data set, respectively, when the shuffle circuit 200 shown in fig. 3 is operated in different SIMD modes.
Fig. 4a schematically shows the correspondence between the operation data set and the result data set when the shuffling circuit 200 is operating in SIMD32 mode. As shown in fig. 4a, when the SIMD mode is the SIMD32 mode, the data correspondence information generated by the SIMD mode control circuit 260 includes: obtaining result data of the first result data group 250-1 only from the operation data of the first operation data group 210-1; obtaining result data of the second result data set 250-2 only from the operation data of the second operation data set 210-2; obtaining result data of the third result data group 250-3 only from the operation data of the third operation data group 210-3; and obtaining result data of the fourth result data group 250-4 only from the operation data of the fourth operation data group 210-4. Thus, the shuffle circuit 200, when operating in SIMD32 mode, requires 4 shuffle cycles to generate result data for 128 threads.
Fig. 4b schematically shows the correspondence between the operation data set and the result data set when the shuffling circuit 200 is operating in SIMD64 mode. As shown in fig. 4b, when the SIMD mode is the SIMD64 mode, the data correspondence information generated by the SIMD mode control circuit 260 includes: obtaining result data of the first result data group 250-1 from only the operation data of the first operation data group 210-1 and the second operation data group 210-2; obtaining result data of the second result data group 250-2 from only the operation data of the first operation data group 210-1 and the second operation data group 210-2; obtaining result data of the third result data group 250-3 from only the operation data of the third operation data group 210-3 and the fourth operation data group 210-4; and obtaining result data of the fourth result data group 250-4 from only the operation data of the third operation data group 210-3 and the fourth operation data group 210-4. Thus, the shuffle circuit 200, when operating in SIMD64 mode, requires 8 shuffle cycles to generate result data for 128 threads.
Fig. 4c schematically shows the correspondence between the operation data set and the result data set when the shuffling circuit 200 is operating in SIMD128 mode. As shown in fig. 4c, when the SIMD mode is the SIMD64 mode, the data correspondence information generated by the SIMD mode control circuit 260 includes: obtaining result data of the first result data group 250-1 from the operation data of the first operation data group 210-1, the second operation data group 210-2, the third operation data group 210-3, and the fourth operation data group 210-4; obtaining result data of the second result data group 250-2 from the operation data of the first operation data group 210-1, the second operation data group 210-2, the third operation data group 210-3, and the fourth operation data group 210-4; obtaining result data of the third result data group 250-3 from the operation data of the first operation data group 210-1, the second operation data group 210-2, the third operation data group 210-3, and the fourth operation data group 210-4; and obtaining result data of the fourth result data group 250-4 from the operation data of the first operation data group 210-1, the second operation data group 210-2, the third operation data group 210-3, and the fourth operation data group 210-4. Thus, the shuffle circuit 200, when operating in SIMD128 mode, requires 16 shuffle cycles to generate result data for 128 threads.
With continued reference to fig. 3, the input selector 220, the shuffler 230, and the output selector 240 included in the shuffle circuit 200 are identical or similar in structure and function to the input selector 120, the shuffler 130, and the output selector 140, respectively, included in the shuffle circuit 100 shown in fig. 2, and thus, are not described again here. It should be appreciated that in the shuffling circuit 200 shown in fig. 3, the SIMD mode control circuit 260 generates data correspondence information defining from which of the operation data of each thread group the result data of each thread group is obtained, respectively, according to the employed SIMD mode, thereby enabling the shuffling circuit 200 to perform a shuffling operation of a fixed number of shuffling cycles for each SIMD mode, thereby reducing the number of shuffling cycles, eliminating waste of computational resources, and improving processing efficiency, compared to the shuffling circuit in the related art.
Referring to fig. 5, a structure of a shuffling circuit according to another exemplary embodiment of the present disclosure is schematically shown in block diagram form. As shown in fig. 5, the shuffle circuit 300 includes an input selector 320, a shuffler 330, an output selector 340, and an index mark control circuit 360. The shuffling circuit 300 is capable of acquiring 128 operational data from 128 parallel threads, and thus the shuffling circuit 300 is a SIMD128 architecture based shuffling circuit. The maximum number of threads that shuffler 330 can process in parallel is 32. Accordingly, the index tag control circuit 360 divides the 128 threads into 4 thread groups, each thread group including 32 threads. In other words, for the shuffling circuit 300, the values of the maximum number of threads k, the number of threads m, and the number of thread groups n of parallel processing described above are respectively: k has a value of 32, m has a value of 128, and n has a value of 4. Thus, the operation data of 128 threads is divided into four operation data groups corresponding to four thread groups, that is: a first operational data set 310-1, a second operational data set 310-2, a third operational data set 310-3, and a fourth operational data set 310-4. Correspondingly, the result data of 128 threads are divided into four result data sets corresponding to four thread groups, namely: a first result data set 350-1, a second result data set 350-2, a third result data set 350-3, and a fourth result data set 350-4.
Index flag control circuit 360 includes a result data index flag generator 361 and an operation data index flag generator 362. The result data index flag generator 361 generates a 4-bit result data index flag dst_grp_mask according to the validity of 128 threads, wherein each bit of the result data index flag dst_grp_mask corresponds to one set of result data of one of the 4 thread groups, that is, corresponds to one of the four result data groups shown in fig. 5. The operation data index flag generator 362 calculates an operation data index corresponding to each result data to generate a 4-bit operation data index flag src_grp_mask for a set of result data (i.e., one of the four result data sets shown in fig. 5) of each of the 4 thread groups, wherein each bit of the operation data index flag src_grp_mask corresponds to a set of operation data of one of the n thread groups, i.e., corresponds to one of the four operation data sets shown in fig. 5. As a non-limiting example, the operation data index may be calculated from a formula corresponding to the shuffling operation. The operation data index indicates from which operation data a certain result data originates, and thus, based on the operation data index, a distribution of the result data with respect to the operation data can be obtained. Thus, the index flag control circuit 360 is able to generate data correspondence information, which defines from which one or more thread groups of operation data the result data of each thread group is obtained, respectively, based on the result data index flag dst_grp_mask and the operation data index flag src_grp_mask. In one exemplary embodiment, the index flag control circuit 360 may determine a set of result data of one thread group corresponding to a bit of 1 in the result data index flag dst_grp_mask as valid result data groups, and for each valid result data group, obtain the result data from the operation data of the thread group corresponding to a bit of 1 in the corresponding operation data index flag src_grp_mask.
Further, it should be understood that the input selector 320, the shuffler 330, and the output selector 340 included in the shuffle circuit 300 are identical or similar in structure and function to the input selector 120, the shuffler 130, and the output selector 140 included in the shuffle circuit 100 shown in fig. 2, respectively, and thus, a detailed description thereof is omitted herein.
Referring to fig. 6a, 6b, 6c and 6d, and in combination with reference to fig. 5, the operation of the shuffle circuit 300 shown in fig. 5 for one SIMD mode is schematically illustrated in fig. 6a, 6b, 6c and 6d together.
Fig. 6a schematically shows a process of the shuffling circuit 300 obtaining result data of the first result data set 350-1. As shown in fig. 6a, 128 threads are active, the result data index flag dst_grp_mask is "1111", and therefore, the shuffling circuit 300 shuffles the result data of the result data group (i.e., the first result data group 350-1) corresponding to the first bit of the result data index flag dst_grp_mask (i.e., the bit marked with a bold square frame in fig. 6 a) based on the result data index flag dst_grp_mask and the operation data index flag src_grp_mask. Because the operation data index flag src_grp_mask is "1111", the operation data of the first operation data group 310-1 corresponding to the first bit of the operation data index flag src_grp_mask is first sent to the shuffler 330 to generate result data in the first result data group 350-1. Subsequently, the operation data index flag processing section 370 processes the operation data index flag src_grp_mask so that its value becomes "0111". The value of the first bit of the operation data index flag src_grp_mask becomes "0", meaning that the shuffling circuit 300 has fetched the result data in the first result data set 350-1 from the first operation data set 310-1. According to the operation data index flag src_grp_mask being "0111", the operation data of the second operation data group 310-2 corresponding to the second bit of the operation data index flag src_grp_mask is then sent to the shuffler 330 to generate the result data in the first result data group 350-1. Then, the operation data index flag processing section 370 processes the operation data index flag src_grp_mask to have its value changed to "0011", which means that the shuffling circuit 300 has fetched the result data in the first result data set 350-1 from the second operation data set 310-2. According to the operation data index flag src_grp_mask being "0011", the operation data of the third operation data group 310-3 corresponding to the third bit of the operation data index flag src_grp_mask is then sent to the shuffler 330 to generate result data in the first result data group 350-1. Then, the operation data index flag processing section 370 processes the operation data index flag src_grp_mask to a value of "0001", which means that the shuffling circuit 300 has fetched the result data in the first result data group 350-1 from the third operation data group 310-3. According to the operation data index flag src_grp_mask being "0001", the operation data of the fourth operation data group 310-4 corresponding to the fourth bit of the operation data index flag src_grp_mask is finally sent to the shuffler 330 to generate the result data in the first result data group 350-1. Then, the operation data index flag processing section 370 processes the operation data index flag src_grp_mask to "0000", which means that the shuffling circuit 300 has shuffled the operation data of all the corresponding operation data groups, without acquiring the operation data again. It can be seen that during the operation shown in fig. 6a, 4 shuffling cycles are required to generate the result data in the first result data set 350-1.
Fig. 6b schematically shows a process of the shuffling circuit 300 obtaining result data of the second result data set 350-2. As shown in fig. 6b, the result data index flag dst_grp_mask becomes "0111", which means that the shuffling circuit 300 has generated the result data in the first result data set 350-1, and therefore, the shuffling circuit 300 will generate the result data of the corresponding result data set (i.e., the second result data set 350-2) of the result data index flag dst_grp_mask (i.e., the one marked with a bold frame in fig. 6 b) based on the result data index flag dst_grp_mask and the operation data index flag src_grp_mask. Because the operation data index flag src_grp_mask is "1111", the operation data of the first operation data group 310-1 corresponding to the first bit of the operation data index flag src_grp_mask is first sent to the shuffler 330 to generate result data in the second result data group 350-2. Subsequently, the operation data index flag processing section 370 processes the operation data index flag src_grp_mask so that its value becomes "0111". The value of the first bit of the operation data index flag src_grp_mask becomes "0", meaning that the shuffling circuit 300 has fetched result data in the second result data set 350-2 from the first operation data set 310-1. According to the operation data index flag src_grp_mask being "0111", the operation data of the second operation data group 310-2 corresponding to the second bit of the operation data index flag src_grp_mask is then sent to the shuffler 330 to generate the result data in the second result data group 350-2. Then, the operation data index flag processing section 370 processes the operation data index flag src_grp_mask to a value of "0011", which means that the shuffling circuit 300 has fetched the result data in the second result data set 350-2 from the second operation data set 310-2. According to the operation data index flag src_grp_mask being "0011", the operation data of the third operation data group 310-3 corresponding to the third bit of the operation data index flag src_grp_mask is then sent to the shuffler 330 to generate result data in the second result data group 350-2. Then, the operation data index flag processing section 370 processes the operation data index flag src_grp_mask to a value of "0001", which means that the shuffling circuit 300 has fetched the result data in the second result data group 350-2 from the third operation data group 310-3. According to the operation data index flag src_grp_mask being "0001", the operation data of the fourth operation data group 310-4 corresponding to the fourth bit of the operation data index flag src_grp_mask is finally sent to the shuffler 330 to generate the result data in the second result data group 350-2. Then, the operation data index flag processing section 370 processes the operation data index flag src_grp_mask to "0000", which means that the shuffling circuit 300 has shuffled the operation data of all the corresponding operation data groups, without acquiring the operation data again. It can be seen that during the operation shown in fig. 6b, 4 shuffling cycles are also required to generate the result data in the second result data set 350-2.
Fig. 6c schematically shows a process of the shuffling circuit 300 obtaining result data of the third result data set 350-3. As shown in fig. 6c, the result data index flag dst_grp_mask becomes "0011", which means that the shuffling circuit 300 has generated result data in the first result data set 350-1 and the second result data set 350-2, and therefore, the shuffling circuit 300 will generate result data of the corresponding result data set (i.e., the third result data set 350-3) of the result data index flag dst_grp_mask based on the result data index flag dst_grp_mask and the operation data index flag src_grp_mask (i.e., the one marked with a bold square frame in fig. 6 c). Because the operation data index flag src_grp_mask is "1111", the operation data of the first operation data group 310-1 corresponding to the first bit of the operation data index flag src_grp_mask is first sent to the shuffler 330 to generate result data in the third result data group 350-3. Subsequently, the operation data index flag processing section 370 processes the operation data index flag src_grp_mask so that its value becomes "0111". The value of the first bit of the operation data index flag src_grp_mask becomes "0", meaning that the shuffling circuit 300 has fetched result data in the third result data set 350-3 from the first operation data set 310-1. According to the operation data index flag src_grp_mask being "0111", the operation data of the second operation data group 310-2 corresponding to the second bit of the operation data index flag src_grp_mask is then sent to the shuffler 330 to generate the result data in the third result data group 350-3. Then, the operation data index flag processing section 370 processes the operation data index flag src_grp_mask to have its value changed to "0011", which means that the shuffling circuit 300 has fetched the result data in the third result data set 350-3 from the second operation data set 310-2. According to the operation data index flag src_grp_mask being "0011", the operation data of the third operation data group 310-3 corresponding to the third bit of the operation data index flag src_grp_mask is then sent to the shuffler 330 to generate result data in the third result data group 350-3. Then, the operation data index flag processing section 370 processes the operation data index flag src_grp_mask to a value of "0001", which means that the shuffling circuit 300 has fetched the result data in the third result data group 350-3 from the third operation data group 310-3. According to the operation data index flag src_grp_mask being "0001", the operation data of the fourth operation data group 310-4 corresponding to the fourth bit of the operation data index flag src_grp_mask is finally sent to the shuffler 330 to generate the result data in the third result data group 350-3. Then, the operation data index flag processing section 370 processes the operation data index flag src_grp_mask to "0000", which means that the shuffling circuit 300 has shuffled the operation data of all the corresponding operation data groups, without acquiring the operation data again. It can be seen that during the operation shown in fig. 6c, 4 shuffling cycles are also required to generate the result data in the third result data set 350-3.
Fig. 6d schematically shows the process of the shuffle circuit 300 to obtain result data of the fourth result data set 350-4. As shown in fig. 6d, the result data index flag dst_grp_mask becomes "0001", which means that the shuffling circuit 300 has generated result data in the first result data group 350-1, the second result data group 350-2 and the third result data group 350-3, and therefore, the shuffling circuit 300 will generate result data of the result data group (i.e., the fourth result data group 350-4) corresponding to the fourth bit of the result data index flag dst_grp_mask (i.e., the one marked with a bold square frame in fig. 6 d) based on the result data index flag dst_grp_mask and the operation data index flag src_grp_mask. Because the operation data index flag src_grp_mask is "1111", the operation data of the first operation data group 310-1 corresponding to the first bit of the operation data index flag src_grp_mask is first sent to the shuffler 330 to generate result data in the fourth result data group 350-4. Subsequently, the operation data index flag processing section 370 processes the operation data index flag src_grp_mask so that its value becomes "0111". The value of the first bit of the operation data index flag src_grp_mask becomes "0", meaning that the shuffling circuit 300 has fetched result data in the fourth result data set 350-4 from the first operation data set 310-1. According to the operation data index flag src_grp_mask being "0111", the operation data of the second operation data group 310-2 corresponding to the second bit of the operation data index flag src_grp_mask is then sent to the shuffler 330 to generate the result data in the fourth result data group 350-4. Then, the operation data index flag processing section 370 processes the operation data index flag src_grp_mask to a value of "0011", which means that the shuffling circuit 300 has fetched the result data in the fourth result data set 350-4 from the second operation data set 310-2. According to the operation data index flag src_grp_mask being "0011", the operation data of the third operation data group 310-3 corresponding to the third bit of the operation data index flag src_grp_mask is then sent to the shuffler 330 to generate result data in the fourth result data group 350-4. Then, the operation data index flag processing section 370 processes the operation data index flag src_grp_mask to a value of "0001", which means that the shuffling circuit 300 has fetched the result data in the fourth result data group 350-4 from the third operation data group 310-3. According to the operation data index flag src_grp_mask being "0001", the operation data of the fourth operation data group 310-4 corresponding to the fourth bit of the operation data index flag src_grp_mask is finally sent to the shuffler 330 to generate result data in the fourth result data group 350-4. Then, the operation data index flag processing section 370 processes the operation data index flag src_grp_mask to "0000", which means that the shuffling circuit 300 has shuffled the operation data of all the corresponding operation data groups, without acquiring the operation data again. It can be seen that during the operation shown in fig. 6d, 4 shuffling cycles are required to generate the result data in the fourth result data set 350-4.
It can be seen that during the operations schematically shown in fig. 6a, 6b, 6c and 6d, 16 shuffling cycles are required for the shuffle circuit 300 to generate 128 result data for 128 threads, and that all four operation data sets need to be traversed to generate result data for each result data set. Thus, during the operation schematically shown in fig. 6a, 6b, 6c and 6d, the SIMD128 architecture based shuffling circuit 300 operates in SIMD128 mode.
Referring to fig. 7a, 7b, 7c and 7d, and in combination with reference to fig. 5, fig. 7a, 7b, 7c and 7d together schematically illustrate the operation of the shuffle circuit shown in fig. 5 for another SIMD mode.
Fig. 7a schematically shows a process of the shuffling circuit 300 obtaining result data of the first result data set 350-1. As shown in fig. 7a, 128 threads are active, the result data index flag dst_grp_mask is "1111", and therefore, the shuffling circuit 300 shuffles the result data of the result data group (i.e., the first result data group 350-1) corresponding to the first bit of the result data index flag dst_grp_mask (i.e., the bit marked with a bold square frame in fig. 7 a) based on the result data index flag dst_grp_mask and the operation data index flag src_grp_mask. Because the operation data index flag src_grp_mask is "1000", the operation data of the first operation data group 310-1 corresponding to the first bit of the operation data index flag src_grp_mask is sent to the shuffler 330 to generate result data in the first result data group 350-1. Subsequently, the operation data index flag processing section 370 processes the operation data index flag src_grp_mask so that its value becomes "0000", which means that the shuffling circuit 300 has already shuffled the operation data of all the corresponding operation data groups, without acquiring the operation data again. The value of the first bit of the operation data index flag src_grp_mask becoming "0" means that the shuffling circuit 300 has fetched the result data in the first result data set 350-1 from the first operation data set 310-1, and subsequently the operation data index flag src_grp_mask becomes "0000", meaning that the shuffling circuit 300 no longer has to fetch the result data in the first result data set 350-1 from the other operation data sets. Thus, during the operation shown in FIG. 7a, 1 shuffling period is required to generate the result data in the first result data set 350-1.
Fig. 7b schematically shows a process of the shuffling circuit 300 obtaining result data of the second result data set 350-2. As shown in fig. 7b, the result data index flag dst_grp_mask becomes "0111", which means that the shuffling circuit 300 has generated the result data in the first result data set 350-1, and therefore, the shuffling circuit 300 will generate the result data of the corresponding result data set (i.e., the second result data set 350-2) of the result data index flag dst_grp_mask (i.e., the one marked with a bold frame in fig. 7 b) based on the result data index flag dst_grp_mask and the operation data index flag src_grp_mask. Because the operation data index flag src_grp_mask is "0100", the operation data of the second operation data group 310-2 corresponding to the second bit of the operation data index flag src_grp_mask is sent to the shuffler 330 to generate result data in the second result data group 350-2. Subsequently, the operation data index flag processing section 370 processes the operation data index flag src_grp_mask so that its value becomes "0000", which means that the shuffling circuit 300 has already shuffled the operation data of all the corresponding operation data groups, without acquiring the operation data again. The value of the second bit of the operation data index flag src_grp_mask becoming "0" means that the shuffling circuit 300 has fetched the result data in the second result data set 350-2 from the second operation data set 310-2, and subsequently the operation data index flag src_grp_mask becomes "0000", meaning that the shuffling circuit 300 no longer has to fetch the result data in the second result data set 350-2 from the other operation data sets. Thus, during the operation shown in FIG. 7b, 1 shuffling cycle is required to generate the result data in the second result data set 350-2.
Fig. 7c schematically shows a process of the shuffling circuit 300 obtaining result data of the third result data set 350-3. As shown in fig. 7c, the result data index flag dst_grp_mask becomes "0011", which means that the shuffling circuit 300 has generated result data in the first result data set 350-1 and the second result data set 350-2, and therefore, the shuffling circuit 300 will generate result data of the corresponding result data set (i.e., the third result data set 350-3) of the result data index flag dst_grp_mask based on the result data index flag dst_grp_mask and the operation data index flag src_grp_mask (i.e., the one marked with a bold square frame in fig. 7 c). Because the operation data index flag src_grp_mask is "0010", the operation data of the third operation data group 310-3 corresponding to the third bit of the operation data index flag src_grp_mask is sent to the shuffler 330 to generate result data in the third result data group 350-3. Subsequently, the operation data index flag processing section 370 processes the operation data index flag src_grp_mask so that its value becomes "0000", which means that the shuffling circuit 300 has already shuffled the operation data of all the corresponding operation data groups, without acquiring the operation data again. The value of the third bit of the operation data index flag src_grp_mask becoming "0" means that the shuffling circuit 300 has fetched the result data in the third result data set 350-3 from the third operation data set 310-3, and subsequently the operation data index flag src_grp_mask becomes "0000", meaning that the shuffling circuit 300 does not need to fetch the result data in the third result data set 350-3 from the other operation data sets any more. Thus, during the operation shown in FIG. 7c, 1 shuffling period is required to generate the result data in the third result data set 350-3.
Fig. 7d schematically shows a process of the shuffling circuit 300 obtaining result data of the fourth result data set 350-4. As shown in fig. 7d, the result data index flag dst_grp_mask becomes "0001", which means that the shuffling circuit 300 has generated result data in the first result data group 350-1, the second result data group 350-2 and the third result data group 350-3, and therefore, the shuffling circuit 300 will generate result data of the result data group (i.e., the fourth result data group 350-4) corresponding to the fourth bit of the result data index flag dst_grp_mask (i.e., the one marked with a bold square frame in fig. 7 d) based on the result data index flag dst_grp_mask and the operation data index flag src_grp_mask. Because the operation data index flag src_grp_mask is "0001", the operation data of the fourth operation data group 310-4 corresponding to the fourth bit of the operation data index flag src_grp_mask is sent to the shuffler 330 to generate result data in the fourth result data group 350-4. Subsequently, the operation data index flag processing section 370 processes the operation data index flag src_grp_mask so that its value becomes "0000", which means that the shuffling circuit 300 has already shuffled the operation data of all the corresponding operation data groups, without acquiring the operation data again. The value of the fourth bit of the operation data index flag src_grp_mask becoming "0" means that the shuffling circuit 300 has fetched result data in the fourth result data set 350-4 from the fourth operation data set 310-4, and then because the operation data index flag src_grp_mask becomes "0000", it means that the shuffling circuit 300 no longer has to fetch result data in the fourth result data set 350-4 from other operation data sets. It can be seen that during the operation shown in fig. 7d, 1 shuffling period is required to generate the result data in the fourth result data set 350-4.
It can be seen that during the operations schematically shown in fig. 7a, 7b, 7c and 7d, 4 shuffling cycles are required for the shuffling circuit 300 to generate 128 result data for 128 threads, and the result data for the first result data set 350-1 is obtained from the operation data for the first operation data set 310-1, the result data for the second result data set 350-2 is obtained from the operation data for the second operation data set 310-2, the result data for the third result data set 350-3 is obtained from the operation data for the third operation data set 310-3, and the result data for the fourth result data set 350-4 is obtained from the operation data for the fourth operation data set 310-4. Thus, during the operation schematically shown in fig. 7a, 7b, 7c and 7d, the shuffling circuit 300 based on SIMD128 architecture operates in SIMD32 mode.
It should be appreciated that in the shuffling circuit 300 shown in fig. 5, the index flag control circuit 360 includes a result data index flag generator 361 and an operation data index flag generator 362, and generates data correspondence information based on the result data index flag dst_grp_mask and the operation data index flag src_grp_mask, wherein the data correspondence information defines from which one or more thread groups of operation data the result data of each thread group are respectively obtained. In this manner, the shuffling circuit 300 enables the shuffling period spent generating all of the result data to be a valid shuffling period, thereby eliminating invalid shuffling periods, eliminating waste of computational resources, and improving processing efficiency.
In addition, data correspondence information is generated based on the result data index flag dst_grp_mask and the operation data index flag src_grp_mask, so that the shuffling circuit 300 can more flexibly determine the number of effective shuffling periods, thereby better compatible with different SIMD modes. Taking the example of a shuffling circuit 300 that is based on SIMD128 architecture and performs shuffling operations with 32 threads, this circuit can implement: the processing SIMD128 mode may be flexibly changed according to the actual situation within 1 to 16 shuffling periods, the processing SIMD64 mode may be flexibly changed according to the actual situation within 1 to 8 shuffling periods, and the processing SIMD32 mode may be flexibly changed according to the actual situation within 1 to 4 periods. Even for some lower granularity SIMD modes, the shuffling circuit 300 is able to complete shuffling operations in 1 to 4 shuffling cycles.
It should also be appreciated that index tag control circuit 360 in shuffle circuit 300 shown in fig. 5 may be equally applied in shuffle circuit 100 shown in fig. 2, for example, in place of control circuit 160. In this case, the result data index flag generator 361 generates an n-bit result data index flag dst_grp_mask according to the validity of m threads, wherein each bit in the result data index flag dst_grp_mask corresponds to one set of result data of one of the n thread groups; the operation data index flag generator 362 calculates an operation data index corresponding to each result data to generate an n-bit operation data index flag src_grp_mask for a set of result data of each of the n thread groups, wherein each bit of the operation data index flag src_grp_mask corresponds to a set of operation data of one of the n thread groups; and the index flag control circuit 360 generates data correspondence information based on the result data index flag dst_grp_mask and the operation data index flag src_grp_mask.
Referring to fig. 8, a data shuffling method according to an exemplary embodiment of the present disclosure is schematically shown in the form of a flow chart. As shown in fig. 8, the data shuffling method 500 includes steps 510, 520, 530, 540 and 550:
at step 510, m threads are divided into n thread groups according to a maximum number k of threads capable of parallel processing, each thread group including k threads, where k, m, n are integers greater than or equal to 1;
at step 520, data correspondence information is generated defining from which one or more thread groups of operational data the result data for each thread group is obtained, respectively;
selecting operation data of one or more corresponding thread groups from the n thread groups according to the data correspondence information in step 530;
at step 540, performing a shuffling operation on k operation data for each corresponding thread group and outputting j shuffling output data, where j is an integer and 0.ltoreq.j.ltoreq.k;
at step 550, result data for each thread group is generated based on shuffle output data obtained from the operation data for the one or more corresponding thread groups in accordance with the data correspondence information.
The data shuffling method 500 shown in fig. 8 is capable of ensuring that all shuffling periods are used to generate result data and are effective shuffling periods by generating data correspondence information defining from which operation data of one or more thread groups the result data of each thread group is obtained, thereby eliminating waste of computing resources and improving processing efficiency.
According to an exemplary embodiment, step 520 of data shuffling method 500 may further comprise: the data correspondence information is generated based on the SIMD mode. Further, as a non-limiting example, m has a value of 128, k has a value of 32, n has a value of 4, and the SIMD mode is one of SIMD32 mode, SIMD64 mode, and SIMD128 mode. Thus, the data correspondence information generated based on the SIMD mode may include the correspondence between the operation data set and the result data set in the different SIMD modes (e.g., in SIMD32 mode, SIMD64 mode, or SIMD128 mode) as previously shown in fig. 4a, 4b, and 4 c. Thus, the data shuffling method 500 completes a shuffling operation of a fixed number of shuffling cycles for each SIMD mode, thereby reducing the number of shuffling cycles, eliminating the waste of computational resources, and improving processing efficiency compared to the data shuffling method of the prior art.
Referring to fig. 9, details of the data shuffling method shown in fig. 8 are shown in flow chart form. As shown in fig. 9, step 520 in the data shuffling method 500 shown in fig. 8 further includes steps 521, 522 and 523:
generating n bits of result data index flag according to the validity of the m threads, wherein each bit of the result data index flag corresponds to a set of result data of one of the n thread groups;
at step 522, an operation data index corresponding to each of the result data is calculated to generate n-bit operation data index flags for a set of result data for each of the n thread groups, wherein each bit of the operation data index flags corresponds to a set of operation data for one of the n thread groups;
in step 523, the data correspondence information is generated based on the result data index flag and the operation data index flag.
Referring to fig. 10, details of the data shuffling method shown in fig. 9 are shown in flow chart form. As shown in fig. 10, step 523 shown in fig. 9 further includes steps 523a and 523b:
in step 523a, determining a set of result data of a thread group corresponding to a bit with a value of 1 in the result data index flag as a valid result data set;
In step 523b, for each valid result data set, result data is obtained from the operation data of the thread group corresponding to the bit of 1 in the corresponding operation data index flag.
As has been described in detail above, by the steps shown in fig. 9 and 10, the data shuffling method 500 not only enables the shuffling periods spent generating all result data to be valid shuffling periods, thereby eliminating invalid shuffling periods, eliminating waste of computational resources, improving processing efficiency, but also enables a more flexible determination of the number of valid shuffling periods, thereby better compatible with different SIMD modes.
Referring to fig. 11, a structure of a chip according to an exemplary embodiment of the present disclosure is schematically shown in the form of a block diagram. As shown in fig. 11, chip 600 includes a shuffle circuit 610, wherein shuffle circuit 610 may be shuffle circuit 100, 200, 300 as shown in fig. 2, 3, 5 of the present disclosure. It should be appreciated that chip 600 may be any suitable type of chip, including but not limited to a GPU chip, a CPU chip, and the like.
Referring to fig. 12, a structure of an integrated circuit device according to an exemplary embodiment of the present disclosure is schematically shown in block diagram form. As shown in fig. 12, the integrated circuit device 700 includes the chip 600 shown in fig. 11. It should be appreciated that the integrated circuit device 700 may be any suitable type of integrated circuit device including, but not limited to, an integrated graphics card, a stand-alone graphics card, an image processing device, and the like.
It should be appreciated that the shuffle circuits 100, 200, 300 provided in accordance with the exemplary embodiments of the present disclosure shown in fig. 2, 3, and 5 can each be implemented in the form of any suitable hardware circuit. These hardware circuits may be implemented using any suitable technology known in the art or combination thereof including, by way of non-limiting example: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable gate arrays, field programmable gate arrays, application specific integrated circuits, and the like.
It should also be appreciated that all or part of the steps of the data shuffling method provided in accordance with the exemplary embodiments of the present disclosure as shown in fig. 8, 9 and 10 may be implemented by a list of executable instructions in addition to the shuffling circuits 100, 200, 300 as shown in fig. 2, 3 and 5. The list of executable instructions may be embodied in any suitable computer readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions.
The terminology used in the present disclosure is for the purpose of describing embodiments only and is not intended to be limiting of the present disclosure. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and "comprising," when used in this disclosure, specify the presence of stated features, but do not preclude the presence or addition of one or more other features. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items. It will be understood that, although the terms "first," "second," "third," etc. may be used herein to describe various features, these features should not be limited by these terms. These terms are only used to distinguish one feature from another feature.
Unless defined otherwise, all terms (including technical and scientific terms) used in this disclosure have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and/or the present specification and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
In the description of the present disclosure, the descriptions of the terms "one embodiment," "some embodiments," "examples," "particular examples," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present disclosure. In the present disclosure, the schematic representations of the above terms are not necessarily for the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples described in this disclosure, as well as features of various embodiments or examples, may be combined and combined by those skilled in the art without contradiction.
It should be understood that the various steps of the methods shown in the flowcharts or otherwise described herein are merely exemplary and do not imply that the steps of the illustrated or described methods must be performed in accordance with the steps shown or described. Rather, the various steps of the methods shown in the flowcharts or otherwise described herein may be performed in a different order than in the present disclosure, or may be performed simultaneously. Furthermore, the methods represented in the flowcharts or otherwise described herein may include other additional steps as desired.
Although the present disclosure has been described in detail in connection with some exemplary embodiments, it is not intended to be limited to the specific form set forth herein. Rather, the scope of the present disclosure is limited only by the appended claims.

Claims (20)

1. A shuffle circuit adapted for use in a plurality of SIMD modes, comprising a control circuit, an input selector, a shuffler and an output selector, wherein:
the control circuit is configured to: dividing m threads into n thread groups according to the maximum number k of threads which can be processed in parallel by the shuffler, generating data corresponding information, and sending the data corresponding information to the input selector and the output selector, wherein the data corresponding information defines from which one or more thread groups of operation data the result data of each thread group are respectively obtained, and k, m and n are integers which are greater than or equal to 1, and the data corresponding information reflects the SIMD mode currently used by the shuffle circuit;
the input selector is configured to: selecting one or more corresponding thread groups from the n thread groups according to the received data corresponding information, and sequentially sending the operation data to the shuffler according to a preset sequence of the thread groups;
The shuffler is configured to: sequentially receiving operation data of one or more corresponding thread groups from the input selector, performing shuffling operation on the received k operation data of each corresponding thread group, and outputting j shuffling output data, wherein j is an integer and 0.ltoreq.j.ltoreq.k;
the output selector is configured to: shuffle output data obtained from the operation data of the one or more corresponding thread groups is sequentially received from the shuffler, and result data for each thread group is generated based on the shuffle output data in accordance with the received data correspondence information.
2. The shuffling circuit of claim 1, wherein the control circuit is further configured to: the data correspondence information is generated based on the SIMD mode.
3. The shuffling circuit of claim 2, wherein m has a value of 128, k has a value of 32, n has a value of 4, and the SIMD mode is one of SIMD32 mode, SIMD64 mode, and SIMD128 mode.
4. A shuffle circuit according to claim 3, wherein when the SIMD mode is a SIMD32 mode, the data correspondence information includes:
obtaining result data of the first thread group from operation data of the first thread group;
Obtaining result data of the second thread group from the operation data of the second thread group;
obtaining result data of the third thread group from the operation data of the third thread group;
and obtaining the result data of the fourth thread group from the operation data of the fourth thread group.
5. A shuffle circuit according to claim 3, wherein when the SIMD mode is a SIMD64 mode, the data correspondence information includes:
obtaining result data of the first thread group from operation data of the first thread group and the second thread group;
obtaining result data of a second thread group from operation data of the first thread group and the second thread group;
obtaining result data of a third thread group from operation data of the third thread group and the fourth thread group;
and obtaining the result data of the fourth thread group from the operation data of the third thread group and the fourth thread group.
6. A shuffle circuit according to claim 3, wherein when the SIMD mode is a SIMD128 mode, the data correspondence information includes:
obtaining result data of a first thread group from operation data of the first, second, third and fourth thread groups;
obtaining result data of a second thread group from operation data of the first, second, third and fourth thread groups;
obtaining result data of a third thread group from operation data of the first, second, third and fourth thread groups;
And obtaining the result data of the fourth thread group from the operation data of the first, second, third and fourth thread groups.
7. The shuffling circuit of claim 1, wherein the control circuit comprises a result data index flag generator and an operation data index flag generator, and wherein:
the result data index flag generator is configured to: generating an n-bit result data index flag according to the validity of the m threads, wherein each bit in the result data index flag corresponds to one group of result data of one thread group in the n thread groups;
the operation data index flag generator is configured to: calculating an operation data index corresponding to each result data to generate an n-bit operation data index flag for a set of result data of each of the n thread groups, wherein each bit of the operation data index flag corresponds to a set of operation data of one of the n thread groups;
the control circuit is configured to: and generating the data corresponding information based on the result data index mark and the operation data index mark.
8. The shuffling circuit of claim 7, wherein the control circuit is further configured to: and determining a group of result data of one thread group corresponding to the bit with the value of 1 in the result data index mark as a valid result data group, and obtaining the result data from a group of operation data of the thread group corresponding to the bit with the value of 1 in the corresponding operation data index mark for each valid result data group.
9. The shuffle circuit of claim 8, wherein m has a value of 128, k has a value of 32, and n has a value of 4.
10. A method of data shuffling for multiple SIMD modes, comprising:
dividing m threads into n thread groups according to the maximum number k of threads capable of being processed in parallel, wherein each thread group comprises k threads, and k, m and n are integers greater than or equal to 1;
generating data correspondence information defining from which one or more thread groups of operation data the result data of each thread group is obtained, respectively, the data correspondence information reflecting a SIMD mode currently used by the data shuffling method;
selecting operation data of one or more corresponding thread groups from the n thread groups according to the data corresponding information;
performing shuffling operation on k operation data of each corresponding thread group and outputting j shuffling output data, wherein j is an integer and 0.ltoreq.j.ltoreq.k;
based on shuffle output data obtained from the operational data of the one or more corresponding thread groups, result data for each thread group is generated from the data correspondence information.
11. The data shuffling method of claim 10, wherein the generating data correspondence information comprises: the data correspondence information is generated based on the SIMD mode.
12. The data shuffling method of claim 11, wherein m has a value of 128, k has a value of 32, n has a value of 4, and the SIMD mode is one of SIMD32 mode, SIMD64 mode, and SIMD128 mode.
13. The data shuffling method of claim 12, wherein the generating the data correspondence information based on SIMD patterns comprises: when the SIMD mode is a SIMD32 mode, the data correspondence information includes:
obtaining result data of the first thread group from operation data of the first thread group;
obtaining result data of the second thread group from the operation data of the second thread group;
obtaining result data of the third thread group from the operation data of the third thread group;
and obtaining the result data of the fourth thread group from the operation data of the fourth thread group.
14. The data shuffling method of claim 12, wherein the generating the data correspondence information based on SIMD patterns comprises: when the SIMD mode is a SIMD64 mode, the data correspondence information includes:
obtaining result data of the first thread group from operation data of the first thread group and the second thread group;
obtaining result data of a second thread group from operation data of the first thread group and the second thread group;
Obtaining result data of a third thread group from operation data of the third thread group and the fourth thread group;
and obtaining the result data of the fourth thread group from the operation data of the third thread group and the fourth thread group.
15. The data shuffling method of claim 12, wherein the generating the data correspondence information based on SIMD patterns comprises: when the SIMD mode is a SIMD128 mode, the data correspondence information includes:
obtaining result data of a first thread group from operation data of the first, second, third and fourth thread groups;
obtaining result data of a second thread group from operation data of the first, second, third and fourth thread groups;
obtaining result data of a third thread group from operation data of the first, second, third and fourth thread groups;
and obtaining the result data of the fourth thread group from the operation data of the first, second, third and fourth thread groups.
16. The data shuffling method of claim 10, wherein the generating data correspondence information comprises:
generating an n-bit result data index flag according to the validity of the m threads, wherein each bit in the result data index flag corresponds to one group of result data of one thread group in the n thread groups;
Calculating an operation data index corresponding to each result data to generate an n-bit operation data index flag for a set of result data of each of the n thread groups, wherein each bit of the operation data index flag corresponds to a set of operation data of one of the n thread groups;
and generating the data corresponding information based on the result data index mark and the operation data index mark.
17. The data shuffling method of claim 16, wherein the generating the data correspondence information based on the result data index flag and the operation data index flag comprises:
determining a group of result data of a thread group corresponding to a bit with the value of 1 in the result data index mark as a valid result data group;
for each valid result data set, result data is obtained from the operation data of the thread group corresponding to the bit of 1 in the corresponding operation data index flag.
18. A chip based on SIMD architecture comprising a shuffling circuit as claimed in any of claims 1 to 9.
19. The chip of claim 18, wherein the chip is a GPU chip.
20. An integrated circuit device comprising at least one chip according to claim 18 or 19.
CN202210717989.2A 2022-06-23 2022-06-23 Shuffling circuit and method, chip and integrated circuit device Active CN115061731B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210717989.2A CN115061731B (en) 2022-06-23 2022-06-23 Shuffling circuit and method, chip and integrated circuit device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210717989.2A CN115061731B (en) 2022-06-23 2022-06-23 Shuffling circuit and method, chip and integrated circuit device

Publications (2)

Publication Number Publication Date
CN115061731A CN115061731A (en) 2022-09-16
CN115061731B true CN115061731B (en) 2023-05-23

Family

ID=83201983

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210717989.2A Active CN115061731B (en) 2022-06-23 2022-06-23 Shuffling circuit and method, chip and integrated circuit device

Country Status (1)

Country Link
CN (1) CN115061731B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102169427A (en) * 2006-06-30 2011-08-31 英特尔公司 Apparatus, method and apparatus for implementing shuffle instruction
CN103189837A (en) * 2011-10-18 2013-07-03 松下电器产业株式会社 Shuffle pattern generating circuit, processor, shuffle pattern generating method, and instruction
CN109478175A (en) * 2016-07-13 2019-03-15 高通股份有限公司 The shuffler circuit shuffled in SIMD framework for channel

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040054877A1 (en) * 2001-10-29 2004-03-18 Macy William W. Method and apparatus for shuffling data
US9218182B2 (en) * 2012-06-29 2015-12-22 Intel Corporation Systems, apparatuses, and methods for performing a shuffle and operation (shuffle-op)
US10296489B2 (en) * 2014-12-27 2019-05-21 Intel Corporation Method and apparatus for performing a vector bit shuffle
KR102659495B1 (en) * 2016-12-02 2024-04-22 삼성전자주식회사 Vector processor and control methods thererof
KR102510451B1 (en) * 2018-05-09 2023-03-16 삼성전자주식회사 Integrated circuit device and operating method of integrated circuit device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102169427A (en) * 2006-06-30 2011-08-31 英特尔公司 Apparatus, method and apparatus for implementing shuffle instruction
CN103189837A (en) * 2011-10-18 2013-07-03 松下电器产业株式会社 Shuffle pattern generating circuit, processor, shuffle pattern generating method, and instruction
CN109478175A (en) * 2016-07-13 2019-03-15 高通股份有限公司 The shuffler circuit shuffled in SIMD framework for channel

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
一种高效的面向基2 FFT算法的SIMD并行存储结构;陈海燕;杨超;刘胜;刘仲;;电子学报(02);全文 *

Also Published As

Publication number Publication date
CN115061731A (en) 2022-09-16

Similar Documents

Publication Publication Date Title
US10360039B2 (en) Predicted instruction execution in parallel processors with reduced per-thread state information including choosing a minimum or maximum of two operands based on a predicate value
JP2005310166A (en) Multiscalar expansion to simd instruction set processor
KR102118836B1 (en) Shuffler circuit for rain shuffle in SIMD architecture
JP2009026106A (en) Instruction code compression method and instruction fetch circuit
CN107924307B (en) Processor, method, system, and instructions for scatter by index to registers and data element rearrangement
JP2017529597A (en) Bit group interleave processor, method, system and instruction
CN113885942A (en) System and method for zeroing pairs of chip registers
US10489155B2 (en) Mixed-width SIMD operations using even/odd register pairs for wide data elements
CN114691217A (en) Apparatus, method, and system for an 8-bit floating-point matrix dot-product instruction
EP1267255A2 (en) Conditional branch execution in a processor with multiple data paths
GB2604497A (en) SIMD operand permutation with selection from among multiple registers
CN115061731B (en) Shuffling circuit and method, chip and integrated circuit device
JP2001005675A (en) Program converter and processor
CN114489791B (en) Processor device, instruction execution method thereof and computing equipment
US10223113B2 (en) Processors, methods, systems, and instructions to store consecutive source elements to unmasked result elements with propagation to masked result elements
KR102528073B1 (en) Method and apparatus for performing a vector bit gather
CN112667291A (en) RISC-V instruction set shift instruction implementing circuit
US20200183684A1 (en) Arithmetic processing apparatus and method of controlling arithmetic processing apparatus
US7287151B2 (en) Communication path to each part of distributed register file from functional units in addition to partial communication network
JP2000293373A (en) Branch predicting device
JP2008524723A (en) Evaluation unit for flag register of single instruction multiple data execution engine
CN116635841A (en) Near memory determination of registers
EP3495960A1 (en) Program, apparatus, and method for communicating data between parallel processor cores
JP7377208B2 (en) Data processing
JP5630798B1 (en) Processor and method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant