Nothing Special   »   [go: up one dir, main page]

CN101694627B - Compiler system based on TCore configurable processor - Google Patents

Compiler system based on TCore configurable processor Download PDF

Info

Publication number
CN101694627B
CN101694627B CN200910070942.6A CN200910070942A CN101694627B CN 101694627 B CN101694627 B CN 101694627B CN 200910070942 A CN200910070942 A CN 200910070942A CN 101694627 B CN101694627 B CN 101694627B
Authority
CN
China
Prior art keywords
compiler
code
architecture
register
configurable processor
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN200910070942.6A
Other languages
Chinese (zh)
Other versions
CN101694627A (en
Inventor
魏继增
郭炜
史再峰
王正华
刘壮丽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN200910070942.6A priority Critical patent/CN101694627B/en
Publication of CN101694627A publication Critical patent/CN101694627A/en
Application granted granted Critical
Publication of CN101694627B publication Critical patent/CN101694627B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Devices For Executing Special Programs (AREA)

Abstract

The invention discloses a compiler system based on a configurable processor, adopting an intermediate language format to carry out the organization of a compiler and finishing the distribution and instruction scheduling of a register by a linear trace and a genetic algorithm, wherein the compiler comprises an SUIF2, a first-level intermediate machine language SUIF IR, an extensible compiling framework MACHSUIF, a second-level intermediate machine language SUIFVM, a consolidated extension interface Machine Library Interface, a serial configurable processor code Sequential Code, an architecture parser, a code generator, a configurable architecture description file ADT and a parallel configurable processor code. The invention enables the compiler to efficiently finish the generation of codes and solves the long-term problem that the compiler based on a TTA architecture processor has low efficiency through the combination of the front end of the compiler based on an intermediate formate and the rear end of the compiler based on two heuristic algorithms of the linear trace and the genetic algorithm.

Description

Compiler system based on configurable processor
Technical field
Configurable microprocessor in the embedded system of field of computer technology of the present invention particularly relates to the development technique of the compiler in the configurable microprocessor.
Background technology
Current, embedded system is used widely at numerous areas.And embedded microprocessor has determined the performance of embedded system to a great extent.The main flow microprocessor all is based on the design of general purpose as ARM, MIPS, PowerPC etc., and it is powerful, support multiple computing and data type.The design of this versatility has limited its possibility in specific area exploitation concurrency on the contrary.
Configurable microprocessor has overcome above-mentioned shortcoming, and it had both kept the dirigibility of processor also to make performance near application-specific design specialized integrated circuit (ASIC) at particular task custom instruction collection and processor architecture.This configurable processor needs the software-hardware synergism design environment of a robotization, comprises collaborative design environment, compiler, isa simulator, design space detector etc.China starts late at this area research, but lacks the configurable processor of industrialization, therefore needs the expensive IP kernel of payment and related software mandate expense.
The Tcore microprocessor be the VLSI of University Of Tianjin design with applied research designed have an independent intellectual property right trigger architecture (Transport Triggered Architecture based on transmission, TTA) configurable microprocessor, and finished flow under the 0.13 μ m GSMC technology, the work dominant frequency can reach 200MHZ, has good industrial prospect.
The maximum characteristics of Tcore configurable microprocessor are to compare other microprocessors can control in thinner granularity, and in the architecture level as seen the data transmission details thereby has perfect performance/cost ratio.But how problems such as it also exists user's hand-coding code difficulty big, and compiler efficient is low design high efficiency compiler and produce the key that efficient code is based on the design of TTA architecture microprocessor always.Compiler is divided into front end and rear end two parts, and traditional compiler front-end based on the TTA architecture processor mainly is to realize by GNU GCC technical organization, as the MoveFramework of Delft university.Adopt this mode, the TTA instruction of serial is expressed as binary file format: this formatting flexibility is very low, lack the program profile information, even there is the program profile information, also by independent body, the contact of shortage and program itself certainly will cause and can't generate high-quality instruction level parallelism code in the compilation process of rear end.Two tasks are mainly finished in compiler front-end, rear end: register distributes and instruction scheduling.Register allocation method adopts the graph coloring method that Chaitin proposes more at present, but the method produces a not conflict graph of pigmentable through regular meeting, thereby causes having a large amount of temporary variables to be injected into internal memory, has increased the time of access variable.And aspect instruction scheduling, Moveframework adopts the list scheduling method, and its core concept is based on data dependency graph topological structure and dispatches, and does not consider global optimum in the process of scheduling, and the code quality that this algorithm generates is relatively poor.In addition, it lacks the processing for resource deadlock and streamline collision problem.Integral linear programming (Integer Linear Programming) algorithm provides a unified method to distribution and the software bypass of instruction scheduling, register, a plurality of problems is incorporated into one optimizes in the model.Though this method can generate the high-quality code in theory, successfully solved the multi-objective optimization question of register allocation optimized and parallel two the conflicting targets of optimizing scheduling of instruction to a certain extent.But because the restriction relation complexity of setting up often is difficult in and obtains use in the actual technique of compiling.Experiment shows that this method only is applicable to code optimization in the small-scale fundamental block.
Summary of the invention
The present invention is intended to for overcoming above-mentioned prior art problem, and proposes a kind of compiler system based on the Tcore processor, carries out the tissue of compiler by the intermediate language form, uses linear sweep and genetic algorithm to finish register and distributes the and instruction scheduling.
The present invention proposes a kind of compiler based on configurable processor, and this compiler comprises SUIF2, first order middle machine language SUIF IR, can expand compiler framework MACHSUIF, second level middle machine language SUIFVM, unified expansion interface Machine Library Interface, serial Tcore processor code Sequential Code, architecture resolver Architecture parser, code generator code generator, configurable processor architecture description document ADT, parallel ConfigurableProcessor code, wherein:
SUIF2 describes the higher level lanquage of using and is converted into first order middle machine language SUIF IR, and outputs it to and can expand compiler framework MACHSUIF and carry out subsequent treatment; Can expand compiler framework MACHSUIF is converted into first order middle machine language SUIF IR towards the machine language level intermediate representation of the virtual machine of its inner definition, i.e. second level middle machine language SUIFVM; Can expand compiler framework MACHSUIF and comprise unified expansion interface Machine Library Interface, provide the interface of seeking unity of standard with the required machine architecture of extending user by this interface for the user, and second level middle machine language SUIFVM has been converted into serial configurable processor code Sequential Code; Serial configurable processor code Sequential Code is imported into the code generator code generator of compiler back-end; Resolve the architecture parameter that configurable processor architecture description document ADT obtains by architecture resolver Architecture parser; Each architecture parameter uses the XML descriptive language to organize; Utilize the architecture parameter, compiler back-end is finished register distribution, instruction scheduling, and the code conversion of serial configurable processor is parallel configurable processor code Parallel Code.
Described architecture parameter comprises total line number, highway width, the delay of register number, functional unit function unit.
Described register distributes employing linear sweep algorithm, and this algorithm may further comprise the steps:
Register distribution and program rewriting are carried out simultaneously, and direct scanning sequence after the data-flow analysis is whenever running into a variable V a, interrupt routine is rewritten, and begins to carry out register and distributes, and determines to distribute certain register or it is overflowed for it, determines directly rewriting source program according to distributing then; Idiographic flow may further comprise the steps: the source code of lining by line scan is when running into a variable V aJudge this variable V aWhether occur for the first time;
If variable V aBe to occur for the first time, the variable that life interval in the active tabulation has been finished removes, and the said variable that has finished refers to that those life cycles are in variable V here aThe variable that has finished before occurring for the first time, these variablees do not need to take any register again;
Judged whether the residue register;
If any the residue register, give variable V aDistribute a register R j
If do not remain register, then this register is set at and overflows;
If variable V aNot to occur for the first time, then for having distributed register R iVariable R a, carrying out source program and rewrite, said source program is rewritten here, refers to denotation of variable V in the source program a, replace with corresponding register flag R iConcrete steps comprise: 1. search variable-register mapping table, find out the register label of this variable correspondence; 2. carry out V a-R iCharacter string is replaced;
For the variable V of having delivered to internal memory a, judge and next will carry out read operation or write operation, step 208;
If read operation then is written into this register again;
If write operation, then register is set at and overflows.
Described instruction scheduling adopts genetic algorithm, and this algorithm may further comprise the steps:
Coding adopts the mode of parallel encoding vector, and namely each the element vi among 1 * n-dimensional vector v represents the scheduling numbering of i bar operation in the serial code.In order to realize dynamic allocation of resources, we have designed a FU dynamic assignment vector w, wi represents the FU resource number of a complete operation of a FU (same FU depends on all operations of one-shot transmission), v, w carries out genetic manipulation simultaneously, each genetic manipulation dynamically generates the time-delay relation by the analysis to w, then v is carried out the cross and variation operation and occurs up to optimum solution;
The generation of initial population
Generate initial population at random, the integer of random number in [1, n/s] is interval, n is serial operation length, s is the total line number.The generation of population has improved the search efficiency to solution space greatly at random;
Select
The employing elite keeps, the strategy that inferior position is eliminated, and population is arranged from small to large by fitness, selects the parent individuality to carry out genetic manipulation at random in 20% the individuality in the past, replaces the individuality of random choose in the back 20% with the result of genetic manipulation;
Intersect
For general cross method such as PMX, OX, CX, the solution after the intersection is not feasible solution.This is because the discrete distribution of solution space determines.At ideal mathematics model, constraint condition all is linear relationship, and solution space is a convex set under this ideal model; Character by convex set has:
sup poseΨis a convex set , ∀ α , β ∈ Ψ
∃ λ ( 0 ≤ λ ≤ 1 ) , such that
λα+(1-λ)β∈Ψ
Thus, we use the method for linear crossing, and the individuality after intersecting so still satisfies constraint condition.This method exists a non-integer problem namely may produce non-integral coding vector by the method for linear crossing.Suppose x, y, z are the individual coding vector of population,
Make z=λ x+ (1-λ) y (0≤λ≤1), x, y are the parent individuality, and z is their offspring individuals after intersecting.The element of the coding vector z that obtains like this may not be integer.The method that we take to round downwards addresses this problem.
Suppose
Figure GSB00000948722900043
And i, j ∈ [1, n], j>i, i, the minimum time-delay of j bar transmission operation room is δ (δ N +), then
z i=λx i+(1-λ)y i (1)
z j=λx j+(1-λ)y j (2)
z j-z i=λ(x j-x i)+(1-λ)(y j-y i) (3)
(3) 〉=δ is obviously arranged, and we need proof
Again by (3) as can be known
z j≥δ+z i
Figure GSB00000948722900045
(4) formula must be demonstrate,proved;
Obtain new individual coding vector z so two individual coding vector x, y intersect, make
Figure GSB00000948722900047
(0≤λ≤1), the λ in the formula is a parameter of carrying out interlace operation, and the interleaved mode that adopts herein is exactly
Figure GSB00000948722900048
(0≤λ≤1), the traditional interleaved mode that adopts for genetic algorithm.Here the value of λ is 1/2, and this is in order to make filial generation can balancedly obtain the information of two parents, for interlace operation in the genetic algorithm the value that generally adopts;
Variation
For each hereditary individual schedule coding vector x, add up dutycycle of each row, preferentially select the forerunner to rely on to be 0 operation, with its scheduling numbering change (preferentially selecting the little numbering of dutycycle).
Described instruction scheduling adopts genetic algorithm, and is further comprising the steps of:
The interval of two transmission operation rooms of compression.
Time-delay δ in the described minimum matrix of delaying time between the startup of the result of the current transmission operation of regulation and the same operation of next bar=-(fu_delay-1), fu_delay is the inherent delay of this functional unit, indication cycle, unit are individual.
Described instruction scheduling adopts genetic algorithm, and is further comprising the steps of:
When deadlock and resource contention took place, the method for the time-delay in the mandatory provision serial code between adjacent two function unit manipulations was avoided, two operation o1, and o2 shares same FU, and is defined in arbitrarily continuous o1 in the serial operation, the time-delay of o2 operation room; Suppose the trigger operation of o1, the serial numbering of the operand operation of o2 is respectively i, j (i<j), stipulate aij=1 in the time-delay matrix, to avoid conflict.
Compared with prior art, the present invention can provide powerful Software tool support for the Tcore configurable microprocessor, make compiler completion code generation efficiently, by based on the compiler front-end of intermediate form be combined based on the compiler back-end of linear sweep and these two kinds of heuritic approaches of genetic algorithm, solved long-term puzzlement based on the problem of the compiler inefficiency of TTA architecture processor.
Description of drawings
Fig. 1 is the configuration diagram of compiler system based on the Tcore processor proposed by the invention;
Fig. 2 be compiler system based on the Tcore processor proposed by the invention the register algorithm flow chart [
Fig. 3 cause distributing in the conflict situations that occurs and the solution for the linear sweep of compiler system based on the Tcore processor proposed by the invention and program rewriting before control flow instance graph;
Fig. 4 causes distributing in the conflict situations that occurs and the solution for the linear sweep of compiler system based on the Tcore processor proposed by the invention and instance graph is flowed in the control of program rewriting after finishing;
Fig. 5 is the test result synoptic diagram that the register of compiler system based on the Tcore processor proposed by the invention distributes.
Embodiment
The present invention adopts intermediate form to preserve and organize a large amount of program profile information in the compiler front-end of Tcore microprocessor, carry out more complicated code analysis and code and transform, use the abundant framework information of XML linguistic organization to finish the retargetable design of compiler back-end.Based on the Tcore compiler framework of this thought tissue as shown in Figure 1.This framework adopts by the SUIF2 of Stanford university exploitation and MACHSUIF structure compiler front-end, the higher level lanquage that SUIF2101 is responsible for using is described and is converted into first order middle machine language (being expressed as SUIF IR) 102, and outputs it to and can expand compiler framework (MACHSUIF) 103 and carry out subsequent treatment.Can expand compiler framework (MACHSUIF) and be one and be convenient to construct compiler back-end that extendible compiler framework utilizes it can construct machine language level intermediate form rapidly.MACHSUIF is converted into SUIF IR towards the machine language level intermediate representation of the virtual machine of its inner definition, i.e. second level middle machine language 104 (expression SUIFVM).Unified interface (the Machine Library Interface) 105 of the significant components expansion of MACHSUIF provides the interface of unified standard with the required machine architecture of extending user for the user, as x86, alpha, Tcore etc., and SUIFVM is converted into serial Tcore processor code (Sequential Code) 106; It is code generator (code generator) 108 that the Tcore processor code 106 of serial is imported into compiler back-end.
Resolve the various architecture parameters that Tcore architecture description document ADT (Architecture Description of Tcore) obtains by architecture resolver (Architecture parser) 107, comprise total line number, highway width, register number, functional unit (function unit, delay FU) etc.Each architecture parameter uses the XML descriptive language to organize, and resolves by the libxml storehouse.Utilize these parameter compiler back-ends 110 to finish register and distribute (Register Allocation), resource to distribute (Resource Allocation), instruction scheduling (Instruction Scheduling), bypass work such as (ByPass), serial Tcore code conversion is parallel Tcore code (Parallel Code) 109.
The resource distribution is to have adopted a resource constraint table, the current resource that can use is put into table distribute on demand, and resource is removed from the resource constraint table after distributing, and the resource of having used discharges back the resource constraint table after being used to complete.
The present invention has adopted the front end compiler based on intermediate form to substitute traditional GUN GCC compiler, preserves and organize abundant program profile information, generates efficient code to support back-end compiler.Defective at conventional back end method that compiler adopts, using the linear scan technique processing register instead distributes, experimental results show that, the method applied in the present invention utilizes less register can generate high-quality code, can successfully be applied to improve programming efficiency greatly in the Tcore processor compiler based on TTA.Compiler back-end adopts genetic algorithm to realize instruction scheduling, compare by this algorithm with traditional list scheduling heuritic approach and to finish the parallel codes that instruction scheduling can produce better quality, and by minimum time-delay matrix reduction for the processing of resource contention and streamline conflict, the instruction level parallelism degree of the test case 90% or more is above classic method.
Specific algorithm involved in the present invention is described as follows:
(1) register distributes
1, important definition
If V is the set of all variablees, I is the set of all instructions.Provide as giving a definition:
Definition 1Location (a)={ V i| V i∈ V, a ∈ I, and V iBe existing state };
Definition 2Live (V i)=[a, b], namely
Figure GSB00000948722900061
Wherein, a=start (V i), and b=end (V i), right ∀ m , n ∈ I , M<a and n>b, V i ∉ Location ( m ) , And V i ∉ location ( n ) .
Definition 3lives={V i| V iIf ∈ V is start (V i)<start (V j), V then i<V j}
Definition 4active={V i| V i∈ V, V are just taking certain register, if end (V i)<end (V j), V then i<V j}
2, the details of algorithm is described
The linear sweep algorithm can be finished the statistics to the life interval of aleatory variable in the program by classical du-chain data-flow analysis.If exist the overlapping region between any two variablees, then same physical register can not be distributed to simultaneously this two variablees.The linear sweep algorithm can be distributed to variable with physical register as much as possible.If give m the register that be configured to of processor, if certain some place in program has n variable simultaneously overlapping, and n>m, then have at least n-m variable must deliver to memory cache this moment, with correct dependence between the assurance program, thereby guarantee program correctness.
Vector lives presses all variablees in the vertical sequential storage code of starting point in variable life interval; Vector active takies the variable of certain register at current point by the vertical sequential storage of the terminating point in life interval.The overlapping number of range of variables only just changes when the starting point that runs into certain range of variables and terminating point.Therefore only scan and just need consider how to distribute this variable when occurring a new variable in the program.
Each variable in the scanning lives vector when scanning a new variables V in order, turns to scanning active vector, does not have overlapping variable to remove from active with this new variables V in those life intervals.Because the end point of the variable that is removed early than the starting point of variable V, makes its register that takies be released, d/d free register can be redistributed to other variable.Distribute a register to variable V then, and by the Cahn-Ingold-Prelog sequence rule of active V is put into the active vector.
The length of active be important judge index a: length (act) during algorithm is realized=| active|.Known max (length (act))=m, and m=num (reg) by its definition.Worst case is that length (act)=m in this case, must have a variable to be overflowed (spill) when meeting a new variable and need distribute register.What is called is overflowed, and refers to that the register number divides timing that variable directly is saved in memory cache inadequately.In the time of need reading this and overflow variable, the process with its extraction from internal memory is reload.
3, algorithm improves and implementation procedure
Traditional method is the two-pass scan method: first pass scans the live vector fast, determines which variable to distribute register, and which variable spills into internal memory; Second time scanning sequence rewritten program according to allocation result.Distribution soon and rewriting divide to be carried out respectively for twice.
The shortcoming of this method is to need record and keep variable all storage track in register and in the internal memory, has increased the space complexity of compiling.
Here above-mentioned classic method is improved, register distribution and program rewriting are carried out simultaneously.Direct scanning sequence after the data-flow analysis is whenever running into a variable V a, interrupt routine is rewritten, and begins to carry out register and distributes, and determines to distribute certain register or it is overflowed for it, determines directly rewriting source program according to distributing then.Idiographic flow is as shown in Figure 2: the source code of lining by line scan, and when running into a variable V a, step 201; Judge this variable V aWhether step 202 appears for the first time;
If variable V aBe to occur for the first time, the variable that life interval in the active tabulation has been finished removes, and (the said variable that has finished refers to that those life cycles are in variable V here aThe variable that has finished before occurring for the first time, these variablees do not need to take any register again), step 203;
Judged whether the residue register, step 204;
If any the residue register, give variable V aDistribute a register R j, step 205;
If do not remain register, then this register is set at and overflows step 206;
If variable V aNot to occur for the first time, then for having distributed register R iThe variable V of (step 207) a, carry out source program and rewrite, (said source program is rewritten here, refers to denotation of variable V in the source program a, replace with corresponding register flag R iConcrete steps comprise: 1. search variable-register mapping table, find out the register label of this variable correspondence; 2. carry out V a-R iCharacter string is replaced.) step 207;
For the variable V of having delivered to internal memory a, judge and next will carry out read operation or write operation, step 208;
If read operation then is written into this register, step 209 again;
If write operation, then register is set at and overflows step 210.
The linear sweep algorithm is regarded source program as an integral body and is operated, and it is advantageous that baroque control cleanlinessization, has reduced the complexity of graph coloring register allocation.Yet with all fundamental blocks all linear sweep distribute register, not fully with respect to the control stream of program.This way is not considered the follow-up relation of forerunner between each fundamental block, the conflict that has caused the allocation result between each fundamental block to exist potentially fully.In order to keep the correctness of Program Semantics, after distribution/rewriting process finishes, need to travel through each bar and control stream, solving the same variable distribution condition in different fundamental blocks that produces because of linear sweep does not have linking to cause inconsistent problem.This problem be the present invention proposes the solution of inserting storage instruction between two fundamental blocks, as shown in Figure 3, Figure 4.Fig. 3 is the example of the control flow graph before a register distribution and the program rewriting.Include a plurality of variablees among the figure, but only the situation of variable v1 is listed.Fig. 4 distributes for register variable and rewrites situation after finishing.According to the linear precedence of block (1,2,3,4), the distribution condition of variable v1 as shown in the figure, simultaneously v1 aside marks in top and the bottom of each piece.Storage instruction between block1 and the block2 is shown in ist7.Bilingual among Fig. 3, Fig. 4: block: expression fundamental block; Ist: presentation directives; V: expression variable; R: expression register.
Here be procedure division an acyclic digraph (DAG), all nodes of DAG be configured to a matrix store relation between each fundamental block.
There is a control stream in Ergodic Matrices when judging between fundamental block block1 and the block2, and block1 is predecessor block, when block2 is successor block, scans the situation of each variable.If there is certain variable, it is inconsistent in the memory location at the top of the bottom of block1 and block2, this moment need according to its read-write situation in block2 determine whether should to insert storage instruction with and the insertion position, with the correct dependence between the assurance program.The variable that transmits between fundamental block need be known them in the memory location of predecessor block bottom and in the memory location at successor block top.For the ease of statistics, adopt the list structure storage.
Experimental result
The test procedure of this compiler uses the C64+ series DSP test case of TI, has provided the wherein test result of four groups of test cases: dotproc (16 * 16) (dot product), fir_r8, idct_8 * 8 and maxval (16) (maximal value).This testing needle compiles the instruction set of Tcore processor with 4 buses.The specific use register of the processing usefulness when this processor architecture requires 3 registers are arranged as spill and reload, total the number of registers must not be less than 5.As a comparison, distribute different register numbers to compare to every group of program respectively.Test result as shown in Figure 5.
Experimental observation is 5 from the register number.As described in Figure, along with increasing of giving register number, the instruction strip number of generating code reduces gradually, and the description code quality improves gradually, and flooding code tails off gradually.When distributing 6 registers, the instruction strip number of code significantly reduces.That is to say that under given register situation seldom, the probability that variable overflows just can even not need to overflow very for a short time.
Empirical tests, linear sweep algorithm can be applied to save register resources simultaneously in the Tcore compiler efficiently.
(2). instruction scheduling
1, instruction coding
The instruction scheduling problem be need search out one the mapping S:V → 1 ..., n}, V are the linear sets of serial transmission operation, and n is the set of instruction scheduling time point, and S need satisfy constraints such as certain data dependence relation and resource restriction.We operate encode (transmission operates in the value under the mapping S) to each serial transmission, the coding of representing the serial transmission operation with a 1 * n-tuple, here handled is serial code, and namely delegation is with regard to an instruction, and always total n is capable to establish code, respectively the code of corresponding row is carried out mark with 1~n, mapping S:V → 1 ..., n} has defined a mapping S in other words, the content of this S is the element in the instruction set V, i.e. each instruction is carried out mark with 1~n respectively.Here said resource restriction refers to:
1) number of buses restriction: current highway width has limited the operand that can walk abreast and carry out.
2) functional unit restriction: the functional unit of current residual has limited the current operation that can carry out.
Data dependence relation refers to:
1) data stream relies on: the value in certain storage address or the register is to be generated by instruction 1, and is used by instruction 2.(being called for short write-read relies on)
2) antidependence: the value in certain storage address or the register is instructed 1 to use, and then is revised by instruction 2.(being called for short read-write relies on)
3) data output relies on: the value in certain storage address or the register is to be generated by instruction 1, and then is revised by instruction 2.(dependence write in abbreviation)
Except last relation of plane, trigger structure for transmission and also have some other dependences:
1) triggering-result relies on: the triggering transmission that guarantees an operation always was called before the number of results operand transmission of same operation.
2) operand-triggering relies on: the operand transmission that guarantees an operation all was called before the trigger action transmission of same operation.
And following steps are the processes that this coding are optimized scheduling.
2, the time-delay of instruction coding
The delay of instruction coding is essential adopting genetic algorithm to carry out instruction scheduling, because must satisfy certain restrictive condition when carrying out instruction scheduling, and an important restrictive condition is exactly the delay relation that must satisfy them between the instruction and instruction.Must interlude when so-called two instructions that postpone to refer to have dependence are carried out, for example, the time-delay of instruction 1 and instruction 2 be 2 refer to instruct 2 can only be after instruction 1 be carried out two time intervals or more of a specified duration could the execution.
For a customizable TTA processor, the time-delay of each operation is by a functional unit (function unit, FU) determining, is 2 cycles such as the basic time-delay of adder unit, and two all after dates after expression add_t is triggered could be got the result of this operation in add_r.
Defined the minimum time-delay that the N * N of minimum time-delay matrix A=(aij) represents to transmit operation room.Wherein aij represents that serial operation is numbered i and serial operation and is numbered minimum time-delay between the j (i<j).(i, j) expression is numbered i, the time-delay between the j serial operation hereinafter to use Mindelay.
3, the feasibility adjustment of parallel scheduling coding
Judge it is the key of algorithm success for the feasibility of a scheduling scheme arbitrarily.The present invention has adopted a kind of adjustment scheme based on minimum time-delay matrix.Shown in algorithm 1,2:
Algorithm 1
Figure GSB00000948722900101
Algorithm 2
ScheduledNumber (j) judges whether the two satisfies minimum delay requirement, if do not satisfy be adjusted to meet the demands till.Adjustment through algorithm 1 can be satisfied the time-delay constraint to the parallel encoding vector arbitrarily.
Algorithm 2 is that a parallel encoding vector that obtains arbitrarily almost can not be as the input of genetic algorithm through poor effect after the adjustment of algorithm 1 to the replenishing of algorithm 1.The main effect of algorithm 2 is intervals of two transmission operation rooms of compression, the coding relative compact that obtains like this, thus improved the efficient of genetic algorithm greatly.Here said compression refers to the instruction compression that allows compression at interval to just in time satisfying the position that all postpone condition.
4, conflict and deadlock avoids
Conflict and deadlock mainly are owing to unreasonable the taking to system resource produces.
1) generation of conflict and deadlock
Conflict mainly is resource contention and streamline conflict, and resource contention is the public same functional unit of different operating and the conflict that produces, and the streamline conflict is the continuous use of same functional unit, and the result exports and the conflict that causes continuously.
Deadlock mainly results from the list scheduling algorithm, and transmission operation o1 and o2 take operand and the trigger register of same FU respectively.The o1 operation can not be called because operation o2 needs the trigger register of same functional unit, and it is because o1 has taken the operand register of this functional unit that o2 can not be called.
2) avoidance strategy
Deadlock and resource contention are all caused by shared same functional unit, therefore, can take similar method to handle, the mode that the present invention takes is that the method for the time-delay between two function unit manipulations adjacent in the mandatory provision serial code is avoided, such as two operation o1, o2 shares same FU, and we are defined in arbitrarily continuous o1 in the serial operation, the time-delay of o2 operation room.Suppose the trigger operation of o1, the serial numbering of the operand operation of o2 is respectively i, and (i<j), we stipulate aij=1 to j in the time-delay matrix, and collision problem just can be avoided like this.Experiment shows that this mode effect under some fc-specific test FC use-case is not fine.
For the conflict of streamline, the method success of minimum time-delay avoided this problem.Time-delay δ in minimum time-delay matrix between the trigger of the result operation of the current transmission operation of regulation and the same operation of next bar=-(fu_delay-1), fu_delay is the inherent delay of this functional unit.
5, the realization of genetic algorithm
Method for collision avoidance above adopting is carried out genetic algorithm in desirable model based.Coding of the present invention adopts the mode of parallel encoding vector, and namely each the element vi among 1 * n-dimensional vector v represents the scheduling numbering of i bar operation in the serial code.In order to realize dynamic allocation of resources, we have designed a FU dynamic assignment vector w, wi represents the FU resource number of a complete operation of a FU (same FU depends on all operations of a trigger transmission), v, w carries out genetic manipulation simultaneously, each genetic manipulation dynamically generates the time-delay relation by the analysis to w, then v is carried out the cross and variation operation and occurs up to optimum solution.
1) generation of initial population
Generate initial population at random, the integer of random number in [1, n/s] is interval, n is serial operation length, s is the total line number.The generation of population has improved the search efficiency to solution space greatly at random.
2) select
The employing elite keeps, the strategy that inferior position is eliminated, and population is arranged from small to large by fitness, selects the parent individuality to carry out genetic manipulation at random in 20% the individuality in the past, replaces the individuality of random choose in the back 20% with the result of genetic manipulation.Adopt this strategy will accelerate convergence of algorithm speed greatly.But can make algorithm be absorbed in " precocity " phenomenon.
3) intersect
For general cross method such as PMX, OX, CX, the solution after the intersection is not feasible solution with very big probability.This is because the discrete distribution of solution space determines.At ideal mathematics model, constraint condition all is linear relationship, and solution space is a convex set under this ideal model.Character by convex set has:
sup poseΨis a convex set , ∀ α , β ∈ Ψ
∃ λ ( 0 ≤ λ ≤ 1 ) , such that
λα+(1-λ)β∈Ψ
Thus, we use the method for linear crossing, and the individuality after intersecting so still satisfies constraint condition.This method exists a non-integer problem namely may produce non-integral coding vector by the method for linear crossing.Suppose x, y, z are the individual coding vector of population,
Make z=λ x+ (1-λ) y (0≤λ≤1), x, y are the parent individuality, and z is their offspring individuals after intersecting.The element of the coding vector z that obtains like this may not be integer.The method that we take to round downwards addresses this problem.
Suppose And i, j ∈ [1, n], j>i, i, the minimum time-delay of j bar transmission operation room is δ (δ ∈ N +), then
z i=λx i+(1-λ)y i (1)
z j=λx j+(1-λ)y j (2)
z j-z i=λ(x j-x i)+(1-λ)(y j-y i) (3)
(3) 〉=δ is obviously arranged, and we need proof
Figure GSB00000948722900124
Again by (3) as can be known
z j≥δ+z i
Figure GSB00000948722900125
Figure GSB00000948722900126
(4) formula must be demonstrate,proved!
Obtain new individual coding vector z so two individual coding vector x, y intersect, make
Figure GSB00000948722900127
(0≤λ≤1), the λ in the formula is a parameter of carrying out interlace operation, and the interleaved mode that adopts herein is exactly
Figure GSB00000948722900131
(0≤λ≤1), the traditional interleaved mode that adopts for genetic algorithm.Here the value of λ is 1/2, and this is in order to make filial generation can balancedly obtain the information of two parents, for interlace operation in the genetic algorithm the value that generally adopts.
4) variation
For each hereditary individual schedule coding vector x, add up dutycycle of each row, preferentially select the forerunner to rely on to be 0 operation, with its scheduling numbering change (preferentially selecting the little numbering of dutycycle).
6, experiment effect
Instruction scheduling still adopts the standard testing use-case of TI C64+ series DSP to carry out statistical study respectively, as shown in table 1, the scheduling result of genetic algorithm and MOVEframework list scheduling is compared, serial code bar number in the table is that MACHSUIF generates, and the parallel codes bar number after scheduling is handled in bypass has been listed in back two.Pipeline.c (streamline conflict test procedure), vecsum.c (16 bit vectors and), maxval.c (asking vectorial peaked test procedure) and dotpro.c (16 sites are long-pending through the unroll processing) are for to write several test cases targetedly to this algorithm.
Table 1 test program set
Figure GSB00000948722900132

Claims (2)

1. compiler based on configurable processor, this compiler comprises SUIF2, first order middle machine language SUIF IR, can expand compiler framework MACHSUIF, second level middle machine language SUIFVM, unified expansion interface Machine Library Interface, serial configurable processor code Sequential Code, architecture resolver Architecture parser, code generator code generator, configurable processor architecture description document ADT, parallel configurable processor code, wherein:
SUIF2 describes the higher level lanquage of using and is converted into first order middle machine language SUIF IR, and outputs it to and can expand compiler framework MACHSUIF and carry out subsequent treatment; Can expand compiler framework MACHSUIF is converted into first order middle machine language SUIF IR towards the machine language level intermediate representation of the virtual machine of its inner definition, i.e. second level middle machine language SUIFVM; Can expand compiler framework MACHSU IF and comprise unified expansion interface Machine Library Interface, provide the interface of seeking unity of standard with the required machine architecture of extending user by this interface for the user, and second level middle machine language SUIFVM has been converted into serial configurable processor code Sequential Code; Serial configurable processor code Sequential Code is imported into the code generator codegenerator of compiler back-end; Resolve the architecture parameter that configurable processor architecture description document ADT obtains by architecture resolver Architecture parser; Each architecture parameter uses the XML descriptive language to organize; Utilize the architecture parameter, compiler back-end is finished register distribution, instruction scheduling, and the code conversion of serial configurable processor is parallel configurable processor code Parallel Code.
2. the compiler based on configurable processor as claimed in claim 1 is characterized in that, described architecture parameter comprises total line number, highway width, the delay of register number, functional unit function unit.
CN200910070942.6A 2009-10-23 2009-10-23 Compiler system based on TCore configurable processor Expired - Fee Related CN101694627B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN200910070942.6A CN101694627B (en) 2009-10-23 2009-10-23 Compiler system based on TCore configurable processor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN200910070942.6A CN101694627B (en) 2009-10-23 2009-10-23 Compiler system based on TCore configurable processor

Publications (2)

Publication Number Publication Date
CN101694627A CN101694627A (en) 2010-04-14
CN101694627B true CN101694627B (en) 2013-09-11

Family

ID=42093600

Family Applications (1)

Application Number Title Priority Date Filing Date
CN200910070942.6A Expired - Fee Related CN101694627B (en) 2009-10-23 2009-10-23 Compiler system based on TCore configurable processor

Country Status (1)

Country Link
CN (1) CN101694627B (en)

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102799419B (en) * 2012-09-05 2014-10-22 无锡江南计算技术研究所 Register writing conflict detection method and device, and processor
CN104731555A (en) * 2013-12-23 2015-06-24 中兴通讯股份有限公司 Method and device for avoiding conflict among registers
US9825884B2 (en) 2013-12-30 2017-11-21 Cavium, Inc. Protocol independent programmable switch (PIPS) software defined data center networks
US10050833B2 (en) * 2014-06-19 2018-08-14 Cavium, Inc. Method of reducing latency in a flexible parser and an apparatus thereof
US9635146B2 (en) 2014-06-19 2017-04-25 Cavium, Inc. Method of using bit vectors to allow expansion and collapse of header layers within packets for enabling flexible modifications and an apparatus thereof
US10616380B2 (en) 2014-06-19 2020-04-07 Cavium, Llc Method of handling large protocol layers for configurable extraction of layer information and an apparatus thereof
CN104503732A (en) * 2014-12-30 2015-04-08 中国人民解放军装备学院 One-dimensional eight-point IDCT (inverse discrete cosine transform) parallelism method for Feiteng processor
CN105005638B (en) * 2015-06-04 2018-06-26 广东顺德中山大学卡内基梅隆大学国际联合研究院 A kind of High Level Synthesis dispatching method based on linear delay model
CN106020922B (en) * 2016-05-30 2019-01-08 湖南科技大学 The instruction dispatching method of idle beat is filled with the execution packet of jump target basic block
CN107590535A (en) * 2017-09-08 2018-01-16 西安电子科技大学 Programmable neural network processor
CN110147236B (en) * 2019-04-30 2023-01-31 创新先进技术有限公司 Code compiling method and device
CN110221838B (en) * 2019-05-28 2020-10-27 中国科学院高能物理研究所 Method for carrying out automatic program design optimization based on genetic algorithm and directed acyclic graph
CN112214222B (en) * 2020-10-27 2021-11-19 华中科技大学 Sequential structure for realizing feedforward neural network in COStream and compiling method thereof
CN113553061B (en) * 2021-09-22 2021-12-17 西安芯瞳半导体技术有限公司 Method and device for improving execution performance of source program and computer storage medium
CN113791770B (en) * 2021-11-15 2022-06-21 北京壁仞科技开发有限公司 Code compiler, code compiling method, code compiling system, and computer medium
WO2023142005A1 (en) * 2022-01-28 2023-08-03 Intel Corporation Concept for handling memory spills
CN116841618B (en) * 2023-07-04 2024-02-02 上海耀芯电子科技有限公司 Instruction compression method and system, decompression method and system of TTA processor

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1525323A (en) * 2003-02-24 2004-09-01 ���µ�����ҵ��ʽ���� Processor and compiler for creating program for the processor
CN1656446A (en) * 2002-05-24 2005-08-17 皇家飞利浦电子股份有限公司 Configurable processor

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1656446A (en) * 2002-05-24 2005-08-17 皇家飞利浦电子股份有限公司 Configurable processor
CN1525323A (en) * 2003-02-24 2004-09-01 ���µ�����ҵ��ʽ���� Processor and compiler for creating program for the processor

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
JP特开2004-34660A 2004.02.05

Also Published As

Publication number Publication date
CN101694627A (en) 2010-04-14

Similar Documents

Publication Publication Date Title
CN101694627B (en) Compiler system based on TCore configurable processor
Wingate et al. Lightweight implementations of probabilistic programming languages via transformational compilation
US5339420A (en) Partitioning case statements for optimal execution performance
CN104424128A (en) Variable-length instruction word processor system and method
Vollmer et al. Compiling tree transforms to operate on packed representations
Winterstein et al. Separation logic-assisted code transformations for efficient high-level synthesis
Goumas et al. An efficient code generation technique for tiled iteration spaces
Giering et al. Recomputations in reverse mode AD
CN100414514C (en) Memory access instruction vectorization
CN103760965B (en) A kind of algorithm source program energy conservation optimizing method of energy constraint embedded system
Schanen Semantics driven adjoints of the message passing interface
CN116627396A (en) Polyhedral model nested cyclic transformation dynamic solving acceleration method
Gutiérrez et al. Automatic parallelization of irregular applications
CN103140853A (en) Method and apparatus for using entropy in ant colony optimization circuit design from high level systhesis
Nicolau et al. Register allocation, renaming and their impact on fine-grain parallelism
Ciesielski et al. Data-flow transformations using Taylor expansion diagrams
Ottoni et al. Improving offset assignment through simultaneous variable coalescing
Rho et al. Compilation of a functional language for the multithreaded architecture: Davrid
Hamilton et al. Compile-time garbage collection by necessity analysis
Brockmeyer et al. Low power storage cycle budget distribution tool support for hierarchical graphs
Touati et al. Advanced Backend Code Optimization
Gomm et al. On the design of parallel programs for machines with distributed memory
JP2000020482A (en) Loop parallelizing method
Ghodrat et al. Control flow optimization in loops using interval analysis
Färber Terms for Efficient Proof Checking and Parsing

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20130911

Termination date: 20211023

CF01 Termination of patent right due to non-payment of annual fee