Nothing Special   »   [go: up one dir, main page]

CN101833437A - Device and method for a microprocessor - Google Patents

Device and method for a microprocessor Download PDF

Info

Publication number
CN101833437A
CN101833437A CN201010185635A CN201010185635A CN101833437A CN 101833437 A CN101833437 A CN 101833437A CN 201010185635 A CN201010185635 A CN 201010185635A CN 201010185635 A CN201010185635 A CN 201010185635A CN 101833437 A CN101833437 A CN 101833437A
Authority
CN
China
Prior art keywords
instruction
formation
byte
row
length
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201010185635A
Other languages
Chinese (zh)
Other versions
CN101833437B (en
Inventor
汤玛斯·C·麦当劳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Via Technologies Inc
Original Assignee
Via Technologies Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US12/572,024 external-priority patent/US8335910B2/en
Application filed by Via Technologies Inc filed Critical Via Technologies Inc
Publication of CN101833437A publication Critical patent/CN101833437A/en
Application granted granted Critical
Publication of CN101833437B publication Critical patent/CN101833437B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Advance Control (AREA)
  • Executing Machine-Instructions (AREA)

Abstract

The invention provides a device and method for a microprocessor. The device is used for extracting an order from an order byte series flow of the microprocessor and an order set of the microprocessor comprises a variable length order. The device comprises: a first queue comprising items, each item being used for storing and receiving an order byte line from an order cache; a plurality of decoders respectively generating a corresponding start/end marker for each order byte of the first queue; a second queue comprising items, each item being used for storing and receiving the order byte line from the first queue and receiving the corresponding start/end marker of the decoder; and a control logic unit for detecting a status; loading the first line and the corresponding start/end marker on the second queue and not shifting out the first line of the first queue; extracting a plurality of orders from the first line in the second queue to be subject to post-treatment by the microprocessor.

Description

Be applicable to the device and method of microprocessor
Technical field
The present invention is relevant field of microprocessors, particularly about getting instruction from a kind of command byte crossfire of microprocessor of the instruction set architecture with variable length instruction.
Background technology
Microprocessor comprises one or more performance element, carries out in order to carry out actual instruction.SuperScale (superscalar) microprocessor can send a plurality of instructions to each performance element in each clock period, thereby is promoted throughput or promote the interior average instruction of each clock period.Yet the instruction fetch of microprocessor pipeline upper end and decoding function must provide an instruction crossfire to performance element with efficient speed, use and use performance element effectively and promote throughput.The x86 framework is because therefore its instruction length and on-fixed make this work difficult more, and under this framework, the length of its each instruction changes, and this will be in following detailed description.Therefore, the x86 microprocessor must comprise a lot of logical circuits to handle the command byte crossfire of coming in, with beginning and end position of decision instruction.Therefore, must promote the x86 microprocessor and resolve the command byte crossfire to obtain the processing speed of each instruction.
Summary of the invention
According to the features of this invention, the invention provides a kind of device that is applicable to microprocessor, in order in a command byte crossfire of microprocessor, to extract instruction, the instruction set architecture tool variable length instruction of this microprocessor, this device comprises: one first formation, it has a plurality of projects, and each this project is received from the command byte row of an instruction cache in order to storage; A plurality of demoders for each command byte of this command byte row of this first formation, produce at the beginning accordingly respectively/finish and indicate; One second formation, it has a plurality of projects, and each this project is received from these command byte row of this first formation in order to storage and is received from corresponding this beginning/end sign of this demoder; An and steering logic unit, in order to: detect a situation, this situation comprises the instruction length of an instruction and does not determine as yet, because the beginning of this instruction partly is first row that are positioned at this command byte of this first formation row, and the remainder of this instruction is positioned at the secondary series that this command byte of this first formation is listed as and does not load this first formation from this instruction cache as yet; According to this situation that detects, load these first row and corresponding this beginning/end and be indicated to this second formation, and do not shift out these first row of this first formation; And according to should beginning/end indicating accordingly, first row of this in this second formation extract a plurality of instructions and make subsequent treatment for this microprocessor certainly, and wherein the instruction of these a plurality of extractions does not contain the uncertain instruction of length.
According to the features of this invention, the invention provides a kind of method that is applicable to microprocessor, be applicable in the microprocessor of tool variable length instruction, this method is in order to instruction to be provided in the command byte crossfire that provides from an instruction cache, this microprocessor comprises one first formation, receives a plurality of command byte row in order to this instruction cache certainly; This microprocessor comprises a demoder, for each command byte of this command byte row of this first formation, produces at the beginning accordingly respectively/finishes and indicate; And this microprocessor comprises one second formation, begins/finishes sign in order to receive from these command byte row of this first formation and from this of this demoder.This method comprises: detect a situation, this situation comprises the instruction length of an instruction and does not determine as yet, because the beginning of this instruction partly is first row that are positioned at this command byte of this first formation row, and the remainder of this instruction is positioned at the secondary series that this command byte of this first formation is listed as and does not load this first formation from this instruction cache as yet; According to this situation that detects, load these first row and corresponding this beginning/end and be indicated to this second formation, and do not shift out these first row of this first formation; And according to should beginning/end indicating accordingly, first row of this in this second formation extract a plurality of instructions and make subsequent treatment for this microprocessor certainly, and wherein the instruction of these a plurality of extractions does not contain the uncertain instruction of length.
Description of drawings
Fig. 1 shows the calcspar of the microprocessor of the embodiment of the invention.
The calcspar of the L level of the order format device of Fig. 2 displayed map 1.
The preposition message 238 of the accumulation of Fig. 3 displayed map 2.
The operation of the microprocessor of Fig. 4 displayed map 1.
The partial L level of the order format device of Fig. 5 displayed map 1 and M level calcspar.
Fig. 6 shows the operational flowchart of microprocessor element shown in Figure 5, in order in the command byte crossfire, taking out instruction (can take out three instructions at most in one embodiment), its can generation time postpone and and instruction in the prefix byte number irrelevant.
The calcspar of the part of the order format device of Fig. 7 displayed map 1.
The operational flowchart of the part order format device of Fig. 8 a and Fig. 8 b displayed map 7.
The detailed block diagram of multiplex's formation of Fig. 9 displayed map 5.
The calcspar of the part M level of the order format device of Figure 10 displayed map 1.
The calcspar of the M level steering logic unit of Figure 11 displayed map 5.
The operational flowchart of the part M level of the order format device of Figure 12 displayed map 1.
Multiplex's formation of Figure 13 displayed map 5 is in the content of continuous two clock period, with the operation of illustration M level.
Multiplex's formation of Figure 14 displayed map 5 is in the content of continuous two clock period, with the operation of illustration M level.
Figure 15 shows that Figure 14 middle finger makes formatter in a clock in the cycle, and three instructions that will contain maximum 40 command byte obtain and send out.
The BTAC of Figure 16 displayed map 1 has done bad prediction thereby has caused branch's mistake of microprocessor, that is the branch of Fig. 1 is designated as logic true value but non-operational code for instruction.
Figure 17 shows the composition signal of ripple logical block output.
The operational flowchart of the microprocessor of Figure 18 displayed map 1.
The detailed block diagram of the length decoder of Figure 19 displayed map 2.
Figure 20 shows the configuration of 16 length decoders.
Figure 21 shows the operational flowchart of the length decoder of Figure 20.
[main element label declaration]
100 microprocessors, 102 instruction caches
104 x86 command byte formations, 106 order format devices
108 format instruction queues, 112 instruction transfer interpreters
114 translate instruction queue 116 working storage alias tables
118 reservation stations, 122 performance elements
124 retirement unit, 126 extraction units
128 branch target address cachings, 132 command byte
134 command byte, 136 x86 instruct crossfire
142 extract address 144 totalizers at present
146 predicted target address 148 are carried out destination address
152 next address 154 branches that extract continuously indicate
202 length decoders, 204 ripple logical blocks
The output of 208 steering logic unit, 212 length decoders
Output 218 operands and the address size of 214 ripple logical blocks
The arbitrary preposition designator of 222 instruction lengths, 224 decodings
226 decoding LMP designators 228 are subjected to LMP to influence designator
229 preposition message, 232 start bits
234 stop bit, 236 significance bits
238 preposition message 252 predetermined registration operation numbers of accumulation and address sizes
302 OS 304 AS
308 REX.W appear in 306 REX
312 REX.R 314 REX.X
316 REX.B 318 REP
322 REPNE 324 LOCK
326 fragments exceed appearance 328 coding sections and exceed [2:0]
332 arbitrary preposition 402-414 steps that occur
502 multiplex's formations, 504 I1 multiplexers
506 I2 multiplexers, 508 I3 multiplexers
512 M level steering logic unit, 514 control signals
516 control signals, 518 control signals
524 first instruction I1,526 second instruction I2
528 the 3rd instruction I3,534,536,538 significance indicators
602-608 step 702 XIBQ steering logic unit
The preposition array of 802-824 step 1002 accumulation
1004 command byte arrays, 1102 subtracters
1104 partial L EN, 1106 residue LEN1
1108 byte location END1,1112 byte location END0
1114 multiplexers, 1116 totalizers
1118 working storages, 1122 instruction length LEN1
The bad BTAC of 1201-1222 step 1702 position
1802-1816 step 1902 programmable logic array (PLA)
1904 totalizers, 1906 multiplexers
1912 eaLen values, 1914 control signals
1916 immLen values, 1918 eaLen values
The 2102-2116 step
Embodiment
Fig. 1 shows the calcspar of the microprocessor 100 of the embodiment of the invention.Microprocessor 100 comprises by pipeline (pipeline) multistage or that a plurality of functional unit is formed, it comprises level Four instruction cache (four-stage instruction cache) 102, the formation of x86 command byte (x86 instructionbyte queue, XIBQ) 104, (it comprises three grades of L to order format device (instruction formatter) 106, M and F), format instruction queue (formatted instruction queue) 108, instruction transfer interpreter (instruction translator) 112, translate instruction queue (translatedinstruction queue) 114, working storage alias table (register alias table) 116, reservation station (reservation station) 118, performance element (execution units) 122 and retirement unit (retire unit) 124.Microprocessor 100 also comprises extraction unit (fetch unit) 126, and it provides present extraction address 142 to instruction cache 102, is listed as to XIBQ104 in order to select a command byte (byte) 132 to get soon.Microprocessor 100 also comprises totalizer 144, and it increases the present address 142 of extracting to produce the next address 152 of extracting continuously, feeds back to extraction unit 126 again.Also (branch target address cache BTAC) 128 receives predicted target address 146 to extraction unit 126 from branch target address caching.At last, extraction unit 126 receives from performance element 122 and carries out destination address (executedtarget address) 148.
The formation of XIBQ104 contains a plurality of projects (entry), and each project comprises 16 byte datas from instruction cache 102.Moreover each project of XIBQ104 comprises relevant pre decoding (pre-decoded) message of data byte.Pre decoding message is produced when instruction cache 102 flow to XIBQ104 when data byte.Caching data from XIBQ104 is command byte 134 crossfires, and its form is a plurality of 16 byte blocks, yet and do not know in the crossfire or block in the beginning or the end position of x86 instruction.Order format device 106 i.e. beginning and end byte in order to each instruction in the decision crossfire, thereby byte serial stream is separated into x86 instruction crossfire 136, it is fed to and is stored in format instruction queue 108 again, handles with the other parts for the treatment of microprocessor 100 pipelines.When take place resetting or carry out/predict flow control instruction (for example jump over (jump) instruction, subroutine call (subrout ine call) instruction or from the subroutine link order), then provide replacement address or branch target address to order format device 106 as instruction pointer (pointer), in order to activation order format device 106, make first byte of first effective instruction in its 16 present byte blocks that determine the instruction crossfire.Therefore, order format device 106 can add the length of first target instruction target word according to the starting position of first target instruction target word, with the starting position of decision next instruction.Order format device 106 repeats said procedure, up to carrying out or predict another flow control instruction.
BTAC128 also provides branch that (taken) indication 154 takes place and gives XIBQ104.132 pairs of each command byte that instruction cache 102 offers XIBQ104 should have a branch that indication 154 takes place.Indication 154 takes place and whether has branch instruction in order to command byte 132 row that expression BTAC128 prediction offers XIBQ104 in branch; If for being that then extraction unit 126 will be chosen the predicted target address 146 that BTAC128 provides.Detailed it, BTAC128 for first byte (even this first byte is a prefix byte) of branch instruction can corresponding output logic true value branch indication 154 takes place, but for the branch that other byte of instruction then can the output logic falsity indication 154 takes place.
Microprocessor 100 is the microprocessor 100 of x86 framework.Can correctly carry out when aiming at the performed major applications program of x86 microprocessor when microprocessor, then this microprocessor promptly can be considered the microprocessor of x86 framework.In the time can obtaining expected results, then this application program promptly can be considered and can correctly carry out.One of feature of X86 framework is variable for the instruction length in its instruction set architecture, but not fixes as the instruction length in some instruction set architectures.Moreover, for a certain x86 operational code (opcode), may influence because of whether having preposition (prefix) before the operational code length of instruction.In addition, the length of some instructions may be the function of predetermined registration operation number (operand) under microprocessor 100 operator schemes and/or address size (for example the D position of sign indicating number segment descriptor (code segment descriptor), perhaps whether microprocessor 100 operates in IA-32e or 64 bit patterns).At last, outside default address/operand size, instruction also can comprise a length and revise preposition (length-modifying prefix), in order to select address/operand size.For example, (operand size, OS) the REX.W position of preposition (0x66), address size (AS) preposition (0x67) and REX preposition (0x4x) (position 3) is to change default address/operand size can to use the operand size.Intel (Intel) company claims these, and to be that length changes preposition (length-changing prefix, LCP), yet be called in this manual length revise preposition (length-modifying prefix, LMP).The form and the length of X86 instruction are well known, details can be with reference to IA-32 Intel Architecture software development notebook (IA-32 Intel ArchitectureSoftware Developer ' s Manual), the chapter 2 of 2A collection: instruction set is with reference to (InstructionSet Reference), A-M, in June, 2006 in Christian era.
According to Intel 64 and IA-32 framework optimization reference manual (
Figure GSA00000120397300061
64and IA-32Architectures Optimization Reference Manual), in March, 2009 in Christian era, page or leaf 3-21 to 3-23 (can from following page download http://www.intel.com/Assets/PDF/manual/248966.pdf): ", then must use slower length decoder algorithm when pre decoder runs into LCP in extracting row.When using slower length decoder algorithm, pre decoder was decoded in six cycles, but not general one-period.Formation in the machine pipeline (queuing) generally is the delay that can't avoid LCP to cause.」
The calcspar of the L level of the order format device 106 of Fig. 2 displayed map 1.Order format device 106 comprises a plurality of length decoders 202, and its output 212 is coupled to a plurality of ripples (ripple) logical block 204 respectively, and the output 214 of ripple logical block 204 is coupled to steering logic unit 208 and offers the M level of order format device 106.In one embodiment, length decoder 202 produces output 212 during first phase place of the two phase clock signal of microprocessor 100, and ripple logical block 204 produces output 214 during second phase place of two phase clock signal.
Length decoder 202 receives command byte 134 from XIBQ104.In one embodiment, each project width of XIBQ104 is 16 bytes, thereby 16 length decoders 202 should be arranged mutually, as shown in Figure 20 to 15.Each length decoder 202 receives and decoding corresponding instruction byte from the bottom of XIBQ104 project.In addition, each length decoder 202 receives and ensuing three the adjacent instructions bytes of decoding.For last three length decoders 202, it receives one or more command byte (if the bottom penult project of XIBQ104 is invalid, then last three length decoders 202 must be waited for and produce effectively output in the next clock period) from the bottom of XIBQ104 penult project.The details of length decoder 202 will illustrate in Figure 19.By this, make length decoder 202 can determine and export the instruction length 222 of the instruction in the bottom project of XIBQ104.In one embodiment, the byte number of this instruction of instruction length 222 expressions except prefix byte.In other words, in the middle of instruction length 222 presentation directiveses, the byte number from operational code to last byte.Specifically, be instruction length 222 by the instruction length of being exported corresponding to the length decoder 202 of first command byte of instructing.
In order to produce instruction length 222, length decoder 202 also uses operand and the address size 218 that is received from steering logic unit 208.Steering logic unit 208 can output function number and address size 218 for each command byte 134.The predetermined registration operation number of the present microprocessor 100 of steering logic unit 208 bases and the output 214 of address size 252 and ripple logical block 204 are with decision operand and address size 218.If do not have LMP in output 214 presentation directiveses of ripple logical block 204, then corresponding length decoder 202 is given for each instruction word festival-gathering output predetermined registration operation number and address size in steering logic unit 208.Yet, if in output 214 presentation directiveses of ripple logical block 204 one or more LMP is arranged, then predetermined registration operation number and address size 252 are revised for each instruction word festival-gathering and output function number and address size 218 are given corresponding length decoder 202 in steering logic unit 208, wherein predetermined registration operation number and address size 252 are revised according to the value of 308 of OS 302, AS 304 and REX.W in steering logic unit 208, these are contained in the preposition message 238 of accumulation of output 214 of ripple logical block 204, as shown in Figure 3.
As shown in Figure 2, the output 212 of each length decoder 202 comprises command byte 134, instruction length 222, the arbitrary preposition designator of decoding (decoded any prefix indicator) 224, decoding LMP designator (decoded LMP indicator) 226, is subjected to LMP to influence designator (susceptible toLMP indicator) 228 and preposition message 229.
The byte of being decoded when length decoder 202 corresponds to arbitrary x86 preposition (no matter whether it is LMP), and arbitrary preposition designator 224 of then decoding is logic true value; Otherwise, be the logic falsity.
The byte of being decoded when length decoder 202 corresponds to arbitrary x86LMP, that is OS preposition (0x66), AS preposition (0x67) or REX.W preposition (0x48-0x4F), and the LMP designator 226 of then decoding is logic true value; Otherwise, be the logic falsity.
The byte of being decoded when length decoder 202 is an opcode byte, wherein the instruction length of operational code (is not for example influenced by LMP, OS is preposition to be compulsory for some SIMD instructions, therefore can not change its length), then be subjected to LMP to influence designator 228 and be the logic falsity; Otherwise, be logic true value.
Preposition message 229 comprises a plurality of position (bit), in order to presentation directives's byte whether have various x86 preposition one of them.These are similar to the preposition message 238 of accumulation shown in Figure 3.Yet the preposition message 229 of length decoder 202 output is only represented single preposition, that is, be subjected to the prefix value of command byte of the single correspondence of length decoder 202 decodings.Opposite, because ripple logical block 204 is accumulated the preposition message 229 that all length demoder 202 provides, therefore all of accumulating in 238 presentation directiveses of preposition message are preposition.
As shown in Figure 2, the output 214 of each ripple logical block 204 comprises command byte 134, start bit 232, stop bit 234, significance bit 236 and accumulates preposition message 238.The output 214 of each ripple logical block 204 also is fed to next adjacent ripple logical block 204.In one embodiment, 16 ripple logical blocks 204 are organized into four logical blocks, four command byte of each block processes and related news thereof.Each ripple logical block block 204 is also exported the corresponding instruction byte.
When ripple logical block 204 handled bytes are the opcode byte of instruction (for example Zhi Ling first byte is non-is prefix byte), then start bit 232 is a logic true value.Order format device 106 increases by an index, and it points to all prefix bytes, makes when pointed one non-prefix byte the operand byte that this pointer will directional order.
When ripple logical block 204 handled bytes were the last byte of instruction, then stop bit 234 was a logic true value; Otherwise, be the logic falsity.
From 16 significance bits 236 of ripple logical block 204 output first, till first untreated LMP occurring, each significance bit 236 is a logic true value.
Accumulating preposition message 238 is shown in Fig. 3 and discusses as above.Steering logic unit 208 uses the preposition message 238 of accumulation also to cooperate significance bit 236, whether uses predetermined registration operation number and address size 252 or it is made amendment with decision.
Output 212 that it should be noted that length decoder 202 belongs to a kind of test character.In other words, it produces when exporting and does not know the address of dependent instruction byte in instruction.Especially, be to suppose that this byte is to produce under effective preposition prerequisite with preposition relevant designator 224/226/228/229, and this hypothesis may be the hypothesis of a mistake.Therefore, this byte may by chance have a preposition value, but this byte is displacement (displacement) byte with value identical with LMP in fact.For example, 0x67 is the preposition value of AS, and it is LMP.Yet the SIB byte of address displacement byte or immediate data value (immediatedata value) byte or Mod R/M byte or instruction is neither to be prefix byte, but may have the 0x67 value.Only all LMP in the present block of command byte handle, could determine that the output 212 and 214 corresponding to all bytes in the block all is correct.
If in present clock period, all command byte in the XIBQ104 project are not decoded to go out any LMP, and then the L level can be exported 214 (particularly start bit 232 and stop bit 234) in single clock produce whole project in the cycle ripple logical block 204.If decodedly in the present project of XIBQ104 go out one or more LMP, the ripple logical block 204 outputs 214 required clock periodicities that then produce correct start bit 232 of tool and stop bit 234 are N+1, and wherein N is the number that has the instruction of at least one LMP in the present project of XIBQ104.No matter how many preposition numbers of the arbitrary instruction in the project is, the L level all can be carried out above-mentioned work, and this is shown in the process flow diagram of Fig. 4.Steering logic unit 208 comprises a state, and processed in order to which byte in the present block of presentation directives's byte, which still is untreated.This state makes steering logic unit 208 to produce significance bit 236 and operand and address size 218 at each command byte.Have iteration (iterative) characteristic owing to have the processing of the command byte block of the instruction that contains LMP, even when first clock period, the instruction length 222, start bit 232 and the stop bit 234 that contain first instruction of LMP may be also incorrect; Yet when next clock period, first instruction and arbitrary instruction length 222, start bit 232 and stop bit 234 that does not contain the adjacent instructions of LMP can become correctly; And in the clock period of continuing, next of first instruction contains instruction and adjacent instruction length 222, start bit 232 and the stop bit 234 that does not contain the instruction of LMP thereof of LMP all can be correct.Whether in one embodiment, this state comprises the sixteen bit working storage, processed in order to expression dependent instruction byte.
[indicate begin and end byte] at the instruction that contains LMP
The operation of the microprocessor 100 of Fig. 4 displayed map 1, this flow process starts from step 402.
In step 402, steering logic unit 208 output predetermined registration operation numbers and address size 218 are given length decoder 202.Then, flow process enters step 404.
In step 404, in first phase place of clock period, operand and address size 218 that length decoder 202 provides according to steering logic unit 208, with the decoding XIBQ104 the bottom project command byte and produce its output 212.As previously mentioned, for each command byte of the bottom project of XIBQ104, the output 212 of length decoder 202 comprises instruction length 222 and and preposition relevant designator 224/226/228/229 (Fig. 2).Then, flow process enters step 406.
In step 406, in second phase place of clock period, ripple logical block 204 exports 214 according to the output 212 of length decoder 202 to produce.As previously mentioned, the output 214 of ripple logical block 204 comprises start bit 232, stop bit 234, significance bit 236 and accumulates preposition message 238 (Fig. 3).Then, flow process enters step 408.
In step 408, the output 214 of (examine) ripple logical block 204 is inspected in steering logic unit 208, comprises untreated LMP (length is revised preposition symbol) whether to also have any instruction in the bottom project of judging XIBQ104.If for being then to enter step 412: otherwise, step 414 entered.
In step 412, the preposition message 238 of accumulation that steering logic unit 208 provides according to ripple logical block 204 is to upgrade internal state and operand and address size.Then, flow process is returned step 404, according to new operand size and address size, handles the command byte of bottom project once more.
In step 414, the command byte of project was handled fully bottom steering logic unit 208 was judged, thereby it is shifted out from XIBQ104, and the M level is delivered in its output 214 together with each command byte 134 corresponding ripple logical block 204.Specifically, as previously mentioned, because the output 214 of ripple logical block 204 comprises start bit 232 and stop bit 234, it expresses the border of each instruction in the middle of the instruction crossfire that instruction cache 102 provided, thereby make the M level of order format device 106 and F level be able to further processing instruction crossfire, and individual instructions inserted FIQ (format instruction queue) 108, allow instruction transfer interpreter 112 handle.Flow process ends at step 414.
According to aforementioned, if do not contain LMP (length is revised preposition symbol) in the command byte, then the L level can in single clock in the cycle at the whole project of XIBQ (formation of x86 word byte) 104 to produce start bit 232 and stop bit 234; If there are one or more instructions to have LMP (length is revised preposition symbol) in the project of XIBQ104, then produce start bit 232 and stop bit 234 required clock periodicities become N+1, wherein N is the number of instructions that contains at least one LMP (length is revised preposition symbol) in the XIBQ104 project, and the preposition number that contains in no matter instructing why, and the L level can be carried out.
[accumulating preposition] to handle the instruction that contains a plurality of prefix bytes effectively
The x86 framework allows instruction to contain 0 to 14 prefix byte.This causes the difficulty of pipeline (pipeline) front end when processing instruction byte crossfire.When processing contains the instruction of prefix byte of a great deal of, can meet with the delay of time in the past.According to Intel 64 and IA-32 framework optimization reference manual (
Figure GSA00000120397300111
64and IA-32Architectures Optimization Reference Manual), in March, 2009 in Christian era, page or leaf 12-5, Intel mentions at the ATOM micro-architecture: " contain instruction preposition more than three and can produce the MSROM transfer, cause two clock cycle delays of front end." according to the micro-architecture (The microarchitecture of Intel and AMD CPU ' s) of another research document-Intel and AMD central processing unit; author Agner Fog; Copenhagen University College of Enginerring; May 5 2009 Christian era last the renewal; page or leaf 93 (can in following page download www.agner.org/optimize/microarchitecture.pdf), it is mentioned: " containing a plurality of preposition instructions needs extra time to decode.The instruction decoder of P4 only can be handled one preposition in the cycle in a clock.On P4, contain its each preposition cost one clock cycle decoder that needs of a plurality of preposition instructions ", and " instruction decoder of P4E can be preposition in two of clock period treatment.Therefore, decodable code contains at the most two preposition instructions in the single clock cycle, and containing three or four preposition instructions then needs decode in two clock period.So P4E increases this function, be because under 64 bit patterns, a lot of instructions all contain two preposition (for example the operand size is preposition and REX is preposition).」
Yet, the embodiment of the invention need not increase under the condition of time delay, can handle all (14 at the most) prefix bytes that framework allowed in the instruction, no matter the quantity of prefix byte why (as long as should be preposition non-be LMP (length is revised preposition symbol), if this is preposition to be LMP, then contain the extra processing time that increases by a clock cycle of one or more each preposition instruction meeting, as previously mentioned).So the embodiment of the invention can reach this purpose, be because length decoder 202 produces preposition message 229, ripple logical block 204 then accumulate preposition message 229 and is given the opcode byte of instructing to produce the preposition message 238 of accumulation, this will be in following detailed description.
The partial L level of the order format device 106 of Fig. 5 displayed map 1 and M level (multiplex's level) calcspar.The M level comprises multiplex's formation (mux queue) 502.In one embodiment, multiplex's formation 502 comprises four projects, each items storing 16 byte.The blank project of next of multiplex's formation 502 receives the output 214 (Fig. 2) of corresponding ripple logical block 204, and it comprises command byte 134, start bit 232, stop bit 234 and accumulates preposition message 238.
The M level also comprises M level steering logic unit 512, it receives beginning/stop bit 232/234 from the bottom of multiplex's formation 502 project, and (in one embodiment) receive bottom project second from the bottom (next-to-bottom entry, preceding cross joint NTBE) of multiplex's formation 502.According to beginning/stop bit 232/234,512 controls, three groups of multiplex's logical blocks in M level steering logic unit are respectively I1 multiplexer 504, I2 multiplexer 506 and I3 multiplexer 508.The I1 multiplexer 504 outputs first instruction I1 524 is to the F level of order format device 106; The I2 multiplexer 506 outputs second instruction I2 526 to F levels; I3 multiplexer 508 outputs the 3rd instruction I3 528 to F levels.In addition, three significance indicators 534/536/538 of M level steering logic unit 512 outputs, whether effective in order to represent corresponding first, second, third instruction 524/526/528.By this, the M level is able to take out at most (extract) three format instructions from the instruction crossfire, and provides it to the F level in the cycle at single clock.In other embodiments, the M level can be taken out in the cycle and provide more than three formats and be instructed to the F level at single clock.Three instruction each instructions in 524/526/528 comprise command adapted thereto byte 134, and its prefix byte is replaced into the preposition message 238 of corresponding accumulation.In other words, each instruction 524/526/528 comprises the other parts of opcode byte and command byte and accumulates preposition message 238.Each multiplexer 504/506/508 receives message 214 (but start bit 232, stop bit 234 except) respectively from the respective base project of multiplex's formation 502, and (in one embodiment) cross joint before the corresponding NTBE of multiplex's formation 502 receives is in order to choose individually and output order 524/526/528.
Fig. 6 shows the operational flowchart of microprocessor shown in Figure 5 100 elements, in order in the command byte crossfire, taking out instruction (can take out three instructions at most in one embodiment), its can generation time postpone and and instruction in the prefix byte number irrelevant.As previously mentioned, ripple logical block 204 can the preposition message 229 of accumulation be given the opcode byte of instruction to produce the preposition message 238 of accumulation.Shown in flow process start from step 602.
In step 602, in first phase place of clock period, length decoder 202 decoding instruction bytes 134 crossfires to be producing output 212 (Fig. 2), particularly preposition message 229, and the class of operation of this and step 404 is seemingly.Then, enter step 604.
In step 604, in second phase place of clock period, ripple logical block 204 is opcode byte (that is first non-prefix byte) according to preposition message 229 with which byte in each instruction of decision crossfire.Moreover ripple logical block 204 is accumulated its preposition message 229 at all (mostly being 14 the most) prefix bytes in the instruction, gives the opcode byte of instruction to produce the preposition message 238 of accumulation.Specifically, ripple logical block 204 begins to accumulate preposition message 229 from first prefix byte of instruction, and accumulates the preposition message 229 of each byte one by one, till it detects opcode byte.When the time comes, ripple logical block 204 stops the accumulation of preposition message, makes that the preposition message 238 of accumulation of instruction can not continue to be accumulated to next instruction at present.Ripple logical block 204 begins to carry out the accumulation of preposition message 229 from first prefix byte of next instruction, and stops at opcode byte.Each instruction in the crossfire repeats this program.Ripple logical block 204 uses another output 212 of length decoder 202 to finish the accumulation of preposition message.For example, as previously mentioned, ripple logical block 204 uses instruction length 222 to determine first byte of each instruction, and it may be prefix byte, in order to begin the accumulation program of preposition message.Ripple logical block 204 is also used the position of other message 224/226/228 with the decision opcode byte, its first byte (by start bit 232 expressions) for not containing preposition instruction, and the position of the last byte of decision instruction (by stop bit 234 expressions).Then, flow process enters step 606.
In step 606, command byte 134 and beginning accordingly/stop bit 232/234, the preposition message 238 of accumulation are loaded in next available items of multiplex's formation 502.In one embodiment, the step shown in the step 602,604,606 is carried out (presumptive instruction does not contain LMP (length is revised preposition symbol)) in single clock in the cycle.Then, enter step 608.
In step 608, in next clock period, M level steering logic unit 512 control multiplexers 504/506/508 make it can take out three instructions at the most.In other words, no matter the quantity of prefix byte why, the M level need not increase time delay and can get instruction.Behind multiplex (MUX) (muxed), but the instruction 524/526/528 each be fed to the F level.Specifically, the M level can be taken out the opcode byte and the subsequent byte of each instruction along with the preposition message 238 of accumulation.The F level is according to instructing kenel, the exceptional situation that some are possible, pairing property (pairability) and other characteristic with decoding instruction 524/526/528, with translating of sign on 524/526/528.F level and instruction transfer interpreter 112 can utilize the preposition message 238 of accumulation.Flow process ends at step 608.
Present embodiment is different from traditional design.As previously mentioned, ripple logical block 204 tradition is come complicatedly, the start bit 232 that it produced is opcode byte of pointing to instruction, but not as first byte (it may be prefix byte) of directional order as the tradition, and produce the preposition message 238 of accumulation, therefore, no matter why the quantity of prefix byte all can get instruction and can not cause time delay (only LMP (length is revised preposition), as described above).On the contrary, traditional practice is to point out to instruct the first actual byte be first byte, if instruction contains prefix byte, then this prefix byte is represented as first instruction.When instruction contained a plurality of prefix byte, in order to remove prefix byte, therefore traditional multiplex's logic can cause time delay.
[when operation part occurs, making caching data to discharge as early as possible] with beginning/end sign
The calcspar of the part of the order format device 106 of Fig. 7 displayed map 1.In Fig. 1, instruction cache 102 provides command byte 132 to XIBQ104.In one embodiment, order format device 106 comprises pre decoding (pre-decode) logical block (be not shown in graphic in), in order to the command byte 132 from instruction cache 102 is carried out pre decoding, then be loaded onto XIBQ104 in the lump together with command byte 132 through pre decoding message.Order format device 106 comprises XIBQ steering logic unit 702, and the project of its control XIBQ104 loads and shifts out.
Length decoder 202 and ripple logical block 204 (Fig. 2) receive command byte 134 and produce output 214 from XIBQ104, in order to the multiplex's formation 502 that offers Fig. 5 and the M level steering logic unit 512 of order format device 106.The project of M level steering logic unit 512 control multiplex (MUX) formations 502 loads and shifts out.Multiplex's formation 502 gives information 214 to multiplexer 504/506/508 and M level steering logic unit 512 in its project, M level steering logic unit 512 is controlled multiplexer 504/506/508 again, as previously mentioned.
When following situation, can have problems: (1) but the bottom project of XIBQ104 comprises effective command byte NTBE then not to be comprised; (2) has only the instruction (for example Zhi Ling first or second byte) of part in the bottom project; (3) Bu Fen instruction does not provide enough message to allow length decoder 202/ ripple logical block 204 determine instruction lengths 222 (and beginning/stop bit 232/234), that is instruction also has some bytes to be positioned at NTBE.For example, suppose that the start bit 232 of the byte 15 (that is last byte) in XIBQ104 bottom project is logic true value, and the value of this byte is 0x0F.In the instruction of x86, the value of the first non-prefix byte is that 0x0F represents the operational code that a tool extends, therefore need be according to its subsequent byte with decision instruction kenel.In other words, can't be only from the 0x0F byte with decision instruction length (in some cases, may need at the most to the 5th byte with the decision instruction length).Yet, when instruction cache 102 provides the next column caching data to XIBQ104 by the time, a period of time will be needed, for example, the error (miss) of instruction cache 102 may take place, or searching impact damper (translation lookaside buffer, error TLB) are translated in instruction, therefore, need a kind ofly not wait for other command byte and the scheme of footpath row processing.Moreover in some cases, microprocessor 100 must obtain the instruction before the unknown lengths instruction, if therefore these instructions are handled, then microprocessor 100 will be waited for always.Therefore, the mode that needs a kind of footpath row to handle.
The operational flowchart of the part order format device 106 of Fig. 8 displayed map 7.This flow process starts from step 802.
In step 802, XIBQ steering logic unit 702 detects the instruction of the bottom project terminal of XIBQ104 and strides to another row of instruction caching data crossfire, and the instruction in the project of XIBQ104 bottom is not enough to allow length decoder 202/ ripple logical block 204 determine instruction lengths (and beginning/stop bit 232/234), and the required subsequent instructions byte of decision instruction length does not place XIBQ104NTBE as yet, that is XIBQ104NTBE is invalid or blank.Then, flow process enters step 804.
In step 804, M level steering logic unit 512 will be loaded onto multiplex's formation 502 corresponding to the output 214 of the XIBQ104 bottom ripple logical block 204 that project produced.Yet M level steering logic unit 512 does not shift out the bottom project of XIBQ104, because still need to determine the stop bit 234 of unknown lengths instruction.In other words, for the instruction of unknown lengths, its byte that is positioned at XIBQ104 bottom project must keep, and when other byte of instruction is come XIBQ104, is determined instruction length and stop bit.Then, flow process enters step 806.
In step 806, previous step 804 loaded outputs 214 arrive the bottom project of multiplex's formation 502.At this moment, M level steering logic unit 512 takes out all instructions and it is reached the F level, but does not transmit the instruction of unknown lengths.Yet M level steering logic unit 512 does not shift out the bottom project of multiplex's formation 502, because the stop bit 234 of the instruction of unknown lengths also do not learn, and all the other bytes of instruction still cannot not get.The existence of unknown lengths instruction is known in M level steering logic unit 512, because this instruction does not have the stop bit 234 of effect.In other words, had first byte of imitating start bit 232 directional orders, but the byte and the NTBE that do not have the bottom project of imitating stop bit 234 sensing multiplex (MUX) formations 502 are invalid.Then, flow process enters 808.
In step 808, M level steering logic unit 512 stops (stall) multiplex (MUX) formation 502, inserts effective output 214 up to NTBE.Then, flow process enters step 812.
In step 812, XIBQ104 receives the command byte 132 of row finally from instruction cache 102, and it is loaded onto among the NTBE.The command byte 132 of these row comprises all the other bytes of unknown lengths instruction.Then, flow process enters step 814.
In step 814, instruction produces instruction length 222 and beginning/stop bit 232/234 to length decoder 202/ ripple logical block 204 at unknown lengths.In one embodiment, XIBQ steering logic unit 702 is according to the remaining word joint number amount (it be arranged in NTBE that step 812 be loaded onto XIBQ104) of instruction length 222 with the instruction of calculating unknown lengths.This remaining word joint number amount is the position in order to decision stop bit 234 in following step 818.Then, flow process enters step 816.
In step 816, XIBQ steering logic unit 702 shifts out the bottom project.Yet M level steering logic unit 512 does not load the output 214 of the ripple logical block 204 of respective base project, because it has placed multiplex's formation 502 according to step 804.Then, flow process enters step 818.
In step 818, length decoder 202/ ripple logical block 204 handle new XIBQ104 bottom project (that is, in the caching data that step 812 received), and M level steering logic unit 512 is loaded onto the output 214 of ripple logical block 204 (it comprises the stop bit 234 of unknown lengths instruction) among the NTBE of multiplex's formation 502.Then, flow process enters step 822.
In step 822, M level steering logic unit 512 takes out unknown lengths instruction (and other instruction that can take out) from the bottom of multiplex's formation 502 project and NTBE, and is sent to the F level.Then, flow process enters step 824.
In step 824, M level steering logic unit 512 shifts out the bottom project of multiplex's formation 502.Flow process ends at step 824.
According to above-mentioned, even the order format device 106 of present embodiment is under the not available as yet situation of the related news of XIBQ (formation of x86 command byte) 104 bottom projects, for instruction with available message, by allowing message (command byte, beginning/stop bit and accumulate preposition message) disengage from the L level as early as possible, thereby solved foregoing problems.
[by preposition accumulation obtaining] with the enhancement instruction
The detailed block diagram of multiplex's formation 502 of Fig. 9 displayed map 5.In the embodiment of Fig. 9, multiplex's formation 502 comprises four projects, be respectively the bottom project (bottom entry, BE), NTBE, bottom project third from the bottom (second-from-bottom entry, SFBE) and bottom fourth from the last project (third-from-bottom entry, TFBE).Each project of multiplex's formation 502 contains 16 bytes, and each byte is deposited a command byte and start bit 232, stop bit 234 and accumulated preposition message 238.As shown in the figure, BE is denoted as 0 to 15 respectively.NTBE is denoted as 16 to 31 respectively.These labels also are shown in Figure 10.SFBE is denoted as 32 to 47 respectively.
The calcspar of the part M level of the order format device 106 of Figure 10 displayed map 1.Figure 10 shows the preposition array of the accumulation of multiplex's formation 502 (accumulated prefix array) 1002 and command byte array (instruction byte array) 1004.The message of accumulating preposition array 1002 and command byte array 1004 is actually BE and the NTBE that is stored in multiplex's formation 502.Yet, multiplex's formation 502 message provide be by lead to selecting circuit (in one embodiment, it is the dynamic logic unit), it comprises the multiplexer 504/506/508 of Fig. 5.Figure 10 only demonstrates I1 multiplexer 504, yet the input that I2 multiplexer 506 and I3 multiplexer 508 are received is also as I1 multiplexer 504.Instruction multiplexer 504/506/508 is the 16:1 multiplexer.As shown in figure 10, the input of I1 multiplexer 504 is denoted as 0 to 15 respectively.The input of each I1 multiplexer 504 receives 11 command byte and accumulates preposition message 238, wherein accumulates preposition message 238 lowest orders corresponding to 11 command byte of receive (lowest order) byte.The byte number that this lowest order byte is a command byte array 1004, it corresponds to Entering Number of I1 multiplexer 504.For example, the input 8 of I1 multiplexer 504 receives the byte 8 to 18 (that is byte 16-18 of the byte 8-15 of BE and NTBE) of multiplex's formation 502 and the preposition message 238 of accumulation of respective byte 8.The reason that I1 multiplexer 504 receives 11 command byte is: though the x86 instruction allows maximum 15 bytes, right non-prefix byte mostly is 11 bytes most, previous embodiment only obtain and transmit non-prefix byte to the remainder of pipeline (that is, remove prefix byte and replace prefix bytes to accumulate preposition message 238), thereby can reduce the decoding workload of pipeline following stages in a large number and allow microprocessor 100 realize various benefits.
The calcspar of the M level steering logic unit 512 of Figure 11 displayed map 5.M level steering logic unit 512 comprises 2:1 multiplexer 1114, and in order to produce instruction length LEN1 1122, it is the instruction length of the instruction (the first instruction I1 524 of Fig. 5) by the instruction crossfire of order format device 106.Instruction length LEN11122 continues to transmit by pipeline also processed together with the first instruction I 1524.Multiplexer 1114 exists according to the situation whether partial-length was arranged in the last clock period, with the output of selection subtracter 1102 or the output of totalizer 1116.Multiplexer 1114 is controlled by working storage 1118, and it stores one in order to represent whether the last clock period have the situation of partial-length, and this will describe in detail in Figure 12 to Figure 14.If there is the partial-length situation to take place, multiplexer 1114 is selected the output of totalizer 1116; Otherwise multiplexer 1114 is selected the output of subtracter 1102.First of totalizer 1116 is input as the instruction residue length, is denoted as residue LEN1 1106, and it will describe in detail in Figure 12 to Figure 14.M level steering logic unit 512 also comprises other logical block (be not shown in graphic in), its according to stop bit 234 (it is to offer M level steering logic unit 512 by multiplex's formation 502) of the first instruction I1 524 to calculate residue LEN1 1106.Second of totalizer 1116 is input as the partial-length of present instruction, is denoted as partial L EN 1104, and it is provided by the working storage that the last clock period loads, and will describe in detail in Figure 12.Subtracter 1102 deducts the byte location (END1 1108) of stop bit 234 in multiplex's formation 502 of the first instruction I1524 with the byte location (END0 1112) of stop bit 234 in multiplex's formation 502 of last instruction.Though it should be noted that the mathematical operation that M level steering logic unit 512 is carried out as shown in figure 11, yet M level steering logic unit 512 can not use conventional adders/subtracter, but implement with combinatorial logic unit.For example, in one embodiment, carry out with decoded form the position; For example, subtraction can use boolean (Boolean) AND-OR computing.The employed subtracter of length computation (be not shown in graphic in) of the second instruction I2 526 and the 3rd instruction I3 528 is similar to the subtracter of the first instruction I1 524, and END1 deducts END2 and END2 deducts END3 but be respectively.At last, the decision of the present skew (offset) of multiplex's formation 502 projects is back bytes of selecting from the last byte of final injunction of multiplexer 504/506/508.
The operational flowchart of the part M level of the order format device 106 of Figure 12 displayed map 1.This flow process starts from step 1201.
In step 1201, the new clock period, and the BE and the NTBE (Fig. 9) of multiplex's formation 502 are inspected in M level steering logic unit 512.Then, flow process enters step 1202.
In step 1202, control multiplexers 504/506/508 in M level steering logic unit 512 are sent to the instruction of the BE of multiplex's formation 502 and NTBE (if possible) the F level of order format device 106.As previously mentioned, in one embodiment, the M level can obtain three instructions in a clock in the cycle.Because the length of x86 instruction can be zero to 15 bytes, so the bottom project of multiplex's formation 502 may have one to 16 x86 instruction.Therefore, need all instructions of a plurality of clock period with the BE that obtains multiplex's formation 502.Moreover, be prefix byte, end byte or other type byte according to the last byte of BE actually, instruction may be crossed over BE and NTBE, therefore, M level steering logic unit 512 is when getting instruction and shift out the BE of multiplex's formation 502, and its mode of operation has difference, and this will be in following detailed description.Moreover M level steering logic unit 512 calculates each and obtains/and the length of move instruction, the logic of particularly using Figure 11 is to calculate the first instruction I1 524 (the instruction length LEN1 1122 of Figure 11).If be the partial-length (this will describe in detail in step 1212) of last clock period, then M level steering logic unit 512 uses the partial L EN1104 that stores with computations length LEN 1 1122; Otherwise M level steering logic unit 512 uses subtracter 1102 (Figure 11) with computations length LEN 1 1122.Then, flow process enters step 1204.
In step 1204, M level steering logic unit 512 judges that all instructions that whether end at BE all have been sent to the F level.In one embodiment, in the cycle, the M level can obtain and transmit three instructions at most and give the F level in a clock.Therefore, if the M level obtains three instructions from the bottom project, and the start bit 232 that another instruction is at least still arranged is in the project of bottom, and then another instruction must obtain in next clock period.All be sent to the F level if end at all instructions of BE, then flow process enters step 1206; Otherwise flow process enters step 1205.
In step 1205, M level steering logic unit 512 does not shift out BE, makes when next clock period, and more instruction be obtained and be transmitted in M level steering logic unit 512 can from BE.Flow process is back to step 1201, to carry out the program of next clock period.
In step 1206, M level steering logic unit 512 judges that the last byte of BE is preposition actually or is non-prefix byte.If the last byte of BE is non-prefix byte, then flow process enters step 1216; If the last byte of BE is a prefix byte, then flow process enters step 1212.
In step 1212, M level steering logic unit 512 calculates and is positioned at the partial-length that BE comprises the instruction of prefix byte at last, that is, until the prefix byte number between the last byte 15 of BE, this calculating is not carried out by the mathematical logic unit of M level steering logic unit 512 (be shown in graphic in) from the end byte of last instruction.For example, in the example of Figure 13, the partial-length of instruction b is 14.Prefix byte between the byte is to be in " gore " (no-man ' s land) at end byte and beginning, and prefix byte in fact is unnecessary in multiplex's formation 502, because its content is Already in accumulated preposition message 238, itself and the opcode byte of instructing are stored in multiplex's formation 502.By this, if BE last for prefix byte and in BE all other instructed in this clock period and obtained all, then M level steering logic unit 512 can shift out (step 1214) with BE (1214), because these prefix bytes are that (it will accumulate on opcode byte in the middle of ensuing 16 byte streams) that exist and M level steering logic unit 512 store the prefix byte number (the partial-length working storage 1104 that is stored to Figure 11) and shift out from multiplex's formation 502.On the other hand, if the last of BE do not obtained or transmitted as yet for non-prefix byte and its, then M level steering logic unit 512 can not shift out (consulting step 1222) from multiplex's formation 502 with it.Then, flow process enters step 1214.
In step 1214, the 512 control multiplex (MUX) formations 502 of M level steering logic unit are to shift out BE.Flow process is back to step 1201, to carry out the program of next clock period.
In step 1216, M level steering logic unit 512 judges that whether the last byte of BE is the end byte of instruction, that is whether stop bit 234 is logic true value.If for being that then flow process enters step 1214; Otherwise flow process enters step 1218.
In step 1218, M level steering logic unit 512 judges whether NTBE is effective.When the end byte of the final injunction of obtaining is positioned at the last byte (that is byte 15) of BE, perhaps last byte stride to NTBE and its for effectively, then M level steering logic unit 512 shifts out BE; Otherwise BE is kept up to next clock period in M level steering logic unit 512.If NTBE is that effectively flow process enters step 1214; Otherwise flow process enters step 1222.
In step 1222, M level steering logic unit 512 does not shift out BE.This is because the real bytes (that is, non-prefix byte) of instruction is crossed over BE and NTBE, and NTBE is invalid.In this situation, M level steering logic unit 512 can't determine instruction length, because the stop bit 234 of instruction can't be learnt from invalid NTBE.Flow process is back to step 1201, carries out the program of next clock period, to wait for that NTBE fills up valid data.
Multiplex's formation 502 of Figure 13 displayed map 5 is in the content of continuous two clock period, with the operation of illustration M level.First multiplex's formation 502 contents were in for first clock period 0, and second multiplex's formation 502 content is in the second clock cycle 1.Graphic three projects that only demonstrate the bottom.In Figure 13, " S " expression beginning byte (that is start bit 232 be a logic true value), " E " represents end byte (that is stop bit 234 is a logic true value), " P " represents prefix byte (that is, accumulate preposition message 238 represented).4 instructions represent with a, b, c, d respectively, and show that it begins, end and prefix byte.Shown in byte number correspond to Fig. 9, for example byte 0 to 47, it is positioned at BE, NTBE and the SFBE of multiplex's formation 502.
The cycle 0 at the beginning, the byte 1 of BE includes instructs the end byte Ea of a, and the byte 2 to 15 of BE includes the prefix byte Pb of 14 instruction b.Because instruction b starts from BE, but its to begin byte be to be positioned at NTBE rather than BE, its partial-length is calculated as ten nybbles.The content of NTBE and SFBE is invalid, that is the formation 104 of X86 command byte and length decoder 202/ ripple logical block 204 do not provide the caching data of instruction crossfire and related news thereof (for example start bit 232, stop bit 234 and accumulate preposition message 238) to other project except BE as yet.
In 0 o'clock cycle, content (step 1201 of Figure 12) and move instruction a to the F level (step 1202) of BE and NTBE inspected in M level steering logic unit 512.Moreover, the length of M level steering logic unit 512 computations a, it equals to instruct the difference between the end byte position of the end byte position of a and last instruction.At last, (instruct last byte (byte 15) that a) has transmitted (step 1204) and BE to be prefix byte (step 1206) owing to end at all instructions of BE, the partial-length of M level steering logic unit 512 computations b is ten nybbles, and it is stored in partial L EN 1104 working storages (step 1212).At last, M level steering logic unit 512 shifts out (step 1214) from multiplex's formation 502 with BE.
Because step 1214 has carried out shifting out and moving into ripple logical block 204 outputs 214 of other 16 byte streams in the cycle 0, thereby the beginning cycle 1, this moment, BE comprised: the beginning byte (Sb) and the end byte (Eb) (that is the non-prefix byte of instruction b only has single byte) that are positioned at the instruction b of byte 0; Be positioned at five prefix bytes (Pc) of the instruction c of byte 1 to 5; Be positioned at the beginning byte (Sc) of the instruction c of byte 6; Be positioned at the end byte (Ec) of the instruction c of byte 8; Be positioned at the beginning byte (Sd) of the instruction d of byte 9; And be positioned at the end byte (Ed) of the instruction d of byte 15.
In 1 o'clock cycle, content (step 1201) and move instruction b, c and d to the F level (step 1202) of BE and NTBE inspected in M level steering logic unit 512.Moreover, M level steering logic unit 512 calculates the following: the length (LEN1 1122) (step 1202) (being 15 bytes in this example) of instruction b, and it equals the residue length (being a byte) that partial L EN 1104 (being ten nybbles) adds instruction b in this example in this example; The length (in this example for the Eight characters joint) of instruction c, its equal to instruct c the end byte position and instruct the difference of end byte position of b; And the length (being seven bytes in this example) of instruction d, it equals to instruct the difference of end byte position of the end byte position of d and instruction c.Moreover, because all end at instruction (the instruction b of BE, c, d) the last byte (byte 15) that has all transmitted (step 1204) and BE is an end byte (step 1216) for the last byte of non-prefix byte (step 1206) and BE, so M level steering logic unit 512 shifts out (step 1214) from multiplex's formation 502 with BE.
According to embodiment shown in Figure 13, the preposition message 238 of accumulation by accumulation instruction b is to the partial L EN 1104 of its operational code and save command b, make order format device 106 to instruct the BE of prefix byte of b to shift out, and obtain and transmit maximum three instructions in next clock period from multiplex's formation 502 with containing.If do not accumulate preposition message 238 and storage compartment LEN 1104, this will be impossible (that is instruction c and d can't obtain in same period and transmit by and instruction b, but must carry out in next clock period).By making the enough instructions of functional unit tool of microprocessor to handle, can reduce the use of microprocessor 100 resources.
Multiplex's formation 502 of Figure 14 displayed map 5 is in the content of continuous two clock period, with the operation of illustration M level.The example of Figure 14 is similar to the example of Figure 13; Yet, the position of instruction and multiplex's formation 502 enter and to leave sequential variant.
In the cycle 0 at the beginning, BE is positioned at byte 1 and includes and instruct the end byte (Ea) of a, and is positioned at byte 2 to 15 and includes and instruct 14 prefix bytes (Pb) of b.In addition, because instruction b starts from BE, but to begin byte but be to be positioned at NTBE for it, so partial L EN 1104 is calculated as 14.NTBE comprises: be positioned at byte 16 instruction b beginning byte (Sb) and the instruction b end byte (Eb) (that is, the instruction b except prefix byte, only be single byte); Be positioned at five prefix bytes (Pc) of the instruction c of byte 17-21; Be positioned at the beginning byte (Sc) of the instruction c of byte 22; Be positioned at the end byte (Ec) of the instruction c of byte 27; Be positioned at three prefix bytes (Pd) of the instruction d of byte 28-30; And be positioned at the beginning byte (Sd) of the instruction d of byte 31.SFBE comprises: be positioned at the end byte (Ed) of the instruction d of byte 41, and be positioned at the beginning byte (Se) of the instruction e of byte 42.
In 0 o'clock cycle, content (step 1201 of Figure 12) and move instruction a to the F level (step 1202) of BE and NTBE inspected in M level steering logic unit 512.Moreover, the length of M level steering logic unit 512 computations a, it equals to instruct the difference between the end byte position of the end byte position of a and last instruction.At last, (instruct last byte (byte 15) that a) has transmitted (step 1204) and BE to be prefix byte (step 1206) owing to end at all instructions of BE, the partial-length of M level steering logic unit 512 computations b is ten nybbles, and it is stored in partial L EN1104 working storage (step 1212).At last, M level steering logic unit 512 shifts out (step 1214) from multiplex's formation 502 with BE.
Because step 1214 shifted out in the cycle 0, thereby the beginning cycle 1, this moment, BE comprised the content of the NTBE in 0 o'clock cycle, and NTBE comprises the content of the SFBE in 0 o'clock cycle.
In 1 o'clock cycle, content (step 1201) and move instruction b, c and d to the F level (step 1202) of BE and NTBE inspected in M level steering logic unit 512.Moreover, M level steering logic unit 512 calculates the following: the length (LEN1 1122) (step 1202) (being 15 bytes in this example) of instruction b, and it equals the residue length (being a byte) that partial L EN 1104 (being ten nybbles) adds instruction b in this example in this example; The length (being 11 bytes in this example) of instruction c, it equals to instruct the difference of end byte position of the end byte position of c and instruction b; And the length (being ten nybbles in this example) of instruction d, it equals to instruct the difference of end byte position of the end byte position of d and instruction c.Moreover, because the last byte (byte 15) that all instructions (instruction b, c, d) that end at BE have all transmitted (step 1204) and BE is non-for end byte (step 1216) and NTBE are effectively (step 1218) for the last byte of non-prefix byte (step 1206) and BE, so M level steering logic unit 512 shifts out (step 1214) from multiplex's formation 502 with BE.
According to embodiment shown in Figure 14, order format device 106 can be in a clock in the cycle, and three instructions that will contain maximum 40 command byte obtain and send out, as shown in figure 15.
[detection of bad branch prediction, sign and accumulation are in order to fast processing instruction crossfire]
Consult Fig. 1 again, extract address 142 at present in order to when instruction cache 102 extractions one command byte is listed as and offers XIBQ104 when extraction unit 126 outputs, BTAC128 also obtains this simultaneously and extracts address 142 at present.Hit (hit) BTAC128 if extract address 142 at present, then the address is previously herein extracted in expression has a branch instruction once to be performed; Therefore, whether BTAC128 is measurable has branch instruction that (taken) takes place, if for being that then BTAC128 has also predicted predicted target address 146.Specifically, BTAC128 be microprocessor 100 obtain from the command byte crossfire or the branch instruction of decoding before promptly predict.Therefore, the branch instruction that BTAC128 predicted may not be present in the cache column of command byte of taking-up, that is BTAC128 has done bad prediction, causes microprocessor 100 branch's mistakes.It should be noted that this bad prediction is not equal to incorrect prediction.Because program is carried out the tool dynamic property, for example change of the value of the status code of branch instruction or status data, so all branch predictors are in essence all might prediction error.Yet the cache column difference that BTAC128 predicts is represented in bad prediction herein, and perhaps the content in the identical but cache column of cache column changes.Why the reason of these situations takes place, as United States Patent (USP) 7,134,005 description, reason has following several: because the BTAC128 only address tag of storage compartment (tag) but not full address label, thereby cause label to obscure (aliasing); Because BTAC128 only stores virtual (virtual) address tag but not physical address, thereby cause and virtually obscure; And the spontaneous sign indicating number (self-modifying code) of revising.When this situation took place, microprocessor 100 must be determined not bad predict command and follow-up because of bad predict command and the false command that mistake obtains sends out.
If for its branch of a command byte take place indication 154 (Fig. 1) for logic true value but in fact be not first byte for instruction, as shown in figure 16, represent that promptly BTAC128 has done bad prediction thereby caused branch's mistake of microprocessor 100.As previously mentioned, indication 154 expression BTAC128 take place in the true value branch that BTAC128 provided thinks that this command byte is first byte (that is operational code) of branch instruction, and extraction unit 126 carries out branch according to the predicted target address 146 that BTAC128 predicted.
The determining method system of bad BTAC prediction waits for, obtain from the command byte crossfire and length is known up to individual other instruction, and non-first byte that scans each instruction takes place to inspect its branch whether indication 154 serves as true.Yet this kind inspection method is too slow, because it needs a lot of shieldings (masking) and shifts out, and needs result with each byte via logical OR (OR) computing, therefore can cause sequence problem.
For fear of sequence problem, indication 154 message that provided take place in embodiment of the invention accumulation branch, and it is the part of ripple logical block 204 executive routines, and after the M level gets instruction, use these accumulation message.Specifically, ripple logical block 204 detected states and with designator hand on up to the instruction last byte, it inspects single byte, that is the instruction last byte.When the M level gets instruction, determine whether an instruction is bad instruction, that is this instruction whether will be included in the instruction crossfire and continuation transmits down along pipeline.
Figure 17 shows the composition signal of ripple logical block 204 outputs 214.Ripple logical block 204 output signals shown in Figure 17 are similar to shown in Figure 2, but have additionally increased bad BTAC position 1702 for each command byte, and it will be in following detailed description.In addition, 204 outputs of ripple logical block comprise: a signal, and when it was logic true value, expression corresponding instruction byte was first byte of the branch instruction that BTAC128 predicted, however the branch instruction that BTAC128 predicted will can not take place; And another signal, it represents the end byte of last byte for instruction.
The operational flowchart of the microprocessor 100 of Figure 18 displayed map 1.This flow process starts from step 1802.
In step 1802, BTAC (branch target address caching) 128 predicts in the cache column of present extraction address 142 indications that extraction unit 126 is provided, have a branch instruction, and this branch instruction will take place.BTAC (branch target address caching) 128 is the predicted target address 146 of predicted branches instruction also.Therefore, first row of XIBQ104 in the middle of the instruction cache 102 at 142 indication places, present extraction address receives 16 command byte, and the then secondary series in the middle of the instruction cache 102 at predicted target address 146 indication places receives 16 command byte.Then, flow process enters step 1804.
In step 1804, XIBQ104 stores each branch and indication 154 (Fig. 1) takes place together with in the two row corresponding instruction bytes that step 1802 received.Then, flow process enters step 1806.
In step 1806, first row of length decoder 202 and ripple logical block 204 processing instruction bytes, and detect that indication 154 takes place for branch that command byte contains logic true value but this byte is not the situation of first byte of instruction, error situations as shown in figure 16.In other words, ripple logical block 204 knows which byte is first byte in 16 groups of the command byte row, in order to set stop bit 234.In view of the above, the ripple logical block 204 of the first non-prefix byte of corresponding each instruction is inspected the logic true value of branch's generation indication 154 and is detected this situation.Then, flow process enters step 1808.
In step 1808, indicate 154 to be logic true value when the true value branch generation of non-first byte that detects instruction, the bad BTAC position 1702 that ripple logical block 204 is set these command byte is a logic true value.In addition, ripple logical block 204 is passed to the bad BTAC position 1702 of true value all the other bytes of 16 byte streams from its byte location.Moreover, if the end byte of instruction does not come across first row of command byte, ripple logical block 204 update modes (for example flip-flop) (be not shown in graphic in) then are in order to represent in the row bad BTAC (branch target address caching) 128 predictions being taken place in an instruction at present.Then, when the secondary series of ripple logical block 204 processing instruction bytes, because state be that very ripple logical block 204 is set its bad BTAC position 1702 for all bytes of command byte secondary series.Then, flow process enters step 1812.
In step 1812, for first and second row of command byte, multiplex's formation 502 stores the output 214 of ripple logical blocks 204, comprises bad BTAC position 1702, and stores together with each command byte.Then, flow process enters step 1814.
In step 1814, M level steering logic unit 512 finds that the bad BTAC position 1702 corresponding to command byte be that the stop bit 234 of logic true value and this command byte also is logic true value (that is, detect the situation that bad BTAC (branch target address caching) 128 predicts).Therefore, instruction that bad situation takes place and subsequent instructions thereof are abandoned transmitting to the F level by removing corresponding significance bit 534/536/538 in M level steering logic unit 512.Yet if before the instruction that bad situation takes place an instruction is arranged, this instructs to effectively and be transferred into the F level.As previously mentioned, the bad BTAC position 1702 of true value is passed to the end byte of the instruction that bad situation takes place, and will make M level steering logic unit 512 only be inspected single byte, that is, the byte of stop bit 234 indications, thereby obviously reduce the restriction of sequential.Then, flow process enters 1816.
In step 1816, it is invalid that microprocessor 100 allows the wrong project of BTAC (branch target address caching) 128 become.In addition, microprocessor 100 is removed all the elements of XIBQ104 and multiplex's formation 502 and is allowed extraction unit 126 upgrade and extracts address 142 at present, in order to the byte that gets instruction again from bad prediction place of BTAC (branch target address caching) 128 generations.When obtaining again, BTAC (branch target address caching) 128 can not produce bad prediction, because bad item is eliminated, that is when obtaining again, BTAC (branch target address caching) 128 predicted branches can not take place.In one embodiment, step 1816 is executed in the F level of order format device 106, and/or instruction transfer interpreter 112.Flow process ends at step 1816.
[effective decision of x86 instruction length]
Decision x86 instruction length is very complicated, it is described in the IA-32 of Intel framework software development notebook (Intel IA-32 Architecture Software Developer ' s Manual), the chapter 2 of 2A collection: instruction set is with reference to (Instruction Set Reference), A-M.The instruction total length is following sum: whether number (1,2 or 3), the ModR/M byte of the number of prefix byte (if any), operation byte occurs, whether the SIB byte occurs, address displacement (displacement) length (if any) reaches the length (if any) of (immediate) data immediately.Below be the characteristic or the requirement of x86 instruction, it is enough to influence the decision of length (except preposition):
The number of opcode byte is:
3, if the first two byte is 0F 38/3A
2, if the first word byte is 0F, and the second word byte is not 38/3A
1, other situation
Whether the ModR/M byte operational code occurs being decided by, as follows:
If be three byte oriented operands, then ModR/M is compulsory
If be a byte or two byte oriented operands, then inspect opcode byte
Whether the SIB byte ModR/M byte occurs being decided by.
Whether displacement the ModR/M byte occurs being decided by.
Displacement scale is decided by ModR/M byte and present address size (AS).
Whether immediate data opcode byte occurs being decided by.
The size of immediate data is decided by opcode byte, operational code size (OS), present AS and REX.W are preposition at present; Specifically, the ModR/M byte can not influence the immediate data size.
If there is not the ModR/M byte, then there are not SIB, displacement or immediate data.
When the decision instruction length, instruction operation code and ModR/M byte only have five kinds of forms:
Operational code
The 0F+ operational code
Operational code+ModR/M
0F+ operational code+ModR/M
0F+38/3A+ operational code+ModR/M
The detailed block diagram of the length decoder 202 of Figure 19 displayed map 2.Fig. 2 has shown 16 length decoders 202.Figure 19 shows a representative length decoder 202, is denoted as n.As shown in Figure 2, each length decoder 202 corresponds to a byte of command byte crossfire 134.In other words, length decoder 0 corresponds to command byte 0, and length decoder 1 corresponds to command byte 1, until length decoder 15 corresponds to command byte 15.Length decoder 202 comprise programmable logic array (Programmable Logic Array, PLA) 1902,4:1 multiplexer 1906 and totalizer 1904.
PLA 1902 receives address size (AS), operand size (OS) and REX.W value 218 shown in Figure 2.On behalf of address size, OS, AS represent the operand size, and the preposition appearance of REX.W value representation REX.W.PLA 1902 also receives the command byte 134 (it indicates with n+1) of corresponding instruction byte 134 (it indicates with n) and higher order.For example, PLA 3 1902 receives command byte 3 and 4.
PLA 1902 produces immLen value 1916, and it offers first input of totalizer 1904.ImmLen value 1916 is between 1 and 9 (containing), and its value is following sum: the size of opcode byte number and immediate data (0,1,2,4,8).PLA 1902 is when decision immLen value 1916, be this two command byte 134 of hypothesis the first two opcode byte, and foundation two opcode byte (then is an opcode byte if not 0F), address size (AS), operand size (OS) and REX.W value 218 are to produce immLen value 1916 for instruction.
PLA 1902 produces eaLen value 1912, and it offers the multiplexer 1906 of three low order length decoders 202.EaLen value 1912 is between 1 and 6 (containing), and its value is following sum: ModR/M byte number (existence of PLA hypothesis ModR/M byte), SIB byte number (0 or 1) and displacement scale (0,1,2,4).PLA 1902 is that hypothesis first command byte 134 is the ModR/M byte when decision eaLen value 1912, and according to ModR/M byte and address size (AS) 218 to produce eaLen value 1912.
One of them input of multiplexer 1906 receives null value.Three inputs of other of multiplexer 1906 receive the eaLen value 1912 from three high-order PLA 1902.Multiplexer 1906 selects one of them input in order to provide eaLen value 1918 as output, and it offers second input of totalizer 1904 again.In one embodiment, in order to reduce transmission delay, can not use aforesaid multiplexer 1906, each eaLen value 1912 is input to totalizer 1904, and wherein eaLen value 1912 is ternary line or (tri-statewired-OR) signal.
Totalizer 1904 with immLen value 1916 and by the eaLen value chosen 1918 totallings to produce final instruction length 222 shown in Figure 2.
PLA 1902 produces control signal 1914 with control multiplexer 1906, and it detects as follows according to aforementioned five kinds of forms:
1. for instruction type of not having a ModR/M byte shown below, then select null value:
Operational code only, or
The 0F+ operational code
2. for following instruction type, then select PLA n+1:
Operational code+ModR/M
3. for following instruction type, then select PLA n+2:
0F+ operational code+ModR/M
4. for following instruction type, then select PLA n+3:
0F+38/3A+ operational code+ModR/M
Figure 20 shows the configuration of 16 length decoders 202.PLA 15 (programmable logic array) 1902 receives the command byte 0 of command byte 15 and previous column, and multiplexer 151906 receives three PLA1902 eaLen value 1912 of (figure does not show), and wherein these three PLA 1902 inspect the command byte 0/1,1/2 and 2/3 of previous column respectively.
The benefit that aforementioned each PLA 1902 inspects two bytes each time is to reduce required complete and (minterm) number in a large number, thereby is reduced the size of the logical circuit on the wafer.This design provides total delay balance between the two complete and that the item number purpose reduces and sequential requires to be allowed to select.
Figure 21 shows the operational flowchart of the length decoder 202 of Figure 20.This flow process starts from step 2102.
In step 2102, for each command byte 134 from XIBQ104, corresponding PLA 1902 inspects two command byte 134, that is corresponding instruction byte 134 and next instruction byte 134.For example, PLA 3 (programmable logic array) 1902 inspects command byte 3 and 4.Then, flow process enters step 2104 and 2106 simultaneously.
In step 2104, each PLA 1902 hypothesis two command byte 134 are the first two opcode byte of instruction, and according to this two command byte 134, operand size (OS), address size (AS), and REX.W value with generation immLen value 1916.Know clearly it, immLen value 1916 is following sum: the size (0,1,2,4 or 8) of number of opcode byte (1,2 or 3) and immediate data.Then, flow process enters step 2114.
In step 2106, each PLA 1902 hypothesis first command byte 134 be the ModR/M byte, and according to ModR/M byte and address size (AS) with generation eaLen value 1918, and provide eaLen value 1918 to inferior three low order multiplexers 1906.Know clearly it, eaLen value 1918 is following sum: the size (0,1,2,4) of ModR/M byte number (1), SIB byte (0 or 1) and displacement.Then, flow process enters step 2108.
In step 2108, the eaLen value 1912 that each multiplexer 1906 receives zero input and receives from three high-order PLA 1902.For example, PLA 3 (programmable logic array) 1902 receives eaLen value 1912 from PLA 4,5,6 (programmable logic array) 1902.Then, flow process enters step 2112.
In step 2112, each PLA 1902 produces control signal 1914 to corresponding multiplexer 1906, and wherein one imports with selection according to aforementioned five kinds of forms.Then, flow process enters step 2114.
In step 2114, each totalizer 1904 adds to multiplexer 1906 selected eaLen values 1918 with immLen value 1916, to produce instruction length 222.Then, flow process enters step 2116.
In step 2116, if LMP is arranged, then the L level is for each extra clock period of instruction cost of containing LMP, graphic as described above shown in, Fig. 1 to Fig. 4 particularly.
The above is embodiments of the invention only, is not in order to limit claim scope of the present invention.The equivalence that the personage in familiar with computers field is finished under the spirit that the disengaging invention is not disclosed changes or modifies, and all should be included in the above-mentioned claim scope.For example, can use software to start function, to make, set up model, emulation, description and/or test disclosed device and method.But it reaches service routine language (for example C, C++), hardware description language (HDL), and it comprises Verilog HDL, VHDL and other program.This software can place computing machine can use medium, for example semiconductor, disk or CD (for example CD-ROM, DVD-ROM).Disclosed device and method embodiment can be contained in intellecture property core (IPcore), for example microcontroller core (for example placing HDL) and be converted to hardware to make integrated circuit.Moreover disclosed device and method embodiment can use the combination of hardware and software to implement.Therefore, the scope of the invention is not limited to any illustrative embodiments, and should define with claim scope and equivalent scope thereof.Know clearly it, invention can be implemented in the micro processor, apparatus, this microprocessor can be used in the general computing machine.At last, those skilled in the art can use disclosed notion and specific embodiment as the basis with design or be modified as other framework, in order to realize identical purpose, it does not break away from claim scope of the present invention yet.

Claims (12)

1. a device that is applicable to microprocessor instructs in order to extract in a command byte crossfire of microprocessor, the instruction set architecture tool variable length instruction of this microprocessor, and this device comprises:
One first formation, it has a plurality of projects, and each this project is received from the command byte row of an instruction cache in order to storage;
A plurality of demoders for each command byte of this command byte row of this first formation, produce at the beginning accordingly respectively/finish and indicate;
One second formation, it has a plurality of projects, and each this project is received from these command byte row of this first formation in order to storage and is received from corresponding this beginning/end sign of this demoder; And
One steering logic unit, in order to:
Detect a situation, this situation comprises the instruction length of an instruction and does not determine as yet, because the beginning of this instruction partly is first row that are positioned at this command byte of this first formation row, and the remainder of this instruction is positioned at the secondary series that this command byte of this first formation is listed as and does not load this first formation from this instruction cache as yet;
According to this situation that detects, load these first row and corresponding this beginning/end and be indicated to this second formation, and do not shift out these first row of this first formation; And
According to should beginning/end indicating accordingly, first row of this in this second formation extract a plurality of instructions and make subsequent treatment for this microprocessor certainly, and wherein the instruction of these a plurality of extractions does not contain the uncertain instruction of length.
2. device according to claim 1, wherein above-mentioned steering logic unit also in order to:
After extracting this a plurality of instructions, this that does not shift out this second formation first is listed as, and extracts the uncertain instruction of this length up to this second formation certainly.
3. device according to claim 2, a plurality of projects of wherein above-mentioned second formation comprise a bottom project, wherein this steering logic unit in order to:
Extract a plurality of instructions according to this steering logic unit from these first row, detect these first row and corresponding this beginning/end and indicate the bottom project that has arrived this second formation.
4. device according to claim 3, a plurality of projects of wherein above-mentioned first formation also comprise bottom project second from the bottom, wherein this steering logic unit:
Be loaded the bottom project second from the bottom of this first formation according to this secondary series of the remainder that comprises this instruction, the bottom project of this first formation shifts out this first row certainly.
5. device according to claim 4, a plurality of projects of wherein above-mentioned second formation also comprise bottom project second from the bottom, wherein this steering logic unit:
According to this demoder produce mutually should secondary series this begin/finish to indicate, load the bottom project second from the bottom that this secondary series and this beginning/end accordingly are indicated to this second formation; And
Extract the previous uncertain instruction of this length from this second formation.
6. device according to claim 5, wherein above-mentioned steering logic unit also in order to:
After the uncertain instruction of previous this length is extracted in this second formation, these first row are shifted out this second formation.
7. method that is applicable to microprocessor, be applicable in the microprocessor of tool variable length instruction, this method is in order to instruction to be provided in the command byte crossfire that provides from an instruction cache, this microprocessor comprises one first formation, receives a plurality of command byte row in order to this instruction cache certainly; This microprocessor comprises a demoder, for each command byte of this command byte row of this first formation, produces at the beginning accordingly respectively/finishes and indicate; And this microprocessor comprises one second formation, begins/finishes sign in order to receive from these command byte row of this first formation and from this of this demoder, and this method comprises:
Detect a situation, this situation comprises the instruction length of an instruction and does not determine as yet, because the beginning of this instruction partly is first row that are positioned at this command byte of this first formation row, and the remainder of this instruction is positioned at the secondary series that this command byte of this first formation is listed as and does not load this first formation from this instruction cache as yet;
According to this situation that detects, load these first row and corresponding this beginning/end and be indicated to this second formation, and do not shift out these first row of this first formation; And
According to should beginning/end indicating accordingly, first row of this in this second formation extract a plurality of instructions and make subsequent treatment for this microprocessor certainly, and wherein the instruction of these a plurality of extractions does not contain the uncertain instruction of length.
8. method according to claim 7 also comprises:
After extracting this a plurality of instructions, this that does not shift out this second formation first is listed as, and extracts the uncertain instruction of this length up to this second formation certainly.
9. method according to claim 8, wherein above-mentioned second formation comprises a bottom project, and wherein this method also comprises:
Above-mentioned according to carrying out from a plurality of instructions of this first row extraction, detect these first row and corresponding this beginning/end and indicate the bottom project that has arrived this second formation.
10. method according to claim 9, wherein above-mentioned first formation also comprises bottom project second from the bottom, and wherein this method also comprises:
Be loaded the bottom project second from the bottom of this first formation according to this secondary series of the remainder that comprises this instruction, the bottom project of this first formation shifts out this first row certainly.
11. method according to claim 10, wherein above-mentioned second formation also comprises bottom project second from the bottom, and wherein this method also comprises:
According to this demoder produce mutually should secondary series this begin/finish to indicate, load the bottom project second from the bottom that this secondary series and this beginning/end accordingly are indicated to this second formation; And
Extract the previous uncertain instruction of this length from this second formation.
12. method according to claim 11, wherein above-mentioned method also comprises:
After the uncertain instruction of previous this length is extracted in this second formation, these first row are shifted out this second formation.
CN 201010185635 2009-05-19 2010-05-19 Device and method for a microprocessor Active CN101833437B (en)

Applications Claiming Priority (6)

Application Number Priority Date Filing Date Title
US17961609P 2009-05-19 2009-05-19
US61/179,616 2009-05-19
US22829609P 2009-07-24 2009-07-24
US61/228,296 2009-07-24
US12/572,024 US8335910B2 (en) 2009-05-19 2009-10-01 Early release of cache data with start/end marks when instructions are only partially present
US12/572,024 2009-10-01

Publications (2)

Publication Number Publication Date
CN101833437A true CN101833437A (en) 2010-09-15
CN101833437B CN101833437B (en) 2013-06-26

Family

ID=42717517

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201010185635 Active CN101833437B (en) 2009-05-19 2010-05-19 Device and method for a microprocessor

Country Status (1)

Country Link
CN (1) CN101833437B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102446137A (en) * 2010-10-08 2012-05-09 群联电子股份有限公司 Data write-in method, memory controller and memory storage device
US9158697B2 (en) 2011-12-28 2015-10-13 Realtek Semiconductor Corp. Method for cleaning cache of processor and associated processor
US9501397B2 (en) 2010-09-23 2016-11-22 Phison Electronics Corp. Data writing method, memory controller, and memory storage apparatus

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0498654A2 (en) * 1991-02-08 1992-08-12 Fujitsu Limited Cache memory processing instruction data and data processor including the same
US5809272A (en) * 1995-11-29 1998-09-15 Exponential Technology Inc. Early instruction-length pre-decode of variable-length instructions in a superscalar processor
US5948100A (en) * 1997-03-18 1999-09-07 Industrial Technology Research Institute Branch prediction and fetch mechanism for variable length instruction, superscalar pipelined processor
US6209079B1 (en) * 1996-09-13 2001-03-27 Mitsubishi Denki Kabushiki Kaisha Processor for executing instruction codes of two different lengths and device for inputting the instruction codes
CN1625731A (en) * 2002-01-31 2005-06-08 Arc国际公司 Configurable data processor with multi-length instruction set architecture
US20090119485A1 (en) * 2007-11-02 2009-05-07 Qualcomm Incorporated Predecode Repair Cache For Instructions That Cross An Instruction Cache Line

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0498654A2 (en) * 1991-02-08 1992-08-12 Fujitsu Limited Cache memory processing instruction data and data processor including the same
US5809272A (en) * 1995-11-29 1998-09-15 Exponential Technology Inc. Early instruction-length pre-decode of variable-length instructions in a superscalar processor
US6209079B1 (en) * 1996-09-13 2001-03-27 Mitsubishi Denki Kabushiki Kaisha Processor for executing instruction codes of two different lengths and device for inputting the instruction codes
US5948100A (en) * 1997-03-18 1999-09-07 Industrial Technology Research Institute Branch prediction and fetch mechanism for variable length instruction, superscalar pipelined processor
CN1625731A (en) * 2002-01-31 2005-06-08 Arc国际公司 Configurable data processor with multi-length instruction set architecture
US20090119485A1 (en) * 2007-11-02 2009-05-07 Qualcomm Incorporated Predecode Repair Cache For Instructions That Cross An Instruction Cache Line

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9501397B2 (en) 2010-09-23 2016-11-22 Phison Electronics Corp. Data writing method, memory controller, and memory storage apparatus
CN102446137A (en) * 2010-10-08 2012-05-09 群联电子股份有限公司 Data write-in method, memory controller and memory storage device
CN102446137B (en) * 2010-10-08 2015-12-09 群联电子股份有限公司 Method for writing data, Memory Controller and memorizer memory devices
US9158697B2 (en) 2011-12-28 2015-10-13 Realtek Semiconductor Corp. Method for cleaning cache of processor and associated processor
TWI579695B (en) * 2011-12-28 2017-04-21 瑞昱半導體股份有限公司 Method for cleaning cache of processor and associated processor

Also Published As

Publication number Publication date
CN101833437B (en) 2013-06-26

Similar Documents

Publication Publication Date Title
US8769539B2 (en) Scheduling scheme for load/store operations
US4860199A (en) Hashing indexer for branch cache
CN101558388B (en) Data cache virtual hint way prediction, and applications thereof
US6157994A (en) Microprocessor employing and method of using a control bit vector storage for instruction execution
JP6849274B2 (en) Instructions and logic to perform a single fused cycle increment-comparison-jump
CN100495325C (en) Method and system for on-demand scratch register renaming
US8838938B2 (en) Prefix accumulation for efficient processing of instructions with multiple prefix bytes
JPH07334361A (en) Microprocessor device with pipeline for processing of instruction and apparatus for generation of program counter value used in it
CN101529378B (en) A system and method for using a working global history register
JPH0785223B2 (en) Digital computer and branch instruction execution method
CN101002169A (en) Microprocessor architecture
CN104335168A (en) Branch prediction preloading
US5860154A (en) Method and apparatus for calculating effective memory addresses
CN100468323C (en) Pipeline type microprocessor, device and method for generating early stage instruction results
CN102200905A (en) Microprocessor with compact instruction set architecture
CN101833437B (en) Device and method for a microprocessor
US6799266B1 (en) Methods and apparatus for reducing the size of code with an exposed pipeline by encoding NOP operations as instruction operands
CN101535947A (en) Twice issued conditional move instruction, and applications thereof
CN101833436B (en) Device and method suitable for a microprocessor
CN101853148B (en) Device and method adaptive to microprocessor
CN101853151B (en) Device and method adaptive to microprocessor
CN101819517B (en) Device and method suitable for microprocessor
CN101887358B (en) Device and method suitable for a microprocessor
JP2001100997A (en) Parallel processing processor
Kalmath et al. Implementation of 32-bit ISA five-stage pipeline RISC-V processor core

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant