Nothing Special   »   [go: up one dir, main page]

Abstract - The Choice of A Platform, Software, ASIC Or: Ntroduction Algorithm Analysis and Implementation

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Reconfigurable Implementation for the AES Algorithm

Raoel Ashruf, Georgi Gaydadjiev, Stamatis Vassiliadis


Computer Engineering Laboratory,
Electrical Engineering Department,
Delft University of Technology,
Delft, The Netherlands
rmas5, Georgi  @Dutepp0.ET.TUDelft.NL

Abstract— The choice of a platform, software, ASIC or In the early days, ASICs (Application Specific Integrated
FPGA, is driven by several aspects, such as algorithm per- Circuits) where used to implement such systems. Consid-
formance, cost and flexibility. Although ASIC has the high- ering the variety of algorithms and the fact that many new
est performance and the lowest unit cost, it has no flexibil- standards are developed continuously, the next logical step
ity at all. While software has the most flexibility of all, the
was to use reconfigurable hardware for implementing such
performance is very low. The MOLEN architecture, devel-
oped at the Techinical University of Delft, open up new per- algorithms. Substantial amount of work has been reported
spective for this problem. The architecture is based on a on cryptography implementations based on FPGAs (Field
cooperation between a general purpose core processor and Programmable Gate Arrays). Such systems, however, are
reconfigurable hardware, for example a FPGA. Since cryp- idle for the time a new hardware configuration is being
tographic algorithms are relative frequently upgraded, this loaded. This fact is unacceptable for some of the systems,
high flexibility is desperately needed. There are several rea- e.g. transaction verification systems. Architectures con-
sons for upgrading such as when the algorithm is broken
sisting of reconfigurable hardware and one or more general
or there is a better algorithm or even in case where an al-
gorithm independent protocol is needed. To put it briefly purpose processors are considered as a good candidates for
the  -coded processor put a new perspective on great per- speeding up the performance and are capable of reconfig-
formance and high flexibility. In this paper we investigate uration without execution interrupting. The MOLEN  -
several Rijndael (AES) implementations based on the Molen coded processor [3], designed at the Delft University of
 -coded processor. Two types of FPGAs, Xilinx and Altera, Technology is one example for such an architecture. This
are evaluated in order to produce realistic results. In addi- paper concentrates on the speed performance of the AES
tion, a modified simulator based on the SimpleScalar Toolset encryption algorithm mapped on the  -coded processor.
(v3.0)is used to estimate the performance potential. The pro-
The performance results are compared to the ANSI C soft-
filing of the AES code is done based on different method-
ologies. In this paper an analogy is drawn between. The ware implementation of AES as presented in [1]. Since the
VHDL descriptions produced, are enhanced with standard realization of the used architecture was still not finalized at
interface making them reusable for other research projects the time this paper was written, the whole architecture is
based on the same architecture and dealing with similar data simulated in a modified Simple Scalar cycle accurate sim-
processing problems. The AES running on the top of Molen ulator (v3.0) [2].
performs equaly good as other FPGA based solutions. The
The organization of this paper is as follows. Section
main advantage of our solution is that reconfiguration from
one keysize to another is done faster than on pure FPGA im-
II describes the AES algorithm and analyzes different al-
plementations. Our solution is applicable in a wide range of gorithm parts eligible for FPGA implementation. Sec-
devices - staring from mainframes and going down to very tion III introduces the MOLEN  -coded architecture used
limited architectures, e.g. smart card microcontrollers. for the implementation. Section IV shows the perfor-
Keywords— Reconfigurable Processors, VHDL, FPGA, mance numbers of our AES implementation based on the
AES MOLEN processor. Section V concludes the discussion
and presents the future developments to be completed.
I. I NTRODUCTION
II. AES ALGORITHM ANALYSIS AND
Since privacy issues and network security are emerg-
IMPLEMENTATION
ing due to the wide internet penetration, the research
in cryptography and application of cryptographical algo- The Advanced Encryption Standard (AES), also known
rithms is increasing. Not only the algorithm reliability, as the Rijndael algorithm, is a block cipher that can en-
but also the speed performance and implementation flex- crypt data blocks of 128, 192 or 256 bits using symmetric
ibility are considered as major factors for improvement. keys of 128, 192 or 256 bits. The main idea behind in-

169
troducing AES was to replace the DES (and triple DES) the initialization step, the first round will be initiated. The
algorithm used for more than 20 years now. Due to its first round transformation of this step will be the byte sub-
performance, security, efficiency, ease of implementation stitution (ByteSub) where 16 transformed data bytes are
and flexibility Rijndael was chosen for AES standard in the substituted. This is most probably fetching data out of the
year 2001. According to the National Institutes of Stan- memory or inside the register file, which will also take a
dards and Technology a computer that could crack 56-bit limited amount of clock cycles. The next round transfor-
DES key in just one second would need 149 trillion years mation inside the round is ShiftRows. This is a simple re-
to do the same with a 128-bit AES key. Our initial in- ordering of the bytes, which isn’t really explicitly needed
vestigation will only focus on the 128 bit implementation, and can be performed together with the next transforma-
however, the same techniques are applicable also for big- tion, that is MixColumn. Mixcolumn is based on matrix
ger key sizes. multiplication in GF( 465 ), which execution will definitely
cost significant amount of cycles. Profiling of the algo-
A. The algorithm rithm shows that this stage will cost 4948605 processor
Rijndael is designed to use only simple byte (8-bits) op- cycles while encrypting a 4k input data. The last round
erations. This makes Rijndael implementations possible transformation is again simple arithmetic operation not in-
even on very simple microcontrollers. The encryption of a teresting for hardware implementation. The following en-
data block is composed of an initial XOR step with a tem- cryption process is based on repetition of the transforma-
porally key, several round transformations (depending on tions described above. For the deciphering mode only the
the key or block size) and an additional round performed Inverse Mixcolumn (InvMixcolumn) transformation, will
at the end with one round step omitted. In case of 128 bits take significant amount of clock cycles making it a good
block size, there are 9 rounds, each involving the follow- candidate for implementation in hardware. To summarize
ing four transformations (also known as round transforma- it, only the Mixcolumn and the Inverse Mixcolumn trans-
tions):
 formation are illegible candidates for implementing it in
hardware.
  !
C. The hardware implementation of Mixcolum and In-
$#%&(')+*-,
" verse Mixcolumn
 1, 3

.0// /2 The MixColumn transformation is a based on matrix


multiplication in 798:465 . Every 16 input bytes are multi-
For the final step, the Mixcolumn transformation is not plied by the following matrix:
performed. For data decryption, however, different trans- ;< > >6? >A@B>A@ DFEE
<
formations are used for the first three round transforma- < >A4 @C> >6? >A@ E
4
tions, respectively InvByteSub, InvShiftRow and InvMix- >A@C>A@B> >6? (1)
= 4 G
Column. In addition, the round transformations are ap- >6? >A@B>A@B>
plied in an inverse sequence inside one round. This all 4
makes the encryption and the decryption procedures differ- The multiplications are performed in 798IHJ4 5K domain.
ent, enforcing separate performance investigation for both Multiplication by a constant in 798IHJ45 K will result in xor-
directions. ing the bits of the input byte in a particular way. For ex-
ample multiplication by the constant ”03” in 7 8IHJ45 K is
B. Performance analysis depicted in figure 1.
First we analyze each transformations based on how For the inverse Mixcolumn transformation all columns
many execution cycles it would take on the proposed archi- are multiplied by a different matrix, which is depicted be-
tecture. This done to estimate which parts of the algorithm low. Please note, that the matrix elements are denoted as
should be implemented in hardware. As described earlier, hexadecimal values.
;< >ML
the AES algorithm consists of many different transforma-
< > >MN >6O D EE
tions. The first step in ciphering mode is the initialization < >6O E
>ML > >MN
step. In this step the input data is xor-ed with the 128-bit
>MN >6O >ML > (2)
key. Since the proposed architecture will contain 4 ALUs, = G
> >MN >6O >ML
it would just take 1 execution cycle to perform this oper-
ation considering the data is in place. This all makes this This matrix is the inverse matrix of the one used in Mix-
step not interesting for hardware implementation. After column. The primary difference, compared to Mixcolumn,

170
A. The Architecture
The machine organization is shown in figure 2. The In-
struction Buffer is an additional instruction cache, which
stores the instruction that are fetched from the memory.
The arbiter partially decodes these instructions in order to
determine if it should be issued by the core processor (CP)
or the reconfigurable hardware unit. The needed data are
fetched from the general-purpose registers (GPRs) or the
data cache. The results are written back to the same GPRs
or the data cache, while the control register (CR) stores
other status information. The reconfigurable unit consists
of a custom configured unit (CCU) and a  -code unit.
This allows hardware reconfiguration by firmware via an
extension of the classical microcode. There are two new
instructions, which allow partial and complete % reconfigu-
ration of the CCU. These are Y -set and -set instructions.
%
The Y -set instructions configures the CCU only partially,
while the -set instructions does it totally. For actually
executing microcode onZthe#Z%&1 
configured CCU, there is an
extra instruction called . These new instructions
Fig. 1. Multiplication by constant 3 in PRQ SUTVXW
point to the memory location where the reconfiguration or
execution microcode is stored. The sequencer is respon-
sible for the microinstruction execution sequence for the
are the bigger hexadecimal values of the matrix coeffi- CCU, while the p-CONTROL STORE unit only store the
cients. Multiplication by these constant elements of the microcodes. The sequencer uses a residence table to de-
Galois Field leads to more complex dependency. Since termine whether the microcode is already cached in the
more gates are used the Inverse Mixcolumn transforma- Y -CONTROL STORE unit. The table consists the most
tion has a longer critical path compared to the Mixcolumn frequently used translations and keeps track of these trans-
transformation. This is the main property that the decryp- lations.
tion process is more time consuming than the encryption
process.

III. T HE MOLEN  - CODED PROCESSOR

The MOLEN [3] architecture is based on a cooperation


between a general purpose core processor and reconfig-
urable hardware capable of partial reconfiguration. A new
hardware/software co-design methodology forms the ba-
sis of the MOLEN approach. In this methodology,  -
code is utilized that controls the reconfiguration and execu-
tion processes of the reconfigurable hardware (unit). The
MOLEN approach can support an infinite number of im-
plementations on the FPGA structure as long as they fit on
the available FPGA. Consequently, a storage unit (called
 -code unit) is present on-chip that permanently stores
frequently used  codes and temporarily stores less fre-
quently used  -code in order to diminish their loading
times. This all provides a suitable platform for exploring
the hardware/software paradigm to gain the best algorithm Fig. 2. Block Diagram of the MOLEN architecture
performance.

171
TABLE I
T HE SPEED OF THE AES IN SOFTWARE AND ON THE  - CODED PROCESSOR . (BARRING UNFORESEEN CIRCUMSTANCES )

Encryption Decryption
(ANSI C) Software based 68.75 kb/s 82.75 kb/s
 -coded processor with Xilinx FPGA 33.75 kb/s 34.25 kb/s
 -coded processor with Altera FPGA 33.50 kb/s 34.25 kb/s
Gain with the Altera FPGA 51% 59 %
Gain with the Xilinx FPGA 51% 59%

B. The advantage of using this architecture for Mixcolumn on the Altera FPGA was determined on65
The choice of a platform (software, ASICs or FPGAs) is MHz.
driven by several aspects such as algorithm performance, Since the realization of the Molen  -coded processor
costs and flexibility. It is known that ASICs have the high- was still under development, the whole architecture was
est speed performance while its flexibility is very limited. simulated in a modified version of sim-outorder simula-
For software it is quite the opposite. Since cryptographic tor from the Simple Scalar (V3.0) [3] toolset. We as-
algorithms are relative frequently upgraded, not only the sumed that the general purpose processor of the proposed
high flexibility is needed, but also the speed performance machine organization is clocked on 1 GHz. The sim-
is desperately needed. These needs can be fulfilled by a ulator configuration was as follow: 4 integer ALUs, 1
FPGA. The only disadvantage of FPGAs is the unit-cost, integer MULT/Div-unit, 4 FP adders, 1 FP MULT/Div-
which is dependent of the FPGA size. Since the MOLEN unit, 2 memory ports (R/W), 1 mixcolumn-unit and 1
architecture is based on a combination of software and inverseMixcolumn-unit. A 4kbyte text file with a 128-bit
FPGA, we get high flexibility, good performance and re- key in ECB mode [1] was used as a benchmark. With the
duced cost. help of microcode annotation and the synthesis results the
speed performance results were obtained. The normalized
IV. T HE PERFORMANCE RESULTS results of the simulations are presented in table I.
Both the Mixcolumn and the Inverse Mixcolumn trans- V. C ONCLUSION AND FUTURE WORK
formations are simulated in Modelsim (v5.5) and synthe-
sized with Leonardo spectrum (v2001-1a32) CAD tools. In this paper, we have presented the speed performace
The VHDL sources were synthesized for both Altera of the AES on the Molen  -coded processor. The  -
EPEX20k 400FC672 and Xilinx Spartan2 2s15cs144 FP- coded processor allows pieces of the algorithm to execute
GAs devices. The results are presented in table II. directly on hardware. The whole architecture was simu-
lated in a modifed sim-outorder simulator of the Simple
TABLE II Scalar (V3.0) [3]. The obtained speed gain on this archi-
T HE CRITICAL PATH DELAY & USED CLB S OF M IXCOLUMN tecture was 51% for encryption mode and 59% for decryp-
AND I NVERSE M IXCOLUMN tion mode.
When the Molen realization is finalized, the results of
Altera Xilinx this paper are to be vallidated on real hardware. In addition
Mixcolumn 15.24 ns 12.73ns to this, future work will concentrate on speed performance
Inverse Mixcolumn 21.78 ns 14,82ns for other ciphering algorithms on the Molen  -coded pro-
Used CLBs for Mixcolumn 1.35% 6.94% cessor. Anoter important aspect for future research is the
Used CLBs for InvMixcolumn 3.41% 18.23% vulnerability of algorithms on the Molen  -coded proces-
sor.
Due to the bigger hexadecimal values of the matrix co-
efficients an increased amount of CLBs is used for the R EFERENCES
Inverse Mixcolumn implementation compared to Mixcol- [1] J. Daemen and V. Rijmen. The design of rijndael, 2002.
umn. The syntehsis results vallidate the assumption, that [2] M. Austin E. Larson and D. Erns. The simplescalar tool set, version
Inverse Mixcolumn is more compute intensive. Since the 2.0, doug burger. http://www.simplescalar.com
[3] S. Vassiliadis S.Cotofana and S. Wong. The molen [3\ -coded
critical path delays are known, we were able to calculate
processor. In The 11th International Conference on Field Pro-
the maximum frequencies for both Mixcolumn and Inverse grammable Logic and Application (FPL), Augustus 2001.
Mixcolumn. For example, the maximum clock frequency

172

You might also like