Nothing Special   »   [go: up one dir, main page]

Academia.eduAcademia.edu
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 1 Hardware Implementation of four byte per clock RC4 algorithm arXiv:1401.2727v1 [cs.AR] 13 Jan 2014 Rourab Paul, Amlan Chakrabarti, Senior Member, IEEE, and Ranjan Ghosh, Abstract—In the field of cryptography till date the 2-byte in 1-clock is the best known RC4 hardware design [1], while 1-byte in 1-clock [2], and the 1-byte in 3 clocks [3][4] are the best known implementation. The design algorithm in[2] considers two consecutive bytes together and processes them in 2 clocks. The design [1] is a pipelining architecture of [2]. The design of 1-byte in 3-clocks is too much modular and clock hungry. In this paper considering the RC4 algorithm, as it is, a simpler RC4 hardware design providing higher throughput is proposed in which 6 different architecture has been proposed. In design 1, 1-byte is processed in 1-clock, design 2 is a dynamic KSA-PRGA architecture of Design 1. Design 3 can process 2 byte in a single clock, where as Design 4 is Dynamic KSA-PRGA architecture of Design 3. Design 5 and Design 6 are parallelization architecture design 2 and design 4 which can compute 4 byte in a single clock. The maturity in terms of throughput, power consumption and resource usage, has been achieved from design 1 to design 6. The RC4 encryption and decryption designs are respectively embedded on two FPGA boards as co-processor hardware, the communication between the two boards performed using Ethernet. Keywords—Computer Society, IEEEtran, journal, LATEX, paper, template. ✦ 1 I NTRODUCTION R C4 is a widely used stream cipher whose algorithm is very simple. It has withstood the test of time in spite of its simplicity. The RC4 was proposed by Ron Rivest in 1987 for RSA Data Security and was kept as trade secret till 1994 when it was leaked out [5]. Today RC4 is a part of many network protocols, e.g. SSL, TLS, WEP, WPA and many others. There were many cryptanalysis to look into its key weaknesses[5] [6] followed by many new stream ciphers [7] [8] . RC4 is still the popular stream cipher since it is executed fast and provides high security. (Reference needed) There exist hardware implementations of some of the stream ciphers in the literature [9] [10] [11]. Since about 2003 when FPGA technology has been matured to provide cost effective solutions, many researchers started hardware implementation of RC4 as a natural fall out [3] [4]. The FPGA technology turns out to be attractive since it provides soft • • M. Shell is with the Department of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA, 30332. E-mail: see http://www.michaelshell.org/contact.html J. Doe and J. Doe are with Anonymous University. Manuscript received April 19, 2005; revised January 11, 2007. core processor having design specific functional capability of a main processor [12] along with reconfigurable logic blocks that can be synthesized to a desired custom coprocessor, embedded memories and IP cores. One can design RC4 algorithm totally as an executable code for the soft core processor (main processor) only or in custom coprocessor hardware operated by the main processor. Because of the system overhead, any single instruction if executed in the main processor takes at least 3 clocks, while the identical one when executed in a coprocessor takes 2 or 1-clock as the latter is customized to handle the specific task. Besides the clock advantage, the coprocessor based design makes the system throughput faster by another fold since it is executed in parallel with the main processor. In this paper, RC4 algorithm is considered as it is and exploiting conventional VHDL features a design methodology is proposed processing of RC4 algorithm in two different hardware approaches. 1st architecture named as design 1 and design 2, which can execute 1 byte in single clock, the 2nd approach can execute 2 bytes in single clock which is named as design 3 and design 4 and lastly these two architectural approaches (design 2 and design 4) has been accommodated in coprocessor environment named as design 5 and design 6 to increase the system throughput. The said design is implemented in a custom coprocessor functioning in parallel with a main processor (Xilinx Spartan3E XC3S500e-FG320 and Virtex5 LX110t FPGA architecture) followed by secured data communication between two FPGA boards through their respective Ethernet ports each of the two boards performs RC4 encryption and decryption engines separately. The performance of our design in terms of number of clocks proved to be better than the previous works [1], [2], [3], [4]. The dynamic KSA-PRGA architecture(design 3 and design 4) is introduced to reduce system power and resource utilization. 1.1 Existing Work In the year 2004 P. Kitsos et al [3] has tried a 3 byte per clock architecture of RC4 algorithm. At the first clock of PRGA the i and j is computed. In next clock S[i] and S[j] has been extracted from RAM and added. At the same step the addition has been stored in register t. At third clock swapping process has been done as well the value of t register used to address the key S[t]. The swapping and Z calculation is going on in the 3rd clock. When the Z is being computed t can not address the value at the swapped s-box, because swapped s-box will be updated after a clock pulse. This timing distribution of task can give error while S[i] + S[j] = i or j. In 2003 Matthews, Jr. [4] has proposed another 3 byte clock architecture where KSA and PRGA both have 3 clock for each iteration. On 2008 Matthews again proposed a new 1 byte per clock pipelined architecture in article Jr.[11]. Lastly S. Sen Gupta et al has proposed 2 architectures, such that 1 byte per clock [2] and 2 byte per clock [1] architecture. Article [2] is an excellent loop unrolled architecture where 2 byte is executing in 2 consecutive clocks. Article [1] is pipelined architecture of 1 byte clock architecture where 4 byte can be executed in 2 clocks. 1.2 Our Contribution We have proposed 6 different architectures for RC4 algorithm to increase the throughput compromising with resource and power consumption and trying to search this most optimized architecture in terms of throughput, resource usage and power. The main contribution of this paper is described as a various design approaches. Design 1: We have designed a RC4 architecture with 1 byte per clock throughput. This design has not been done using any loop unrolling method. It is a dual edge sensitive clock architecture, incorporate with pipelining concept. Design 2: This design is identical with the design 1 which was consuming more power and more resource to have 1 byte per clock throughput. Keeping these points in mind we were searching some optimizing technique. We noticed that the KSA and PRGA processes are very much identical to each other. In design 1 when the PRGA starts the KSA does not have any significance of occupying the FPGA dice according to the algorithm concern. The existence of both of the PRGA and KSA process causes large silicon area leading with more power consumption but in recent days the FPGA vendors like XILINX, ALTERA does not have any provision to cut down the Vcc voltage line of those MOSFETs which are unnecessary after its execution like KSA process. If the said provision can be accommodated on FPGA platform it can save large power consumption and resources. Recently XILINX has launched a partial reconfiguration tool which can modify hardware in runtime but without this tool we can use some smart trick to modify the KSA process into PRGA dynamically. This means the same hardware can be utilized by PRGA process, previously which was used by KSA process. In this design 2 we have introduced some logic with the algorithm which can transform the KSA process(after its execution) into PRGA process. The design 2 architecture is named as Dynamic KSA-PRGA architecture(DKP architecture). Design 2 has reduced huge power consumption as well as resource usage without compromising of throughput. The results of power consumption and resource usage supporting our claim. Design 3: To increase The throughput of this architecture we used a loop unrolling method which is mainly motivated from [2] but having a unique different hardware approach. This design 3 architecture can execute 2 byte in a single clock which means, comparing with the previous designs, the throughput is doubled without contributing large power and usage. Design 4: Design 4 is a DKP form of design 3 to save the wattage and slice-LUTs. Design 5: To have more throughput we used 4 parallel co-processor where each is coprocessor executing the design 2 architecture, parallel with a main processor. The main processor is responsible for data receiving and transmitting part. The data has been accused from PS2 port, RS232 port and Ethernet port. Only the 4 co-processors can be accommodate for parallel processing, because the data bus of microblaze processor is 32 bit, so coprocessors can communicate only 4 keys(1 key from each co-processor) with main processor at a time. Design 6: In design 6 we used 2 parallel coprocessor where each coprocessor is executing the design 4 architecture, parallel with a main processor. Only the 2 co-processor can be accommodate for parallel processing, because the data bus of microblaze processor is 32 bit, so co-processors can communicate only 4 keys(2 key from each co-processor) with main processor at a time. The other hardwares of design 6 is similar to design 5. January 11, 2007 2 RC4 A LGORITHM RC4 has a S-Box S[N], N = 0 to 255 and a secret key, key[l] where l is typically between 5 and 16, used to scramble the S-Box [N]. It has two sequential processes, namely KSA (Key Scheduling Algorithm) and PRGA (Pseudo Random Generation Algorithm) which are stated below in figure 1. 1.N = 256; 2.for i = 0 to (N-1) //Initialization module 3. S[i] = i; // Identity permutation 4. K[i] = key[i % l]; 5. end for; 6. j = 0; //Storage module 7. for i = 0 to (N-1) 8. j = (j + S[i] + K[i]) % N; 9. swap (S[i], S[j]); 10. end for; 1. 2. 3. 4. 5. 6. 7. 8. 9. N = 256; i = j = 0; while (TRUE)//Generating Key stream Z i = (i + 1) % N; j = (j + S[i]) % N; swap (S[i] , S[j]); t = (S[i] + S[j]) % N output Z = S[t]; end while; PRGA Process KSA Process Fig. 1. RC4 Algorithm 3 H ARDWARE I MPLEMENTATION OF 1BYTE 1- CLOCK DESIGN (D ESIGN 1 AND D ESIGN 2) Figure 2 shows key operations performed by the main processor in conjunction with a coprocessor till the ciphering of the last text character. The hardware for realizing RC4 algorithm comprises of KSA and PRGA units, which are designed in the coprocessor as two independent units, and the XOR operation is designed to be done in the main processor. The central idea of the present embedded system implementation of one RC4 byte in 1-clock is the hardware design of a storage block shown in Figure 3, which is used in the KSA as well as in the PRGA units. The storage block contains a common S-Box connected to dual select MUX-DEMUX combination and executes the swap operation following line 9 of Algorithm 1 and line 6 of Algorithm 2, in order to update the S-Box. The swap operation in hardware is explained in the following sub-section. 3.1 Storage Block Updating the S-Box The storage block consists of a register bank containing 256 numbers of 8-bit data representing the S-Box(register bank), 256:2 MUX, 2:256 DEMUX and 256 D flip-flops. Each of the MUX and DEMUX is so designed, that with 2-select inputs i, j (8 bit width) Start Main processor Co processor in turn are written to the jth and ith locations of the register bank. The updated S-Box is ready during the next falling edge of the same clock pulse, if called for. i Plain text from RS232 Request for Z Hold processor until key stream is acknowledged RC4 Algorithm 8 bit 8 bit CLK Register Bank XOR 8 bit Z S[0] Y0 D-FF 8 bit I0 8 bit S[1] Y1 S[i] Cipher Text D-FF 8 bit Y3 I1 S[2] 8 bit I2 D-FF 8 bit S[3] 8 bit To LAN DUAL SELECTING INPUT DEMUX Receiver UDP Buffer Y4 S[5] Y5 S[j] S[j] 8 bit Y254 Y255 can address the two register data S[i]andS[j] at the same time (The VHDL complier has merged 2 256:1 MUX to design this single 256:2 MUX, The same thing is happened for 2:256 DEMUX). The hardware design of the storage block updating the S-Box is shown in figure 3 . For swapping, the same S-Box is accessed by the KSA unit with its MUX0-DEMUX0 combination and also by the PRGA unit with its MUX2-DEMUX2 combination and it is also accessed by another MUX3 in PRGA for the generation of the key stream Z. The storage block swaps S[i] and S[j] and thereby updates the S-Box. For swapping, S[i] and S[j] ports of MUX are connected to S[j] and S[i] ports of DEMUX respectively. This storage block has thus 3 input ports (i, j and CLK), and 2 inout ports (S[i] of MUX, S[j] of DEMUX and S[j] of MUX, S[i] of DEMUX). The storage block of PRGA unit provides 2 output ports from its 2 inout ports which are fed to an adder circuit with MUX3. During the falling edge of a clock pulse, S[i] and S[j] values corresponding to ith and jth locations of the register bank are read and put on hold to the respective D flip-flops. During the rising edge of the next clock pulse, the S[i] and S[j] values are transferred to the MUX outputs and instantly passed to the S[j] and S[i] ports of the DEMUX respectively and DUAL SELECTING INPUT MUX S[4] 8 bit 8 bit Fig. 2. Functioning of the main processor with the co processor S[i] 8 bit 8 bit Y2 S[254] S[255] I254 D-FF 8 bit 8 bit I255 D-FF 8 bit 8 bit 8 bit j Fig. 3. Storage Block updating the S-Box 3.2 Design of the KSA Unit following Algorithm Figure 4 shows a schematic diagram of design of the KSA unit. Initially the S-Box is filled with identity permutation of i whose values change from 0 to 255 as stated in line 3 of initialization module. The l-bytes of secret key are stored in the K[256] array as given in line 4. The KSA unit does access its storage block with i being provided by a one round of MOD 256 up counter, providing fixed 256 clock pulses and j being provided by a 3-input adder (j, S[i], and K[i]) following the line 8 of the storage module, where j is clock driven, S[i] is MUX0 driven chosen from the S-Box and K[i] is MUX1 driven chosen from the K-array. The S-Box is scrambled by the swapping operation stated in line 9 using MUX0-DEMUX0 combinations in the storage block. The KSA operation takes one initial clock and subsequent 256 clock cycles. 3.3 Design of the PRGA Unit following Algorithm Figure 5 shows a schematic diagram of the design of the PRGA unit. The PRGA unit does Clock One Round MOD 256 Up Counter Clock One Round MOD 256 Up Counter Output i S[j] Clock i S[i] S[j] MUX 1 256:1 Storage Box Clock S[j] S[i] j S[i] S[j] S[i] Clock D-FF MUX 3 256:1 Storage Box K[i] K[i] j Adder (j+S[i]+K[i]) Clock D-FF Adder (j+S[i]) S B O X Adder S[i]+S[j] Fig. 4. Schematic Design of the KSA Unit access the storage block with i being provided by a MOD 256 up counter (line 4) and j being given by a 2-input (j and S[i]) adder following the line 5 where j is clock driven and S[i] is MUX2 driven chosen from the S-Box. With updated j and current i, the swapping of S[i] and S[j] is executed following line 6 using MUX2DEMUX2 combination of the storage block. Following the line 7, S[i] and S[j] give a value of t based on which the key stream Z is selected from the S-Box using MUX 3 (line 8). Fig. 5. Schematic Design of the PRGA Unit CLK Initialize i =0, j =0 0 0 Start Counter i 1=1 j =(j +S[i ]) 1 1 0 %256 temp=i 1 Swap S[temp] & S[j ]1 i =2 2 Z1executed j =(j +S[i ]) 2 2 1 %256 2 temp=i 3.4 Swap S[temp] & S[j ] 2 Z2 executed Continue...... Timing analysis of the PRGA Operation Let φi denotes the ith clock cycle for i ≥ 0. It is assumed that the PRGA unit starts when clock cycle is φ0 . It is observed that a signal value gets updated during a falling edge of a clock cycle if it is changed during the rising edge of the previous clock cycle. The symbol ↔ indicates swap operation. The clock-wise PRGA operations are shown in Figure 6. Timing Analysis of PRGA The MOD 256 up counter shown in Figure 5 is so designed that i starts from 1, goes up to 255 and then it repeats from 0 to 255 for each 256 subsequent clock cycles. Fig. 6. Clock-wise description of PRGA operation • Rising edge of φ1 : j1 =(j0 + S[i1]) % 256; temp=i1 . • Falling edge of φ1 : i2 →2; S[temp] ↔ S[j1 ]; • Rising edge of φ2 : j2 = (j1 + S[i2 ])% 256; temp=i2 , Z1 = (S[i1 ] + S[j1 ]) % 256 = 1st Key. • Falling edge of φ2 : i3 →3; S[temp] ↔S[j2 ]; • Rising edge of φ3 : j3 = (j2 + S[i3 ])%256, temp=i3 , Z2 = (S[i2 ] + S[j2 ]) % 256 = 2nd Key stream. • Rising edge of φ0 : Initialize j0 =0 and i0 =0. • Falling edge of φ0 : Start counter i1 =1. i counter The series continues generating successive keys (Zs). If the text characters are n and n > 254, i = 0 after the first round and the clock repeats (n+2) times. After generating Zn during the rising edge of φn+1 , PRGA stops. For generating Zn, the PRGA requires (n+2) clocks and its throughput per byte is (1+2/n). i S[i] S[i] Storage Box OUTPUT S[j] S[j] Z S[0] S[1] S[2] 3.5 Dynamic KSA-PRGA Architecture (Design 2) If we look into the KSA and PRGA process in Figure 1 we can find both of these processes are almost identical. Line 8,9 of KSA process and line 5, 6 of PRGA process are selfsame, except the extra addend secret key (K[i]) in KSA process. The key execution(Z) is also an additional process in PRGA. In section 3 design we have used two separate sequential blocks for both KSA and PRGA. As these processes are separated and executed sequentially, the algorithm core is contributing few extra static power, and dynamic power with some additional resource usage. To overcome these flaws we are proposing a single process dynamic KSA-PRGA architecture where KSA process itself is transformed to PRGA process after 257 clocks. Due to the reutilization of LUTs, slices and FF-pairs the proposed architecture can reduce significant resource usage which has led to low power design. 3.5.1 Hardware Overview For the dynamic KSA-PRGA architecture few additional hardware has been added and some significant alteration of previous normal structural design has been done. The whole changes has been described below, Instead of using two separated ′ i − counter′ for KSA and PRGA process we have used a single ′ i − counter′ in our new architecture. Previous ′ i − counter′ for KSA is a single round MOD256 up counter and PRGA ′ i − counter′ was continuous MOD-256 up counter which has a counting range of 0 to 255 skipping the initial 0 count in first round. For both of these processes the dynamic architecture uses a simple j MUX 256:1 D-FF prga_en S[254] ADDER1 S[255] S[j] K[i] 0 MUX 2:1 S[i] j S[i] ADDER2 Enable Fig. 7. Dynamic KSA-PRGA continuous MOD-256 up counter which starts counting from count 0 and ends with 255 to finish the entire KSA process. After finishing the KSA process the counter is again reset to 0 count. At this 0 count the PRGA process is initialized (Initialized j by 0) and from the next successive count it starts the key execution process. A signal named as prga enable(prga_en) is employed to detect the KSA process status. When KSA has been completed prga enable has been latched high. During the KSA process prga enable stands 0 logic state to pass the K[i] to Adder1 through the 2:1 MUX but when prga enable goes high, 0 value will be forwarded to Adder1. So, at that time the Adder1 is started to execute (j + S[i]) instead of (j + S[i] + K[i]) as stated in PRGA algorithm. At the same time the prga enable signal enables the Adder2 circuit to execute the Z(keys). These two activation on main hardware block has altered the KSA process to PRGA process. Figure 7 shows the architectural block diagram of dynamic KSA-PRGA. Resource usage and power consumption of dynamic KSA-PRGA architecture has been compared with previous architecture in table ?? and ?? and Figure ?? is the graphical representation of table ?? and ??. i2 TABLE 1 Two unrolled loops i1 8 bit Steps 1 2 1st iteration i1 = i0 + 1 j1 = j0 + S0 [i1 ] 3 4 SwapS0 [i1 ] ↔ S0 [j1 ] Z1 = S1 [S0 [i1 ] + S0 [j1 ]] 2nd iteration i2 = i1 + 1 = i0 + 2 j2 = j1 + S1 [i2 ] = j0 + S0 [i1 ] + S1 [i2 ] SwapS1 [i2 ] ↔ S1 [j2 ] Z2 = S2 [S1 [i2 ] + S1 [j2 ]] 8 bit 8 bit 8 bit CLK Register Bank 8 bit D-FF S[0] Y0 8 bit I0 8 bit S[1] Y1 S[i1] D-FF S[i2] Y3 I1 S[2] 8 bit I2 D-FF S[j1] QUAD SELECTING INPUT DEMUX Y4 8 bit 8 bit S[3] S[i2] 8 bit 8 bit S[i1] 8 bit 8 bit Y2 8 bit QUAD SELECTING INPUT MUX S[4] 8 bit S[j1] S[5] Y5 8 bit 8 bit 4 H ARDWARE I MPLEMENTATION OF 2BYTE 1- CLOCK DESIGN (D ESIGN 3 AND D ESIGN 4) S[j2] 8 bit S[j2] 8 bit 8 bit Y254 Y255 8 bit S[254] D-FF S[255] D-FF Storage Block Updating the S-Box Figure 8 shows a schematic diagram of design of the storage box unit which is identical with figure 3. The only difference between the s-box of 1 byte 1 clock architecture and s-box of 2 byte 1 clock architecture is number of ports. This storage block also consists of a register bank containing 256 numbers of 8-bit data representing the S-Box (register bank), 256:1 MUX, 1:256 DEMUX and 256 D flip-flops. Here two sets of i and j can address 4 s-box element at a time. The value of S[i1 ], S[i2 ], S[j1 ] and S[j2 ] are updating by a 4 input DEMUX followed by a 4 input MUX via Swap Controlling block. 4.2 KSA unit of 2 byte per clock architecture 4.2.1 j1 and j2 Generation of KSA unit In this proposed architecture we need to compute j1 and j2 at the same clock instance. j1 computation circuit is same as the previous KSA architecture shown in 4 but in the j2 circuitry we need to adopt few tricks to compute the j2 at that same clock while j1 is executing. At the j computation j0 has been initiated by 0 value. According to the conventional RC4 algorithm j1 is executed at the step 2 of table 1. In figure 9 Adder7 and Adder 8 is responsible to add j0 +S0 [i1 ]+K[i1 ] to evaluate the value of j1 . For the j2 computation we again need to see 1st column step2 of table 8 bit 8 bit 8 bit I255 8 bit 8 bit 8 bit 8 bit j1 This approach is mainly motivated from reference [saurav senguptas paper] which is based on loop unrolling method. 4.1 I254 j2 Fig. 8. Storage Block updating the S-Box 1 where the j2 computation may be divided into the following two cases j2 = j0 + S0 [i1 ] + S1 [i2 ] + 2keys  j0 + S0 [i1 ] + S0 [i2 ] + 2keys if i2 6= j1 ; j0 + S0 [i1 ] + S0 [i1 ] + 2keys i2 = j1 . (1) It has been noted that the only differentiation from S0 to S1 is the swap process, so it is very obvious to check if i2 is equal to either of i1 or j1 . As i2 never be equal to i1 (differ only by 1 modulo 256) we only need to check i2 and j1 which will decide whether S0 [i2 ] or S0 [i2 ] will be passed as a 3rd operand of addition process. Figure 9 shows the block diagram of j1 and j2 generator. Adder 1 is adding the two successive secret keys which is addressed by i1 and i2 . Adder 2 is summing k[i1 ] + K[i2 ] with j0 and passing this value to Adder3 and adder4. Adder3 and adder4 is adding k[i1 ] + K[i2 ] + j0 with S[i2 ] and S[i1 ] respectively. Adder5 and Adder 6 is responsible to execute j0 + S0 [i1 ] + S0 [i2 ] + K[i1 ] + K[i2 ] and j0 + S0 [i1 ] + S0 [i1 ] + K[i1 ] + K[i2 ]. 4.2.2 Swap Controlling block In line 9 of figure 1 a swapping process is updating s-box in each clock cycle. In 2 byte per clock architecture 2 loop iteration has been unrolled in a single loop iteration. Here two successive swap process should be accumulated in a single iteration. To replace this two j0 i2 K[i1] Comparator1 1 if eql receive S[i1 ], S[i2 ], S[j1 ] and S[j2 ] from MUX and has 4 output to fed the swapped data to DEMUX unit. The Pictorial presentation is depicted on figure 10. Adder8 S[i1] Adder1 Adder7 K[i2] j1 K[i1] K[i1]+K[i2] S[i2] Adder3 One round MOD 256 Up Counter Clock Adder5 Adder2 j0 j0+S[i1]+S[i1]+K[i1]+K[i2] j0+S[i1]+S[i2]+K[i1]+K[i2] i1 j2 i2 2 to 1 MUX1 S[i1] Adder4 S[i1] S[i1] S[i2] S[i2] Adder6 Clock S[i1] Storage Box S[j1] MUX 1 256:1 S[j1] Fig. 9. j1 and j2 genarator of KSA unit TABLE 2 Different cases for the data movement in the swap operation. # Condition 1 i2 6= j1 &j2 6= i1 &j2 6= j1 i2 6= j1 &j2 6= i1 &j2 = j1 2 3 i2 6= j1 &j2 = i1 &j2 6= j1 4 i2 6= j1 &j2 = i1 &j2 = j1 i2 = j1 &j2 6= i1 &j2 6= j1 5 6 7 8 i2 = j1 &j2 6= i1 &j2 = j1 i2 = j1 &j2 = i1 &j2 6= j1 i2 = j1 &j2 = i1 &j2 = j1 Register to register Data movement S0 [i1 ] → S0 [j1 ], S0 [j1 ] → S0 [i1 ], S0 [i2 ] → S0 [j2 ], S0 [j2 ] → S0[i2 ] S0 [i1 ] → S0 [i2 ], S0 [i2 ] → S0 [j1 ] = S0 [j2 ], S0 [j1 ] → S0 [i1 ] S0 [i1 ] → S0 [j1 ], S0 [i2 ] → S0 [i1 ] = S0 [j2 ], S0 [j1 ] → S0 [i2 ] S0 [i1 ] → S0 [i2 ], S0 [i2 ] → S0 [i1 ] = S0 [j1 ] = S0 [j2 ] S0 [i1 ] → S0 [j2 ], S0 [j2 ] → S0 [j1 ] = S0 [i2 ], S0 [j1 ] → S0 [i1 ] S0 [i1 ] → S0 [j1 ] = S0 [i2 ] = S0 [j2 ], S0 [j1 ] → S0 [i1 ] No data Movement Impossible swap processes by a single swap process we need to check all possible relation of i1 , i2 , j1 and j2 . We need to check if i2 and j2 can be equal to i1 or j1 . Among the all possible correlations we can skip the combination of i1 and i2 as i2 = i1 + 1. For the remaining 3 cases, such that i2 with j1 , j2 with i1 and j2 with j1 we need 3 comparator circuits (8 bit) to compare each of their relation in every clock cycle. Table 2 is showing the other remaining combination of of 4 indexes. Following the data transfer of 2 we have designed a hardware behavioral model of swap controlling unit which has 4 input port to S[j2] S[j2] j1 j2 K[i1] K[i2] Clock D-FF j0 j1 j2 generator SWAP Controller Fig. 10. KSA unt of 2 byte per Clock Hardware 4.3 PRGA unit of 2 byte per clock architecture PRGA unit of 2 byte per clock architecture is very much smiler with the PRGA unit of 1 byte per clock architecture. Figure 11 shows a schematic diagram of the design of the PRGA unit. The Counter circuit generates i1 and i2 continuously and addresses S[i1 ] and S[i2 ] at at the same instance. The j1 and j2 is updated by the respective i1 and i2 via j1 and j2 generator and latches the S[j1 ] and S[j2 ]. After generating S[i1 ], S[i2 ], S[j1 ] and S[j2 ], all of these signals has been fed to Swap Controlling Block. This Swap Block has decided which signal will be transfer to which address of S-box through the DEMUX circuit. 4.3.1 j1 and j2 Generation of PRGA unit This is a very simple circuit built by 4 adder blocks, 1 comparator and 1 MUX(2:1). As we seen in figure 12 j1 computed very easily by Adder9 circuit but for the j2 computation we again need to see 2nd column step2 of table 1 where the j2 computation may be divided into Clock TABLE 3 Different cases for the data movement for the Z2 computation. One round MOD 256 Up Counter i1 i2 S[i1] S[i1] Z1 S[i2] S[i2] Clock Storage Box # Condition 1 i2 6= j1 &j2 6= i1 j2 6= j1 i2 6= j1 &j2 6= i1 &j2 = j1 i2 6= j1 &j2 = i1 &j2 6= j1 i2 6= j1 &j2 = i1 &j2 = j1 i2 = j1 &j2 6= i1 &j2 6= j1 i2 = j1 &j2 6= i1 &j2 = j1 i2 = j1 &j2 = i1 &j2 6= j1 Z1 Z2 generator S[j1] S[j1] Z2 S[j2] S[j2] j1 Clock 2 j2 D-FF j0 j1 j2 generator 3 4 SWAP Controller 5 Fig. 11. PRGA unit of 2 byte per Clock Hardware 6 7 the following two cases j2 = j0 + S0 [i1 ] + S1 [i2 ]  j0 + S0 [i1 ] + S0 [i2 ] if i2 6= j1 ; j0 + S0 [i1 ] + S0 [i1 ] i2 = j1 . (2) These two possible values from the two cases has been passed to MUX2 circuit through Adder10 and Adder11. The selecting input of the MUX2 is connected with Comparator5. The j1 and i2 have connected with the input of Comparator5. The Comparator5 takes the decision which value of j2 will be passed to Adder12. Adder12 is computing the final j2 by adding j0 with it. j0 S[i1] j1 Adder9 j1 i2 Comparator 1 if eql 5 S[i2] Register to register Data movement S1 [i2 ] = S0 [i2 ], S1 [j2 ] = S0 [j2 ] S1 [i2 ] = S0 [i2 ], S1 [j2 ] = S0 [i1 ] S1 [i2 ] = S0 [i2 ], S1 [j2 ] = S0 [j1 ] S1 [i2 ] = S0 [i2 ], S1 [j2 ] = S0 [j1 ] S1 [i2 ] = S0 [i1 ], S1 [j2 ] = S0 [j2 ] S1 [i2 ] = S0 [i1 ], S1 [j2 ] = S0 [i1 ] S1 [i2 ] = S0 [i1 ], S1 [j2 ] = S0 [j1 ] 4.3.3 Z1 &Z2 Generator block In step 4 of Table 1, we got Z1 = S1 [i1 ] + S1 [j1 ]. (3) As S0 and S1 has been differentiated by a swap process, the value of Z1 can be evaluated like, Z1 = S0 [j1 ] + S0 [i1 ]. (4) Thus the Z1 can be computed by Adder20 of figure 13 by adding S0 [j1 ] and S0 [i1 ]. The output of Adder20 is connected with a 256:2 MUX2 (Merge of a two 256:1 MUX) which is used to retrieve the appropriate data from s-box. Now the Z2 computation is involved as Adder10 Z2 = S2 [S2 [i2 ] + S2 [j2 ]] = S2 [S1 [j2 ] + S1 [i2 ]]. (5) 2 to 1 MUX2 Adder12 Adder11 j2 j0 S[i1] Fig. 12. j1 and j2 generator of PRGA unit 4.3.2 Swap Controlling block The Swap controlling circuit of PRGA process is identical with the swap circuit of KSA, as described in section 4.2.2. As we used loop unrolling method, we need to jump directly from S0 state to S2 state. Here while we will compute Z2 , a swap of the 1st loop already has been computed so this important issue should be kept in our mind during the Z2 computation. Where as the 2nd swap process could not make effect on Z2 according to equation 5. So Considering all possible cases of i1 , i2 , j1 and j2 we are trying the compute the value of S1 [i2 ] and S1 [j2 ] in terms of S0 state. The table 3 is showing all the details. The circuit of the Z2 computation can be realized using a 8:1 MUX (named as MUX3). 3 comparator circuit is connected with the 3 selecting input of MUX3. Comparator 6,7 and 8 comparing (i) i2 and j1 , (ii) j2 and i1 , (iii) j2 and j1 . The output of MUX3 is connected with the selecting input of MUX4 to address the proper appropriate element of s-box as Z2 key. The hardware description of Z2 computation is shown in figure 13. • Rising edge of φ1 : j1 =(j0 + S0 [i1 ]) % 256; j2 = j0 + S0 [i1 ] + S1 [i2 ], temp=i1 , temp_next=i2 . • Falling edge of φ1 : i3 =3; and i4 = 4, Swap Occurred; • Rising edge of φ2 : j3 = (j2 + S2 [i3 ])% 256; j4 = (j2 + S2 [i3 ] + S3 [i4 ])%256, temp=i3 , temp_next=i4 , Z1 = S1 (S0 [i1 ] + S0 [j1 ]) %256, Z2 = S2 [S1 [i2 ] + S1 [j2 ]]% 256. • Falling edge of φ2 : i5 =5; and i6 = 6, Swap Occurred; • Rising edge of φ3 : j5 = (j4 + S4 [i5 ])% 256; j6 = (j4 + S2 [i5 ] + S5 [i6 ])%256, temp=i5 , temp_next=i6 , Z3 = (S2 [i3 ] + S2 [j3 ])% 256, Z4 = S3 [S2 [i4 ] + S2 [j4 ]% 256. S[i2] S[j2] Adder13 S box S[i2] Adder14 256:2 MUX4 S[i1] 8 to 1 MUX3 Z1 Z2 S[i1] Adder20 S[j1] i2 j1 Comp6 1 if eql j2 i1 Comp7 1 if eql j2 j1 Comp8 1 if eql CLK Fig. 13. Z1 and Z2 generator of PRGA unit 4.4 Timing Analysis of PRGA of 2 byte per clock design Again a MOD 256 up counter shown in figure 11 is designed which is very much identical with the up counter 5 but instead of single output it has two consecutive outputs named as i1 and i2 . That i1 starts from 1, goes up to 255 skipping all the even value between the said range and it repeats again until the plaintext sequence has been stopped. The i2 starts from 2, goes up to 254 skipping all the odd value between the said range and then it repeats again from 0 to 254 skipping the same odd numbers. This means the said counter generates two consecutive counting values at the same clock instance. The details timing analysis has been shown below. • Rising edge of φ0 : Initialize j0 = 0 and i0 =0. • Falling edge of φ0 : Start counter i1 =1 and i2 =2. Start Counter i1 = 1 i2 = 2 Initialize i0 = 0 j0 = 0 i3 = 3 i4 = 4 i5 = 5 i6 = 6 i7 = 7 i8 = 8 Swap Z1 , Z 2 executed j1 = (j0 + S0 [i1 ] %256 j2 = (j1 + S1 [i2 ] %256 Swap Z3 , Z4 executed j3 = (j2 + S2 [i3 ] %256 j4 = (j3 + S3 [i4 ] %256 Swap Z5 , Z6 executed j5 = (j4 + S4 [i5 ] %256 j6 = (j5 + S3 [i6 ] %256 4.5 Dynamic KSA-PRGA Architecture (Design 4) Taking the advantage of selfsame architecture of KSA and PRGA process, the same approach has been adopted in the 2 byte per clock architecture. The KSA process in transformed into PRGA process taking the acknowledgment of ′ prga enable′ (prga_en) signal. Only the counter of said design is slightly different from 1 byte dynamic KSA-PRGA architecture of section 3.5. Two outputs of the said counter i1 and i2 count two consecutive counts starts from count ′ 0′ and ′ 1′ value respectively following an initialization clock of KSA process. At the 129th clock i1 and i2 reaches at count 254 and 255 respectively. At the 130th clocks i1 , i2 and j1 , j2 are initialized by 0 and prga enable goes high for the successive PRGA process. Since swap is not required for this initialization state, swap process has been stopped for this single clock duration. After this initialization again the i1 and i2 counts has been started with ′ 1′ and ′ ′ 2 counts. This consecutive counts is going on while we need key streams form PRGA process. The timing diagram is shown in 4.4 of Storage Block unit, Z1 and Z2 generator has been already discussed and shown in section 4.1 and figure 8, section 4.3.3figure 13 respectively. The swap control unit also has been discussed in section 4.3.2. Clock i counter j0 i2 i1 Comparator 1 if eql i2 Adder8 S[i1] Adder7 K[i1] 2:1 MUX4 S[i1] K[i1] ’0’ S[i1] S[i2] j1 Z1 S[i2] Z1 Z2 generator Storage Box Z2 2 to 1 MUX3 j2 prga_en prga_en ’0’ 2:1 MUX1 K[i1] 2:1 MUX2 S[j1] S[j2] S[j2] K[i1]+K[i2]+j0+S[i1]+S[i1](for KSA)/ j0+S[i1]+S[i2](for PRGA) K[i2] K[i1] S[j1] j1 K[i1]+K[i2]+j0+S[i1]+S[i2](for KSA)/ j0+S[i1]+S[i2](for PRGA) ’0’ K[i2] j2 S[i1] prga_en j1 j2 generator prga_en S[i2] S[i2] j0 K[i] Adder3 Adder5 Adder1 SWAP Controller Adder2 prga_en j0 Adder4 S[i1] Adder6 S[i1] Fig. 15. Dynamic KSA-PRGA of 2 byte per Clock Hardware 5 Fig. 14. j1 and j2 Generator of Dynamic KSAPRGA of 2 byte per Clock Hardware 4.5.1 Hardware overview The hardware of j1 and j2 generator is shown in figure 14. Bu MUX1 and MUX2 the prga enable signal decides whether K[i1 ] + K[i2 ] or ′ 0′ will be passed to Adder1. If prga enable goes low then K[i1 ] + K[i2 ] will be added with signal j0 + S[i1 ] + S[i1 ] (for i2 = j1 ) and with signal j0 +S[i1 ]+S[i2 ] (for i2 6= j1 ). Otherwise ′ 0′ will be added with the said signal for both of the mentioned conditions using the Adder2, Adder3, Adder4, Adder5, Adder6 blocks.Among these two values of j2 , which one will passes as a final j2 output is decided by Comparator circuit which has two input i2 and j1 . The j1 computation is smiler with section 3.5 architecture figure 7. The top level hierarchy of Dynamic KSA-PRGA architecture is shown in figure 15. The functionality and pictorial representation 1 BYTE PER CLOCK ARCHITECTURE AS 4 CO - PROCESSOR (D ESIGN 5) The back ground of this approach is ”how many bytes can be generated in a single clock?”. The more loop unrolling can make the circuit more complex in terms of latency and critical path, so we thought if we can accommodate four RC4 co-processor communicating with main processor to en-cipher 4 plain text character coming from the ethernet (or from other interfaces) at a single clock. 4 co-processor environment has been chose as we have 32 bit main processor and 32 bus availability. Here at an instant only 4 Z can be communicated with main processor through bus. See the figure 16 where 4 separate RC4 Co-processors work breaking the secret keys into same sizes (different size can also be accommodated according to the requirement of user). The four separated RC4 blocks generate 4 keys (Z) at a single pulse. So we can say the architecture can generate 4 bytes at single clock. After Secret Key K1 K0 computing of Zs, they are passed to the main processor to XOR those Zs (keys) with the plain text coming from ethernet. When the ciphering has been done, again those ciphers have been send to the unsecured ethernet environment. clk Co Processor2 Co Processor 1 Synchronization Signal Clock Controll Circuit 6 2 BYTE PER CLOCK ARCHITECTURE AS 2 CO - PROCESSOR (D ESIGN 6) Synchronization Signal Z0 Z1 PLB bus(24 to 31) PLB bus(16 to 23) Z3 Z2 PLB bus(8 to 15) Synchronization Signal PLB bus(0 to 7) Co-Processor Interface Contoller bus_clk Control Signal PLB bus(0 to 31) The same 4 byte per clock throughput can be achieved, if we used two 2 byte per clock architecture instead of using four 1 byte per clock architecture. here in figure 17 two 2 byte per clock architecture can generate 4 keys at a single clock. The hardware usage and power consumption of proposed four different architecture is shown in table 6 and 5. Main Clock C[n-1]=P[n-1] XOR Z0 C[n-2]=P[n-2] XOR Z1 Plain Text from Ethernet Main Processor C[n-3]=P[n-3] XOR Z2 C[n-4]=P[n-4] XOR Z3 Cipher text backs to Ethernet Secret Key K0 clk Co Processor 1 K2 Co processor 2 Co Processor3 Synchronization Signal Z0 PLB bus(24 to 31) Clock Controll Circuit K1 Z1 Synchronization Signal PLB bus(16 to 23) Z2 PLB bus(8 to 15) K3 Synchronization Signal Co-Processor4 Z3 Fig. 17. 2 RC4 co-processor with main processor TABLE 4 Critical path analysis Architectures Synchronization Signal Critical Path (ns) PLB bus(0 to 7) Co-Processor Interface Contoller bus_clk 1 byte per Clock 4.127 2 byte per Clock 5.15 Control Signal PLB bus(0 to 31) Main Clock Plain Text from Ethernet C[n-1]=P[n-1] XOR Z0 C[n-2]=P[n-2] XOR Z1 C[n-3]=P[n-3] XOR Z2 Main Processor C[n-4]=P[n-4] XOR Z3 Cipher text backs to Ethernet Fig. 16. 4 RC4 co-processor with main processor 7 C RITICAL PATH A NALYSIS Here the critical path analysis 8 R ESULTS AND IMPLEMENTATION The four kind of architetcure have been proposed, such as, 1)1 byte per clock, 2)1 byte per clock Dynamic KSA-PRGA(DKP) clock, 3)2 byte per clock. and finally 4)2 byte per clock Dynamic KSA-PRGA(DKP) clock. The hardware usage and power consumption of these proposed four different architectures are shown in table 6 and 5. All of the architecture has been imported in parallel processing environment where main processor handling the data interface part and co-processor are taking care of RC4 algorithm part. We have seen the trade off between throughput, power consumption and throughput, resource usage. We have also observed how the power consumption and resource usage increases with the increasing of the number of co-processor in tables 7, 8, 9, and 10. TABLE 5 Power consumption of Proposed Architectures Milli watt Design 1 design 2 Design 3 Design 4 Design 5 Design 6 Static Power 970.8 914.87 976.34 953.94 1004.74 1016.96 Dynamic Power 206.4 79.85 446.38 109.91 159.92 194.61 Total Power 1177.2 994.72 1422.71 1063.85 1164.66 1211.58 TABLE 7 Hardware usage of 1 byte per clock architecture for different number of co-processor Milli watt Static Power Dynamic Power Total Power Only Algorithm Core 914.87 1 CoProcessor 2 Co-Processor 3 Co-Processor 1000.45 1002.35 1004.12 4 Co-Processor Design 5 1004.74 052.86 83.25 117.39 148.95 159.92 994.72 1083.69 1119.74 1153.07 1164.66 TABLE 8 Power consumption of 1 byte per clock architecture for different number of co-processor 80 Milli watt 70 Slice Registers Slice LUTS Slice LUTF/ F pairs Only Algorithm Core 4139 1 CoProcessor 2 Co-Processor 3 Co-Processor 5602 9740 13842 4 Co-Processor Design 5 17964 12560 13986 26013 38217 49545 4132 14510 26396 38782 49886 60 TABLE 9 Hardware usage of 2 byte per clock architecture for different number of co-processor 50 40 a small bar a medium bar a large bar TABLE 6 Hardware usage of Proposed Architectures # Design Design Design Design Design Design 1 2 3 4 5 6 Slice LUTs 4173 2094 4178 2123 17964 5682 14588 5383 33115 15452 49545 36958 Fully used LUTs FF pairs 4149 2078 4168 2104 49886 37420 All the designs has been implemented on Virtex(ML505, lx110t) and Spartan 3E(XC3S500e) FPGA board using ISE 14.4 and EDK 14,4 tool. The power and resource result is shown for Virtex board. The co processor has been coded by VHDL language and the main processor functionality has projected by system.c language. Two Xilinx FPGA Spartan3E (XC3S500e-FG320) Milli watt Static Power Dynamic Power Total Power Only Algorithm Core 953.94 109.91 1063.85 1 CoProcessor 1011.457 158.48 1173.05 2 Co-Processor Design 6 1016.96 194.61 1211.58 TABLE 10 Power consumption of 2 byte per clock architecture for different number of co-processor Milli watt Slice Registers Slice LUTs Slice LUT-F/ F pairs Only Algorithm Core 2123 15452 2104 1 CoProcessor 3386 16821 17420 2 Co-Processor Design 6 5682 36958 37420 boards, each with RC4 encryption and decryption engines separately and shown in Figure 18, are connected through Ethernet ports and to respective hyper terminals (DTE) through RS 232 ports. A PPENDIX A P ROOF OF THE F IRST Z ONKLAR E QUA TION Appendix one text goes here. A PPENDIX B Appendix two text goes here. ACKNOWLEDGMENTS The authors would like to thank... R EFERENCES Fig. 18. Experimental setup of FPGA based secured data communication [1] 9 [2] C OMPARISON WITH E XISTING D E - SIGNS In table 11 we have compared the throughput of our design with existing work till date. In section 3 we have seen a 1 byte per clock architecture where KSA is executing within 257 clock after an initial lag of 1 clock. The PRGA is computing 1 byte after a lag of 2 initial clock. The KSA process in section 4 named as of 2 byte per clock architecture can iterate 256 loops within 128 clock after an initial clock delay and the PRGA of that said architecture can execute 2 bytes in single clock after a lag of 2 initial clocks. From the table 11 we can see Ref [3], [4] are 3 byte per clock design. Article [11], [2] and design 1, design 2 have same throughput of 1 byte per clock but more sophisticatedly article [2] and design1, design 2 have slightly better throughput as mention in the said table 11 where as article [1] and design 3, design 4 have same throughput of 2 byte per clock. More specifically our proposed design 3 and design 4 has slightly better throughput than [1]. 10 C ONCLUSION The conclusion goes here. [3] [4] [5] [6] [7] [8] [9] S. Gupta, A. Chattopadhyay, K. Sinha, S. Maitra, and B. Sinha, “High-performance hardware implementation for rc4 stream cipher,” Computers, IEEE Transactions on, vol. 62, no. 4, pp. 730–743, 2013. S. Sen Gupta, K. Sinha, S. Maitra, and B. Sinha, “One byte per clock: A novel rc4 hardware,” in Progress in Cryptology - INDOCRYPT 2010, ser. Lecture Notes in Computer Science, G. Gong and K. Gupta, Eds. Springer Berlin / Heidelberg, 2010, vol. 6498, pp. 347– 363, 10.1007/978-3-642-17401-8_24. [Online]. Available: http://dx.doi.org/10.1007/978-3-642-17401-8_24 P. Kitsos, G. Kostopoulos, N. Sklavos, and O. Koufopavlou, “Hardware implementation of the rc4 stream cipher,” in Circuits and Systems, 2003 IEEE 46th Midwest Symposium on, vol. 3, dec. 2003, pp. 1363 – 1366 Vol. 3. D. Jr., in System and method for a fast hardware implementation of RC4, vol. Patent No. 6549622, Campbell, CA„ April, 2003. S. Paul and B. Preneel, “A new weakness in the rc4 keystream generator and an approach to improve the security of the cipher,” in FSE, 2004, pp. 245–259. S. Maitra and G. Paul, “Analysis of rc4 and proposal of additional layers for better security margin,” in Progress in Cryptology - INDOCRYPT 2008, ser. Lecture Notes in Computer Science, D. Chowdhury, V. Rijmen, and A. Das, Eds. Springer Berlin / Heidelberg, 2008, vol. 5365, pp. 27–39, 10.1007/978-3-540-89754-5_3. [Online]. Available: http://dx.doi.org/10.1007/978-3-540-89754-5_3 T. Good and M. Benaissa, “Hardware results for selected stream cipher candidates,” in of Stream Ciphers 2007 (SASC 2007), Workshop Record, 2007, pp. 191–204. G. J. Q. P. Leglise, FX. Standaert, in Efficient implementation of recent stream ciphers on reconfigurable hardware devices, vol. Proc. of the 26th Symposium on Information Theory in the BeneluxâĂİ, Benelux, May 19-20, 2005, p. pp. 261 âĂŞ 268. G. K. N. S. M. Galanis, P. Kitsos and C. Goutis, in Comparison of the Hardware Implementation of stream ciphers, vol. 2, No. 4, Int. Arab J. of Information Technology, May 19-20, 2005, p. pp. 267 âĂŞ 274. TABLE 11 Comparison of various performance metrics with existing designs Features KSA KSA per byte PRGA for N byte PRGA per byte RC4 for N byte Per byte output from RC4 Number of Clock Cycles Ref. [2] [1] Ref [3], [4] Ref [11] 1+256 1+1/256 Design 1 & Design 2 1 + 256= 257 1 + 1/256 Design 3 & Design 4 1+128=129 1/2+ 1/256 256X3=768 3 3 + 256 = 259 1 + 3/256 1 + 256= 257 1 + 1/256 3n 3+n 2+n n/2+2 2+n 2+n/2 3 1 + 3/n 1 + 2/n 1/2+2/n 1 + 2/n 1/2+2/n 3n+768 259+(3+n) 257+(2+n) 257+(2+n/2) 257+(2+n) 129+(2+n/2) 3+768/n 1 + 262/n 1 + 259/n 1/2 + 259/n 1 + 259/n 1/2+131/n [10] M. Galanis, P. Kitsos, G. Kostopoulos, N. Sklavos, O. Koufopavlou, and C. Goutis, “Comparison of the hardware architectures and fpga implementations of stream ciphers,” in Electronics, Circuits and Systems, 2004. PLACE ICECS 2004. Proceedings of the 2004 11th IEEE International PHOTO Conference on, dec. 2004, pp. 571 – 574. HERE [11] D. Jr., in Methods and apparatus for accelerating ARC4 processing, US Patent No.7403615, Morgan Hill, CA, July, 2008. [12] X. M. S. Processor. http://www.xilinx.com/tools/microblaze.htm. Rourab Paul Biography text here. PLACE PHOTO HERE Amlan Chakrabarti Biography text here. PLACE PHOTO HERE Ranjan Ghosh Biography text here.