WO2024115951A1

WO2024115951A1 - Reconfigurable convolution neural network processor

Info

Publication number: WO2024115951A1
Application number: PCT/IB2022/061641
Authority: WO
Inventors: A.P.G.D Alahakoon; G.D.K Mahanama; H.A.D.S.D Perera; N.M Wickramage; W.M.K De Silva; K.V.S Prasad; S.M Vidanagamachchi; K.G Samararathna; H.D.S Amaradasa
Original assignee: University Of Ruhuna
Priority date: 2022-12-01
Filing date: 2022-12-01
Publication date: 2024-06-06

Abstract

A Reconfigurable Convolution Neural Network Processor (RCNNP) includes a data cache comprising parallel memories configured to store image data, a memory handler to obtain and store image data comprised of a matrix of P × Q pixels from an external memory to the data cache, and an instruction handler to perform pixel extraction for a convolution operation. The instruction handler is configured to: process the image data in the data cache at a defined address to specify a convolutional window having M rows × N columns of pixels of the image data in each memory cycle; extract at least one pixel in a defined N column of the M × N pixels in each clock cycle to obtain at least a subset of pixels from M rows × N columns of pixels in T clock cycles; define the address for next convolutional window for pixel extraction and perform the extraction.

Description

RECONFIGURABLE CONVOLUTION NEURAL NETWORK PROCESSOR

TECHNICAL FIELD

The present disclosure relates to a reconfigurable convolution neural network processor, and more particularly, relates to system and method for performing memory operations for a reconfigurable convolution neural network processor.

BACKGROUND

In recent years, processing images using machine learning techniques have gained a lot of importance. This is because image processing can be performed with less human intervention and with greater accuracy. Particularly, the use of deep learning techniques that may involve multi-layer neural networks has been on the rise. Among various techniques used for image processing using deep learning techniques, convolutional neural network (CNN) has been a popular choice due to the development of learning a multi-layer (deep) neural network of four or more layers. To process the images, CNN requires extracting information from the image frames which are then sent for convolution operation. FIG. 1 illustrates a conventional CNN operation for character recognition, where an exemplary handwritten character image of 28 X 28 X 1 is taken as input 122. The input is processed with convolution operation 102 (5 X 5) by applying kernel padding to each convolution window into nl channels (24 X 24 X nl) 124, which is further processed to nl channels (12 x 12x nl) 126 via max-pooling (2 x 2) step 104. The nl channels (12 x 12 x nl) are subjected to convolution where further kernel padding (5 x 5) is applied in step 106 to obtain n2 channels (8 x 8 x n2) 128, which are further processed to downsample feature maps to obtain n2 channels of (4 X 4 X n2) 130 by application of max pooling (2 X 2) in step 108. The output of the max-pooling (modified feature map) is provided to a flattening layer which performs a flattening process. The flattening process involves converting all the resultant 2- dimensional arrays from pooled feature maps into a single long continuous linear vector. In an aspect, the flattening layers convert the two-dimensional arrays from the pooled layer feature maps to a single-dimensional linear vector of data, i.e., convert the retained feature maps with selected columns rearranged into a single -dimensional linear vector. The singledimensional linear vector is communicated to a dense layer. The dense layer is a layer that is used in the later stages of the CNN. A dense layer includes a layer that is deeply connected with its preceding layer, meaning that the neurons of the layer are connected to every neuron of its preceding layer. The output of the dense layer 132 is communicated to fully connected neural network 112 that classifies the image based on output from convolutional layers. The conventional CNN is not described here in detail as the steps of the conventional CNN are well known. The CNN defines and uses a convolutional window to extract the information from the image frames. The CNN ‘slides’ the convolutional window on the image frames and generates a number of small images. To elaborate, the convolution operation involves multiplying an image value matrix with a filter that has weights for convolution layers. The image value matrix is obtained by convolution window. To calculate a next value in the layer, the convolution window is slid by a single pixel, for example, and corresponding image value matrix under the new convolution window is obtained. This image value matrix is multiplied with the filter to obtain convolution output. The above steps are continued till the convolution window is moved covering the last image pixel in the image. This is demonstrated in exemplary FIG. 2, where an image of 6 X 18 is processed using convolution windows. Convolution window 102 is a 3 X 3 image value matrix that includes pixels Vi, V2, V3, V19, V20, V21 V37, V38, and V39. The letter representing the pixel is a value of the pixel. Once the pixels of the convolution window 202 are extracted, the convolution window 102 is slid by 1 pixel. The convolution window 202 in a new position would include V2, V3, V4, V20, V21, V22, V38, V39 and V40. These pixels under the convolution window in the new position are extracted. This process of sliding and extracting continues until the pixel Vios is covered. In FIG. 2, exemplary three matrices 204-208 that are extracted using convolution windows are shown. As observed, except for the first matrix 204 all other matrices contain redundancy of two italicized valued columns. This redundancy costs a CNN processor redundant clock cycles. To elaborate, the image is stored in internal memory from an external memory. The pixels are stored in the internal memory consecutively one by one. For example, the pixels of the first row are stored consecutively, which are followed by pixels of second row stored consecutively adjacent to the pixels of the first row, and so on, until all the pixels of the image are stored. FIG. 3 illustrates how the pixels are stored in the internal memory. In some examples, according to how the values are stored, it takes at least three memory accesses to load single matrix into the processor for processing. For example, for the CNN processor to extract the first matrix 104, the CNN processor has to use one memory cycle and three clock cycles of time to extract three pixels of the first row. Here each pixel would require a clock cycle. Next, the CNN processor extracts pixels of second row by accessing the memory location of the first pixel of the second row and then extracting the three pixels of the second row. For extracting the pixels of the second row, the CNN processor would require one memory cycle and three clock cycles of time. This process continues till all the pixels of the matrix are extracted. For a 3 X 3 convolution window, the CNN processor would take three memory cycles and nine clock cycles. Also, when the convolution window is slid by one pixel, it takes another three memory accesses and nine clock cycles, and six redundant image pixel values. Except for the first matrix, the CNN processor extracts redundant six pixels for every convolution window, which involves redundant memory and clock cycles. To explain mathematically, if an image has 125 X 125 pixels and if the convolution window is 3 X 3, there would be 123 X 123 sliding matrices to fetch from the external memory. Each matrix may take three memory accesses, and hence the total memory access are determined as follows:

123 X 123 X 3 = 45,387.

Each memory access may take approximately 20 - 25 clock cycles. Considering a best-case scenario, the memory access may take 20 clock cycles. Therefore, total clock cycles used by Direct Memory Access (DMA) to load convolution layer matrices for a single filter can be calculated as:

45,387 X 20 = 907,740.

Consequently, there is a plethora of redundant memory cycles, clock cycles and redundant pixels, which makes the CNN process inefficient and redundant.

Furthermore, hardware for CNN in conventional systems is customized. In other words, if a different CNN is to be implemented on the same hardware, the hardware is required to be reconfigured. The process of reconfiguring the hardware is cumbersome, time-consuming, and inefficient. As a result, the conventional systems for CNN may not be usable for different CNN implementations, making it expensive.

SUMMARY

In one aspect of the present disclosure, a Reconfigurable Convolution Neural Network Processor (RCNNP) is disclosed. The RCNNP includes a data cache, a memory handler, an instruction handler and a convolution unit. The data cache includes parallel memories configured to store image data. The memory handler is coupled to the data cache, and is configured to obtain image data comprised of a matrix of P x Q pixels from an external memory and store the image data in parallel memories of the data cache. The instruction handler, coupled to the memory handler and the data cache, to perform pixel extraction for a convolution operation based on a first set of instructions to process the image data in the data cache at a defined address to specify a convolutional window having M rows X N columns of pixels of the image data in each memory cycle, wherein M is equal to or less than P and N is less than or equal to Q, extract at least one pixel in a defined N column of the M x N pixels in each clock cycle to obtain at least a subset of pixels from M rows x N columns of pixels in T clock cycles, wherein T is less than or equal to N, and define the address for next convolutional window based on the first set of instructions for pixel extraction and perform extraction until the image data is obtained. The convolution unit, coupled to the data cache and the instruction handler, to perform a convolution operation on the extracted pixels associated with each of the convolution windows based on the first set of instructions.

In another aspect of the present disclosure, a method for a Reconfigurable Convolution Neural Network Processor (RCNNP) is disclosed. The method includes executing, by an instruction handler of the RCNNP, a first set of instructions based on an Instruction Set Architecture (ISA), corresponding to a first CNN architecture for operating the RCNNP. The method also includes receiving and storing, by a memory handler of the RCNNP, image data comprised of a matrix of P X Q pixels from an external memory and store the image data in parallel memories of a data cache. In addition, the method also includes executing, by the instruction handler of the RCNNP, a subset of instructions of the first set instructions to perform a pixel extraction for a convolution operation. The pixel extraction includes processing the image data in the data cache at a defined address to specify a convolutional window having M rows X N columns of pixels of the image data in each memory cycle, extracting at least one pixel in a defined N column of the M X N pixels in each clock cycle to obtain at least a subset of pixels from M rows x N columns of pixels in in T clock cycles, wherein T is less than or equal to N, and defining the address for next convolutional window based on the subset of instructions for pixel extraction and perform extraction until the image data is obtained. The method further includes performing the convolution operation, by a convolution module, on the extracted pixels associated with each of the convolution windows based on the first set of instructions.

These and other aspects and features of non-limiting embodiments of the present disclosure will now become apparent to those skilled in the art upon review of the following description of specific non-limiting embodiments of the disclosure in conjunction with the accompanying drawings. BRIEF DESCRIPTION OF DRAWINGS

A beter understanding of embodiments of the present disclosure (including alternatives and/or variations thereof) may be obtained with reference to the detailed description of the embodiments along with the following drawings, in which:

FIG. 1 illustrates an exemplary convolution neural network;

FIG. 2 illustrates an exemplary pixel extraction process for convolution operation;

FIG. 3 illustrates an example process of storing an image in memory;

FIG. 4 is an architecture of reconfigurable convolution neural network processor, according to one or more embodiments of the present disclosure;

FIG. 5A illustrates an instruction structure associated with instruction set architecture, according to certain embodiments of the present disclosure;

FIG. 5B illustrates an example convolution operation where weights are multiplied with pixels values of an image in a convolution operation, according to certain embodiments of the present disclosure;

FIG. 6 illustrates an exemplary implementation of a convolutional neural network (CNN), according to certain embodiments of the present disclosure;

FIG. 7 illustrates a data cache memory transfer operation, according to certain embodiments of the present disclosure;

FIG. 8 illustrates a technique of memory fetching of an image from an external memory to an internal memory, according to certain embodiments of the present disclosure;

FIG. 9 illustrates an exemplary application of a method for memory extraction of an image from an external memory to a data cache, according to certain embodiments of the present disclosure;

FIG. 10 illustrates an exemplary application of the method of memory extraction of an image with a different size of convolution window, according to certain embodiments of the present disclosure;

FIG. 11 illustrates a convolution window on which a dilation rate is applied, according to certain embodiments of the present disclosure; and

FIG. 12 illustrates a modified architecture of reconfigurable convolution neural network processor, according to one or more embodiments of the present disclosure. FIG. 13 illustrates a process flow for pixel extraction for performing convolution in RCNNP, according to one or more embodiments.

It should be appreciated by those skilled in the art that any diagram herein represents conceptual views of illustrative systems embodying the principles of the present disclosure.

DETAILED DESCRIPTION

Reference will now be made in detail to specific embodiments or features, examples of which are illustrated in the accompanying drawings. Wherever possible, corresponding or similar reference numbers will be used throughout the drawings to refer to the same or corresponding parts. Moreover, references to various elements described herein, are made collectively or individually when there may be more than one element of the same type. However, such references are merely exemplary in nature. It may be noted that any reference to elements in the singular may also be construed to relate to the plural and vice-versa without limiting the scope of the disclosure to the exact number or type of such elements unless set forth explicitly in the appended claim.

The terminologies and/or phrases used herein are for the purpose of describing particular embodiments only and is not intended to be limiting to the invention. As used herein, the singular forms “a,” “an,” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise.

Aspects of the present disclosure are directed towards a reconfigurable convolution neural network processor.

FIG. 4 is an architecture of reconfigurable convolution neural network processor (RCNNP) 400, according to one or more embodiments of the present disclosure. The architecture 400 includes memory handler 402, instruction handler 404, arithmetic and logic unit (ALU) 406, registries 408, program counter 410, instruction cache 412, pooling unit 414, convolution unit 416, and data cache 418.

Memory handler 402 is configured to manage all transactions between an external memory 450 (for example, Random Access Memory (RAM)) and the RCNNP 400. The memory handler 402 may include a Finite State Machine (FSM). The FSM determines a type of transaction that should be performed on priority among various transaction types. In an example, the transaction types may include instruction reads, data memory reads, and data memory writes. The RCNNP 400 is configured to operate on a set of instructions. The set of instructions is based on a custom Instruction Set Architecture (ISA). In an embodiment, the ISA includes four types of instructions: registry access instructions, memory access instructions, conditional instructions and CNN operation instructions. The registry access instructions are associated with operations relating to registries 408, values in the registries 408, and modifications to the values in registries 408. The memory access instructions are configured to save values of registries to memory locations on RAM and vice-versa. The conditional instructions are configured to control a flow of instruction executions. The CNN operation instructions are associated with CNN operations. Table 1 summarizes the functions of the instruction types. Each instruction may be stored in 4 bytes (32 bits).

Table 1: Instruction types and description

The ISA can be used for developing CNN programs. The ISA has a rich set of instructions that can also be used for developing an operating system for a desired CNN program containing an embedded system. The robustness of ISA allows the RCNNP 400 to design different CNNs. In some examples, the operating system developed by using ISA can be used for implementing and executing CNNs. In some examples, the ISA allows users to program the RCNNP 400 to compute their own CNN using built-in resources. The instructions of ISA may be classified into instructions that work with both registries and constant values and instructions that work with registries only. In some implementations, the instructions that work with both registries and constant values include six bits of opcode, five bits for carrying a registry address, another five bits for carrying another registry address, and sixteen bits of constant payload. In some implementations, the instructions that work with registries only include six bits of opcode, five bits for carrying a registry address, another five bits for carrying another registry address, a third five bits for carrying yet another registry address and the remaining bits for payload. The instruction types are shown in FIG. 5 A, where instruction 502 represents the instructions that work with both registries and constant values, and instruction 504 represents the instructions that work only with registries. Furthermore, the instructions may be also classified by function to constant bound, registry bound or memory bound. Table 2a to Table 2i provides some exemplary instructions that include constant bound instructions, registry bound instructions and memory bound instructions along with their definitions.

Table 2a: Constant loading instructions

Table 2b: External memory access instructions

Table 2c: Data cache access instructions

Table 2d: Arithmetic with constant and registry instructions

Table 2e: Arithmetic with registries instructions

Table 2f: Flow control instructions

Table 2g: CNN instructions saving to registry

Table 2h: CNN instructions saving to external memory

Table 2i: CNN instructions saving to general cache (General cache explain in FIG. 12)

In addition to the aforementioned instructions, there are additional slots provided for custom instructions.

The ISA is handled by the instruction handler 404. The ISA includes a versatile and robust instruction set. In some embodiments, the ISA is used for creating an operating system to operate RCNNP 400. In one or more embodiments, a CNN may be implemented by developing a CNN program using a program code 602, which is converted to a ISA of RCNNP language machine code 606 using a RCNNP language converter program 604 as shown in FIG. 6. Further, a memory of trained neural network 608 and the program code 606 is input to the RCNNP 612 for implementing a CNN. The RCNNP 612 may perform an image processing operation on an input image data 610 based on the implemented CNN. The CNN architecture is stored in a program code using the ISA. The architecture of the CNN may be changed using a software program written by using the ISA. Therefore, the RCNNP may be upgraded for different CNNs without modifying the hardware. In other words, the existing hardware architecture may be used with different CNN architectures without changing the hardware configuration. In some examples, the RCNNP may be used as a stand-alone processor. Using the ISA, a stand-alone operating system may be generated for handling all RCNNP-related task with no support sought from an external processor. In one or more embodiments, the ISA allows programs to have fewer instructions for performing a given task in comparison with conventional computer programs. Due to this agility, the RCNNP 400 requires less storage capacity required to store the program than conventional programs.

The instruction handler 404 is configured to operate the RCNNP 400 using the instructions that are based on the ISA. The instructions may be associated with the operation of RCNNP 400 and/or CNN operations. The instruction handler 404 includes circuitries to operate the RCNNP 400 based on the instructions. In some implementations, the instruction handler 404 is configured to fetch the instructions stored in the external memory or instruction cache 412 sequentially and execute the operations in the sequence defined therefor. The sequential flow may be modified using conditional instructions.

The Arithmetic and Logic Unit (ALU) 406 is configured to perform all logical calculations, arithmetic calculations, and comparison operations. The operations include a logical ‘AND’ operation, logical ‘OR’ operation, logical ‘NOT’ operation, addition operation, subtraction operation, bit shift operation, equal comparison operation, less than comparison operation, and greater than comparison operation. The operations as described are applied to 32 bits wide values in some implementations. The ALU 406 is configured to receive values from the registries from the instructions and store the results of the calculation in the registry specified by the instructions.

In some implementations, the registries 408 are 32 bits wide and have 32 high-speed memory locations which can be accessed within a single clock cycle. The registries 408 are used for storing temporary values for fast accessing. The registries 408 may include general- purpose registries and special purpose registries. General purpose registries may be used to store any type of 32 bits wide values and can be accessed using instructions. The special purpose registries include, inter alia, a zero registry, a status registry, a configuration registry, and a width registry. The registries 408 are illustrated in FIG. 4. The zero registry is a readonly registry which has a value zero. The status registry includes various statuses about the operations of the RCNNP 400. Bits of the status registry are set automatically by various operations on the RCNNP 400. In some examples, the user can configure the status registry to set or reset the status to zero for a CNN implementation or operation. A width registry is used to configure the width of a feature image which is used by a caching method described below.

The program counter (PC) 410 is configured to maintain addresses of the instruction list of the instruction set. In some examples, the program counter 410 maintains the address of a next instruction to be executed. In some examples, the program counter 410 increases the address pointer by 4 bytes automatically when the instruction is in execution to access the address of the next instruction. This automatic increment of the address may be changed by flow control instructions. The flow control instructions may alter the value of the PC 410 according to a user-specified value in the instruction. The PC 410 communicates a newly calculated address to the instruction cache 412 to execute the instruction at a defined memory location.

The instruction cache 412 includes a Block Random Access Memory (BRAM). The BRAM is used to store the instruction read from the external memory. The instruction cache 412 fetches a set of instructions starting from an address provided by the PC 410 in the external memory and stores them in the BRAM for fast access. According to the address given by the PC 410, the instruction cache 412 communicates a relevant instruction stored in the BRAM to the instruction handler 404. The instruction cache 412 maintains an address mapping between the BRAM and the external memory. When the PC 410 issues an address which is not stored in the BRAM, this instruction cache 412 fetches a set of instructions starting with the particular address and stores them in the BRAM. Due to the mechanism of the instruction cache 412, instruction execution is faster than fetching the instructions directly from the external memory 450. Direct instruction access from the external memory 450 is approximately taking 20 - 25 clock cycles but accessing the same instruction from the instruction cache 412 takes two clock cycles.

Pooling unit 414 may be configured to calculate the output of pooling layers in the CNN. In some examples, there may be at least three types of pooling operations that include max pooling, min pooling, and average pooling. In max pooling, pooling layers determine a maximum pixel value of a batch (or a group of pixels). The max pooling may be used in determining brighter pixels of the image. In min pooling, pooling layers determine a minimum pixel value of a batch. The min pooling may be used in determining sensitive or lighter pixels of the image. In avg pooling, pooling layers determine an average pixel value of a batch. The avg pooling may be used in smoothening the image. The pooling unit 414 is configured to communicate the determined values to the appropriate registry defined by the instruction.

Convolution unit 416 is configured to perform a convolution operation. In some examples, the convolution unit 416 multiplies floating-point values. The convolution unit 416 requires two sets of floating-point values to carry out the convolution operation. A first set of values is referred to as weights (also referred to as a filter), and a second set of values is pixels of the image. FIG. 5B shows an example convolution operation where weights 552 are multiplied with the pixels of the image 554 in a convolution operation 556 to generate an output 558. The output may be determined using a following exemplary equation.

The convolution unit 416 communicates the output of the calculation to an appropriate registry defined by the instruction.

The data cache 418 is configured to store values for convolution operation. The data cache 418, for example, uses BRAM memory to store pixels of image value rows obtained from the external memory. The data cache 418 accesses the external memory one time through the memory handler 402 for fetching image pixel values and storing the image pixel values in the BRAM. As a result of having the data cache 418, number of memory access cycles is reduced as the pixel of image values does not have to be fetched every time from external memory 450. When fetching the image pixel values from the external memory, the data cache may use a special purpose registry “width” to arrange the values on the BRAM. If the width registry contains value n, the data cache 418 loads a first n image pixels values starting from a given address into the first row of the BRAM and next n values to the second row, and so on. To elaborate, the data cache 418 is coded such that a first matrix, for example, of R X S pixels of a matrix of P X Q pixels (original image matrix) is stored in R sequential rows of the parallel memory such that S number of pixels of each R^th row of the first matrix of R X S pixels are stored in consecutive memory locations of a corresponding memory row to facilitate the extraction, where the R is equal to or lesser than the P, and the S is equal to or lesser than the Q. Consider an example, where a matrix of 3 x 18 pixels of a matrix of 6 X 18 pixels is stored such that the three rows are stored sequentially with 18 pixels of each row of the matrix of 3 X 18 pixels are stored in consecutive memory locations in corresponding row to facilitate quicker extraction. When the pixels in the BRAM are used for convolution and need to be replaced, the memory handler is configured to replace pixels of the first matrix with pixels of second matrix of T X U pixels of the matrix of P X Q pixels in T sequential rows of the parallel memory such that U number of pixels of each T row of the second matrix of T X U pixels are stored in consecutive memory locations of a corresponding memory row to facilitate the extraction, where the T is equal to or lesser than R, and the U is equal to or lesser than the S. In other words, each of the pixels of the corresponding row R is replaced with pixels of corresponding row P. This operation continues till a defined number of pixels of the image are fetched from the external memory and stored or replaced in the data cache. In some examples, the whole image may be loaded from external memory into the data cache instead of obtaining the pixels in matrices.

In an example implementation, the data cache 418 accesses the external memory using a 64 bits wide data bus with a burst transaction width length of 16. As a result, the RCNNP can access 32 image values within 16 clock cycles plus 20 - 25 clock cycles waiting time. Therefore, clock cycles for the whole image of 125 x 125 pixels to load into the BRAM may be calculated as shown in Table 3.

Table 3: Calculations of clock cycle count on the data cache memory fetching

When the pixels of image data are stored in the data cache, the instruction handler is configured to perform pixel extraction for a convolution operation based on a first set of instructions. The first set of instructions includes processing the image data in the data cache at a defined address to specify a convolutional window having M rows x N columns of pixels of the image data in each memory cycle, where M is equal to or less than P and N is less than or equal to Q. The RCNNP extracts one or more pixels under a defined N column of the M x N pixels in each clock cycle to obtain at least a subset of pixels from M rows X N columns of pixels in T clock cycles, where T is less than or equal to N. The number of pixels may be based on the extraction requirement. In one embodiment, the number of pixels to be extracted may be equal to a number of rows. In another embodiment, the number of pixels to be extracted may be based on a dilation rate. Other embodiments not described here are contemplated herein. Once the pixels are extracted, the convolution unit 416 performs the convolution operation on the extracted pixels. Further, the instruction handler 404 defines an address for the next convolutional window based on the first set of instructions for pixel extraction and performs the extraction step. This process continues as long based on instructions as defined in the first set of instructions. To elaborate, consider FIG. 7 where a convolution window is defined for 3 x 3 matrix of pixels. To perform pixel extraction for a convolution operation based on a first set of instructions, the instruction handler 404 processes the image data in the data cache at a defined address, that is, ‘0, 0’ to specify a convolutional window having a matrix of 3 rows X 3 columns of pixels of the image data in each memory cycle. In one example, the RCNNP 400 extracts three pixels in a defined column of the 3 x 3 pixels in each clock cycle to obtain at least a subset of pixels from 3 rows x 3 columns of pixels in 3 clock cycles as shown in FIG. 7. As illustrated in FIG. 7, pixels Vi, V19, and V37, corresponding to first column 702 of the matrix are extracted in 1 clock cycle. In another example, the RCNNP extracts at least one pixel in a defined column of the 3 x 3 pixels in each clock cycle to obtain at least a subset of pixels from 3 rows x 3 columns of pixels in 3 clock cycles as per the dilation rate. Once the pixels corresponding to the first column are obtained, one or more pixels from a second column 704 are obtained in a similar manner. This process continues till the required pixels of the convolution window are extracted. For example, as illustrated in FIG. 7, pixels V2, V20, and V38, corresponding to the second column 704 of the matrix are extracted in 1 clock cycle, and pixels V3, V21, and V39, corresponding to a third column 706 of the matrix are extracted in 1 clock cycle. Thus, pixels of 3 X 3 matrix 708 are obtained in one memory cycle and three clock cycles.

Once the required pixels of the convolution window are extracted, the instruction handler 404 defines the address for the next convolutional window based on the first set of instructions for pixel extraction. This process continues till the required pixels are extracted for the image data as per the CNN requirements. This process significantly saves time and data transfer as explained below.

Using the clock cycle count corresponding to the direct memory access (DMA) and the data cache, the reduced clock cycle count for a single filter calculation using the data cache can be calculated as shown in Table 4 which is continued based on data in Table 3. Table 4: Calculations of clock cycle count on the data cache memory fetching

Using the clock cycle count corresponding to the direct memory access (DMA) and the data cache 418, the reduced clock cycle count for a single fdter calculation using the data cache 418 is calculated as shown in table 4 which is continued based on data in table 3 : 907,740 - 108,378 = 799,362.

Since a single clock cycle takes 10 ns, the data cache 418 saves 7,993,620 ns from single fdter calculation on 125 x 125 image. According to the above calculations, the data cache 418 has 88.06% clock cycle reduction rate compared to the DMA on loading image values into the convolution unit 416. The data caching as described herein reduces clock cycles that it takes to access convolution memory values by 88.06% while increasing the speed of the RCNNP 400. In other words, the RCNNP 400 can use its resources for performing other operations that otherwise would have been wasted on obtaining redundant pixels. When calculating the output of the convolution layer of the CNN, each fdter is multiplied by the image value matrix defined by a convolution window that is fetched from the image.

In one example, the RCNNP architecture as disclosed herein, was implemented on Xilinx Zynq 7010 Field Programmable Gate Array (FPGA) device. The clock signal was 100 MHz, and as a result a single clock cycle is 10 ns long. The RCNNP used the external RAM to store instructions and data values. In examples, the RCNNP 400 can also be implemented in other processing technologies not described here, but contemplated herein.

FIG. 8 illustrates a pixel extraction technique. The technique described in FIG. 8 performs memory fetching of an image having 6 x 18 pixel matrix from an external memory to an internal memory in an improvised manner. In the aforementioned technique, the internal memory fetches whole pixels of an image of three rows of values from the external memory each time the position of the sliding window gets one row down. Therefore, memory architecture fetches redundant rows from the external memory. In this example, if the convolution window is moved two rows down, the data cache fetches a total of nine rows from the external memory instead of fetching two new rows. To elaborate, in first cycle of data extraction, pixels corresponding to first three rows are obtained as shown in 804. When the convolution window is moved to an address of the second row to extract next three rows, the processor fetches pixels corresponding to row 2- row 4, as shown in 806. Here pixels of row 2 and row 3 were already obtained in the previous cycle and thus making pixel extraction corresponding to these rows redundant. In the next cycle, row 3 and row 4 are obtained as shown in 808, again making the pixel extraction corresponding to these rows redundant. As a result of these redundant steps, a total of nine rows are extracted from the external memory instead of fetching two new rows. Consequently, there are unwanted redundant memory cycles, clock cycles and redundant pixels obtained, which makes the CNN process inefficient and redundant.

FIG. 9 illustrates an application of the disclosed method of memory extraction of the disclosure on an image having 6 x 18 pixel matrix from an external memory to the data cache 418 internal memory. In the current disclosure, the data cache 418 fetches whole pixels of the image of three rows of values in the first cycle as shown in 904 from the external memory. In the next cycle and after operation on extracted pixels, the data cache 418 replaces the pixels of the first row in the data cache 418 with pixels of a fourth row from the external memory as shown in 906. In next cycle, the data cache 418 replaces the pixels of the third row in the data cache with pixels of sixth row. This is demonstrated in step 908. As a result, the time taken by fetching the redundant rows is eliminated, and there is significant time savings due to the method of the disclosure.

In some embodiments, the data cache 418 can be configured to support any X x Y pixel matrices. Regardless of the configurations, the RCNNP 400 may take a single memory access time and X clock cycles to fetch the X x Y pixel matrix from the data cache. This is explained in FIG. 10, where columns of pixels are extracted in each clock cycle totaling to X number of cycles and one memory cycle to perform data extraction.

The disclosure also supports any size of strides. The data cache is configured to save a starting memory address of the pixels in buffered memory. After finishing the memory fetching, the memory accesses in the data cache 418 use the absolute memory addresses to fetch the memory matrices. As a result of having the absolute memory address for the pixels, the data cache 418 can support any size of strides along vertical and/or horizontal directions. The RCNNP 400 and the method of the disclosure support different dilation rates. Dilation may refer to the expansion of input by inserting holes between its consecutive pixel elements. The dilation is also part of convolution, but it involves pixel skipping to cover a larger area of input. The dilation rate may refer to a spacing of gaps between pixel elements in a feature map on which a convolution filter is applied. The RCNNP 400 is configured to support any dilation rate through an introduction of a dilation rate parameter to the instructions. This parameter can be set and/or varied through the instructions. The RCNNP 400 automatically applies the dilation and outputs convolution matrices according to the dilation rate given by the parameter. FIG. 11 shows a 5 X 5 pixel convolution window 1102 on which dilation rate of two is applied, and an output of the extraction is shown as 3 X 3 matrix 1104.

The RCNNP 400 and the method of the disclosure supports padding. Padding may refer to addition of pixels to an image when it is being processed by a kernel. The data cache 418 can be configured to support padding. The RCNNP 400 is configured to support any padding size through an introduction of a parameter in instructions and architecture. The padding can be set and/or varied through the instructions.

To further improve speed and efficiency, an additional cache is introduced to the architecture of the RCNNP of FIG. 4. The additional cache functions as a general cache between registries, pooling and the data cache. FIG. 12 shows an architecture of the modified RCNNP 1200 after the implementation of the additional general cache 1220. The architecture of the modified RCNNP 1200 includes a memory handler 1202, an instruction handler 1204, an arithmetic and logic unit (ALU) 1206, registries 1208, a program counter 1210, an instruction cache 1212, a pooling unit 1214, a convolution unit 1216, a data cache 1218 and the general cache 1220. The functions of the memory handler 1202, the instruction handler 1204, the arithmetic and logic unit (ALU) 1206, the registries 1208, the program counter 1210, the instruction cache 1212, the pooling unit 1214, the convolution unit 1216, and the data cache 1218 are substantially similar to the functions of the elements of FIG. 4. Thus, the functionalities are not repeated herein for the sake of brevity. The general cache (GC) 1220 is used for storing values from the registries 1208, the convolution unit 1216 and the pooling unit 1215. To support the general cache 1220, some additional instructions were added to the ISA to handle this new cache memory. There are two new instructions to read a value in the registries 1208 and store it in the GC 1220, read a value from the GC 1220 and store it in the registries 1208. Some additional instructions are added which can be used to read values from the GC 1220 to the data cache and store the calculation values of the convolution unit 1216 and the pooling unit 1214 to the GC 1220. With the GC 1220, the memory access time which takes to access the external memory may be reduced by storing intermediate calculation values in the GC 1220, and final calculation results may be communicated to the external memory 1250.

FIG. 13 illustrates a process flow for pixel extraction for performing convolution in RCNNP, according to one or more embodiments.

In step 1302, executing, by the instruction handler 404 of the RCNNP 400, a first set of instructions based on an Instruction Set Architecture (ISA), corresponding to a first CNN architecture for operating the RCNNP.

In step 1304, receiving and storing, by the memory handler 402 of the RCNNP 400, image data comprised of a matrix of P X Q pixels from an external memory and store the image data in parallel memories of the data cache 418.

In step 1306, processing, by the instruction handler 404, the image data in the data cache 418 at a defined address to specify a convolutional window having M rows x N columns of pixels of the image data in each memory cycle.

In step 1308, extracting, by the instruction handler 404, at least one pixel in a defined N column of the M x N pixels in each clock cycle to obtain at least a subset of pixels from M rows X N columns of pixels in in T clock cycles, where T is less than or equal to N.

In step 1310, defining, by the instruction handler 404, the address for next convolutional window based on the subset of instructions for pixel extraction and perform extraction until the image data is obtained; and

In step 1312, performing the convolution operation, by the convolution module, on the extracted pixels associated with each of the convolution windows based on the first set of instructions.

INDUSTRIAL APPLICABILITY

The present disclosure provides a RCNNP that is a robust and efficient processor. The RCNNP is configured to be used for, inter aha, image processing. The RCNNP 400 is supported with an ISA for handling various operations including processor operating operation, arithmetic and logical calculations, CNN-related functions, and the like. The RCNNP 400 includes a data cache configured for fetching neural network-related memory values with a substantially lesser amount of time. The RCNNP includes an instruction caching technique that reduces the average clock cycles it takes to access the instructions on memory, for example, from 23 clock cycles to 4 clock cycles. The reconfigurable hardware architecture in RCNNP 400 enables for implementation of different size CNNs using the same processor. The ISA based instructions increase an efficiency of the RCNNP 400 by 19.45% for CNN calculations such as convolution, pooling and fully-connected layers.

While aspects of the present disclosure have been particularly shown and described with reference to the embodiments above, it will be understood by those skilled in the art that various additional embodiments may be contemplated by the modification of the disclosed methods without departing from the spirit and scope of what is disclosed. Such embodiments should be understood to fall within the scope of the present disclosure as determined based upon the claims and any equivalents thereof.

Claims

CLAIMS What is claimed is:

1. A Reconfigurable Convolution Neural Network Processor (RCNNP) comprising: a data cache comprising parallel memories configured to store image data; a memory handler coupled to the data cache, configured to obtain image data comprised of a matrix of P X Q pixels from an external memory and store the image data in parallel memories of the data cache; an instruction handler, coupled to the memory handler and the data cache, to perform pixel extraction for a convolution operation based on a first set of instructions to: process the image data in the data cache at a defined address to specify a convolutional window having M rows X N columns of pixels of the image data in each memory cycle, wherein M is equal to or less than P and N is less than or equal to Q; extract at least one pixel in a defined N column of the M x N pixels in each clock cycle to obtain at least a subset of pixels from M rows x N columns of pixels in T clock cycles, wherein T is less than or equal to N; define the address for next convolutional window based on the first set of instructions for pixel extraction and perform extraction until the image data is obtained; and a convolution module, coupled to the data cache and the instruction handler, to perform a convolution operation on the extracted pixels associated with each of the convolution windows based on the first set of instructions.

2. The RCNNP of claim 1, wherein the first set of instructions corresponds to a first CNN architecture operating the RCNNP, wherein the first set of instructions is based on an Instruction Set Architecture (ISA).

3. The RCNNP of claim 1, wherein the memory handler is configured to store a first matrix of R X S pixels of the matrix of P X Q pixels in R sequential rows of the parallel memory such that S number of pixels of each R row of the first matrix of R X S pixels are stored in consecutive memory locations of a corresponding memory row to facilitate the extraction, wherein the R is equal to or lesser than the P, and the S is equal to or lesser than the Q.

4. The RCNNP of claim 3, wherein the memory handler is configured to replace pixels of the first matrix with a pixels of second matrix of T X U pixels of the matrix of P X Q pixels in T sequential rows of the parallel memory such that U number of pixels of each T row of the second matrix of T X U pixels are stored in consecutive memory locations of a corresponding memory row to facilitate the extraction, wherein the T is equal to or lesser than R, and the U is equal to or lesser than the S.

5. The RCNNP of claim 1, wherein the instruction handler is configured to reconfigure the RCNNP with a second CNN architecture using a second set of instructions, wherein the second set of instructions is based on the ISA.

6. The RCNNP of claim 1, further comprising an operating system based on the ISA to operate the RCNNP.

7. The RCNNP of claim 1, further comprising an Arithmetic and Logic Unit (ALU) to perform arithmetic and logic operations associated with the first set of instructions.

8. The RCNNP of claim 1, further comprising an instruction cache configured to: store program instructions read from the external memory; and providing relevant program instructions for the RCNNP to operate based on the first set of instructions.

9. The RCNNP of claim 1, further comprising: a program counter module configured to store address of next program instruction to be executed according to the first set of instructions; a pooling unit configured to compute an output of pooling layers according to the first set of instructions; and a registry unit comprising memory locations configured to store data and instructions associated with the first set of instructions.

10. A method of a Reconfigurable Convolution Neural Network Processor (RCNNP) comprising: executing, by an instruction handler of the RCNNP, a first set of instructions based on an Instruction Set Architecture (ISA), corresponding to a first CNN architecture for operating the RCNNP; receiving and storing, by a memory handler of the RCNNP, image data comprised of a matrix of P X Q pixels from an external memory and store the image data in parallel memories of a data cache; executing, by the instruction handler of the RCNNP, a subset of instructions of the first set instructions to perform a pixel extraction for a convolution operation, the pixel extraction comprising: processing the image data in the data cache at a defined address to specify a convolutional window having M rows X N columns of pixels of the image data in each memory cycle; extracting at least one pixel in a defined N column of the M x N pixels in each clock cycle to obtain at least a subset of pixels from M rows x N columns of pixels in in T clock cycles, wherein T is less than or equal to N; defining the address for next convolutional window based on the subset of instructions for pixel extraction and perform extraction until the image data is obtained; and performing the convolution operation, by a convolution module, on the extracted pixels associated with each of the convolution windows based on the first set of instructions.

11. The method of claim 10, wherein the storing comprising storing, the memory handler, a first matrix of R X S pixels of the matrix of P X Q pixels in R sequential rows of the parallel memory such that S number of pixels of each R row of the first matrix of R X

S pixels are stored in consecutive memory locations of a corresponding memory row to facilitate the extraction, wherein the R is equal to or lesser than the P, and the S is equal to or lesser than the Q.

12. The method of claim 11, wherein the storing further comprising replacing, by the memory handler, the pixels of the first matrix with a pixels of second matrix of T X U pixels of the matrix of P X Q pixels in T sequential rows of the parallel memory such that U number of pixels of each T row of the second matrix of T X U pixels are stored in consecutive memory locations of a corresponding memory row to facilitate the extraction, wherein the T is equal to or lesser than to R, and the U is equal to or lesser than the S.

13. The method of claim 10, further comprising reconfiguring, by the instruction handler, the RCNNP with a second CNN architecture using a second set of instructions, wherein the second set of instructions is based on the ISA.

14. The method of claim 10, further comprising operating the RCNNP using an operating system based on the ISA.

15. The method of claim 10, further comprising: performing, by an Arithmetic and Logic Unit (ALU), arithmetic and logic operations associated with the first set of instructions; storing, by an instruction cache, program instructions read from the external memory; providing, by the instruction cache, relevant program instructions for the RCNNP to operate; storing, by a program counter, addresses of instructions to be executed consecutively according to the first set of instructions; computing, by a pooling unit, an output of pooling layers according to the first set of instructions; and storing, by a registry unit comprising memory locations, data and instructions associated with the first set of instructions