US20200110635A1 - Data processing apparatus and method - Google Patents
Data processing apparatus and method Download PDFInfo
- Publication number
- US20200110635A1 US20200110635A1 US16/698,996 US201916698996A US2020110635A1 US 20200110635 A1 US20200110635 A1 US 20200110635A1 US 201916698996 A US201916698996 A US 201916698996A US 2020110635 A1 US2020110635 A1 US 2020110635A1
- Authority
- US
- United States
- Prior art keywords
- frequency
- processor
- neural network
- data
- layer
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
- 238000012545 processing Methods 0.000 title claims abstract description 280
- 238000000034 method Methods 0.000 title claims abstract description 110
- 238000013528 artificial neural network Methods 0.000 claims description 230
- 238000011176 pooling Methods 0.000 claims description 47
- 230000003247 decreasing effect Effects 0.000 claims description 13
- 230000033228 biological regulation Effects 0.000 claims description 12
- 230000011218 segmentation Effects 0.000 claims description 9
- 230000001105 regulatory effect Effects 0.000 claims description 6
- 230000008451 emotion Effects 0.000 description 65
- 230000004913 activation Effects 0.000 description 39
- 238000010586 diagram Methods 0.000 description 34
- 230000008569 process Effects 0.000 description 30
- 230000006870 function Effects 0.000 description 28
- 210000004205 output neuron Anatomy 0.000 description 27
- 241001442055 Vipera berus Species 0.000 description 25
- 230000010365 information processing Effects 0.000 description 19
- 238000006243 chemical reaction Methods 0.000 description 18
- 238000009825 accumulation Methods 0.000 description 16
- 210000002364 input neuron Anatomy 0.000 description 16
- 210000002569 neuron Anatomy 0.000 description 14
- 238000007781 pre-processing Methods 0.000 description 11
- 238000003672 processing method Methods 0.000 description 10
- 238000012549 training Methods 0.000 description 8
- 230000003068 static effect Effects 0.000 description 5
- 230000003044 adaptive effect Effects 0.000 description 4
- 238000000605 extraction Methods 0.000 description 4
- 238000013500 data storage Methods 0.000 description 3
- 230000000946 synaptic effect Effects 0.000 description 3
- 238000004148 unit process Methods 0.000 description 3
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 2
- 102000053602 DNA Human genes 0.000 description 2
- 108020004414 DNA Proteins 0.000 description 2
- 208000032140 Sleepiness Diseases 0.000 description 2
- 206010041349 Somnolence Diseases 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 210000005069 ears Anatomy 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 210000001508 eye Anatomy 0.000 description 2
- 210000004709 eyebrow Anatomy 0.000 description 2
- 230000001815 facial effect Effects 0.000 description 2
- 230000008921 facial expression Effects 0.000 description 2
- 210000001097 facial muscle Anatomy 0.000 description 2
- 210000001061 forehead Anatomy 0.000 description 2
- 210000000088 lip Anatomy 0.000 description 2
- 238000007726 management method Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 210000001331 nose Anatomy 0.000 description 2
- 230000037321 sleepiness Effects 0.000 description 2
- 238000007619 statistical method Methods 0.000 description 2
- 238000005481 NMR spectroscopy Methods 0.000 description 1
- 240000007594 Oryza sativa Species 0.000 description 1
- 235000007164 Oryza sativa Nutrition 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000006698 induction Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 235000009566 rice Nutrition 0.000 description 1
- 238000005406 washing Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/48—Program initiating; Program switching, e.g. by interrupt
- G06F9/4806—Task transfer initiation or dispatching
- G06F9/4843—Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
- G06F9/4881—Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/16—Error detection or correction of the data by redundancy in hardware
- G06F11/18—Error detection or correction of the data by redundancy in hardware using passive fault-masking of the redundant circuits
- G06F11/186—Passive fault masking when reading multiple copies of the same data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F1/00—Details not covered by groups G06F3/00 - G06F13/00 and G06F21/00
- G06F1/26—Power supply means, e.g. regulation thereof
- G06F1/32—Means for saving power
- G06F1/3203—Power management, i.e. event-based initiation of a power-saving mode
- G06F1/3234—Power saving characterised by the action undertaken
- G06F1/324—Power saving characterised by the action undertaken by lowering clock frequency
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F1/00—Details not covered by groups G06F3/00 - G06F13/00 and G06F21/00
- G06F1/26—Power supply means, e.g. regulation thereof
- G06F1/32—Means for saving power
- G06F1/3203—Power management, i.e. event-based initiation of a power-saving mode
- G06F1/3234—Power saving characterised by the action undertaken
- G06F1/3296—Power saving characterised by the action undertaken by lowering the supply or operating voltage
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0751—Error or fault detection not based on redundancy
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/08—Error detection or correction by redundancy in data representation, e.g. by using checking codes
- G06F11/10—Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
- G06F11/1004—Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's to protect a block of data words, e.g. CRC or checksum
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/08—Error detection or correction by redundancy in data representation, e.g. by using checking codes
- G06F11/10—Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
- G06F11/1008—Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's in individual solid state devices
- G06F11/1044—Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's in individual solid state devices with specific ECC/EDC distribution
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/14—Error detection or correction of the data by redundancy in operation
- G06F11/1402—Saving, restoring, recovering or retrying
- G06F11/1446—Point-in-time backing up or restoration of persistent data
- G06F11/1448—Management of the data involved in backup or backup restore
- G06F11/1451—Management of the data involved in backup or backup restore by selection of backup contents
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/14—Error detection or correction of the data by redundancy in operation
- G06F11/1402—Saving, restoring, recovering or retrying
- G06F11/1446—Point-in-time backing up or restoration of persistent data
- G06F11/1458—Management of the backup or restore process
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3003—Monitoring arrangements specially adapted to the computing system or computing system component being monitored
- G06F11/3024—Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a central processing unit [CPU]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3058—Monitoring arrangements for monitoring environmental properties or parameters of the computing system or of the computing system component, e.g. monitoring of power, currents, temperature, humidity, position, vibrations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3058—Monitoring arrangements for monitoring environmental properties or parameters of the computing system or of the computing system component, e.g. monitoring of power, currents, temperature, humidity, position, vibrations
- G06F11/3062—Monitoring arrangements for monitoring environmental properties or parameters of the computing system or of the computing system component, e.g. monitoring of power, currents, temperature, humidity, position, vibrations where the monitored property is the power consumption
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0875—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches with dedicated cache, e.g. instruction or stack
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/16—Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2413—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
- G06F18/24133—Distances to prototypes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/01—Input arrangements or combined input and output arrangements for interaction between user and computer
- G06F3/011—Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
- G06F3/012—Head tracking input arrangements
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/01—Input arrangements or combined input and output arrangements for interaction between user and computer
- G06F3/017—Gesture based interaction, e.g. based on a set of recognized hand gestures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/30007—Arrangements for executing specific machine instructions to perform operations on data operands
- G06F9/3001—Arithmetic instructions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/3004—Arrangements for executing specific machine instructions to perform operations on memory
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/3004—Arrangements for executing specific machine instructions to perform operations on memory
- G06F9/30047—Prefetch instructions; cache control instructions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3836—Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3885—Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
- G06F9/3893—Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units controlled in tandem, e.g. multiplier-accumulator
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5027—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
- G06F9/505—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the load
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/54—Interprogram communication
- G06F9/546—Message passing systems or structures, e.g. queues
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/94—Hardware or software architectures specially adapted for image or video understanding
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/16—Error detection or correction of the data by redundancy in hardware
- G06F11/1608—Error detection by comparing the output signals of redundant hardware
- G06F11/1612—Error detection by comparing the output signals of redundant hardware where the redundant component is persistent storage
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/16—Error detection or correction of the data by redundancy in hardware
- G06F11/1666—Error detection or correction of the data by redundancy in hardware where the redundant component is memory or memory area
- G06F11/167—Error detection by comparing the memory output
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3051—Monitoring arrangements for monitoring the configuration of the computing system or of the computing system component, e.g. monitoring the presence of processing resources, peripherals, I/O links, software programs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2201/00—Indexing scheme relating to error detection, to error correction, and to monitoring
- G06F2201/81—Threshold
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2203/00—Indexing scheme relating to G06F3/00 - G06F3/048
- G06F2203/01—Indexing scheme relating to G06F3/01
- G06F2203/011—Emotion or mood input determined on the basis of sensed human body parameters such as pulse, heart rate or beat, temperature of skin, facial expressions, iris, voice pitch, brain activity patterns
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2212/00—Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
- G06F2212/45—Caching of specific data in cache memory
- G06F2212/452—Instruction code
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/38—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
- G06F7/48—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
- G06F7/57—Arithmetic logic units [ALU], i.e. arrangements or devices for performing two or more of the operations covered by groups G06F7/483 – G06F7/556 or for performing logical operations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Definitions
- the disclosure relates to a data processing apparatus and method.
- a central processing unit transmits a launching configuration instruction to an instruction memory of a dedicated processor core to launch the dedicated processor core to complete a task, and the whole task continues to be executed until an end instruction is executed.
- Such a task launching manner is called common launching.
- common launching mode has the following problems. It is difficult to dynamically monitor an execution state of a present task and to schedule the present task.
- the disclosure aims to provide a dynamic voltage frequency scaling (DVFS) method and a DVFS co-processor to solve at least one of the above-mentioned problems.
- DVFS dynamic voltage frequency scaling
- a DVFS method which includes: obtaining a processor load and a neural network configuration signal within a time period of T ⁇ t ⁇ T; and predicting a frequency of a processor in a next time period of T ⁇ T+t, where both of T and t are real numbers greater than zero.
- the method may further include predicting a voltage of the processor in the next time period of T ⁇ T+t according to the frequency predicted.
- predicting the frequency of the processor in the next time period may include: predicting a frequency of a storage unit and/or a computation unit.
- predicting the frequency of the processor in the next time period of T ⁇ T+t includes presetting m frequency scaling ranges for the computation unit, and generating m+1 frequency segmentation points f 0 , f 1 , . . . , and f m in total, where f 0 ⁇ f 1 ⁇ . . . ⁇ f m , f 0 , f 1 , . . . , and f m are real numbers greater than 0, and m is a positive integer greater than 0.
- predicting the frequency of the processor in the next time period of T ⁇ T+t includes presetting m neural network scales, and generating m+1 scale division points n 0 , n 1 , . . . , n m in total, where n 0 ⁇ n 1 ⁇ . . . ⁇ n m , n 0 , n 1 , . . . , and n m are positive integers greater than 0, and m is a positive integer greater than 0.
- predicting the frequency of the processor in the next time period of T ⁇ T+t includes determining a frequency scaling range of the computation unit according to a scale n of a present processing layer, and if n i-1 ⁇ n ⁇ n i , the frequency scaling range of the computation unit is f i-1 ⁇ f ⁇ f i .
- predicting the frequency of the processor in the next time period of T ⁇ T+t includes further narrowing the frequency scaling range of the computation unit according to the type of the present processing layer, and dividing the layer into two types, that is, a compute-intensive layer and a memory access-intensive layer, where the compute-intensive layer may include a convolutional layer, and the memory access-intensive layer may include a fully connected layer, a pooling layer, and an active layer.
- the frequency scaling range of the computation unit is (f i-1 +f i )/2 ⁇ f ⁇ f i ; if the layer is a memory access-intensive layer, the frequency scaling range of the computation unit is f i-1 /2 ⁇ f ⁇ *f i-1 +f i )/2.
- predicting the frequency of the processor in the next time period of T ⁇ T+t includes: performing fine granularity regulation on the frequency of the computation unit according to present time accuracy of the neural network, when the present accuracy of the neural network is higher than an expected accuracy, decreasing the frequency of the computation unit, and when the present accuracy of the neural network is lower than an expected accuracy, increasing the frequency of the computation unit.
- predicting the frequency of the processor in the next time period of T ⁇ T+t includes: presetting k frequency scaling ranges for the storage unit, and generating k+1 frequency segmentation points F 0 , F 1 , . . . , and F k in total, where F 0 ⁇ F 1 ⁇ . . . ⁇ F k , F 0 , F 1 ⁇ . . . , and F k are real numbers greater than zero, and k is a positive integer greater than zero.
- predicting the frequency of the processor in the next time period of T ⁇ T+t includes presetting k neural network scales, and generating k+1 scale segmentation points N 0 , N 1 , . . . , and N k in total, where N 0 ⁇ N 1 ⁇ . . . ⁇ N k , N 0 , N 1 , . . . , N k are positive integers greater than zero, and k is a positive integer greater than zero.
- predicting the frequency of the processor in the next time period of T ⁇ T+t includes predicting a frequency scaling range of the storage unit according to a scale N of a present processing layer, and if N i-1 ⁇ N ⁇ N i , the frequency scaling range of the storage unit is F i-1 ⁇ F ⁇ F i .
- predicting the frequency of the processor in the next time period of T ⁇ T+t includes: further narrowing the frequency scaling range of the storage unit according to the type of the present processing layer, and dividing the layers into two types, that is, a compute-intensive layer and a memory access-intensive layer.
- the compute-intensive layer may include a convolutional layer.
- the memory access-intensive layer may include a fully coupled layer, a pooling layer, and an active layer.
- the frequency scaling range of the storage unit is F i-1 ⁇ F ⁇ (F i-1 +F i )/2, and if the layer is a memory access-intensive layer, the frequency scaling range of the storage unit is (F i-1 +F i )/2 ⁇ F ⁇ F i .
- predicting the frequency of the processor in the next time period of T ⁇ T+t includes performing the fine granularity regulation on the frequency of the storage unit according to present accuracy of the neural network, when the present accuracy of the neural network is higher than an expected accuracy, decreasing the memory access frequency of the storage unit, and when the present accuracy of the neural network is lower than the expected accuracy, increasing the memory access frequency of the storage unit.
- the method may further include: when the frequency is scaled form high to low, decreasing the frequency at first, and then decreasing a voltage; when the frequency is scaled from low to high, increasing the voltage at first, and then increasing the frequency.
- the method may further include regulating a clock setting of a chip to scale the frequency of the processor.
- the method may further include regulating a power management module of the chip to scale the voltage supplied to the processor.
- obtaining the neural network configuration signal includes obtaining a present layer type and present layer scale for processing of the neural network and real-time accuracy of the neural network.
- a DVFS co-processor which may include a signal acquisition unit and a performance prediction unit, where
- the signal acquisition unit is configured to acquire a workload of a processor and further configured to acquire a neural network configuration signal
- the performance prediction unit is configured to receive the neural network configuration signal and to predict a frequency and a voltage of the processor in a next time period according to a present load of the processor.
- predicting the voltage and the frequency of the processor in the next time period may include predicting a frequency of a storage unit and/or a computation unit.
- the co-processor may further include a frequency scaling unit configured to receive a frequency signal, predicted by the performance prediction unit, of the processor in the next time period and to scale the frequency of the storage unit and/or computation unit in the processor.
- a frequency scaling unit configured to receive a frequency signal, predicted by the performance prediction unit, of the processor in the next time period and to scale the frequency of the storage unit and/or computation unit in the processor.
- the co-processor may further include a voltage scaling unit configured to receive a voltage signal, predicted by the performance prediction unit, of the processor in the next time period and to scale a voltage of the storage unit and/or computation unit in the processor.
- a voltage scaling unit configured to receive a voltage signal, predicted by the performance prediction unit, of the processor in the next time period and to scale a voltage of the storage unit and/or computation unit in the processor.
- predicting the frequency of the computation unit in the next time period includes: presetting m frequency scaling ranges for the computation unit, and generating m+1 frequency segmentation points f 0 , f i , . . . , and f m in total, where f 0 ⁇ f 1 ⁇ . . . ⁇ f m , f 0 , f 1 , . . . , and f m are real numbers greater than zero, and m is a positive integer greater than zero.
- predicting the frequency of the computation unit in the next time period includes presetting m segments of neural network scales, and generating m+1 scale division points n 0 , n 1 , . . . n m in total, where n 0 ⁇ n i ⁇ . . . ⁇ n m , n 0 , n 1 , . . . n m are positive integers greater than zero and m is a positive integer greater than zero.
- predicting the frequency of the computation unit in the next time period may include determining a frequency scaling range of the computation unit according to a range of a scale n of a present processing layer, and if n i-1 ⁇ n ⁇ n i , the frequency scaling range of the computation unit is f i-1 ⁇ f ⁇ f i .
- predicting the frequency of the computation unit in the next time period may include further narrowing the frequency scaling range of the computation unit according to the type of the present processing layer; if the layer is a compute-intensive layer, the frequency scaling range of the computation unit is (f i-1 +f i )/2 ⁇ f ⁇ f i ; and if the layer is a memory access-intensive layer, the frequency scaling range of the computation unit is f i-1 /2 ⁇ f ⁇ (f i-1 +f i )/2.
- predicting the frequency of the computation unit in the next time period may include performing fine granularity regulation on the frequency of the computation unit according to present accuracy of the neural network, when the present accuracy of the neural network is higher than expected accuracy, decreasing the frequency of the computation unit, and when the present accuracy of the neural network is lower than the expected accuracy, increasing the frequency of the computation unit.
- predicting the frequency of the storage unit in the next time period may include presetting k segments of frequency scaling ranges for the storage unit, and generating k+1 frequency division points F 0 , F 1 , . . . , F k in total, where F 0 ⁇ F 1 ⁇ . . . ⁇ F k , F 0 , F 1 , . . . , F k are positive integers greater than zero and k is a positive integer greater than zero.
- predicting the frequency of the storage unit in the next time period may include presetting k segments of neural network scales, and generating k+1 scale division points N 0 , N 1 , . . . , N k in total, where N 0 ⁇ N 1 ⁇ . . . ⁇ N k , N 0 , N 1 , . . . , N k are positive integers greater than zero and k is a positive integer greater than zero.
- predicting the frequency of the storage unit in the next time period may include determining a frequency scaling range of the storage unit according to a range of a scale N of a present processing layer, and if N i-1 ⁇ N ⁇ N i , the frequency scaling range of the storage unit is F i-1 ⁇ F ⁇ F i .
- predicting the frequency of the storage unit in the next time period may include further narrowing the frequency scaling range of the storage unit according to the type of the present processing layer; if the layer is a compute-intensive layer, the frequency scaling range of the storage unit is F i-1 ⁇ F ⁇ (F i-1 +F i )/2; and if the layer is a memory access-intensive layer, the frequency scaling range of the storage unit is (F i-1 +F i )/2 ⁇ F ⁇ F i .
- determining the frequency of the storage unit of the processor in the next time period may include performing fine granularity regulation on the frequency of the storage unit according to the present precision of the neural network.
- obtaining the neural network configuration signal may include obtaining a present layer type and a present layer scale for processing of the neural network and real-time accuracy of the neural network.
- the performance prediction unit may include at least one of: a preceding value method-based prediction unit configured to predict the voltage and the frequency of the processor in the next time period by adopting a preceding value method; a moving average load method-based prediction unit configured to predict the voltage and the frequency of the processor in the next time period by adopting a moving average load method; and an exponentially weighted average method-based prediction unit configured to predict the voltage and the frequency of the processor in the next time period by adopting an exponentially weighted average method.
- a preceding value method-based prediction unit configured to predict the voltage and the frequency of the processor in the next time period by adopting a preceding value method
- a moving average load method-based prediction unit configured to predict the voltage and the frequency of the processor in the next time period by adopting a moving average load method
- an exponentially weighted average method-based prediction unit configured to predict the voltage and the frequency of the processor in the next time period by adopting an exponentially weighted average method.
- the present disclosure provides a DVFS method and a DVFS co-processor for neural networks.
- the DVFS method acquires a real-time load and power consumption of a processor, and simultaneously acquiring a topological structure of the neural network, the scale of the neural network, and a precision requirement of the neural network. Then, a voltage prediction and frequency prediction method is adopted to scale the working voltage and frequency of the processor. Therefore, performance of the processor is reasonably utilized, and power consumption of the processor is reduced.
- the DVFS method for the neural network is integrated in the DVFS co-processor, and thus the characteristics of topological structure, network scale, precision requirement, and the like of the neural network may be fully mined.
- the signal acquisition unit acquires a system load signal of the neural network processor, a topological structure signal of the neural network, a neural network scale signal, and a neural network precision signal in real time; the performance prediction unit predicts the voltage and the frequency required by the system; the frequency scaling unit scales the working frequency of the neural network processor; and the voltage scaling unit scales the working voltage of the neural network processor. Therefore, the performance of the neural network processor may be reasonably utilized, and the power consumption of the neural network processor may be effectively reduced.
- FIG. 1 is a structure diagram of a data processing device according to the disclosure
- FIG. 2 is a flowchart of a data processing method according to the disclosure
- FIG. 3 is a structure diagram of a data processing device according to an example of the disclosure.
- FIG. 4 is a structure diagram of a task configuration information storage unit according to an example of the disclosure.
- FIG. 5 is a flowchart of a data processing method according to an example of the disclosure.
- FIG. 6 and FIG. 7 are structure diagrams of a data processing device according to another example of the disclosure.
- FIG. 8 is a structure diagram of a data processing device according to another example of the disclosure.
- FIG. 9 is a structure diagram of a data cache of a data processing device according to an example of the disclosure.
- FIG. 10 is a flowchart of a data processing method according to another example of the disclosure.
- FIG. 11 is a structure diagram of a neural network operation unit according to an example of the disclosure.
- FIG. 12 is a structure diagram of a neural network operation unit according to another example of the disclosure.
- FIG. 13 is a structure diagram of a neural network operation unit according to another example of the disclosure.
- FIG. 14 is a flowchart of a data redundancy method according to an example of the disclosure.
- FIG. 15 is a structure block diagram of a data redundancy device according to another example of the disclosure.
- FIG. 16 is a neural network processor according to an example of the disclosure.
- FIG. 17 is a flowchart of a DVFS method according to an example of the disclosure.
- FIG. 18 is a flowchart of a DVFS method according to another example of the disclosure.
- FIG. 19 is a schematic block diagram of a DVFS method according to an example of the disclosure.
- FIG. 20 is a schematic diagram of a DVFS co-processor according to an example of the disclosure.
- FIG. 21 is a schematic diagram of a DVFS co-processor according to another example of the disclosure.
- FIG. 22 is a functional module diagram of an information processing device according to an example of the disclosure.
- FIG. 23 is a schematic diagram of an artificial neural network chip configured as an information processing device according to an example of the disclosure.
- FIG. 24 is a schematic diagram of an artificial neural network chip configured as an information processing device according to an example of the disclosure.
- FIG. 25 is a schematic diagram of an artificial neural network chip configured as an information processing device according to an example of the disclosure.
- FIG. 26 is a schematic diagram of an artificial neural network chip configured as an information processing device according to an example of the disclosure.
- FIG. 27 is a schematic diagram of an artificial neural network chip configured as an information processing device according to an example of the disclosure.
- FIG. 28 is a functional module diagram of a short-bit floating point data conversion unit according to an example of the disclosure.
- FIG. 29 is a functional module diagram of a computation device according to an example of the disclosure.
- FIG. 30 is a functional module diagram of a computation device according to an example of the disclosure.
- FIG. 31 is a functional module diagram of a computation device according to an example of the disclosure.
- FIG. 32 is a functional module diagram of a computation device according to an example of the disclosure.
- FIG. 33 is a schematic diagram of an operation module according to an example of the disclosure.
- the disclosure provides a data processing device which, when task information is configured and input therein, completes interaction with an external device, for example, a processor core, to automatically implement execution of a task of the processor core to implement self-launching.
- an external device for example, a processor core
- the data processing device may include a self-launching task queue device.
- the self-launching task queue device may include a task configuration information storage unit and a task queue configuration unit.
- the task configuration information storage unit is configured to store configuration information of tasks.
- the configuration information may include, but is not limited to, a start tag and an end tag of a task, a priority of the task, a launching manner for the task, and the like.
- the task queue configuration unit is configured to configure a task queue according to the configuration information of the task configuration information storage unit and complete dynamic task configuration and external communication.
- the self-launching task queue device is configured to cooperate with the external device, receive the configuration information sent by the external device and configure the task queue.
- the external device executes each task according to the configured task queue. Meanwhile, the self-launching task queue device interacts and communicates with the external device.
- a workflow of the self-launching task queue device is illustrated and, as a data processing method of the disclosure, may include the following steps.
- the external device sends a launching parameter to the self-launching task queue device.
- the launching parameter is stored in the task configuration information storage unit of the self-launching task queue device.
- the launching parameter may include launching information and a launching command.
- the launching information may be the abovementioned configuration information.
- the task queue configuration unit configures a task queue according to the launching parameter and sends a configured task queue to the external device.
- the external device executes tasks in the task queue and, every time when completing executing a task, sends a first end signal to the task configuration information storage unit.
- the task configuration information storage unit every time when receiving a first end signal, sends an interrupt signal to the external device, and the external device processes the interrupt signal and sends a second end signal to the task configuration information storage unit.
- S 203 and 5204 are executed for each task in the task queue until all the tasks in the task queue are completed.
- the task configuration information storage unit after receiving the first end signal, may modify the task configuration information stored therein to implement task scheduling.
- the data processing device may further include a processor core.
- the processor core cooperates with the self-launching task queue device as the external device.
- the configuration information input into the self-launching task queue device may include a launching mode, priority, and the like of the task.
- the processor core may execute various types of tasks.
- the tasks may be divided into different task queues, for example, a high-priority queue and a low-priority queue, according to properties of the tasks and an application scenario.
- the launching mode of the task may include self-launching and common launching.
- the task configuration information storage unit may include a first storage unit and a second storage unit.
- the tasks are allocated to the first storage unit or the second storage unit according to the configuration information respectively.
- the first storage unit stores a high-priority task queue
- the second storage unit stores a low-priority task queue. For the tasks in the task queues, launching and execution of the tasks may be completed according to the respective launching modes.
- the task queue may also be configured not according to the priorities but according to other parameters of the tasks, the number of the task queues is also not limited to two and may also be multiple, and correspondingly, there may also be multiple storage units.
- a workflow of the self-launching task queue device of the abovementioned example is illustrated and, as a data processing method of another example of the disclosure, may include the following steps.
- the processor core sends configuration information of a task queue to the self-launching task queue device.
- the task queue configuration unit configures the task queue according to the configuration information and sends a configured task queue to the processor core.
- the first storage unit sends a stored high-priority task queue to the processor core and the second storage unit sends a stored low-priority task queue to the processor core.
- the processor core executes tasks in the task queue and, every time when completing executing a task, sends a first end signal to the task configuration information storage unit and task queue configuration is completed.
- the task configuration information storage unit every time when receiving a first end signal, sends an interrupt signal to the processor core, and the processor core processes the interrupt signal and sends a second end signal to the task configuration information storage unit to complete self-launching of the task queue.
- multiple external devices may be provided, for example, multiple processor cores.
- the processor cores may be various operation modules, control modules, and the like.
- the processor core of the data processing device may include a control module and a neural network operation module.
- the neural network operation module may include a control unit, a neural network operation unit, and a storage unit.
- the storage unit is configured to store data and instruction for neural network operation.
- the data may include an input neuron, an output neuron, a weight, a score, an error mode judgment result, and the like.
- the instruction may include various operation instructions for addition, multiplication, activation, and the like in the neural network operation.
- the control unit is configured to control operations of the storage unit and the neural network operation unit.
- the neural network operation unit is controlled by the control unit to execute the neural network operation on the data according to an instruction stored in the storage unit.
- the control module is configured to provide configuration information of tasks.
- Each of the control module and the neural network operation module is equivalent to a processor core.
- the control module sends configuration information of task queues to the self-launching task queue device.
- the task queue configuration unit configures the task queues according to the configuration information, stores each task queue in each corresponding storage unit and sends each task queue to the control unit of the neural network operation module.
- the control unit may monitor a configuration of the self-launching task queue device and configure a neural network operation instruction of the storage unit to a correct position, namely inputting an instruction of an external storage module in the storage unit into an instruction storage module. As illustrated in FIG. 7 , the control unit controls the neural network operation unit and the storage unit to execute each task according to the configuration information.
- the neural network operation unit and the storage unit are required to cooperate to complete a task execution process.
- the control unit every time when completing executing a task, sends a first end signal to the task configuration information storage unit and task queue configuration is completed.
- the self-launching task queue device every time when receiving a first end signal of the control unit, modifies the configuration information and sends an interrupt signal to the control unit.
- the control unit processes the interrupt signal and then sends a second end signal to the self-launching task queue device.
- the control unit is usually required to, after being started, send an instruction fetching instruction to complete the operation of configuring the neural network operation instruction of the storage unit to the correct position. That is, the control unit usually may include an instruction fetching instruction cache module. In the disclosure, the control unit is not required to send any instruction fetching instruction. That is, the instruction fetching instruction cache module of the control unit may be eliminated. Therefore, a structure of the device is simplified, cost is reduced and resources are saved.
- the data processing device of the example may further include a data cache, an instruction cache, and a DMA.
- the storage unit is connected with the instruction cache and the data cache through the CMA.
- the instruction cache is connected with the control unit.
- the data cache is connected with the operation unit.
- the storage unit receives input data and transmits neural network operational data and instruction in the input data to the data cache and the instruction cache through the DMA respectively.
- the data cache is configured to cache the neural network operational data. More specifically, as illustrated in FIG. 9 , the data cache may include an input neuron cache, a weight cache, and an output neuron cache configured to cache input neurons, weights, and output neurons sent by the DMA respectively.
- the data cache may further include a score cache, error mode judgment result cache, and the like configured to cache scores and error mode judgment results and send the data to the operation unit.
- the instruction cache is configured to cache the neural network operation instruction.
- the instructions for addition, multiplication, activation, and the like of the neural network operation are stored in the instruction cache through the DMA.
- the control unit is configured to read the neural network operation instruction from the instruction cache, decode it into an instruction executable for the operation unit and send an executable instruction to the operation unit.
- the neural network operation unit is configured to execute corresponding neural network operation on the neural network operational data according to the executable instruction.
- An intermediate result in a computation process and a final result may be cached in the data cache and are stored in the storage unit through the DMA as output data.
- a workflow of the self-launching task queue device of the abovementioned example is illustrated and, as a data processing method of another example of the disclosure, may include the following steps.
- control module sends configuration information of task queues to the self-launching task queue device.
- the task queue configuration unit configures the task queues according to the configuration information and sends configured task queues to the neural network operation module.
- the task queue configuration unit configures the task queues according to the configuration information, stores each task queue in each corresponding storage unit thereof and sends each task queue to the control unit of the neural network operation module.
- control unit monitors a configuration of the self-launching task queue device and controls the neural network operation unit and the storage unit to execute tasks in the task queues according to the configuration information and, every time when completing executing a task, send a first end signal to the task configuration information storage unit and task queue configuration is completed.
- the self-launching task queue device every time when receiving a first end signal of the control unit, sends an interrupt signal to the control unit, and the control unit processes the interrupt signal and then sends a second end signal to the self-launching task queue device.
- the operation in S 903 that the control unit controls the neural network operation unit and the storage unit to execute each task according to the configuration information may include the following steps.
- the control unit reads a neural network operation instruction from the storage unit according to the configuration information, and the neural network operation instruction is stored in the instruction cache through the DMA.
- the control unit reads the neural network operation instruction from the instruction cache, decodes it into an instruction executable for the operation unit and sends an executable instruction to the operation unit.
- the neural network operation unit reads neural network operational data from the data cache, executes corresponding neural network operation on the neural network operational data according to the executable instruction and stores a computational result in the data cache and/or the storage unit.
- the neural network operation may include multiplying an input neuron and a weight vector, adding an offset and performing activation to obtain an output neuron.
- the neural network operation unit may include one or more computational components.
- the computational components include, but is not limited to, for example, one or more multipliers, one or more adders and one or more activation function units.
- the neural network operation unit may include multiple adders and the multiple adders form an adder tree.
- the activation function (active) is, but is not limited to, for example, sigmoid, tan h, RELU, and softmax.
- the activation function unit may further implement other nonlinear computation and may execute computation (f) on the input data (in) to obtain the output data (out), and a process is: out f(in).
- the neural network operation may include, but is not limited to, multiplication computation, addition computation, and activation function computation.
- the multiplication computation refers to multiplying input data 1 and input data 2 to obtain multiplied data.
- the addition computation is executed to add the input data 1 through the adder tree step by step or accumulate the input data (in 1 ) and then add the input data (in 2 ) or add the input data 1 and the input data 2 to obtain output data.
- the activation function computation refers to executing computation on the input data through the activation function (active) to obtain the output data.
- Pool is the pooling operation; the pooling operation may include, but is not limited to: average pooling, maximum pooling, and median pooling; and the input data in is data in the pooling core related to the output data out.
- the one or more of the abovementioned computations may be freely selected for combination in different sequences, thereby implementing computation of various functions.
- the neural network operation unit may include, but is not limited to, multiple PEs and one or more ALUs.
- Each PE may include a multiplier, an adder, a comparator, and a register/register set.
- Each PE may receive data from the PEs in each direction, for example, receive data from PEs in a horizontal direction (for example, the right) and/or a vertical direction (for example, the lower), and may also transmit data to the PEs in an opposite horizontal direction (for example, the left) and/or an opposite vertical direction (for example, the upper). And/or each PE may receive data from the PEs in a diagonal direction and may also transmit data to the diagonal PEs in the opposite horizontal direction.
- Each ALU may complete basic computation such as an activation operation, multiplication, addition, and other nonlinear computation.
- the computation executed by the neural network operation unit may include computation executed by the PEs and computation executed by the ALU.
- the PE multiplies the input data 1 and the input data 2, adds a product and data stored in the register or the data transmitted by the other PEs, writes a result back into the register or a storage part and simultaneously transmits certain input data or a computational result to the other PEs. And/or the PE accumulates or compares the input data 1 and the input data 2 or the data stored in the register.
- the ALU completes activation computation or nonlinear computation.
- Out 2 may be written back into the register/register set or the storage part.
- certain input data (in 1 /in 2 ) may be transmitted in the horizontal direction or the vertical direction.
- the computational result (out 2 ) may be transmitted in the horizontal direction or the vertical direction.
- pool is the pooling operation, and the pooling operation may include, but is not limited to, average pooling, maximum pooling, and median pooling.
- the input data in is data in the pooling core related to an output out, and intermediate temporary data may be stored in the register.
- Each ALU is configured to complete basic computation such as an activation operation, multiplication and addition, or nonlinear computation.
- the activation function may be sigmoid, tan h, RELU, softmax, and the like.
- the neural network operation unit may include a primary processing circuit and multiple secondary processing circuits.
- the operation unit may include a tree module.
- the tree module may include a root port and multiple branch ports.
- the root port of the tree module is connected with the primary processing circuit, and each of the multiple branch ports of the tree module is connected with a secondary processing circuit of the multiple secondary processing circuits respectively.
- the tree module is configured to forward a data block, a weight, and an operation instruction between the primary processing circuit and the multiple secondary processing circuits.
- the tree module may be configured as an n-ary tree structure, the structure illustrated in FIG. 11 is a binary tree structure and may also be a ternary tree structure, and n may be an integer greater than or equal to two. A specific value of n is not limited in a specific implementation mode of the application.
- the layer number may also be two.
- the secondary processing circuits may be connected to nodes of another layer, except nodes of the last second layer, and, for example, may be connected to nodes of the last layer illustrated in FIG. 11 .
- the neural network operation unit may include a primary processing circuit, multiple secondary processing circuits, and a branch processing circuit.
- the primary processing circuit is specifically configured to allocate a task in the task queue into multiple data blocks and send at least one data block of the multiple data blocks, the weight, and at least one operation instruction of multiple operation instructions to the branch processing circuit.
- the branch processing circuit is configured to forward the data block, the weight, and the operation instructions between the primary processing circuit and the multiple secondary processing circuits.
- the multiple secondary processing circuits are configured to execute computation on the received data block and the weight according to the operation instruction to obtain intermediate results and to transmit the intermediate results to the branch processing circuit.
- the primary processing circuit is configured to perform subsequent processing on the intermediate results sent by the branch processing circuit to obtain a result of the operation instruction, and to send the result of the operation instruction to the control unit.
- the neural network operation unit may include a primary processing circuit and multiple secondary processing circuits.
- the multiple secondary processing circuits are distributed in an array. Each secondary processing circuit is connected with the other adjacent secondary processing circuits.
- the primary processing circuit is connected with k secondary processing circuits in the multiple primary processing circuits, and the k secondary processing circuits include n secondary processing circuits in a first row, n secondary processing circuits in an m th row, and m secondary processing circuits in a first column.
- the k secondary processing circuits are configured to forward data and instructions between the primary processing circuit and the multiple secondary processing circuits.
- the primary processing circuit is configured to allocate a piece of input data into multiple data blocks and to send at least one data block of the multiple data blocks and at least one operation instruction of multiple operation instructions to the k secondary processing circuits.
- the k secondary processing circuits are configured to convert the data between the primary processing circuit and the multiple secondary processing circuits.
- the multiple secondary processing circuits are configured to execute computation on the received data block according to the operation instruction to obtain intermediate results and to transmit the intermediate results to the k secondary processing circuits.
- the primary processing circuit is configured to perform subsequent processing on the intermediate results sent by the k secondary processing circuits to obtain a result of the operation instruction and send the result of the operation instruction to the control unit.
- the neural network operation unit may be replaced with a non-neural network operation unit.
- the non-neural network operation unit is, for example, a universal operation unit.
- Universal computation may include a corresponding universal operation instruction and data and its computation process is similar to the neural work computation.
- the universal computation may be, for example, scalar arithmetic computation and scalar logical computation.
- the universal operation unit may include, but is not limited to, for example, one or more multipliers and one or more adders, and executes basic computation, for example, addition and multiplication.
- the primary processing circuit is specifically configured to combine and sequence the intermediate results sent by the multiple secondary processing circuits to obtain the result of the operation instruction.
- the primary processing circuit is specifically configured to combine, sequence, and activate the intermediate results sent by the multiple processing circuits to obtain the result of the operation instruction.
- the primary processing circuit may include one or any combination of a conversion processing circuit, an activation processing circuit, and an addition processing circuit.
- the conversion processing circuit is configured to execute preprocessing on the data, specifically to execute exchange between a first data structure and a second data structure on data or intermediate results received by the primary processing circuit, or to execute exchange between a first data type and a second data type on the data or the intermediate results received by the primary processing circuit.
- the activation processing circuit is configured to execute subsequent processing, specifically to execute activation computation on data in the primary processing circuit.
- the addition processing circuit is configured to execute subsequent processing, specifically to execute addition computation or accumulation computation.
- the secondary processing circuit may include a multiplication processing circuit.
- the multiplication processing circuit is configured to execute product computation on the received data block to obtain a product result.
- the secondary processing circuit may further include an accumulation processing circuit.
- the accumulation processing circuit is configured to execute accumulation computation on the product result to obtain the intermediate result.
- the tree module is configured as an n-ary tree structure, n being an integer greater than or equal to two.
- Another example of the disclosure provides a chip, which may include the data processing device of the abovementioned example.
- Another example of the disclosure provides a chip package structure, which may include the chip of the abovementioned example.
- Another example of the disclosure provides a board card, which may include the chip package structure of the abovementioned example.
- the electronic device may include the board card of the abovementioned example.
- the electronic device may include a robot, a computer, a printer, a smayner, a tablet computer, an intelligent terminal, a mobile phone, a drive recorder, a navigator, a sensor, a webcam, a cloud server, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a transportation means, a household electrical appliance, and/or a medical device.
- the transportation means may include an airplane, a ship, and/or a vehicle.
- the household electrical appliance may include a television, an air conditioner, a microwave oven, a refrigerator, an electric rice cooker, a humidifier, a washing machine, an electric lamp, a gas cooker, and a range hood.
- the medical device may include a nuclear magnetic resonance spectrometer, a B-ultrasonic smayner, and/or an electrocardiograph.
- All of the units and modules in the disclosure may be hardware structures, physical implementation of the hardware structures may include, but is not limited to, physical devices, and the physical devices include, but are not limited to, transistors, memristors, and deoxyribonucleic acid (DNA) computers.
- physical implementation of the hardware structures may include, but is not limited to, physical devices, and the physical devices include, but are not limited to, transistors, memristors, and deoxyribonucleic acid (DNA) computers.
- DNA deoxyribonucleic acid
- the disclosure provides a data redundancy method.
- Data is divided into multiple importance ranks, and different data redundancy processing is performed for data of different importance ranks. Therefore, a storage capacity overhead and a memory access power consumption overhead are reduced on the basis of ensuring security and reliability of stored data.
- FIG. 14 is a flowchart of a data redundancy method. As illustrated in FIG. 14 , the data redundancy method specifically may include the following steps.
- the importance ranks of the data may be set by comprehensively considering factors such as a size of the data, a magnitude of an absolute value of the data, a type (floating point type and fixed point type) of the data, a read operation frequency of the data, and a write operation frequency of the data.
- bits in the data are divided into important bits and unimportant bits. If the data has totally x bits in which y bits are important bits and (x-y) bits are unimportant bits, both of x and y being positive integers and 0 ⁇ y ⁇ x, only the y important bits of the data are subsequently processed. Positions of the y important bits may be continuous and may also be discontinuous.
- data redundancy processing may include replica redundancy processing and/or ECC processing. Different processing may be performed according to different importance. For example, when all bits in a piece of data are all important bits, ECC processing may be performed on all the bits of the data. When part of bits in a piece of data are important bits, replica redundancy processing is performed on the important bits of the data.
- Replica redundancy may include implementing redundancy backup in the same storage medium and may also implement redundancy backup in different storage media.
- N data replicas may simultaneously be backed up, where N is a positive integer greater than zero.
- An ECC manner may include CRC and ECC.
- redundancy storage is performed on a control unit, and redundancy storage is not performed on an operation unit.
- redundancy storage is performed on the neural network instruction; redundancy storage is not performed on the parameter; the neural network instruction is configured as the first importance rank; and the neural network parameter is configured as a second importance rank.
- the neural network parameter may include topological structure information, neuron data and weight data. Redundancy storage is performed on data of the first importance rank and redundancy storage is not performed on data of the second importance rank.
- a neural network processor may include a storage unit, a control unit, and an operation unit.
- the storage unit is configured to receive external input data, to store a neuron, weight, and an instruction of a neural network, to send the instruction to the control unit, and to send the neuron and the weight to the operation unit.
- the control unit is configured to receive the instruction sent by the storage unit and decode the instruction to generate control information to control the operation unit.
- the operation unit is configured to receive the weight and the neuron sent by the storage unit, to complete neural network training computation, and to retransmit an output neuron to the storage unit for storage.
- the neural network processor may further include an instruction redundancy processing unit.
- the instruction redundancy processing unit is embedded in the storage unit and the instruction control unit respectively to perform data redundancy processing on the instruction.
- a topological structure of the operation unit is illustrated in FIG. 11 .
- the operation unit may include a primary processing circuit and multiple secondary processing circuits.
- the topological structure illustrated in FIG. 11 is a tree module.
- the tree module may include a root port and multiple branch ports.
- the root port of the tree module is connected with the primary processing circuit, and each of the multiple branch ports of the tree module is connected with a secondary processing circuit of the multiple secondary processing circuits respectively.
- the tree module is configured to forward a data block, a weight, and an operation instruction between the primary processing circuit and the multiple secondary processing circuits.
- the tree module may include a multilayer node structure, a node is a structure with a forwarding function, and the node may have no computation function.
- the operation unit may include a primary processing circuit, multiple secondary processing circuits and a branch processing circuit.
- the primary processing circuit is specifically configured to allocate an input neuron into multiple data blocks and send at least one data block of the multiple data blocks, the weight and at least one operation instruction of multiple operation instructions to the branch processing circuit.
- the branch processing circuit is configured to forward the data block, the weight, and the operation instruction between the primary processing circuit and the multiple secondary processing circuits.
- the multiple secondary processing circuits are configured to execute computation on the received data block and the weight according to the operation instruction to obtain intermediate results and to transmit the intermediate results to the branch processing circuit.
- the primary processing circuit is configured to perform subsequent processing on the intermediate results sent by the branch processing circuit to obtain a result of the operation instruction and to send the result of the operation instruction to the control unit.
- the operation unit may include a primary processing circuit and multiple secondary processing circuits.
- the multiple secondary processing circuits are distributed in an array. Each secondary processing circuit is connected with the other adjacent secondary processing circuits.
- the primary processing circuit is connected with k secondary processing circuits in the multiple primary processing circuits, and the k secondary processing circuits include n secondary processing circuits in a first row, n secondary processing circuits in an m th row, and m secondary processing circuits in a first column.
- the k secondary processing circuits are configured to forward data and instructions between the primary processing circuit and the multiple secondary processing circuits.
- the primary processing circuit is configured to allocate a piece of input data into multiple data blocks and send at least one data block of the multiple data blocks and at least one operation instruction of multiple operation instructions to the k secondary processing circuits.
- the k secondary processing circuits are configured to convert the data between the primary processing circuit and the multiple secondary processing circuits.
- the multiple secondary processing circuits are configured to execute computation on the received data block according to the operation instruction to obtain intermediate results and to transmit the intermediate results to the k secondary processing circuits.
- the primary processing circuit is configured to perform subsequent processing on the intermediate results sent by the k secondary processing circuits to obtain a result of the operation instruction and send the result of the operation instruction to the control unit.
- the primary processing circuit is specifically configured to combine and sequence the intermediate results sent by the multiple secondary processing circuits to obtain the result of the operation instruction.
- the primary processing circuit is specifically configured to combine, sequence and activate the intermediate results sent by the multiple processing circuits to obtain the result of the operation instruction.
- the primary processing circuit may include one or any combination of a conversion processing circuit, an activation processing circuit and an addition processing circuit.
- the conversion processing circuit is configured to execute preprocessing on the data, specifically to execute exchange between a first data structure and a second data structure on data or intermediate results received by the primary processing circuit, or to execute exchange between a first data type and a second data type on the data or the intermediate results received by the primary processing circuit.
- the activation processing circuit is configured to execute subsequent processing, specifically to execute activation computation on data in the primary processing circuit.
- the addition processing circuit is configured to execute subsequent processing, specifically to execute addition computation or accumulation computation.
- the secondary processing circuit may include a multiplication processing circuit.
- the multiplication processing circuit is configured to execute product computation on the received data block to obtain a product result.
- the secondary processing circuit may further include an accumulation processing circuit.
- the accumulation processing circuit is configured to execute accumulation computation on the product result to obtain an intermediate result.
- the tree module is configured as an n-ary tree structure, n being an integer greater than or equal to two.
- data redundancy is performed on a neural network parameter.
- M importance ranks are determined for the neural network parameter according to a magnitude of an absolute value of the parameter, and the parameter is correspondingly divided into a corresponding importance rank.
- M+1 threshold values are set and are recorded as T 0 , T 1 , T 2 , . . . , T M respectively after being sequenced from large to small.
- the neural network parameter when the absolute value of the neural network parameter meets T 0 >D>T 1 , the neural network parameter is divided into the first importance rank, when the absolute value of the neural network parameter meets T 1 >D>T 2 , the neural network parameter is divided into the second importance rank, and so on.
- a floating point type parameter in parameters of the i th importance rank has totally x i bits, and it is set that sign bits and first y i bits of an exponential part and a base part are specified as important bits, where both of x i and y i are positive integers, and 0 ⁇ y i ⁇ x i .
- a fixed point type parameter in parameters of the i th importance rank has totally x i bits, and it is set that sign bits and first z i bits of a numerical part are specified as important bits, where both of x i and z i are positive integers, and 0 ⁇ z i ⁇ x i .
- a data backup manner is adopted for data redundancy of important bits in a parameter of the i th importance rank, two replicas are backed up and redundancy storage is not performed on unimportant bits.
- a read operation is executed on the parameter of the i th importance rank, raw data and two backed-up data replicas are simultaneously read for important bits, in case of corresponding data inconsistency, two replicas of data which are the same are determined as finally read data, and the third replica of data which is inconsistent is simultaneously modified.
- a write operation is executed on the parameter of the i th importance rank, the important bits are simultaneously written back to two backup addresses, and the data in the raw data and the two backed-up data replicas are ensured to be consistent.
- data redundancy is performed on a sparse neural network parameter.
- the sparse neural network parameter is divided into two parts, i.e., a nonzero parameter and a nonzero parameter position respectively.
- the nonzero parameter position is configured as the first importance rank, all other bits are marked as important bits and a CRC code manner is adopted for redundancy storage.
- a read operation is executed, a stored CRC code is read, a CRC code of raw data is calculated, and if the two CRC codes are inconsistent, the data is corrected according to the stored CRC code.
- a write operation is executed, both of the raw data and the CRC code are stored.
- An importance rank is set for the nonzero parameter of the neural network according to a magnitude of an absolute value of the parameter, and M ⁇ 1 importance ranks are sequentially set from the second importance rank.
- M threshold values are set and are recorded as T 1 , T 2 , . . . T M respectively after being sequenced from large to small.
- the nonzero parameter when the absolute value of the nonzero parameter meets T 1 >D>T 2 , the nonzero parameter is divided into the second importance rank, when the absolute value of the nonzero parameter meets T 2 >D>T 3 , the nonzero parameter is divided into the third importance rank, and so on.
- a floating point type parameter in parameters of the i th importance rank has totally bits, and it is set that sign bits and first y i bits of an exponential part and a base part are specified as important bits, where both of x i and y i are positive integers, and 0 ⁇ y i ⁇ x i .
- a fixed point type parameter in parameters of the i th importance rank has totally bits, and it is set that sign bits and first z i bits of a numerical part are specified as important bits, where both of x i and z i are positive integers, and 0 ⁇ z i ⁇ x i .
- a data backup manner is adopted for data redundancy of important bits in a parameter of the i th importance rank. Two replicas are backed up and redundancy storage is not performed on unimportant bits.
- a read operation is executed on the parameter of the i th importance rank, raw data and two backed-up data replicas are simultaneously read for important bits, in case of corresponding data inconsistency, two replicas of data which are the same are determined as finally read data, and the third replica of data which is inconsistent is simultaneously modified.
- the important bits are simultaneously written back to two backup addresses, and meanwhile, the data in the raw data and the two backed-up data replicas are ensured to be consistent.
- redundancy is performed on data in a diagram computation application.
- the data in the diagram computation application is divided into two parts, including vertex data and side data.
- the vertex data in the diagram computation application is configured as the first importance rank. All data bits are marked as important bits and a CRC code manner is adopted for redundancy storage.
- a read operation is executed, a stored CRC code is read, and a CRC code of raw data is calculated, and if the two CRC codes are inconsistent, the data is corrected according to the stored CRC code.
- a write operation is executed, both of the raw data and the CRC code are stored.
- An importance rank is set for the side data in the diagram computation application according to an access frequency of the side data, and M ⁇ 1 importance ranks are sequentially set from the second importance rank, and are recorded as T 1 , T 2 , . . . T M respectively after being sequenced from large to small.
- the access frequency F of the side data meets T i-1 >F>T i
- the side data is divided into the second importance rank
- the side data is divided into the third importance rank
- Floating point type side data in the i th importance rank has totally x i bits, and it is set that sign bits and first y i bits of an exponential part and a base part are specified as important bits, where both of x i and y i are positive integers, and 0 ⁇ y i ⁇ x i .
- Fixed point type side data in parameters of the i th importance rank has totally x i bits, and it is set that sign bits and first z i bits of a numerical part are specified as important bits, wherein both of x i and z i are positive integers, and 0 ⁇ z i ⁇ x i .
- a data backup manner is adopted for data redundancy of important bits in the side data of the i th importance rank. Two replicas are backed up and redundancy storage is not performed on unimportant bits.
- a read operation is executed on the side data of the i th importance rank, raw data and two backed-up data replicas are simultaneously read for important bits, in case of data inconsistency, two replicas of data which are the same are determined as finally read data, and the third replica of data which is inconsistent is simultaneously modified.
- a write operation is executed on the side data of the i th importance rank, the important bits are simultaneously written back to two backup addresses, and the data in the raw data and the two backed-up data replicas are ensured to be consistent.
- FIG. 15 is a structure block diagram of a data redundancy device. As illustrated in FIG. 15 , the data redundancy device 100 may include an importance rank dividing unit 10 , an important bit extraction unit 20 , and a data redundancy processing unit 30 .
- the importance rank dividing unit 10 is configured to divide data into M importance ranks according to importance, M being a positive integer.
- the importance ranks of the data may be set by comprehensively considering factors such as a size of the data, a magnitude of an absolute value of the data, a type (floating point type and fixed point type) of the data, a read operation frequency of the data, and a write operation frequency of the data.
- the important bit extraction unit 20 is configured to extract important bits of each piece of data in each importance rank.
- the important bit extraction unit 20 may recognize data of different importance ranks, divide data bits into important data bits and unimportant data bits and extract important bits of each piece of data of each importance rank.
- the data redundancy processing unit 30 is configured to perform data redundancy processing on the important bits.
- the data redundancy processing unit 30 may include a redundancy storage unit 31 and a read/write control unit 32 .
- the redundancy storage unit 31 may store raw data and perform data redundancy storage on the important bits in the data.
- Data redundancy may be replica backup or ECC.
- N replicas may simultaneously be backed up, where N is a positive integer greater than zero.
- An ECC manner may include, but is not limited to, CRC and ECC.
- the redundancy storage unit 31 may be a hard disk, a dynamic random access memory (DRAM), a static random access memory (SRAM), an ECC-DRAM, an ECC-SRAM, and a nonvolatile memory.
- the read/write control unit 32 may execute a read/write operation on redundant data to ensure data read/write consistency.
- the disclosure further provides a DVFS method for a neural network, which may include that: a real-time load and power consumption of a processor are acquired, and a topological structure of the neural network, a scale of the neural network, and a precision requirement of the neural network are acquired; and then, a voltage prediction and frequency prediction method is adopted to scale a working voltage and frequency of the processor. Therefore, performance of the processor is reasonably utilized, and power consumption of the processor is reduced.
- FIG. 17 is a flowchart of a DVFS method according to an example of the disclosure.
- FIG. 19 is a schematic block diagram of a DVFS method according to an example of the disclosure.
- the DVFS method provided by the example of the disclosure may include the following steps.
- a processor load signal and a neural network configuration signal in a present time period T ⁇ t ⁇ T are acquired.
- a voltage and frequency of a processor in a next time period T ⁇ T+t are predicted according to the processor load and the neural network configuration signal in the present time period T ⁇ t ⁇ T, where T and t are real numbers greater than zero.
- the operation that the processor load signal in the present time period T ⁇ t ⁇ T is acquired refers to acquiring a workload of the processor in real time.
- the processor may be a dedicated processor for neural network operation.
- the processor may include a storage unit and a computation unit and may also include other functional units.
- the disclosure is not limited thereto.
- the workload of the processor may include a memory access load of the storage unit and a computation load of the computation unit.
- Power consumption of the processor may include memory access power consumption of the storage unit and computation power consumption of the computation unit.
- a topological structure of the computation unit is illustrated in FIG. 11 .
- the computation unit may include a primary processing circuit and multiple secondary processing circuits.
- the topological structure illustrated in FIG. 11 is a tree module.
- the tree module may include a root port and multiple branch ports.
- the root port of the tree module is connected with the primary processing circuit, and each of the multiple branch ports of the tree module is connected with a secondary processing circuit of the multiple secondary processing circuits respectively.
- the tree module is configured to forward a data block, a weight, and an operation instruction between the primary processing circuit and the multiple secondary processing circuits.
- the tree module may include a multilayer node structure, a node is a structure with a forwarding function, and the node may have no computation function.
- the topological structure of the computation unit is illustrated in FIG. 12 .
- the computation unit may include a primary processing circuit, multiple secondary processing circuits, and a branch processing circuit.
- the primary processing circuit is specifically configured to allocate an input neuron into multiple data blocks and send at least one data block of the multiple data blocks, the weight and at least one operation instruction of multiple operation instructions to the branch processing circuit.
- the branch processing circuit is configured to forward the data block, the weight, and the operation instruction between the primary processing circuit and the multiple secondary processing circuits.
- the multiple secondary processing circuits are configured to execute computation on a received data block and the weight according to the operation instruction to obtain intermediate results and to transmit the intermediate results to the branch processing circuit.
- the primary processing circuit is configured to perform subsequent processing on the intermediate results sent by the branch processing circuit to obtain a result of the operation instruction and to send the result of the operation instruction to the control unit.
- the topological structure of the computation unit is illustrated in FIG. 13 .
- the computation unit may include a primary processing circuit and multiple secondary processing circuits.
- the multiple secondary processing circuits are distributed in an array. Each secondary processing circuit is connected with the other adjacent secondary processing circuits.
- the primary processing circuit is connected with k secondary processing circuits of the multiple primary processing circuits, and the k secondary processing circuits include n secondary processing circuits in a first row, n secondary processing circuits in an m th row, and m secondary processing circuits in a first column.
- the k secondary processing circuits are configured to forward data and instructions between the primary processing circuit and the multiple secondary processing circuits.
- the primary processing circuit is configured to allocate a piece of input data into multiple data blocks and to send at least one data block of the multiple data blocks and at least one operation instruction of multiple operation instructions to the k secondary processing circuits.
- the k secondary processing circuits are configured to convert the data between the primary processing circuit and the multiple secondary processing circuits.
- the multiple secondary processing circuits are configured to execute computation on the received data block according to the operation instruction to obtain intermediate results and to transmit the intermediate results to the k secondary processing circuits.
- the primary processing circuit is configured to perform subsequent processing on the intermediate results sent by the k secondary processing circuits to obtain a result of the operation instruction and send the result of the operation instruction to the control unit.
- the primary processing circuit is specifically configured to combine and sequence the intermediate results sent by the multiple secondary processing circuits to obtain the result of the operation instruction.
- the primary processing circuit is specifically configured to combine, sequence, and activate the intermediate results sent by the multiple processing circuits to obtain the result of the operation instruction.
- the primary processing circuit may include one or any combination of a conversion processing circuit, an activation processing circuit, and an addition processing circuit.
- the conversion processing circuit is configured to execute preprocessing on the data, specifically to execute exchange between a first data structure and a second data structure on data or intermediate results received by the primary processing circuit, or to execute exchange between a first data type and a second data type on the data or the intermediate results received by the primary processing circuit.
- the activation processing circuit is configured to execute subsequent processing, specifically to execute activation computation on data in the primary processing circuit.
- the addition processing circuit is configured to execute subsequent processing, specifically to execute addition computation or accumulation computation.
- the secondary processing circuit may include a multiplication processing circuit.
- the multiplication processing circuit is configured to execute product computation on the received data block to obtain a product result.
- the secondary processing circuit may further include an accumulation processing circuit.
- the accumulation processing circuit is configured to execute accumulation computation on the product result to obtain the intermediate result.
- the tree module is configured as an n-ary tree structure, n being an integer greater than or equal to two.
- the neural network configuration signal may include the type of a neural network layer presently processed by the processor, a scale of a parameter of the present layer, and real-time accuracy of the neural network.
- the frequency in the operation in S 1702 that the voltage and the frequency of the processor in the next time period T ⁇ T+t are predicted may include: a frequency of the storage unit and/or the computation unit.
- a manner of estimation, computation, prediction, induction, and the like may be adopted for prediction, and the prediction manner may be adopted.
- the operation that the frequency of the computation unit is predicted may include that: m segments of frequency scaling ranges are preset for the computation unit, generating m+1 frequency division points f 0 , f 1 , . . . , f m in total, where f 0 ⁇ f 1 ⁇ . . . ⁇ f m , f 0 , f 1 , . . . , f m are real numbers greater than zero and m is a positive integer greater than zero.
- the operation that the frequency of the storage unit is predicted may include that: m segments of neural network scales, totally m+1 scale division points n 0 , n 1 , . . . , n m are preset, where n 0 ⁇ n i . . . ⁇ n m , n 0 , n 1 , . . . , n m are positive integers greater than zero and m is a positive integer greater than zero.
- the operation that the frequency of the storage unit is predicted may include that: a frequency scaling range of the computation unit is determined according to a range of a scale n of a present processing layer, and if n i-1 ⁇ n ⁇ n i , the frequency scaling range of the computation unit is f i-1 ⁇ f ⁇ f i .
- the operation that the frequency of the storage unit is predicted may include the following steps.
- the frequency scaling range of the computation unit is further narrowed according to the type of the present processing layer, where layers are divided into two types, that is, a compute-intensive layer and a memory access-intensive layer.
- the compute-intensive layer may include a convolutional layer, and the memory access-intensive layer may include a fully connected layer, a pooling layer, and an active layer.
- the frequency scaling range of the computation unit is (f i-1 +f i )/2 ⁇ f ⁇ f i .
- the frequency scaling range of the computation unit is
- the operation that the frequency of the storage unit is predicted may include that: fine granularity regulation is performed on the frequency f of the computation unit according to the present accuracy of the neural network.
- the operation that the frequency of the storage unit is determined may include that: when the present accuracy of the neural network is higher than expected accuracy, the frequency of the computation unit is decreased, and when the present accuracy of the neural network is lower than the expected accuracy, the frequency of the computation unit is increased.
- the operation that the frequency of the storage unit is determined may include that: k segments of frequency scaling ranges, totally k+1 frequency division points F 0 , F 1 , . . . , F k , are preset for the storage unit, where F 0 ⁇ F 1 ⁇ . . . ⁇ F k , F 0 , F 1 , . . . , F k are positive integers greater than zero and k is a positive integer greater than zero; and
- N 0 , N 1 , . . . , N k are preset, where N 0 ⁇ N 1 ⁇ . . . ⁇ N k , N 0 , N 1 , . . . , N k are positive integers greater than zero and k is a positive integer greater than zero.
- the operation that the frequency of the storage unit is determined may include that: a frequency scaling range of the storage unit is determined according to a range of a scale N of a present processing layer, and if N i-1 ⁇ N ⁇ N i , the frequency scaling range of the storage unit is F i-1 ⁇ F ⁇ F i .
- the operation that the frequency of the storage unit is predicted may include that: the frequency scaling range of the storage unit is further narrowed according to the type of the present processing layer; if the layer is a compute-intensive layer, the frequency scaling range of the storage unit is F i-1 ⁇ F ⁇ (F i-1 +F i )/2; and if the layer is a memory access-intensive layer, the frequency scaling range of the storage unit is (F i-1 +F i )/2 ⁇ F ⁇ F i .
- the operation that the frequency of the storage unit is predicted may include that: fine granularity regulation is performed on the frequency of the storage unit according to the present accuracy of the neural network, and the frequency of the storage unit in the next time period is predicted.
- the operation that the frequency of the storage unit is determined may include that: when the present accuracy of the neural network is higher than expected accuracy, the memory access frequency of the storage unit is decreased, and when the present accuracy of the neural network is lower than the expected accuracy, the memory access frequency of the storage unit is increased.
- the operation in S 1702 that the voltage and the frequency of the processor in the next time period T ⁇ T+t are predicted may further include that: a prediction method is adopted to predict the voltage and the frequency of the processor in the next time period.
- the prediction method may include a preceding value method, a moving average load method, an exponentially weighted average method, and/or a minimum average method.
- FIG. 18 is a flowchart of a DVFS method according to another example of the disclosure.
- S 1801 -S 1802 are the same as S 1701 -S 1702 .
- the difference is that the method may further include S 1803 and S 1804 .
- the method may further include S 1803 : a clock setting of a chip is regulated according to the predicted frequency in the next time period to scale the frequency of the processor.
- the method may further include S 1804 : a power management module of the chip is regulated according to the predicted frequency in the next time period, to scale the voltage supplied to the processor.
- FIG. 20 is a schematic diagram of a DVFS co-processor according to an example of the disclosure.
- a DVFS co-processor is provided, which may include a signal acquisition unit and a performance prediction unit.
- the signal acquisition unit is configured to acquire a workload of a processor, and is further configured to acquire a neural network configuration signal.
- the performance prediction unit is configured to receive the neural network configuration signal and predict a frequency and voltage of the processor in a next time period according to a present load and power consumption of the processor.
- the signal acquisition unit may acquire a signal related to the load and the power consumption of the processor and the neural network configuration signal, and transmit these signals to the performance prediction unit.
- the signal acquisition unit may acquire workloads of the computation unit and the storage unit in the neural network processor, and acquire a present layer type and a present layer scale for processing of a neural network and real-time accuracy of the neural network, and transmit these signals to the performance prediction unit.
- the performance prediction unit may receive the signals acquired by the signal acquisition unit, predict performance required by the processor in the next time period according to a present system load condition and the neural network configuration signal and output a signal for scaling the frequency and the voltage.
- the frequency in the operation that the voltage and the frequency of the processor in the next time period are predicted in the performance prediction unit may include: a frequency of the storage unit and/or the computation unit.
- the DVFS co-processor may further include a frequency scaling unit configured to receive a frequency signal, determined by the performance prediction unit, of the processor in the next time period and scale the frequency of the storage unit and/or computation unit in the processor.
- the DVFS co-processor may further include a voltage scaling unit configured to receive a voltage signal, predicted by the performance prediction unit, of the processor in the next time period and scale a voltage of the storage unit and/or computation unit in the processor.
- a voltage scaling unit configured to receive a voltage signal, predicted by the performance prediction unit, of the processor in the next time period and scale a voltage of the storage unit and/or computation unit in the processor.
- the performance prediction unit is connected with the signal acquisition unit, the voltage scaling unit, and the frequency scaling unit.
- the performance prediction unit receives the type of the layer presently processed by the processor and the scale of the present layer, performs coarse granularity prediction on a frequency range, then finely predicts the voltage and the frequency of the processor according to the present load and power consumption of the processor and the real-time accuracy of the neural network and finally outputs the signal for scaling the frequency and scaling the voltage.
- the operation that the frequency of the computation unit of the processor in the next time period is predicted in the performance prediction unit may include that: m segments of frequency scaling ranges are preset for the computation unit, generating m+1 frequency division points f 0 , f 1 , . . . , f m in total, where f 0 ⁇ f 1 ⁇ . . . ⁇ f m , f 0 , f i , . . . , f m are real numbers greater than zero and m is a positive integer greater than zero.
- the operation that the frequency of the computation unit of the processor in the next time period is predicted in the performance prediction unit may include that: m segments of neural network scales are preset, generating m+1 scale division points n 0 , n 1 , . . . , n m in total, where n 0 ⁇ n i ⁇ . . . ⁇ n m , n 0 , n 1 , . . . , n m are positive integers greater than zero and m is a positive integer greater than zero.
- the operation that the frequency of the computation unit of the processor in the next time period is predicted in the performance prediction unit may include that: a frequency scaling range of the computation unit is determined according to a range of a scale n of a present processing layer, and if n i-1 ⁇ n ⁇ n i , the frequency scaling range of the computation unit is f i-1 ⁇ f ⁇ f i .
- the operation that the frequency of the computation unit of the processor in the next time period is predicted in the performance prediction unit may include that: the frequency scaling range of the computation unit is further narrowed according to the type of the present processing layer, where layers are divided into two types, including a compute-intensive layer, i.e., a convolutional layer, and a memory access-intensive layer, i.e., a fully connected layer and/or a pooling layer; if the layer is a compute-intensive layer, the frequency scaling range of the computation unit is (f i-1 +f i )/2 ⁇ f ⁇ f i ; and if the layer is a memory access-intensive layer, the frequency scaling range of the computation unit is f i-1 /2 ⁇ f ⁇ (f i-1 +f i )/2.
- the operation that the frequency of the computation unit of the processor in the next time period is predicted in the performance prediction unit may include that: fine granularity regulation is performed on the frequency of the computation unit according to the present accuracy of the neural network.
- the operation that the frequency of the computation unit of the processor in the next time period is predicted in the performance prediction unit may include that:
- the frequency of the computation unit when the present accuracy of the neural network is higher than expected accuracy, the frequency of the computation unit is decreased, and when the present accuracy of the neural network is lower than the expected accuracy, the frequency of the computation unit is increased.
- the operation that the frequency of the storage unit of the processor in the next time period is predicted in the performance prediction unit may include that: k segments of frequency scaling ranges are preset for the storage unit, generating k+1 frequency division points F 0 , F 1 , . . . , F k in total, where F 0 ⁇ F 1 ⁇ . . . ⁇ F k , F 0 , F 1 , . . . , F k are positive integers greater than zero and k is a positive integer greater than zero.
- the operation that the frequency of the storage unit of the processor in the next time period is predicted in the performance prediction unit may include that: k segments of neural network scales are preset, generating k+1 scale division points N 0 , N 1 , . . . , N k in total, where N 0 ⁇ N 1 ⁇ . . . ⁇ N k , N 0 , N 1 , . . . , N k are positive integers greater than zero and k is a positive integer greater than zero.
- the operation that the frequency of the storage unit of the processor in the next time period is predicted in the performance prediction unit may include that:
- a frequency scaling range of the storage unit is determined according to a range of a scale N of the present processing layer, and if
- the frequency scaling range of the storage unit is F i-1 ⁇ F ⁇ F i .
- the operation that the frequency of the storage unit of the processor in the next time period is predicted in the performance prediction unit may include that:
- the frequency scaling range of the storage unit is further narrowed according to the type of the present processing layer; if the layer is a compute-intensive layer, the frequency scaling range of the storage unit is F i-1 ⁇ F ⁇ (F i-1 +F i )/2; and if the layer is a memory access-intensive layer, the frequency scaling range of the storage unit is (F i-1 +F i )/2 ⁇ F ⁇ F i .
- the operation that the frequency of the storage unit of the processor in the next time period is predicted in the performance prediction unit may include that:
- fine granularity regulation is performed on the frequency of the storage unit according to a present utilization rate and power consumption of the processor and the present accuracy of the neural network.
- the operation that the neural network configuration signal is acquired in the signal acquisition unit may include that: the present layer type and the present layer scale for processing of the neural network and the real-time accuracy of the neural network are acquired.
- the performance prediction unit may include at least one of: a preceding value method-based prediction unit, adopting a preceding value method to predict the voltage and the frequency of the processor in the next time period; a moving average load method-based prediction unit, adopting a moving average load method to predict the voltage and the frequency of the processor in the next time period; an exponentially weighted average method-based prediction unit, adopting an exponentially weighted average method to predict the voltage and the frequency of the processor in the next time period; and a minimum average method-based prediction unit, adopting a minimum average method to predict the voltage and the frequency of the processor in the next time period.
- a preceding value method-based prediction unit adopting a preceding value method to predict the voltage and the frequency of the processor in the next time period
- a moving average load method-based prediction unit adopting a moving average load method to predict the voltage and the frequency of the processor in the next time period
- an exponentially weighted average method-based prediction unit adopting an exponentially weighted average method to predict the voltage and the
- the disclosure provides the DVFS method and DVFS co-processor for the neural network.
- the DVFS method the real-time load and power consumption of the processor are acquired, and the topological structure of the neural network, the scale of the neural network, and the precision requirement of the neural network are acquired; and then, a voltage prediction and frequency prediction method is adopted to scale the working voltage and frequency of the processor. Therefore, performance of the processor is reasonably utilized, and power consumption of the processor is reduced.
- a DVFS algorithm for the neural network is integrated in the DVFS co-processor, and thus the characteristics of topological structure, network scale, precision requirement, and the like of the neural network may be fully mined.
- the signal acquisition unit acquires a system load signal and a topological structure signal of the neural network, a neural network scale signal, and a neural network precision signal in real time; the performance prediction unit predicts the voltage and the frequency required by the system; the frequency scaling unit scales the working frequency of the neural network processor; and the voltage scaling unit scales the working voltage of the neural network processor. Therefore, the performance of the neural network processor is reasonably utilized, and the power consumption of the neural network processor is effectively reduced.
- FIG. 22 is a functional module diagram of an information processing device according to an example of the disclosure.
- the information processing device may include a storage unit and a data processing unit.
- the storage unit is configured to receive and store input data, an instruction, and output data.
- the input data may include one or more images.
- the data processing unit performs extraction and computational processing on a key feature included in the input data and generates a multidimensional vector for each image according to a computational processing result.
- the key feature may include a facial action and expression, a key point position, and the like in the image.
- a specific form is a feature map (FM) in a neural network.
- the image may include a static picture, pictures forming a video, a video, or the like.
- the static picture, the pictures forming the video, or the video may include images of one or more parts of a face.
- the one or more parts of the face include facial muscles, lips, eyes, eyebrows, nose, forehead, ears, and combination thereof of the face.
- Each element of the vector represents an emotion on the face, for example, anger, delight, pain, depression, sleepiness, and doubt.
- the storage unit is further configured to, after tagging an n-dimensional vector, output the n-dimensional vector, namely outputting the n-dimensional vector obtained by computation.
- the information processing device may further include a conversion module configured to convert the n-dimensional vector into a corresponding output.
- the output may be a control instruction, data (0, 1 output), a tag (happiness, depression, and the like), or picture output.
- the control instruction may be single click, double click, and dragging of a mouse, single touch, multi-touch, and sliding of a touch screen, turning on and turning off of a switch, and a shortcut key.
- the information processing device is configured for adaptive training.
- the storage unit is configured to input n images, each image including a tag, each image corresponding to a vector (real emotion vector) and n being a positive integer greater than or equal to one.
- the data processing unit takes calibrated data as an input, calculates an output emotion vector, i.e., a predicted emotion vector, in a format the same as the input, compares the output emotion vector with the real emotion vector and updates a parameter of the device according to a comparison result.
- an output emotion vector i.e., a predicted emotion vector
- the emotion vector may include n elements.
- a value of each element of the emotion vector may include the following conditions.
- each element of the emotion vector may be a number between zero and one (representing a probability of appearance of a certain emotion).
- each element of the emotion vector may also be any number greater than or equal to zero (representing an intensity of a certain emotion).
- a value of only one element of the emotion vector is one and values of the other elements are zero. Under this condition, the emotion vector may only represent a strongest emotion.
- the predicted emotion vector may be compared with the real emotion vector in manners of calculating a Euclidean distance and calculating an absolute of dot product of the predicted emotion vector and the real emotion vector.
- n is three; the predicted emotion vector is [a 1 , a 2 , a 3 ]; the real emotion vector is [b 1 , b 2 , b 3 ]; the Euclidean distance of the two is [(a 1 ⁇ b 1 ) 2 +(a 2 ⁇ b 2 ) 2 +(a 3 ⁇ b 3 ) 2 ] 1/2 ; and the absolute value of the dot product of the two is
- the comparison manners are not limited to calculating the Euclidean distance and calculating the absolute value of the dot product, and other methods may also be adopted.
- the information processing device is an artificial neural network chip.
- the operation that the parameter of the device is updated may include that: a parameter (weight, offset, and the like) of the neural network is adaptively updated.
- the storage unit of the artificial neural network chip is configured to store the data and the instruction.
- the data may include an input neuron, an output neuron, a weight, the image, the vector, and the like.
- the data processing unit of the artificial neural network chip may include an operation unit configured to execute corresponding computation on the data according to an instruction stored in the storage unit.
- the operation unit may be a scalar computation unit configured to complete a scalar multiplication, a scalar addition, or a scalar multiplication and addition operation, or a vector computation unit configured to complete a vector multiplication, vector addition or vector dot product operation, or a hybrid computation unit configured to complete a matrix multiplication and addition operation, a vector dot product computation, and nonlinear computation, or convolutional computation.
- the computation executed by the operation unit may include neural network operation.
- FIG. 11 to 13 a structure of the operation unit is illustrated in FIG. 11 to 13 .
- a specific connecting relationship refers to the descriptions mentioned above and will not be elaborated herein.
- the operation unit may include, but is not limited to: a first part including a multiplier, a second part including one or more adders (more specifically, the adders of the second part form an adder tree), a third part including an activation function unit, and/or a fourth part including a vector processing unit. More specifically, the vector processing unit may process vector computation and/or pooling computation.
- the second part adds the input data in 1 through the adders to obtain output data (out).
- Pool is the pooling operation, the pooling operation may include, but is not limited to: average pooling, maximum pooling, and median pooling, and the input data in is data in a pooling core related to an output out.
- Pool is the pooling operation
- the pooling operation may include, but is not limited to: average pooling, maximum pooling, and median pooling
- the input data in is data in the pooling core related to the output out.
- the computation of one or more parts of the abovementioned parts may be freely selected for combination in different sequences, thereby implementing computation of various functions.
- the artificial neural network chip may further include a control unit, an instruction cache unit, a weight cache unit, an input neuron cache unit, an output neuron cache unit, and a DMA.
- the control unit is configured to read an instruction from the instruction cache, decode the instruction into an operation unit instruction and input the operation unit instruction to the operation unit.
- the instruction cache unit is configured to store the instruction.
- the weight cache unit is configured to cache weight data.
- the input neuron cache unit is configured to cache an input neuron input to the operation unit.
- the output neuron cache unit is configured to cache an output neuron output by the operation unit.
- the DMA is configured to read/write data or instructions in the storage unit, the instruction cache, the weight cache, the input neuron cache, and the output neuron cache.
- the artificial neural network chip may further include a conversion unit, connected with the storage unit and configured to receive first output data (data of a final output neuron) and convert the first output data into second output data.
- a conversion unit connected with the storage unit and configured to receive first output data (data of a final output neuron) and convert the first output data into second output data.
- the neural network has a requirement on a format of an input picture, for example, a length, a width, and a color channel.
- the artificial neural network chip may further include a preprocessing unit configured to preprocess original input data, i.e., one or more images, to obtain image data consistent with an input layer scale of a bottom layer of an artificial neural network adopted by the chip to meet the requirement of a preset parameter and data format of the neural network.
- Preprocessing may include segmentation, Gaussian filtering, binarization, regularization, normalization, and the like.
- the preprocessing unit may exist independently of the chip. That is, the preprocessing unit may be configured as an information processing device including a preprocessing unit and a chip. The preprocessing unit and the chip are configured as described above.
- the operation unit of the chip may adopt a short-bit floating point data module for forward computation, including a floating point data statistical module, a short-bit floating point data conversion unit, and a short-bit floating point data operation module.
- the floating point data statistical module is configured to perform statistical analysis on data of each type required by artificial neural network forward computation to obtain an EL.
- the short-bit floating point data conversion unit is configured to implement conversion from a long-bit floating point data type to a short-bit floating point data type according to the EL obtained by the floating point data statistical module.
- the short-bit floating point data operation module is configured to, after the floating point data conversion units adopts the short-bit floating point data type to represent all inputs, weights, and/or offset data required by the artificial neural network forward computation, execute the artificial neural network forward computation on short-bit floating point data.
- the floating point data statistical module is further configured to perform statistical analysis on the data of each type required by the artificial neural network forward computation to obtain exponential offset.
- the short-bit floating point data conversion unit is configured to implement conversion from the long-bit floating point data type to the short-bit floating point data type according to the exponential offset and the EL obtained by the floating point data statistical module.
- the exponential offset and the EL are set, so that a representable data range may be extended as much as possible. Therefore, all data of the input neuron and the weight may be included.
- the short-bit floating point data conversion unit may include an operation cache unit 31 , a data conversion unit 32 , and a rounding unit 33 .
- the operation cache unit adopts a data type with relatively high accuracy to store an intermediate result of the forward computation. This is because addition or multiplication computation may extend the data range during the forward computation. After computation is completed, a rounding operation is executed on data beyond a short-bit floating point accuracy range. Then, the data in a cache region is converted into the short-bit floating point data through the data conversion unit 32 .
- the rounding unit 33 may complete the rounding operation over the data beyond the short-bit floating point accuracy range.
- the unit may be a random rounding unit, a rounding-off unit, a rounding-up unit, a rounding-down unit, a truncation rounding unit and the like. Different rounding units may implement different rounding operations over the data beyond the short-bit floating point accuracy range.
- the random rounding unit executes the following operation:
- y represents the short-bit floating point data obtained by random rounding
- x represents 32-bit floating point data before random rounding
- ⁇ is a minimum positive integer which may be represented by a present short-bit floating point data representation format, i.e., 2 offset-(X-1-EL)
- ⁇ x ⁇ represents a number obtained by directly truncating the short-bit floating point data from raw data x (similar to a rounding-down operation over decimals)
- w.p. represents a probability, that is, a probability that the data y obtained by random rounding is ⁇ x ⁇ is
- the rounding-off unit executes the following operation:
- y represents the short-bit floating point data obtained by rounding-off
- x represents the long-bit floating point data before rounding-off
- ⁇ is the minimum positive integer which may be represented by the present short-bit floating point data representation format, i.e., 2 offset-(X-1-EL)
- ⁇ x ⁇ is an integral multiple of ⁇
- a value of ⁇ x ⁇ is a maximum number less than or equal to x.
- the rounding-up unit executes the following operation:
- y represents the short-bit floating point data obtained by rounding-up
- x represents the long-bit floating point data before rounding-up
- ⁇ x ⁇ is the integral multiple of ⁇
- a value of ⁇ x ⁇ is a minimum number greater than or equal to x
- ⁇ is the minimum positive integer which may be represented by the present short-bit floating point data representation format, i.e., 2 offset-(X-1-EL) .
- the rounding-down unit executes the following operation:
- y represents the short-bit floating point data obtained by rounding-up
- x represents the long-bit floating point data before rounding-up
- ⁇ x ⁇ is the integral multiple of ⁇
- a value of ⁇ x ⁇ is a maximum number less than or equal to x
- ⁇ is the minimum positive integer which may be represented by the present short-bit floating point data representation format, i.e., 2 offset-(X-1-EL) .
- the truncation rounding unit executes the following operation:
- y represents the short-bit floating point data after truncation rounding
- x represents the long-bit floating point data before truncation rounding
- [x] represents the number obtained by directly truncating the short-bit floating point data from the raw data x.
- the artificial neural network chip may be applied to a terminal.
- the terminal may further include an image acquisition device, besides the artificial neural network chip.
- the image acquisition device may be a webcam and a camera.
- the terminal may be a desktop computer, smart home, a transportation means, or a portable electronic device.
- the portable electronic device may be a webcam, a mobile phone, a notebook computer, a tablet computer, a wearable device, and the like.
- the wearable device may include a smart watch, a smart band, smart clothes, and the like.
- the artificial neural network chip may also be applied to a cloud (server). Then only one application (APP) is required on a device of a user.
- the device uploads an acquired image
- the information processing device of the disclosure calculates an output
- a user terminal makes a response.
- the disclosure further provides an information processing method, which may include the following steps.
- a storage unit receives input data, the input data including one or more images.
- a data processing unit extracts and processes a key feature included in the input data and generates a multidimensional vector for each image according to a processing result.
- the key feature may include a facial action and expression, a key point position, and the like in the image.
- a specific form is an FM in a neural network.
- the image may include a static picture, pictures forming a video, a video, or the like.
- the static picture, the pictures forming the video, or the video may include images of one or more parts of a face.
- the one or more parts of the face include facial muscles, lips, eyes, eyebrows, nose, forehead, ears, and combination thereof of the face.
- Each element of the multidimensional vector represents an emotion on the face, for example, anger, delight, pain, depression, sleepiness, and doubt.
- the information processing method may further include that: tagged data (existing image corresponding to the multidimensional vector) is learned; the multidimensional vector is output after the tagged data is learned; and a parameter of the data processing unit is updated.
- the information processing method may further include that: the multidimensional vector is converted into a corresponding output.
- the output may be a control instruction, data (0, 1 output), a tag (happiness, depression, and the like), and picture output.
- the control instruction may be single click, double click, and dragging of a mouse, single touch, multi-touch, and sliding of a touch screen, turning on and turning off of a switch, a shortcut key, and the like.
- the information processing method may further include that: adaptive training is performed.
- a specific flow is as follows.
- n images are input into the storage unit, each image including a tag, each image corresponding to a vector (real emotion vector) and n being a positive integer greater than or equal to one.
- the data processing unit takes calibrated data as an input, calculates an output emotion vector, i.e., a predicted emotion vector, in a format the same as the input, compares the output emotion vector with the real emotion vector and updates a parameter of the device according to a comparison result.
- an output emotion vector i.e., a predicted emotion vector
- the emotion vector may include n elements.
- a value of each element of the emotion vector may include the following conditions.
- each element of the emotion vector may be a natural number between zero and one (representing a probability of appearance of a certain emotion).
- each element of the emotion vector may also be any number greater than or equal to zero (representing an intensity of a certain emotion).
- a preset expression is [delight, sadness, fear]
- a vector corresponding to a reluctant smiling face may be [0.5, 0.2, 0].
- a value of only one element of the emotion vector is one and values of the other elements are zero.
- the emotion vector may only represent a strongest emotion.
- a preset expression is [delight, sadness, fear]
- a vector corresponding to an obvious smiling face may be [1, 0, 0].
- the predicted emotion vector may be compared with the real emotion vector in manners of calculating a Euclidean distance, calculating an absolute of dot product of the predicted emotion vector and the real emotion vector, and the like.
- n is three; the predicted emotion vector is [a 1 , a 2 , a 3 ]; the real emotion vector is [b 1 , b 2 , b 3 ]; the Euclidean distance of the two is [(a 1 ⁇ b 1 ) 2 +(a 2 ⁇ b 2 ) 2 +(a 3 ⁇ b 3 ) 2 ] 1/2 ; and the absolute value of the dot product of the two is
- the comparison manners are not limited to calculating the Euclidean distance and calculating the absolute value of the dot product, and other methods may also be adopted.
- the information processing device is an artificial neural network chip.
- the value of each element of the emotion vector may be a number between zero and one (representing the probability of appearance of a certain emotion). Since emotions of a person may be overlaid, there may be multiple nonzero numbers in the emotion vector to express a complicated emotion.
- a method by which the artificial neural network chip obtains the emotion vector may include that: each neuron of a final output layer of the neural network corresponds to an element of the emotion vector, and an output neuron value is a number between zero and one and is determined as a probability of appearance of the corresponding emotion.
- the whole process for calculating the emotion vector is as follows.
- a DMA transmits the input data in batches to corresponding on-chip caches (i.e., an instruction cache, an input neuron cache, and a weight cache).
- on-chip caches i.e., an instruction cache, an input neuron cache, and a weight cache.
- a control unit reads an instruction from the instruction cache and decodes and transmits the instruction into an operation unit.
- the operation unit executes corresponding computation according to the instruction.
- computation is implemented mainly in three substeps.
- corresponding input neurons and weights are multiplied.
- adder tree computation is executed, that is, a result obtained in S 41 is added through an adder tree step by step to obtain a weighted sum, and the weighted sum is offset or not processed according to a requirement.
- activation function computation is executed on a result obtained in S 42 to obtain output neurons, and the output neurons are transmitted into an output neuron cache.
- S 5 S 2 to S 4 are repeated until computation for all the data is completed, namely obtaining a final result required by a function.
- the final result is obtained by output neurons of the last layer of the neural network.
- Each neuron of the final output layer of the neural network corresponds to an element of the emotion vector, and an output neuron value is a number between zero and one and is determined as the probability of appearance of the corresponding emotion.
- the final result is output into the output neuron cache from the operation unit, and then is returned to the storage unit through the DMA.
- a magnitude of the emotion vector i.e., expression type, which is also the number of the neurons of the final output layer of the artificial neural network
- a comparison form the Euclidean distance, the dot product, and the like
- a network parameter updating manner (stochastic gradient descent, Adam algorithm, and the like) are required to be preset in an adaptive training stage.
- a value of only one element of the emotion vector is one and values of the other elements are zero. Under this condition, the emotion vector may only represent the strongest emotion.
- the method by which the artificial neural network chip obtains the emotion vector may include that: each neuron of the final output layer of the neural network corresponds to an element of the emotion vector, but only one output neuron is one and the other output neurons are zero.
- the whole process for calculating the emotion vector is as follows.
- the DMA transmits the input data in batches to the instruction cache, the input neuron cache, and the weight cache.
- control unit reads the instruction from the instruction cache and decodes and transmits the instruction into the operation unit.
- the operation unit executes corresponding computation according to the instruction.
- computation is implemented mainly in three substeps.
- corresponding input neurons and weights are multiplied.
- adder tree computation is executed, that is, a result obtained in S 41 is added through an adder tree step by step to obtain a weighted sum, and the weighted sum is offset or not processed according to a requirement.
- activation function computation is executed on a result obtained in S 42 to obtain output neurons, and the output neurons are transmitted into an output neuron cache.
- S 5 S 2 to S 4 are repeated until computation for all the data is completed, namely obtaining a final result required by a function.
- the final result is obtained by the output neurons of the last layer of the neural network.
- Each neuron of the final output layer of the neural network corresponds to an element of the emotion vector, but only one output neuron is one and the other output neurons are zero.
- the final result is output into the output neuron cache from the operation unit, and then is returned to the storage unit through the DMA.
- a magnitude of the emotion vector i.e., expression type, which is also the number of the neurons of the final output layer of the artificial neural network
- a comparison form the Euclidean distance, the dot product, and the like
- a network parameter updating manner the real emotion vector used for training in this example is different from example 1 and should also be an “indication” vector like [1, 0, 0, 0, . . . ].
- Each functional unit/module in the disclosure may be hardware.
- the hardware may be a circuit, including a digital circuit, an analogue circuit, and the like.
- Physical implementation of a hardware structure may include, but is not limited to, a physical device, and the physical device may include, but is not limited to, a transistor, a memristor, and the like.
- the computation module in the computation device may be any proper hardware processor, for example, a CPU, a graphics processing unit (GPU), a field-programmable gate array (FPGA), a digital signal processor (DSP), and an application specific integrated circuit (ASIC).
- GPU graphics processing unit
- FPGA field-programmable gate array
- DSP digital signal processor
- ASIC application specific integrated circuit
- the storage unit may be any proper magnetic storage medium or magneto-optical storage medium, for example, a resistance random access memory (RRAM), a DRAM, an SRAM, an embedded DRAM (EDRAM), a high bandwidth memory (HBM), and a hybrid memory cube (HMC).
- RRAM resistance random access memory
- DRAM dynamic random access memory
- SRAM static random access memory
- EDRAM embedded DRAM
- HBM high bandwidth memory
- HMC hybrid memory cube
- the neural network of the disclosure may be a convolutional neural network and may also be a fully connected neural network, a restricted Boltzmann machine (RBM) neural network, a recurrent neural network (RNN), and the like.
- RBM restricted Boltzmann machine
- RNN recurrent neural network
- computation may be other operations in the neural network besides convolutional computation, for example, fully connected computation.
- FIG. 29 is a functional module diagram of a computation device according to an example of the disclosure.
- the neural network operation device of the disclosure may include a control module and an operation module, and may further include a storage module.
- the operation module may include multiple operation units. Each operation unit may include at least one multiplier and at least one adder. In some examples, each operation unit may further include at least one memory.
- the memory may include a storage space and/or a temporary cache.
- the storage space is, for example, an SRAM.
- the temporary cache is, for example, a register.
- the control module is configured to send an instruction to the multiple operation units and control data transmit between the operation units.
- the instruction may be configured for each operation unit to transmit data to be computed or an intermediate result value to one or more other operation units in one or more directions.
- the transmit directions include transmit to the left/right adjacent or nonadjacent operation units, transmit to the upper/lower adjacent or nonadjacent operation units, transmit to the diagonally adjacent or nonadjacent operation units, and transmit to multiple adjacent or nonadjacent operation units in multiple directions.
- the direction of transmit to the diagonally adjacent or nonadjacent operation units may include a direction of transmit to the left upper diagonally, left lower diagonally, right upper diagonally, and right lower diagonally adjacent or nonadjacent operation units.
- Each operation unit is provided with multiple input ports.
- the multiple input ports include a port connected with the storage module and configured to receive data transmitted by the storage module and a port connected with the other operation units and configured to receive data transmitted by the operation units.
- Each operation unit is also provided with an output port configured to transmit the data back to the storage module or to a specified operation unit.
- the storage module may include a data storage unit and/or a temporary cache. According to a requirement, one or more data storage units and/or temporary caches may be provided. That is, the data to be computed may be stored in the same region and may also be stored separately. An intermediate result may be stored in the same region and may also be stored separately.
- the data storage unit is, for example, an SRAM.
- the temporary cache is, for example, a register.
- control module may include a storage control unit and a computational control unit.
- the storage control unit is configured to control the storage module to store or read required data.
- the computational control unit is configured to control the operation module according to the type of computation to be executed and a computational requirement, including to control specific computation manners in the operation units and to control data transmit between the operation units.
- the disclosure further provides a computation method, which may include the following steps.
- a control module sends an instruction.
- Multiple operation units of an operation module receive the instruction and perform data transmit according to the instruction.
- Each operation unit receives the instruction and transmits data to be computed or an intermediate result to the other operation units except itself in one or more directions according to the instruction.
- the direction may include a direction of transmit to the left/right adjacent or nonadjacent operation units, a direction of transmit to the upper/lower adjacent or nonadjacent operation units, and a direction of transmit to diagonally adjacent or nonadjacent operation units.
- the direction of transmit to the diagonally adjacent or nonadjacent operation units may include a direction of transmit to the left upper diagonally, left lower diagonally, right upper diagonally, and right lower diagonally adjacent or nonadjacent operation units.
- the operation module may include N*N (N is a positive integer) operation units and an ALU.
- the data may be transmitted sequentially in an S-shaped direction, as illustrated in FIG. 31 .
- the ALU is a lightweight ALU.
- Each operation unit may include a multiplier, an adder, a storage space, and a temporary cache. The intermediate results obtained by every computation executed by the operation units are transmitted between the operation units.
- a main computation flow of a processor of the example is as follows.
- a storage control unit sends a read control signal to a storage module to read neuron data and synaptic weight data to be computed, and store neuron data and synaptic weight data to be computed in the storage spaces of the operation units for transmit respectively.
- a computational control unit sends a computational signal to be computed to each operation unit and initializes each operation unit, for example, clearing caches.
- the storage control unit sends an instruction and transmits a neuron to be computed to each operation unit.
- the computational control unit sends an instruction and each operation unit receives neuron data for multiplication with the corresponding synaptic weight data in its own storage space.
- a left upper operation unit transmits a computational result rightwards to a second operation unit, and the second operation unit adds the computational result received and a computational product obtained by itself to obtain a partial sum and transmits the partial sum rightwards, and so on.
- the partial sum is transmitted according to an S-shaped path and is continuously accumulated. If accumulation is completed, the partial sum is transmitted into the ALU for computation such as activation and then a result is written into the storage module. If not, the result is temporally stored back into the storage module for subsequent scheduling and computation is continued.
- the operation module may include N*N (N is a positive integer) operation units and M ⁇ 1 ALUs (M is a positive integer).
- N is a positive integer
- M is a positive integer
- Different operation units may transmit computational data in different directions. That is, there is no such requirement that all the operation units in the same operation module keep a unified transmit direction.
- Each operation unit may include a multiplier, an adder, and a temporary cache. The intermediate results obtained by every computation executed by the operation units are transmitted between the operation units.
- the ALUs are lightweight ALUs.
- An output value of an LRN layer is (1+( ⁇ /n) ⁇ i x i 2 ) ⁇ .
- accumulation of a square of input data may be completed through the operation units and then a subsequent exponential operation is completed through the ALU.
- operations and data transmit direction of the operation units are configured as follows.
- the operation units in the leftmost column are configured to receive data to be computed from the storage module, to complete square operations, and to transmit square values to the right and right lower adjacent operation units.
- the operation units in the uppermost column are configured to receive the data from the storage module, to complete square operations, and to transmit square values to the right lower adjacent operation units.
- the operation units in the rightmost column are configured to receive the data from operation units of the left upper and the left, to complete accumulation and, if all accumulation is completed, to transmit the data rightwards to the ALU for subsequent exponential operations according to the instruction.
- the other operation units are configured to receive the data from the left upper operation units, to transmit the data to the right lower operation units, and to accumulate the data and data transmitted by the left operation units and transmit an accumulated sum rightwards. The rest may be done in the same manner until all computation is completed.
- N 3 as an example.
- data on the horizontal lines is specific data to be transmitted and data in boxes represent computational results obtained in each operation unit.
- related operations may be completed in a pipeline manner.
- data which has been read on the chip may be effectively utilized; the number of memory access times is effectively reduced; a power consumption overhead is reduced; a delay brought by data reading is reduced; and a computational speed is increased.
- Each functional unit/module in the disclosure may be hardware.
- the hardware may be a circuit, including a digital circuit, an analogue circuit, and the like.
- Physical implementation of a hardware structure may include, but is not limited to, a physical device, and the physical device may include, but is not limited to, a transistor, a memristor, and the like.
- the computation module in the computation device may be any proper hardware processor, for example, a CPU, a GPU, an FPGA, a DSP, and an ASIC.
- the storage unit may be any proper magnetic storage medium or magneto-optical storage medium, for example, an RRAM, a DRAM, an SRAM, an EDRAM, an HBM, and an HMC.
- the neural network of the disclosure may be a convolutional neural network and may also be a fully connected neural network, an RBM neural network, an RNN, and the like.
- computation may be other operations in the neural network besides convolutional computation, for example, fully connected computation.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Computing Systems (AREA)
- Quality & Reliability (AREA)
- Mathematical Physics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computational Linguistics (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computational Mathematics (AREA)
- Computer Security & Cryptography (AREA)
- Pure & Applied Mathematics (AREA)
- Mathematical Optimization (AREA)
- Mathematical Analysis (AREA)
- Neurology (AREA)
- Databases & Information Systems (AREA)
- Human Computer Interaction (AREA)
- Medical Informatics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Algebra (AREA)
- Hardware Redundancy (AREA)
- Debugging And Monitoring (AREA)
Abstract
Description
- The disclosure relates to a data processing apparatus and method.
- On a conventional processor chip, a central processing unit (CPU) transmits a launching configuration instruction to an instruction memory of a dedicated processor core to launch the dedicated processor core to complete a task, and the whole task continues to be executed until an end instruction is executed. Such a task launching manner is called common launching. However, such a common launching mode has the following problems. It is difficult to dynamically monitor an execution state of a present task and to schedule the present task.
- In view of this, the disclosure aims to provide a dynamic voltage frequency scaling (DVFS) method and a DVFS co-processor to solve at least one of the above-mentioned problems.
- According to an aspect of the present disclosure, a DVFS method is provided, which includes: obtaining a processor load and a neural network configuration signal within a time period of T−t˜T; and predicting a frequency of a processor in a next time period of T˜T+t, where both of T and t are real numbers greater than zero.
- In some examples, the method may further include predicting a voltage of the processor in the next time period of T˜T+t according to the frequency predicted.
- In some examples, predicting the frequency of the processor in the next time period may include: predicting a frequency of a storage unit and/or a computation unit.
- In some examples, predicting the frequency of the processor in the next time period of T˜T+t includes presetting m frequency scaling ranges for the computation unit, and generating m+1 frequency segmentation points f0, f1, . . . , and fm in total, where f0<f1< . . . <fm, f0, f1, . . . , and fm are real numbers greater than 0, and m is a positive integer greater than 0.
- In some examples, predicting the frequency of the processor in the next time period of T˜T+t includes presetting m neural network scales, and generating m+1 scale division points n0, n1, . . . , nm in total, where n0<n1< . . . <nm, n0, n1, . . . , and nm are positive integers greater than 0, and m is a positive integer greater than 0.
- In some examples, predicting the frequency of the processor in the next time period of T˜T+t includes determining a frequency scaling range of the computation unit according to a scale n of a present processing layer, and if ni-1<n<ni, the frequency scaling range of the computation unit is fi-1<f<fi.
- In some examples, predicting the frequency of the processor in the next time period of T˜T+t includes further narrowing the frequency scaling range of the computation unit according to the type of the present processing layer, and dividing the layer into two types, that is, a compute-intensive layer and a memory access-intensive layer, where the compute-intensive layer may include a convolutional layer, and the memory access-intensive layer may include a fully connected layer, a pooling layer, and an active layer. If the layer is a compute-intensive layer, the frequency scaling range of the computation unit is (fi-1+fi)/2<f<fi; if the layer is a memory access-intensive layer, the frequency scaling range of the computation unit is fi-1/2<f<*fi-1+fi)/2.
- In some examples, predicting the frequency of the processor in the next time period of T˜T+t includes: performing fine granularity regulation on the frequency of the computation unit according to present time accuracy of the neural network, when the present accuracy of the neural network is higher than an expected accuracy, decreasing the frequency of the computation unit, and when the present accuracy of the neural network is lower than an expected accuracy, increasing the frequency of the computation unit.
- In some examples, predicting the frequency of the processor in the next time period of T˜T+t includes: presetting k frequency scaling ranges for the storage unit, and generating k+1 frequency segmentation points F0, F1, . . . , and Fk in total, where F0<F1< . . . <Fk, F0, F1< . . . , and Fk are real numbers greater than zero, and k is a positive integer greater than zero.
- In some examples, predicting the frequency of the processor in the next time period of T˜T+t includes presetting k neural network scales, and generating k+1 scale segmentation points N0, N1, . . . , and Nk in total, where N0<N1< . . . <Nk, N0, N1, . . . , Nk are positive integers greater than zero, and k is a positive integer greater than zero.
- In some examples, predicting the frequency of the processor in the next time period of T˜T+t includes predicting a frequency scaling range of the storage unit according to a scale N of a present processing layer, and if Ni-1<N<Ni, the frequency scaling range of the storage unit is Fi-1<F<Fi.
- In some examples, predicting the frequency of the processor in the next time period of T˜T+t includes: further narrowing the frequency scaling range of the storage unit according to the type of the present processing layer, and dividing the layers into two types, that is, a compute-intensive layer and a memory access-intensive layer. The compute-intensive layer may include a convolutional layer. The memory access-intensive layer may include a fully coupled layer, a pooling layer, and an active layer. If the layer is a compute-intensive layer, the frequency scaling range of the storage unit is Fi-1<F<(Fi-1+Fi)/2, and if the layer is a memory access-intensive layer, the frequency scaling range of the storage unit is (Fi-1+Fi)/2<F<Fi.
- In some examples, predicting the frequency of the processor in the next time period of T˜T+t includes performing the fine granularity regulation on the frequency of the storage unit according to present accuracy of the neural network, when the present accuracy of the neural network is higher than an expected accuracy, decreasing the memory access frequency of the storage unit, and when the present accuracy of the neural network is lower than the expected accuracy, increasing the memory access frequency of the storage unit.
- In some examples, the method may further include: when the frequency is scaled form high to low, decreasing the frequency at first, and then decreasing a voltage; when the frequency is scaled from low to high, increasing the voltage at first, and then increasing the frequency.
- In some examples, after the frequency of the processor in the next time period of T˜T+t is predicted, the method may further include regulating a clock setting of a chip to scale the frequency of the processor.
- In some examples, the method may further include regulating a power management module of the chip to scale the voltage supplied to the processor.
- In some examples, obtaining the neural network configuration signal includes obtaining a present layer type and present layer scale for processing of the neural network and real-time accuracy of the neural network.
- According to another aspect of the disclosure, a DVFS co-processor is provided, which may include a signal acquisition unit and a performance prediction unit, where
- the signal acquisition unit is configured to acquire a workload of a processor and further configured to acquire a neural network configuration signal; and
- the performance prediction unit is configured to receive the neural network configuration signal and to predict a frequency and a voltage of the processor in a next time period according to a present load of the processor.
- In some examples, in the performance prediction unit, predicting the voltage and the frequency of the processor in the next time period may include predicting a frequency of a storage unit and/or a computation unit.
- In some examples, the co-processor may further include a frequency scaling unit configured to receive a frequency signal, predicted by the performance prediction unit, of the processor in the next time period and to scale the frequency of the storage unit and/or computation unit in the processor.
- In some examples, the co-processor may further include a voltage scaling unit configured to receive a voltage signal, predicted by the performance prediction unit, of the processor in the next time period and to scale a voltage of the storage unit and/or computation unit in the processor.
- In some examples, in the performance prediction unit, predicting the frequency of the computation unit in the next time period includes: presetting m frequency scaling ranges for the computation unit, and generating m+1 frequency segmentation points f0, fi, . . . , and fm in total, where f0<f1< . . . <fm, f0, f1, . . . , and fm are real numbers greater than zero, and m is a positive integer greater than zero.
- In some examples, in the performance prediction unit, predicting the frequency of the computation unit in the next time period includes presetting m segments of neural network scales, and generating m+1 scale division points n0, n1, . . . nm in total, where n0<ni< . . . <nm, n0, n1, . . . nm are positive integers greater than zero and m is a positive integer greater than zero.
- In some examples, in the performance prediction unit, predicting the frequency of the computation unit in the next time period may include determining a frequency scaling range of the computation unit according to a range of a scale n of a present processing layer, and if ni-1<n<ni, the frequency scaling range of the computation unit is fi-1<f<fi.
- In some examples, in the performance prediction unit, predicting the frequency of the computation unit in the next time period may include further narrowing the frequency scaling range of the computation unit according to the type of the present processing layer; if the layer is a compute-intensive layer, the frequency scaling range of the computation unit is (fi-1+fi)/2<f<fi; and if the layer is a memory access-intensive layer, the frequency scaling range of the computation unit is fi-1/2<f<(fi-1+fi)/2.
- In some examples, in the performance prediction unit, predicting the frequency of the computation unit in the next time period may include performing fine granularity regulation on the frequency of the computation unit according to present accuracy of the neural network, when the present accuracy of the neural network is higher than expected accuracy, decreasing the frequency of the computation unit, and when the present accuracy of the neural network is lower than the expected accuracy, increasing the frequency of the computation unit.
- In some examples, in the performance prediction unit, predicting the frequency of the storage unit in the next time period may include presetting k segments of frequency scaling ranges for the storage unit, and generating k+1 frequency division points F0, F1, . . . , Fk in total, where F0<F1< . . . <Fk, F0, F1, . . . , Fk are positive integers greater than zero and k is a positive integer greater than zero.
- In some examples, in the performance prediction unit, predicting the frequency of the storage unit in the next time period may include presetting k segments of neural network scales, and generating k+1 scale division points N0, N1, . . . , Nk in total, where N0<N1< . . . <Nk, N0, N1, . . . , Nk are positive integers greater than zero and k is a positive integer greater than zero.
- In some examples, in the performance prediction unit, predicting the frequency of the storage unit in the next time period may include determining a frequency scaling range of the storage unit according to a range of a scale N of a present processing layer, and if Ni-1<N<Ni, the frequency scaling range of the storage unit is Fi-1<F<Fi.
- In some examples, in the performance prediction unit, predicting the frequency of the storage unit in the next time period may include further narrowing the frequency scaling range of the storage unit according to the type of the present processing layer; if the layer is a compute-intensive layer, the frequency scaling range of the storage unit is Fi-1<F<(Fi-1+Fi)/2; and if the layer is a memory access-intensive layer, the frequency scaling range of the storage unit is (Fi-1+Fi)/2<F<Fi.
- In some examples, in the performance prediction unit, determining the frequency of the storage unit of the processor in the next time period may include performing fine granularity regulation on the frequency of the storage unit according to the present precision of the neural network.
- In some examples, in the signal acquisition unit, obtaining the neural network configuration signal may include obtaining a present layer type and a present layer scale for processing of the neural network and real-time accuracy of the neural network.
- In some examples, the performance prediction unit may include at least one of: a preceding value method-based prediction unit configured to predict the voltage and the frequency of the processor in the next time period by adopting a preceding value method; a moving average load method-based prediction unit configured to predict the voltage and the frequency of the processor in the next time period by adopting a moving average load method; and an exponentially weighted average method-based prediction unit configured to predict the voltage and the frequency of the processor in the next time period by adopting an exponentially weighted average method.
- According to the above solutions, the present disclosure provides a DVFS method and a DVFS co-processor for neural networks. The DVFS method acquires a real-time load and power consumption of a processor, and simultaneously acquiring a topological structure of the neural network, the scale of the neural network, and a precision requirement of the neural network. Then, a voltage prediction and frequency prediction method is adopted to scale the working voltage and frequency of the processor. Therefore, performance of the processor is reasonably utilized, and power consumption of the processor is reduced. The DVFS method for the neural network is integrated in the DVFS co-processor, and thus the characteristics of topological structure, network scale, precision requirement, and the like of the neural network may be fully mined. The signal acquisition unit acquires a system load signal of the neural network processor, a topological structure signal of the neural network, a neural network scale signal, and a neural network precision signal in real time; the performance prediction unit predicts the voltage and the frequency required by the system; the frequency scaling unit scales the working frequency of the neural network processor; and the voltage scaling unit scales the working voltage of the neural network processor. Therefore, the performance of the neural network processor may be reasonably utilized, and the power consumption of the neural network processor may be effectively reduced.
-
FIG. 1 is a structure diagram of a data processing device according to the disclosure; -
FIG. 2 is a flowchart of a data processing method according to the disclosure; -
FIG. 3 is a structure diagram of a data processing device according to an example of the disclosure; -
FIG. 4 is a structure diagram of a task configuration information storage unit according to an example of the disclosure; -
FIG. 5 is a flowchart of a data processing method according to an example of the disclosure; -
FIG. 6 andFIG. 7 are structure diagrams of a data processing device according to another example of the disclosure; -
FIG. 8 is a structure diagram of a data processing device according to another example of the disclosure; -
FIG. 9 is a structure diagram of a data cache of a data processing device according to an example of the disclosure; -
FIG. 10 is a flowchart of a data processing method according to another example of the disclosure; -
FIG. 11 is a structure diagram of a neural network operation unit according to an example of the disclosure; -
FIG. 12 is a structure diagram of a neural network operation unit according to another example of the disclosure; -
FIG. 13 is a structure diagram of a neural network operation unit according to another example of the disclosure; -
FIG. 14 is a flowchart of a data redundancy method according to an example of the disclosure; -
FIG. 15 is a structure block diagram of a data redundancy device according to another example of the disclosure; -
FIG. 16 is a neural network processor according to an example of the disclosure; -
FIG. 17 is a flowchart of a DVFS method according to an example of the disclosure; -
FIG. 18 is a flowchart of a DVFS method according to another example of the disclosure; -
FIG. 19 is a schematic block diagram of a DVFS method according to an example of the disclosure; -
FIG. 20 is a schematic diagram of a DVFS co-processor according to an example of the disclosure; -
FIG. 21 is a schematic diagram of a DVFS co-processor according to another example of the disclosure; -
FIG. 22 is a functional module diagram of an information processing device according to an example of the disclosure; -
FIG. 23 is a schematic diagram of an artificial neural network chip configured as an information processing device according to an example of the disclosure; -
FIG. 24 is a schematic diagram of an artificial neural network chip configured as an information processing device according to an example of the disclosure; -
FIG. 25 is a schematic diagram of an artificial neural network chip configured as an information processing device according to an example of the disclosure; -
FIG. 26 is a schematic diagram of an artificial neural network chip configured as an information processing device according to an example of the disclosure; -
FIG. 27 is a schematic diagram of an artificial neural network chip configured as an information processing device according to an example of the disclosure; -
FIG. 28 is a functional module diagram of a short-bit floating point data conversion unit according to an example of the disclosure; -
FIG. 29 is a functional module diagram of a computation device according to an example of the disclosure; -
FIG. 30 is a functional module diagram of a computation device according to an example of the disclosure; -
FIG. 31 is a functional module diagram of a computation device according to an example of the disclosure; -
FIG. 32 is a functional module diagram of a computation device according to an example of the disclosure; and -
FIG. 33 is a schematic diagram of an operation module according to an example of the disclosure. - The disclosure provides a data processing device which, when task information is configured and input therein, completes interaction with an external device, for example, a processor core, to automatically implement execution of a task of the processor core to implement self-launching.
- As illustrated in
FIG. 1 , the data processing device may include a self-launching task queue device. As illustrated inFIG. 1 , the self-launching task queue device may include a task configuration information storage unit and a task queue configuration unit. - The task configuration information storage unit is configured to store configuration information of tasks. The configuration information may include, but is not limited to, a start tag and an end tag of a task, a priority of the task, a launching manner for the task, and the like.
- The task queue configuration unit is configured to configure a task queue according to the configuration information of the task configuration information storage unit and complete dynamic task configuration and external communication.
- The self-launching task queue device is configured to cooperate with the external device, receive the configuration information sent by the external device and configure the task queue. The external device executes each task according to the configured task queue. Meanwhile, the self-launching task queue device interacts and communicates with the external device.
- Referring to
FIG. 2 , a workflow of the self-launching task queue device is illustrated and, as a data processing method of the disclosure, may include the following steps. - In S201, the external device sends a launching parameter to the self-launching task queue device.
- The launching parameter is stored in the task configuration information storage unit of the self-launching task queue device. The launching parameter may include launching information and a launching command. The launching information may be the abovementioned configuration information.
- In S202, the task queue configuration unit configures a task queue according to the launching parameter and sends a configured task queue to the external device.
- In S203, the external device executes tasks in the task queue and, every time when completing executing a task, sends a first end signal to the task configuration information storage unit.
- In S204, the task configuration information storage unit, every time when receiving a first end signal, sends an interrupt signal to the external device, and the external device processes the interrupt signal and sends a second end signal to the task configuration information storage unit.
- S203 and 5204 are executed for each task in the task queue until all the tasks in the task queue are completed.
- The task configuration information storage unit, after receiving the first end signal, may modify the task configuration information stored therein to implement task scheduling.
- In order to make the purpose, technical solutions and advantages of the disclosure clearer, the disclosure will further be described below in combination with specific examples and with reference to the drawings in detail.
- An example of the disclosure provides a data processing device. As illustrated in
FIG. 3 , the data processing device may further include a processor core. The processor core cooperates with the self-launching task queue device as the external device. - The configuration information input into the self-launching task queue device may include a launching mode, priority, and the like of the task. The processor core may execute various types of tasks. The tasks may be divided into different task queues, for example, a high-priority queue and a low-priority queue, according to properties of the tasks and an application scenario. The launching mode of the task may include self-launching and common launching.
- In the example, referring to
FIG. 4 , the task configuration information storage unit may include a first storage unit and a second storage unit. The tasks are allocated to the first storage unit or the second storage unit according to the configuration information respectively. The first storage unit stores a high-priority task queue, and the second storage unit stores a low-priority task queue. For the tasks in the task queues, launching and execution of the tasks may be completed according to the respective launching modes. - The above is only exemplary description and not intended to limit the disclosure. In other examples, the task queue may also be configured not according to the priorities but according to other parameters of the tasks, the number of the task queues is also not limited to two and may also be multiple, and correspondingly, there may also be multiple storage units.
- Referring to
FIG. 5 , a workflow of the self-launching task queue device of the abovementioned example is illustrated and, as a data processing method of another example of the disclosure, may include the following steps. - In S501, the processor core sends configuration information of a task queue to the self-launching task queue device.
- In S502, the task queue configuration unit configures the task queue according to the configuration information and sends a configured task queue to the processor core.
- In the S502, the first storage unit sends a stored high-priority task queue to the processor core and the second storage unit sends a stored low-priority task queue to the processor core.
- In S503, the processor core executes tasks in the task queue and, every time when completing executing a task, sends a first end signal to the task configuration information storage unit and task queue configuration is completed.
- In S504, the task configuration information storage unit, every time when receiving a first end signal, sends an interrupt signal to the processor core, and the processor core processes the interrupt signal and sends a second end signal to the task configuration information storage unit to complete self-launching of the task queue.
- In the disclosure, multiple external devices may be provided, for example, multiple processor cores. The processor cores may be various operation modules, control modules, and the like.
- In order to achieve a purpose of brief description, descriptions about technical features, which may be applied in a same manner, of a data processing device according to another example of the disclosure refer to those made in the abovementioned example and the same descriptions are not required to be repeated.
- Referring to
FIG. 6 andFIG. 7 , the processor core of the data processing device may include a control module and a neural network operation module. The neural network operation module may include a control unit, a neural network operation unit, and a storage unit. - The storage unit is configured to store data and instruction for neural network operation. The data may include an input neuron, an output neuron, a weight, a score, an error mode judgment result, and the like. The instruction may include various operation instructions for addition, multiplication, activation, and the like in the neural network operation.
- The control unit is configured to control operations of the storage unit and the neural network operation unit.
- The neural network operation unit is controlled by the control unit to execute the neural network operation on the data according to an instruction stored in the storage unit.
- The control module is configured to provide configuration information of tasks.
- Each of the control module and the neural network operation module is equivalent to a processor core. The control module sends configuration information of task queues to the self-launching task queue device. After the self-launching task queue device receives the configuration information, the task queue configuration unit configures the task queues according to the configuration information, stores each task queue in each corresponding storage unit and sends each task queue to the control unit of the neural network operation module. The control unit may monitor a configuration of the self-launching task queue device and configure a neural network operation instruction of the storage unit to a correct position, namely inputting an instruction of an external storage module in the storage unit into an instruction storage module. As illustrated in
FIG. 7 , the control unit controls the neural network operation unit and the storage unit to execute each task according to the configuration information. The neural network operation unit and the storage unit are required to cooperate to complete a task execution process. - The control unit, every time when completing executing a task, sends a first end signal to the task configuration information storage unit and task queue configuration is completed. The self-launching task queue device, every time when receiving a first end signal of the control unit, modifies the configuration information and sends an interrupt signal to the control unit. The control unit processes the interrupt signal and then sends a second end signal to the self-launching task queue device.
- The control unit is usually required to, after being started, send an instruction fetching instruction to complete the operation of configuring the neural network operation instruction of the storage unit to the correct position. That is, the control unit usually may include an instruction fetching instruction cache module. In the disclosure, the control unit is not required to send any instruction fetching instruction. That is, the instruction fetching instruction cache module of the control unit may be eliminated. Therefore, a structure of the device is simplified, cost is reduced and resources are saved.
- Referring to
FIG. 8 , the data processing device of the example may further include a data cache, an instruction cache, and a DMA. The storage unit is connected with the instruction cache and the data cache through the CMA. The instruction cache is connected with the control unit. The data cache is connected with the operation unit. - The storage unit receives input data and transmits neural network operational data and instruction in the input data to the data cache and the instruction cache through the DMA respectively.
- The data cache is configured to cache the neural network operational data. More specifically, as illustrated in
FIG. 9 , the data cache may include an input neuron cache, a weight cache, and an output neuron cache configured to cache input neurons, weights, and output neurons sent by the DMA respectively. The data cache may further include a score cache, error mode judgment result cache, and the like configured to cache scores and error mode judgment results and send the data to the operation unit. - The instruction cache is configured to cache the neural network operation instruction. The instructions for addition, multiplication, activation, and the like of the neural network operation are stored in the instruction cache through the DMA.
- The control unit is configured to read the neural network operation instruction from the instruction cache, decode it into an instruction executable for the operation unit and send an executable instruction to the operation unit.
- The neural network operation unit is configured to execute corresponding neural network operation on the neural network operational data according to the executable instruction. An intermediate result in a computation process and a final result may be cached in the data cache and are stored in the storage unit through the DMA as output data.
- Referring to
FIG. 10 , a workflow of the self-launching task queue device of the abovementioned example is illustrated and, as a data processing method of another example of the disclosure, may include the following steps. - In S901, the control module sends configuration information of task queues to the self-launching task queue device.
- In S902, the task queue configuration unit configures the task queues according to the configuration information and sends configured task queues to the neural network operation module.
- After the self-launching task queue device receives the configuration information, the task queue configuration unit configures the task queues according to the configuration information, stores each task queue in each corresponding storage unit thereof and sends each task queue to the control unit of the neural network operation module.
- In S903, the control unit monitors a configuration of the self-launching task queue device and controls the neural network operation unit and the storage unit to execute tasks in the task queues according to the configuration information and, every time when completing executing a task, send a first end signal to the task configuration information storage unit and task queue configuration is completed.
- In S904, the self-launching task queue device, every time when receiving a first end signal of the control unit, sends an interrupt signal to the control unit, and the control unit processes the interrupt signal and then sends a second end signal to the self-launching task queue device.
- The operation in S903 that the control unit controls the neural network operation unit and the storage unit to execute each task according to the configuration information may include the following steps.
- The control unit reads a neural network operation instruction from the storage unit according to the configuration information, and the neural network operation instruction is stored in the instruction cache through the DMA.
- The control unit reads the neural network operation instruction from the instruction cache, decodes it into an instruction executable for the operation unit and sends an executable instruction to the operation unit.
- The neural network operation unit reads neural network operational data from the data cache, executes corresponding neural network operation on the neural network operational data according to the executable instruction and stores a computational result in the data cache and/or the storage unit.
- In the example, the neural network operation may include multiplying an input neuron and a weight vector, adding an offset and performing activation to obtain an output neuron. In an implementation, the neural network operation unit may include one or more computational components. The computational components include, but is not limited to, for example, one or more multipliers, one or more adders and one or more activation function units.
- The multiplier multiplies input data 1 (in1) and input data 2 (in2) to obtain output (out), and a process is: out=in1*in2.
- As one alternative implementation, the neural network operation unit may include multiple adders and the multiple adders form an adder tree. The adder tree adds the input data (in1) step by step to obtain output data (out), in1 being a vector with a length N and N being greater than one, and a process is: out=in1[1]+in2[2]+ . . . +in1[N]; and/or the input data (in1) is accumulated and then added with the input data (in2) to obtain the output data (out), and a process is: out=in1[1]+in2[2]+ . . . +in1[N]+in2; or the input data (in1) and the input data (in1) are added to obtain the output data (out), and a process is: out=in1+in2.
- The activation function unit executes computation on input data (in) through an activation function (active) to obtain activation output data (out), and a process is: out=active(in). The activation function (active) is, but is not limited to, for example, sigmoid, tan h, RELU, and softmax. Besides an activation operation, the activation function unit may further implement other nonlinear computation and may execute computation (f) on the input data (in) to obtain the output data (out), and a process is: out f(in). The operation unit may also be a pooling unit, and the pooling unit executes pooling computation on the input data (in) to obtain the output data (out) after a pooling operation, and a process is out=pool(in). Pool is the pooling operation; the pooling operation may include, but is not limited to: average pooling, maximum pooling, and median pooling; and the input data in is data in a pooling core related to the output data out.
- Correspondingly, the neural network operation may include, but is not limited to, multiplication computation, addition computation, and activation function computation. The multiplication computation refers to multiplying input data 1 and input data 2 to obtain multiplied data. And/or the addition computation is executed to add the input data 1 through the adder tree step by step or accumulate the input data (in1) and then add the input data (in2) or add the input data 1 and the input data 2 to obtain output data. And/or the activation function computation refers to executing computation on the input data through the activation function (active) to obtain the output data. And/or the pooling computation is out=pool(in). Pool is the pooling operation; the pooling operation may include, but is not limited to: average pooling, maximum pooling, and median pooling; and the input data in is data in the pooling core related to the output data out. The one or more of the abovementioned computations may be freely selected for combination in different sequences, thereby implementing computation of various functions.
- In another implementation, the neural network operation unit may include, but is not limited to, multiple PEs and one or more ALUs. Each PE may include a multiplier, an adder, a comparator, and a register/register set. Each PE may receive data from the PEs in each direction, for example, receive data from PEs in a horizontal direction (for example, the right) and/or a vertical direction (for example, the lower), and may also transmit data to the PEs in an opposite horizontal direction (for example, the left) and/or an opposite vertical direction (for example, the upper). And/or each PE may receive data from the PEs in a diagonal direction and may also transmit data to the diagonal PEs in the opposite horizontal direction. Each ALU may complete basic computation such as an activation operation, multiplication, addition, and other nonlinear computation.
- Correspondingly, the computation executed by the neural network operation unit may include computation executed by the PEs and computation executed by the ALU. The PE multiplies the input data 1 and the input data 2, adds a product and data stored in the register or the data transmitted by the other PEs, writes a result back into the register or a storage part and simultaneously transmits certain input data or a computational result to the other PEs. And/or the PE accumulates or compares the input data 1 and the input data 2 or the data stored in the register. The ALU completes activation computation or nonlinear computation.
- When the neural network operation unit executes convolutional computation, fully connected computation, and the like, for each PE, the input data 1 (in1) and the input data 2 (in2) may be multiplied to obtain the multiplied output (out1), and a process is: out1=in1*in2. The data in the register is extracted and accumulated with a multiplication result (out1) to obtain a result (out2): out2=out1+data. Out2 may be written back into the register/register set or the storage part. In addition, certain input data (in1/in2) may be transmitted in the horizontal direction or the vertical direction.
- When the neural network operation unit processes a vector dot product, for each PE, the input data 1 (in1) and the input data 2 (in2) may be multiplied to obtain a multiplied output (out1), and the process is: out1=in1*in2. The data transmitted by the other PEs is accumulated with the multiplication result (out1) to obtain the result (out2): out2=out1+data. Then, the computational result (out2) may be transmitted in the horizontal direction or the vertical direction.
- When the neural network operation unit executes pooling computation, for each PE, a multiplication part may not be executed, and the adder or the comparator is directly adopted to complete the pooling computation: out=pool(in). Pool is the pooling operation, and the pooling operation may include, but is not limited to, average pooling, maximum pooling, and median pooling. The input data in is data in the pooling core related to an output out, and intermediate temporary data may be stored in the register.
- Each ALU is configured to complete basic computation such as an activation operation, multiplication and addition, or nonlinear computation. The activation operation refers to executing computation on the input data (in) through the activation function (active) to obtain the activation output data (out), and the process is: out=active(in). The activation function may be sigmoid, tan h, RELU, softmax, and the like. The other nonlinear computation refers to executing computation (f) on the input data (in) to obtain the output data (out), and the process is: out=f(in).
- In some examples, the neural network operation unit, as illustrated in
FIG. 11 , may include a primary processing circuit and multiple secondary processing circuits. The operation unit may include a tree module. The tree module may include a root port and multiple branch ports. The root port of the tree module is connected with the primary processing circuit, and each of the multiple branch ports of the tree module is connected with a secondary processing circuit of the multiple secondary processing circuits respectively. The tree module is configured to forward a data block, a weight, and an operation instruction between the primary processing circuit and the multiple secondary processing circuits. - The tree module may be configured as an n-ary tree structure, the structure illustrated in
FIG. 11 is a binary tree structure and may also be a ternary tree structure, and n may be an integer greater than or equal to two. A specific value of n is not limited in a specific implementation mode of the application. The layer number may also be two. The secondary processing circuits may be connected to nodes of another layer, except nodes of the last second layer, and, for example, may be connected to nodes of the last layer illustrated inFIG. 11 . - In some examples, the neural network operation unit, as illustrated in
FIG. 12 , may include a primary processing circuit, multiple secondary processing circuits, and a branch processing circuit. The primary processing circuit is specifically configured to allocate a task in the task queue into multiple data blocks and send at least one data block of the multiple data blocks, the weight, and at least one operation instruction of multiple operation instructions to the branch processing circuit. - The branch processing circuit is configured to forward the data block, the weight, and the operation instructions between the primary processing circuit and the multiple secondary processing circuits.
- The multiple secondary processing circuits are configured to execute computation on the received data block and the weight according to the operation instruction to obtain intermediate results and to transmit the intermediate results to the branch processing circuit.
- The primary processing circuit is configured to perform subsequent processing on the intermediate results sent by the branch processing circuit to obtain a result of the operation instruction, and to send the result of the operation instruction to the control unit.
- In some examples, the neural network operation unit, as illustrated in
FIG. 13 , may include a primary processing circuit and multiple secondary processing circuits. The multiple secondary processing circuits are distributed in an array. Each secondary processing circuit is connected with the other adjacent secondary processing circuits. The primary processing circuit is connected with k secondary processing circuits in the multiple primary processing circuits, and the k secondary processing circuits include n secondary processing circuits in a first row, n secondary processing circuits in an mth row, and m secondary processing circuits in a first column. - The k secondary processing circuits are configured to forward data and instructions between the primary processing circuit and the multiple secondary processing circuits.
- The primary processing circuit is configured to allocate a piece of input data into multiple data blocks and to send at least one data block of the multiple data blocks and at least one operation instruction of multiple operation instructions to the k secondary processing circuits.
- The k secondary processing circuits are configured to convert the data between the primary processing circuit and the multiple secondary processing circuits.
- The multiple secondary processing circuits are configured to execute computation on the received data block according to the operation instruction to obtain intermediate results and to transmit the intermediate results to the k secondary processing circuits.
- The primary processing circuit is configured to perform subsequent processing on the intermediate results sent by the k secondary processing circuits to obtain a result of the operation instruction and send the result of the operation instruction to the control unit. The above is only exemplary description and not intended to limit the disclosure. The neural network operation unit may be replaced with a non-neural network operation unit. The non-neural network operation unit is, for example, a universal operation unit. Universal computation may include a corresponding universal operation instruction and data and its computation process is similar to the neural work computation. The universal computation may be, for example, scalar arithmetic computation and scalar logical computation. The universal operation unit may include, but is not limited to, for example, one or more multipliers and one or more adders, and executes basic computation, for example, addition and multiplication.
- In some examples, the primary processing circuit is specifically configured to combine and sequence the intermediate results sent by the multiple secondary processing circuits to obtain the result of the operation instruction.
- In some examples, the primary processing circuit is specifically configured to combine, sequence, and activate the intermediate results sent by the multiple processing circuits to obtain the result of the operation instruction.
- In some examples, the primary processing circuit may include one or any combination of a conversion processing circuit, an activation processing circuit, and an addition processing circuit.
- The conversion processing circuit is configured to execute preprocessing on the data, specifically to execute exchange between a first data structure and a second data structure on data or intermediate results received by the primary processing circuit, or to execute exchange between a first data type and a second data type on the data or the intermediate results received by the primary processing circuit.
- The activation processing circuit is configured to execute subsequent processing, specifically to execute activation computation on data in the primary processing circuit.
- The addition processing circuit is configured to execute subsequent processing, specifically to execute addition computation or accumulation computation.
- In some examples, the secondary processing circuit may include a multiplication processing circuit.
- The multiplication processing circuit is configured to execute product computation on the received data block to obtain a product result.
- In some examples, the secondary processing circuit may further include an accumulation processing circuit. The accumulation processing circuit is configured to execute accumulation computation on the product result to obtain the intermediate result.
- In some examples, the tree module is configured as an n-ary tree structure, n being an integer greater than or equal to two.
- Another example of the disclosure provides a chip, which may include the data processing device of the abovementioned example.
- Another example of the disclosure provides a chip package structure, which may include the chip of the abovementioned example.
- Another example of the disclosure provides a board card, which may include the chip package structure of the abovementioned example.
- Another example of the disclosure provides an electronic device, which may include the board card of the abovementioned example. The electronic device may include a robot, a computer, a printer, a smayner, a tablet computer, an intelligent terminal, a mobile phone, a drive recorder, a navigator, a sensor, a webcam, a cloud server, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a transportation means, a household electrical appliance, and/or a medical device.
- The transportation means may include an airplane, a ship, and/or a vehicle. The household electrical appliance may include a television, an air conditioner, a microwave oven, a refrigerator, an electric rice cooker, a humidifier, a washing machine, an electric lamp, a gas cooker, and a range hood. The medical device may include a nuclear magnetic resonance spectrometer, a B-ultrasonic smayner, and/or an electrocardiograph.
- All of the units and modules in the disclosure may be hardware structures, physical implementation of the hardware structures may include, but is not limited to, physical devices, and the physical devices include, but are not limited to, transistors, memristors, and deoxyribonucleic acid (DNA) computers.
- The disclosure provides a data redundancy method. Data is divided into multiple importance ranks, and different data redundancy processing is performed for data of different importance ranks. Therefore, a storage capacity overhead and a memory access power consumption overhead are reduced on the basis of ensuring security and reliability of stored data.
- Specifically,
FIG. 14 is a flowchart of a data redundancy method. As illustrated inFIG. 14 , the data redundancy method specifically may include the following steps. - In S101, data is divided into M importance ranks, M being a positive integer.
- Specifically, the importance ranks of the data may be set by comprehensively considering factors such as a size of the data, a magnitude of an absolute value of the data, a type (floating point type and fixed point type) of the data, a read operation frequency of the data, and a write operation frequency of the data.
- In S102, important bits of each piece of data in each importance rank are extracted. Specifically, bits in the data are divided into important bits and unimportant bits. If the data has totally x bits in which y bits are important bits and (x-y) bits are unimportant bits, both of x and y being positive integers and 0≥y<x, only the y important bits of the data are subsequently processed. Positions of the y important bits may be continuous and may also be discontinuous.
- In S103, data redundancy processing is performed on the important bits.
- Specifically, data redundancy processing may include replica redundancy processing and/or ECC processing. Different processing may be performed according to different importance. For example, when all bits in a piece of data are all important bits, ECC processing may be performed on all the bits of the data. When part of bits in a piece of data are important bits, replica redundancy processing is performed on the important bits of the data.
- Replica redundancy may include implementing redundancy backup in the same storage medium and may also implement redundancy backup in different storage media. N data replicas may simultaneously be backed up, where N is a positive integer greater than zero. An ECC manner may include CRC and ECC.
- The data redundancy method in the example will be specifically introduced below with some examples.
- In some examples of the disclosure, redundancy storage is performed on a control unit, and redundancy storage is not performed on an operation unit. For example, redundancy storage is performed on the neural network instruction; redundancy storage is not performed on the parameter; the neural network instruction is configured as the first importance rank; and the neural network parameter is configured as a second importance rank. The neural network parameter may include topological structure information, neuron data and weight data. Redundancy storage is performed on data of the first importance rank and redundancy storage is not performed on data of the second importance rank. When a read operation is executed on the data of the first importance rank, raw data and two backed-up data replicas are read, in case of corresponding data inconsistency, two replicas of data which are the same are determined as finally read data, and the third replica of data which is inconsistent is simultaneously modified. When a write operation is executed on the data of the first importance rank, the data is simultaneously written back to two backup addresses, and the raw data and the two backed-up data replicas are ensured to be consistent.
- As illustrated in
FIG. 16 , a neural network processor may include a storage unit, a control unit, and an operation unit. - The storage unit is configured to receive external input data, to store a neuron, weight, and an instruction of a neural network, to send the instruction to the control unit, and to send the neuron and the weight to the operation unit.
- The control unit is configured to receive the instruction sent by the storage unit and decode the instruction to generate control information to control the operation unit.
- The operation unit is configured to receive the weight and the neuron sent by the storage unit, to complete neural network training computation, and to retransmit an output neuron to the storage unit for storage.
- The neural network processor may further include an instruction redundancy processing unit. The instruction redundancy processing unit is embedded in the storage unit and the instruction control unit respectively to perform data redundancy processing on the instruction.
- In some examples of the disclosure, a topological structure of the operation unit is illustrated in
FIG. 11 . The operation unit may include a primary processing circuit and multiple secondary processing circuits. The topological structure illustrated inFIG. 11 is a tree module. The tree module may include a root port and multiple branch ports. The root port of the tree module is connected with the primary processing circuit, and each of the multiple branch ports of the tree module is connected with a secondary processing circuit of the multiple secondary processing circuits respectively. The tree module is configured to forward a data block, a weight, and an operation instruction between the primary processing circuit and the multiple secondary processing circuits. As illustrated inFIG. 11 , the tree module may include a multilayer node structure, a node is a structure with a forwarding function, and the node may have no computation function. - In some examples of the disclosure, the topological structure of the operation unit is illustrated in
FIG. 12 . The operation unit may include a primary processing circuit, multiple secondary processing circuits and a branch processing circuit. The primary processing circuit is specifically configured to allocate an input neuron into multiple data blocks and send at least one data block of the multiple data blocks, the weight and at least one operation instruction of multiple operation instructions to the branch processing circuit. - The branch processing circuit is configured to forward the data block, the weight, and the operation instruction between the primary processing circuit and the multiple secondary processing circuits.
- The multiple secondary processing circuits are configured to execute computation on the received data block and the weight according to the operation instruction to obtain intermediate results and to transmit the intermediate results to the branch processing circuit.
- The primary processing circuit is configured to perform subsequent processing on the intermediate results sent by the branch processing circuit to obtain a result of the operation instruction and to send the result of the operation instruction to the control unit.
- In some examples of the disclosure, the topological structure of the operation unit is illustrated in
FIG. 13 . The operation unit may include a primary processing circuit and multiple secondary processing circuits. The multiple secondary processing circuits are distributed in an array. Each secondary processing circuit is connected with the other adjacent secondary processing circuits. The primary processing circuit is connected with k secondary processing circuits in the multiple primary processing circuits, and the k secondary processing circuits include n secondary processing circuits in a first row, n secondary processing circuits in an mth row, and m secondary processing circuits in a first column. - The k secondary processing circuits are configured to forward data and instructions between the primary processing circuit and the multiple secondary processing circuits.
- The primary processing circuit is configured to allocate a piece of input data into multiple data blocks and send at least one data block of the multiple data blocks and at least one operation instruction of multiple operation instructions to the k secondary processing circuits.
- The k secondary processing circuits are configured to convert the data between the primary processing circuit and the multiple secondary processing circuits.
- The multiple secondary processing circuits are configured to execute computation on the received data block according to the operation instruction to obtain intermediate results and to transmit the intermediate results to the k secondary processing circuits.
- The primary processing circuit is configured to perform subsequent processing on the intermediate results sent by the k secondary processing circuits to obtain a result of the operation instruction and send the result of the operation instruction to the control unit. In some examples, the primary processing circuit is specifically configured to combine and sequence the intermediate results sent by the multiple secondary processing circuits to obtain the result of the operation instruction.
- In some examples, the primary processing circuit is specifically configured to combine, sequence and activate the intermediate results sent by the multiple processing circuits to obtain the result of the operation instruction.
- In some examples, the primary processing circuit may include one or any combination of a conversion processing circuit, an activation processing circuit and an addition processing circuit.
- The conversion processing circuit is configured to execute preprocessing on the data, specifically to execute exchange between a first data structure and a second data structure on data or intermediate results received by the primary processing circuit, or to execute exchange between a first data type and a second data type on the data or the intermediate results received by the primary processing circuit.
- The activation processing circuit is configured to execute subsequent processing, specifically to execute activation computation on data in the primary processing circuit.
- The addition processing circuit is configured to execute subsequent processing, specifically to execute addition computation or accumulation computation.
- In some examples, the secondary processing circuit may include a multiplication processing circuit.
- The multiplication processing circuit is configured to execute product computation on the received data block to obtain a product result.
- In some examples, the secondary processing circuit may further include an accumulation processing circuit. The accumulation processing circuit is configured to execute accumulation computation on the product result to obtain an intermediate result.
- In some examples, the tree module is configured as an n-ary tree structure, n being an integer greater than or equal to two.
- In some examples, data redundancy is performed on a neural network parameter.
- At first, M importance ranks, first, second, . . . , Mth importance ranks, are determined for the neural network parameter according to a magnitude of an absolute value of the parameter, and the parameter is correspondingly divided into a corresponding importance rank.
- Specifically, M+1 threshold values are set and are recorded as T0, T1, T2, . . . , TM respectively after being sequenced from large to small. When the absolute value D of the neural network parameter meets Ti-1>D>Ti, the data is divided into the ith importance rank, where 1=1, 2 . . . , M, T0, T1, T2, . . . TM are all real numbers and T0>T1>T2> . . . >TM≤0. That is, when the absolute value of the neural network parameter meets T0>D>T1, the neural network parameter is divided into the first importance rank, when the absolute value of the neural network parameter meets T1>D>T2, the neural network parameter is divided into the second importance rank, and so on.
- A floating point type parameter in parameters of the ith importance rank has totally xi bits, and it is set that sign bits and first yi bits of an exponential part and a base part are specified as important bits, where both of xi and yi are positive integers, and 0<yi≥xi.
- A fixed point type parameter in parameters of the ith importance rank has totally xi bits, and it is set that sign bits and first zi bits of a numerical part are specified as important bits, where both of xi and zi are positive integers, and 0<zi≥xi.
- A data backup manner is adopted for data redundancy of important bits in a parameter of the ith importance rank, two replicas are backed up and redundancy storage is not performed on unimportant bits. When a read operation is executed on the parameter of the ith importance rank, raw data and two backed-up data replicas are simultaneously read for important bits, in case of corresponding data inconsistency, two replicas of data which are the same are determined as finally read data, and the third replica of data which is inconsistent is simultaneously modified. When a write operation is executed on the parameter of the ith importance rank, the important bits are simultaneously written back to two backup addresses, and the data in the raw data and the two backed-up data replicas are ensured to be consistent.
- In some examples, data redundancy is performed on a sparse neural network parameter.
- In the example, the sparse neural network parameter is divided into two parts, i.e., a nonzero parameter and a nonzero parameter position respectively.
- The nonzero parameter position is configured as the first importance rank, all other bits are marked as important bits and a CRC code manner is adopted for redundancy storage. When a read operation is executed, a stored CRC code is read, a CRC code of raw data is calculated, and if the two CRC codes are inconsistent, the data is corrected according to the stored CRC code. When a write operation is executed, both of the raw data and the CRC code are stored.
- An importance rank is set for the nonzero parameter of the neural network according to a magnitude of an absolute value of the parameter, and M−1 importance ranks are sequentially set from the second importance rank. M threshold values are set and are recorded as T1, T2, . . . TM respectively after being sequenced from large to small. When the absolute value D of the nonzero parameter meets Ti-1>D>Ti, the data is divided into the ith importance rank, where i=2, 3 . . . , M, T1, T2, . . . TM are all real numbers and T1>T2> . . . >TM≤0. That is, when the absolute value of the nonzero parameter meets T1>D>T2, the nonzero parameter is divided into the second importance rank, when the absolute value of the nonzero parameter meets T2>D>T3, the nonzero parameter is divided into the third importance rank, and so on.
- A floating point type parameter in parameters of the ith importance rank has totally bits, and it is set that sign bits and first yi bits of an exponential part and a base part are specified as important bits, where both of xi and yi are positive integers, and 0<yi≥xi.
- A fixed point type parameter in parameters of the ith importance rank has totally bits, and it is set that sign bits and first zi bits of a numerical part are specified as important bits, where both of xi and zi are positive integers, and 0<zi≥xi.
- A data backup manner is adopted for data redundancy of important bits in a parameter of the ith importance rank. Two replicas are backed up and redundancy storage is not performed on unimportant bits. When a read operation is executed on the parameter of the ith importance rank, raw data and two backed-up data replicas are simultaneously read for important bits, in case of corresponding data inconsistency, two replicas of data which are the same are determined as finally read data, and the third replica of data which is inconsistent is simultaneously modified. When a write operation is executed on the parameter of the ith importance rank, the important bits are simultaneously written back to two backup addresses, and meanwhile, the data in the raw data and the two backed-up data replicas are ensured to be consistent.
- In some examples, redundancy is performed on data in a diagram computation application.
- In the example, the data in the diagram computation application is divided into two parts, including vertex data and side data.
- The vertex data in the diagram computation application is configured as the first importance rank. All data bits are marked as important bits and a CRC code manner is adopted for redundancy storage. When a read operation is executed, a stored CRC code is read, and a CRC code of raw data is calculated, and if the two CRC codes are inconsistent, the data is corrected according to the stored CRC code. When a write operation is executed, both of the raw data and the CRC code are stored.
- An importance rank is set for the side data in the diagram computation application according to an access frequency of the side data, and M−1 importance ranks are sequentially set from the second importance rank, and are recorded as T1, T2, . . . TM respectively after being sequenced from large to small. When the access frequency F of the side data meets Ti-1>F>Ti, the data is divided into the ith importance rank, where i=2, 3, . . . , M, T1, T2, . . . , TM are all real numbers and T1>T2> . . . >TM≤0. That is, when the access frequency of the side data meets T1>F>T2, the side data is divided into the second importance rank, when the access frequency of the side data meets T2>F>T3, the side data is divided into the third importance rank, and so on.
- Floating point type side data in the ith importance rank has totally xi bits, and it is set that sign bits and first yi bits of an exponential part and a base part are specified as important bits, where both of xi and yi are positive integers, and 0<yi≥xi.
- Fixed point type side data in parameters of the ith importance rank has totally xi bits, and it is set that sign bits and first zi bits of a numerical part are specified as important bits, wherein both of xi and zi are positive integers, and 0<zi≥xi.
- A data backup manner is adopted for data redundancy of important bits in the side data of the ith importance rank. Two replicas are backed up and redundancy storage is not performed on unimportant bits. When a read operation is executed on the side data of the ith importance rank, raw data and two backed-up data replicas are simultaneously read for important bits, in case of data inconsistency, two replicas of data which are the same are determined as finally read data, and the third replica of data which is inconsistent is simultaneously modified. When a write operation is executed on the side data of the ith importance rank, the important bits are simultaneously written back to two backup addresses, and the data in the raw data and the two backed-up data replicas are ensured to be consistent.
- In some examples, a
data redundancy device 100 is provided.FIG. 15 is a structure block diagram of a data redundancy device. As illustrated inFIG. 15 , thedata redundancy device 100 may include an importancerank dividing unit 10, an importantbit extraction unit 20, and a dataredundancy processing unit 30. - The importance
rank dividing unit 10 is configured to divide data into M importance ranks according to importance, M being a positive integer. Specifically, the importance ranks of the data may be set by comprehensively considering factors such as a size of the data, a magnitude of an absolute value of the data, a type (floating point type and fixed point type) of the data, a read operation frequency of the data, and a write operation frequency of the data. - The important
bit extraction unit 20 is configured to extract important bits of each piece of data in each importance rank. The importantbit extraction unit 20 may recognize data of different importance ranks, divide data bits into important data bits and unimportant data bits and extract important bits of each piece of data of each importance rank. - The data
redundancy processing unit 30 is configured to perform data redundancy processing on the important bits. - As illustrated in
FIG. 2 , the dataredundancy processing unit 30 may include aredundancy storage unit 31 and a read/write control unit 32. - The
redundancy storage unit 31 may store raw data and perform data redundancy storage on the important bits in the data. Data redundancy may be replica backup or ECC. N replicas may simultaneously be backed up, where N is a positive integer greater than zero. An ECC manner may include, but is not limited to, CRC and ECC. Theredundancy storage unit 31 may be a hard disk, a dynamic random access memory (DRAM), a static random access memory (SRAM), an ECC-DRAM, an ECC-SRAM, and a nonvolatile memory. - The read/
write control unit 32 may execute a read/write operation on redundant data to ensure data read/write consistency. - The disclosure further provides a DVFS method for a neural network, which may include that: a real-time load and power consumption of a processor are acquired, and a topological structure of the neural network, a scale of the neural network, and a precision requirement of the neural network are acquired; and then, a voltage prediction and frequency prediction method is adopted to scale a working voltage and frequency of the processor. Therefore, performance of the processor is reasonably utilized, and power consumption of the processor is reduced.
-
FIG. 17 is a flowchart of a DVFS method according to an example of the disclosure.FIG. 19 is a schematic block diagram of a DVFS method according to an example of the disclosure. - Referring to
FIG. 17 andFIG. 19 , the DVFS method provided by the example of the disclosure may include the following steps. - In S1701, a processor load signal and a neural network configuration signal in a present time period T−t˜T are acquired.
- In S1702, a voltage and frequency of a processor in a next time period T˜T+t are predicted according to the processor load and the neural network configuration signal in the present time period T−t˜T, where T and t are real numbers greater than zero.
- In S1701, the operation that the processor load signal in the present time period T−t˜T is acquired refers to acquiring a workload of the processor in real time. The processor may be a dedicated processor for neural network operation.
- In some examples, the processor may include a storage unit and a computation unit and may also include other functional units. The disclosure is not limited thereto. The workload of the processor may include a memory access load of the storage unit and a computation load of the computation unit. Power consumption of the processor may include memory access power consumption of the storage unit and computation power consumption of the computation unit.
- In some examples of the disclosure, a topological structure of the computation unit is illustrated in
FIG. 11 . The computation unit may include a primary processing circuit and multiple secondary processing circuits. The topological structure illustrated inFIG. 11 is a tree module. The tree module may include a root port and multiple branch ports. The root port of the tree module is connected with the primary processing circuit, and each of the multiple branch ports of the tree module is connected with a secondary processing circuit of the multiple secondary processing circuits respectively. The tree module is configured to forward a data block, a weight, and an operation instruction between the primary processing circuit and the multiple secondary processing circuits. As illustrated inFIG. 11 , the tree module may include a multilayer node structure, a node is a structure with a forwarding function, and the node may have no computation function. - In some examples of the disclosure, the topological structure of the computation unit is illustrated in
FIG. 12 . The computation unit may include a primary processing circuit, multiple secondary processing circuits, and a branch processing circuit. The primary processing circuit is specifically configured to allocate an input neuron into multiple data blocks and send at least one data block of the multiple data blocks, the weight and at least one operation instruction of multiple operation instructions to the branch processing circuit. - The branch processing circuit is configured to forward the data block, the weight, and the operation instruction between the primary processing circuit and the multiple secondary processing circuits.
- The multiple secondary processing circuits are configured to execute computation on a received data block and the weight according to the operation instruction to obtain intermediate results and to transmit the intermediate results to the branch processing circuit.
- The primary processing circuit is configured to perform subsequent processing on the intermediate results sent by the branch processing circuit to obtain a result of the operation instruction and to send the result of the operation instruction to the control unit.
- In some examples of the disclosure, the topological structure of the computation unit is illustrated in
FIG. 13 . The computation unit may include a primary processing circuit and multiple secondary processing circuits. The multiple secondary processing circuits are distributed in an array. Each secondary processing circuit is connected with the other adjacent secondary processing circuits. The primary processing circuit is connected with k secondary processing circuits of the multiple primary processing circuits, and the k secondary processing circuits include n secondary processing circuits in a first row, n secondary processing circuits in an mth row, and m secondary processing circuits in a first column. - The k secondary processing circuits are configured to forward data and instructions between the primary processing circuit and the multiple secondary processing circuits.
- The primary processing circuit is configured to allocate a piece of input data into multiple data blocks and to send at least one data block of the multiple data blocks and at least one operation instruction of multiple operation instructions to the k secondary processing circuits.
- The k secondary processing circuits are configured to convert the data between the primary processing circuit and the multiple secondary processing circuits.
- The multiple secondary processing circuits are configured to execute computation on the received data block according to the operation instruction to obtain intermediate results and to transmit the intermediate results to the k secondary processing circuits.
- The primary processing circuit is configured to perform subsequent processing on the intermediate results sent by the k secondary processing circuits to obtain a result of the operation instruction and send the result of the operation instruction to the control unit.
- In some examples, the primary processing circuit is specifically configured to combine and sequence the intermediate results sent by the multiple secondary processing circuits to obtain the result of the operation instruction.
- In some examples, the primary processing circuit is specifically configured to combine, sequence, and activate the intermediate results sent by the multiple processing circuits to obtain the result of the operation instruction.
- In some examples, the primary processing circuit may include one or any combination of a conversion processing circuit, an activation processing circuit, and an addition processing circuit.
- The conversion processing circuit is configured to execute preprocessing on the data, specifically to execute exchange between a first data structure and a second data structure on data or intermediate results received by the primary processing circuit, or to execute exchange between a first data type and a second data type on the data or the intermediate results received by the primary processing circuit.
- The activation processing circuit is configured to execute subsequent processing, specifically to execute activation computation on data in the primary processing circuit.
- The addition processing circuit is configured to execute subsequent processing, specifically to execute addition computation or accumulation computation.
- In some examples, the secondary processing circuit may include a multiplication processing circuit.
- The multiplication processing circuit is configured to execute product computation on the received data block to obtain a product result.
- In some examples, the secondary processing circuit may further include an accumulation processing circuit. The accumulation processing circuit is configured to execute accumulation computation on the product result to obtain the intermediate result.
- In some examples, the tree module is configured as an n-ary tree structure, n being an integer greater than or equal to two. In some examples, the neural network configuration signal may include the type of a neural network layer presently processed by the processor, a scale of a parameter of the present layer, and real-time accuracy of the neural network.
- In some examples, the frequency in the operation in S1702 that the voltage and the frequency of the processor in the next time period T˜T+t are predicted may include: a frequency of the storage unit and/or the computation unit. Here, a manner of estimation, computation, prediction, induction, and the like may be adopted for prediction, and the prediction manner may be adopted.
- In some examples, the operation that the frequency of the computation unit is predicted may include that: m segments of frequency scaling ranges are preset for the computation unit, generating m+1 frequency division points f0, f1, . . . , fm in total, where f0<f1< . . . <fm, f0, f1, . . . , fm are real numbers greater than zero and m is a positive integer greater than zero.
- In some examples, the operation that the frequency of the storage unit is predicted may include that: m segments of neural network scales, totally m+1 scale division points n0, n1, . . . , nm are preset, where n0<ni . . . <nm, n0, n1, . . . , nm are positive integers greater than zero and m is a positive integer greater than zero.
- In some examples, the operation that the frequency of the storage unit is predicted may include that: a frequency scaling range of the computation unit is determined according to a range of a scale n of a present processing layer, and if ni-1<n<ni, the frequency scaling range of the computation unit is fi-1<f<fi.
- In some examples, the operation that the frequency of the storage unit is predicted may include the following steps. The frequency scaling range of the computation unit is further narrowed according to the type of the present processing layer, where layers are divided into two types, that is, a compute-intensive layer and a memory access-intensive layer.
- The compute-intensive layer may include a convolutional layer, and the memory access-intensive layer may include a fully connected layer, a pooling layer, and an active layer.
- If the layer is a compute-intensive layer, the frequency scaling range of the computation unit is (fi-1+fi)/2<f<fi.
- If the layer is a memory access-intensive layer, the frequency scaling range of the computation unit is
-
f i-1/2<f<(f i-1 +f i)/2. - In some examples, the operation that the frequency of the storage unit is predicted may include that: fine granularity regulation is performed on the frequency f of the computation unit according to the present accuracy of the neural network.
- In some examples, the operation that the frequency of the storage unit is determined may include that: when the present accuracy of the neural network is higher than expected accuracy, the frequency of the computation unit is decreased, and when the present accuracy of the neural network is lower than the expected accuracy, the frequency of the computation unit is increased.
- In some examples, the operation that the frequency of the storage unit is determined may include that: k segments of frequency scaling ranges, totally k+1 frequency division points F0, F1, . . . , Fk, are preset for the storage unit, where F0<F1< . . . <Fk, F0, F1, . . . , Fk are positive integers greater than zero and k is a positive integer greater than zero; and
- k segments of neural network scales, totally k+1 scale division points N0, N1, . . . , Nk, are preset, where N0<N1< . . . <Nk, N0, N1, . . . , Nk are positive integers greater than zero and k is a positive integer greater than zero.
- In some examples, the operation that the frequency of the storage unit is determined may include that: a frequency scaling range of the storage unit is determined according to a range of a scale N of a present processing layer, and if Ni-1<N<Ni, the frequency scaling range of the storage unit is Fi-1<F<Fi.
- In some examples, the operation that the frequency of the storage unit is predicted may include that: the frequency scaling range of the storage unit is further narrowed according to the type of the present processing layer; if the layer is a compute-intensive layer, the frequency scaling range of the storage unit is Fi-1<F<(Fi-1+Fi)/2; and if the layer is a memory access-intensive layer, the frequency scaling range of the storage unit is (Fi-1+Fi)/2<F<Fi.
- In some examples, the operation that the frequency of the storage unit is predicted may include that: fine granularity regulation is performed on the frequency of the storage unit according to the present accuracy of the neural network, and the frequency of the storage unit in the next time period is predicted.
- In some examples, the operation that the frequency of the storage unit is determined may include that: when the present accuracy of the neural network is higher than expected accuracy, the memory access frequency of the storage unit is decreased, and when the present accuracy of the neural network is lower than the expected accuracy, the memory access frequency of the storage unit is increased.
- The operation in S1702 that the voltage and the frequency of the processor in the next time period T˜T+t are predicted may further include that: a prediction method is adopted to predict the voltage and the frequency of the processor in the next time period. The prediction method may include a preceding value method, a moving average load method, an exponentially weighted average method, and/or a minimum average method.
-
FIG. 18 is a flowchart of a DVFS method according to another example of the disclosure. In the scaling method of the example, S1801-S1802 are the same as S1701-S1702. The difference is that the method may further include S1803 and S1804. - After the operation that the voltage and the frequency of the processor in the next time period are predicted, the method may further include S1803: a clock setting of a chip is regulated according to the predicted frequency in the next time period to scale the frequency of the processor.
- After the operation that the voltage and the frequency of the processor in the next time period are predicted, the method may further include S1804: a power management module of the chip is regulated according to the predicted frequency in the next time period, to scale the voltage supplied to the processor.
-
FIG. 20 is a schematic diagram of a DVFS co-processor according to an example of the disclosure. According to another aspect of the disclosure, a DVFS co-processor is provided, which may include a signal acquisition unit and a performance prediction unit. - The signal acquisition unit is configured to acquire a workload of a processor, and is further configured to acquire a neural network configuration signal.
- The performance prediction unit is configured to receive the neural network configuration signal and predict a frequency and voltage of the processor in a next time period according to a present load and power consumption of the processor.
- The signal acquisition unit may acquire a signal related to the load and the power consumption of the processor and the neural network configuration signal, and transmit these signals to the performance prediction unit. The signal acquisition unit may acquire workloads of the computation unit and the storage unit in the neural network processor, and acquire a present layer type and a present layer scale for processing of a neural network and real-time accuracy of the neural network, and transmit these signals to the performance prediction unit.
- The performance prediction unit may receive the signals acquired by the signal acquisition unit, predict performance required by the processor in the next time period according to a present system load condition and the neural network configuration signal and output a signal for scaling the frequency and the voltage.
- In some examples, the frequency in the operation that the voltage and the frequency of the processor in the next time period are predicted in the performance prediction unit may include: a frequency of the storage unit and/or the computation unit.
- As illustrated in
FIG. 5 , in some examples, the DVFS co-processor may further include a frequency scaling unit configured to receive a frequency signal, determined by the performance prediction unit, of the processor in the next time period and scale the frequency of the storage unit and/or computation unit in the processor. - As illustrated in
FIG. 5 , in some examples, the DVFS co-processor may further include a voltage scaling unit configured to receive a voltage signal, predicted by the performance prediction unit, of the processor in the next time period and scale a voltage of the storage unit and/or computation unit in the processor. - As illustrated in
FIG. 5 , the performance prediction unit is connected with the signal acquisition unit, the voltage scaling unit, and the frequency scaling unit. The performance prediction unit receives the type of the layer presently processed by the processor and the scale of the present layer, performs coarse granularity prediction on a frequency range, then finely predicts the voltage and the frequency of the processor according to the present load and power consumption of the processor and the real-time accuracy of the neural network and finally outputs the signal for scaling the frequency and scaling the voltage. - In some examples, the operation that the frequency of the computation unit of the processor in the next time period is predicted in the performance prediction unit may include that: m segments of frequency scaling ranges are preset for the computation unit, generating m+1 frequency division points f0, f1, . . . , fm in total, where f0<f1< . . . <fm, f0, fi, . . . , fm are real numbers greater than zero and m is a positive integer greater than zero.
- In some examples, the operation that the frequency of the computation unit of the processor in the next time period is predicted in the performance prediction unit may include that: m segments of neural network scales are preset, generating m+1 scale division points n0, n1, . . . , nm in total, where n0<ni< . . . <nm, n0, n1, . . . , nm are positive integers greater than zero and m is a positive integer greater than zero.
- In some examples, the operation that the frequency of the computation unit of the processor in the next time period is predicted in the performance prediction unit may include that: a frequency scaling range of the computation unit is determined according to a range of a scale n of a present processing layer, and if ni-1<n<ni, the frequency scaling range of the computation unit is fi-1<f<fi.
- In some examples, the operation that the frequency of the computation unit of the processor in the next time period is predicted in the performance prediction unit may include that: the frequency scaling range of the computation unit is further narrowed according to the type of the present processing layer, where layers are divided into two types, including a compute-intensive layer, i.e., a convolutional layer, and a memory access-intensive layer, i.e., a fully connected layer and/or a pooling layer; if the layer is a compute-intensive layer, the frequency scaling range of the computation unit is (fi-1+fi)/2<f<fi; and if the layer is a memory access-intensive layer, the frequency scaling range of the computation unit is fi-1/2<f<(fi-1+fi)/2.
- In some examples, the operation that the frequency of the computation unit of the processor in the next time period is predicted in the performance prediction unit may include that: fine granularity regulation is performed on the frequency of the computation unit according to the present accuracy of the neural network.
- In some examples, the operation that the frequency of the computation unit of the processor in the next time period is predicted in the performance prediction unit may include that:
- when the present accuracy of the neural network is higher than expected accuracy, the frequency of the computation unit is decreased, and when the present accuracy of the neural network is lower than the expected accuracy, the frequency of the computation unit is increased.
- In some examples, the operation that the frequency of the storage unit of the processor in the next time period is predicted in the performance prediction unit may include that: k segments of frequency scaling ranges are preset for the storage unit, generating k+1 frequency division points F0, F1, . . . , Fk in total, where F0<F1< . . . <Fk, F0, F1, . . . , Fk are positive integers greater than zero and k is a positive integer greater than zero.
- In some examples, the operation that the frequency of the storage unit of the processor in the next time period is predicted in the performance prediction unit may include that: k segments of neural network scales are preset, generating k+1 scale division points N0, N1, . . . , Nk in total, where N0<N1< . . . <Nk, N0, N1, . . . , Nk are positive integers greater than zero and k is a positive integer greater than zero.
- In some examples, the operation that the frequency of the storage unit of the processor in the next time period is predicted in the performance prediction unit may include that:
- a frequency scaling range of the storage unit is determined according to a range of a scale N of the present processing layer, and if
- Ni-1<N<Ni, the frequency scaling range of the storage unit is Fi-1<F<Fi.
- In some examples, the operation that the frequency of the storage unit of the processor in the next time period is predicted in the performance prediction unit may include that:
- the frequency scaling range of the storage unit is further narrowed according to the type of the present processing layer; if the layer is a compute-intensive layer, the frequency scaling range of the storage unit is Fi-1<F<(Fi-1+Fi)/2; and if the layer is a memory access-intensive layer, the frequency scaling range of the storage unit is (Fi-1+Fi)/2<F<Fi.
- In some examples, the operation that the frequency of the storage unit of the processor in the next time period is predicted in the performance prediction unit may include that:
- fine granularity regulation is performed on the frequency of the storage unit according to a present utilization rate and power consumption of the processor and the present accuracy of the neural network.
- In some examples, the operation that the neural network configuration signal is acquired in the signal acquisition unit may include that: the present layer type and the present layer scale for processing of the neural network and the real-time accuracy of the neural network are acquired.
- As illustrated in
FIG. 21 , in some examples, the performance prediction unit may include at least one of: a preceding value method-based prediction unit, adopting a preceding value method to predict the voltage and the frequency of the processor in the next time period; a moving average load method-based prediction unit, adopting a moving average load method to predict the voltage and the frequency of the processor in the next time period; an exponentially weighted average method-based prediction unit, adopting an exponentially weighted average method to predict the voltage and the frequency of the processor in the next time period; and a minimum average method-based prediction unit, adopting a minimum average method to predict the voltage and the frequency of the processor in the next time period. - The disclosure provides the DVFS method and DVFS co-processor for the neural network. According to the DVFS method, the real-time load and power consumption of the processor are acquired, and the topological structure of the neural network, the scale of the neural network, and the precision requirement of the neural network are acquired; and then, a voltage prediction and frequency prediction method is adopted to scale the working voltage and frequency of the processor. Therefore, performance of the processor is reasonably utilized, and power consumption of the processor is reduced. A DVFS algorithm for the neural network is integrated in the DVFS co-processor, and thus the characteristics of topological structure, network scale, precision requirement, and the like of the neural network may be fully mined. The signal acquisition unit acquires a system load signal and a topological structure signal of the neural network, a neural network scale signal, and a neural network precision signal in real time; the performance prediction unit predicts the voltage and the frequency required by the system; the frequency scaling unit scales the working frequency of the neural network processor; and the voltage scaling unit scales the working voltage of the neural network processor. Therefore, the performance of the neural network processor is reasonably utilized, and the power consumption of the neural network processor is effectively reduced.
- The disclosure provides an information processing device.
FIG. 22 is a functional module diagram of an information processing device according to an example of the disclosure. As illustrated inFIG. 22 , the information processing device may include a storage unit and a data processing unit. The storage unit is configured to receive and store input data, an instruction, and output data. The input data may include one or more images. The data processing unit performs extraction and computational processing on a key feature included in the input data and generates a multidimensional vector for each image according to a computational processing result. - The key feature may include a facial action and expression, a key point position, and the like in the image. A specific form is a feature map (FM) in a neural network. The image may include a static picture, pictures forming a video, a video, or the like. The static picture, the pictures forming the video, or the video may include images of one or more parts of a face. The one or more parts of the face include facial muscles, lips, eyes, eyebrows, nose, forehead, ears, and combination thereof of the face.
- Each element of the vector represents an emotion on the face, for example, anger, delight, pain, depression, sleepiness, and doubt. The storage unit is further configured to, after tagging an n-dimensional vector, output the n-dimensional vector, namely outputting the n-dimensional vector obtained by computation.
- In some examples, the information processing device may further include a conversion module configured to convert the n-dimensional vector into a corresponding output. The output may be a control instruction, data (0, 1 output), a tag (happiness, depression, and the like), or picture output.
- The control instruction may be single click, double click, and dragging of a mouse, single touch, multi-touch, and sliding of a touch screen, turning on and turning off of a switch, and a shortcut key.
- In some examples, the information processing device is configured for adaptive training.
- Correspondingly, the storage unit is configured to input n images, each image including a tag, each image corresponding to a vector (real emotion vector) and n being a positive integer greater than or equal to one.
- The data processing unit takes calibrated data as an input, calculates an output emotion vector, i.e., a predicted emotion vector, in a format the same as the input, compares the output emotion vector with the real emotion vector and updates a parameter of the device according to a comparison result.
- The emotion vector may include n elements. A value of each element of the emotion vector may include the following conditions.
- (1) The value of each element of the emotion vector may be a number between zero and one (representing a probability of appearance of a certain emotion).
- (2) The value of each element of the emotion vector may also be any number greater than or equal to zero (representing an intensity of a certain emotion).
- (3) A value of only one element of the emotion vector is one and values of the other elements are zero. Under this condition, the emotion vector may only represent a strongest emotion.
- Specifically, the predicted emotion vector may be compared with the real emotion vector in manners of calculating a Euclidean distance and calculating an absolute of dot product of the predicted emotion vector and the real emotion vector. For example, n is three; the predicted emotion vector is [a1, a2, a3]; the real emotion vector is [b1, b2, b3]; the Euclidean distance of the two is [(a1−b1)2+(a2−b2)2+(a3−b3)2]1/2; and the absolute value of the dot product of the two is |a1*b1+a2*b2+a3*b3|. Those skilled in the art may understand that the comparison manners are not limited to calculating the Euclidean distance and calculating the absolute value of the dot product, and other methods may also be adopted.
- In a specific example of the disclosure, as illustrated in
FIG. 23 , the information processing device is an artificial neural network chip. The operation that the parameter of the device is updated may include that: a parameter (weight, offset, and the like) of the neural network is adaptively updated. - The storage unit of the artificial neural network chip is configured to store the data and the instruction. The data may include an input neuron, an output neuron, a weight, the image, the vector, and the like. The data processing unit of the artificial neural network chip may include an operation unit configured to execute corresponding computation on the data according to an instruction stored in the storage unit. The operation unit may be a scalar computation unit configured to complete a scalar multiplication, a scalar addition, or a scalar multiplication and addition operation, or a vector computation unit configured to complete a vector multiplication, vector addition or vector dot product operation, or a hybrid computation unit configured to complete a matrix multiplication and addition operation, a vector dot product computation, and nonlinear computation, or convolutional computation. The computation executed by the operation unit may include neural network operation.
- In some examples, a structure of the operation unit is illustrated in
FIG. 11 to 13 . A specific connecting relationship refers to the descriptions mentioned above and will not be elaborated herein. - In some examples, the operation unit may include, but is not limited to: a first part including a multiplier, a second part including one or more adders (more specifically, the adders of the second part form an adder tree), a third part including an activation function unit, and/or a fourth part including a vector processing unit. More specifically, the vector processing unit may process vector computation and/or pooling computation. The first part multiplies input data 1 (in1) and input data 2 (in2) to obtain a multiplied output (out), and a process is: out=in1×in2. The second part adds the input data in1 through the adders to obtain output data (out). More specifically, when the second part is the adder tree, the input data in1 is added step by step through the adder tree to obtain the output data (out), in1 being a vector with a length N and N being greater than one, and a process is: out=in1[1]+in1[2]+ . . . +in1[N]; and/or the input data (in1) is accumulated through the adder tree and then is added with the input data (in1) to obtain the output data (out), and a process is: out=in1[1]+in1[2]+ . . . +in1[N]+in2; or the input data (in1) and the input data (in2) are added to obtain the output data (out), and a process is: out=in1+in2. The third part executes computation on the input data (in) through an activation function (active) to obtain activation output data (out), and a process is: out=active(in). The activation function may be sigmoid, tan h, RELU, softmax, and the like. Besides an activation operation, the third part may implement another nonlinear function and may execute computation (f) on the input data (in) to obtain the output data (out), and a process is: out=f(in). The vector processing unit executes pooling computation on the input data (in) to obtain output data (out) after a pooling operation, and a process is out=pool(in). Pool is the pooling operation, the pooling operation may include, but is not limited to: average pooling, maximum pooling, and median pooling, and the input data in is data in a pooling core related to an output out.
- The operation that the operation unit executes computation may include that: the first part multiplies the input data 1 and the input data 2 to obtain multiplied data; and/or the second part executes addition computation (more specifically, adder tree computation configured to add the input data 1 through the adder tree step by step) or adds the input data 1 and the input data 2 to obtain output data; and/or the third part executes the activation function computation, that is, executes computation on the input data through the activation function (active) to obtain the output data; and/or the fourth part executes the pooling computation, out=pool(in). Pool is the pooling operation, the pooling operation may include, but is not limited to: average pooling, maximum pooling, and median pooling, and the input data in is data in the pooling core related to the output out. The computation of one or more parts of the abovementioned parts may be freely selected for combination in different sequences, thereby implementing computation of various functions.
- In a specific example of the disclosure, further referring to
FIG. 23 , the artificial neural network chip may further include a control unit, an instruction cache unit, a weight cache unit, an input neuron cache unit, an output neuron cache unit, and a DMA. - The control unit is configured to read an instruction from the instruction cache, decode the instruction into an operation unit instruction and input the operation unit instruction to the operation unit.
- The instruction cache unit is configured to store the instruction.
- The weight cache unit is configured to cache weight data.
- The input neuron cache unit is configured to cache an input neuron input to the operation unit.
- The output neuron cache unit is configured to cache an output neuron output by the operation unit.
- The DMA is configured to read/write data or instructions in the storage unit, the instruction cache, the weight cache, the input neuron cache, and the output neuron cache.
- In a specific example of the disclosure, as illustrated in
FIG. 24 , the artificial neural network chip may further include a conversion unit, connected with the storage unit and configured to receive first output data (data of a final output neuron) and convert the first output data into second output data. - The neural network has a requirement on a format of an input picture, for example, a length, a width, and a color channel. As one alternative implementation, in a specific example of the disclosure, as illustrated in
FIG. 25 andFIG. 26 , the artificial neural network chip may further include a preprocessing unit configured to preprocess original input data, i.e., one or more images, to obtain image data consistent with an input layer scale of a bottom layer of an artificial neural network adopted by the chip to meet the requirement of a preset parameter and data format of the neural network. Preprocessing may include segmentation, Gaussian filtering, binarization, regularization, normalization, and the like. - The preprocessing unit may exist independently of the chip. That is, the preprocessing unit may be configured as an information processing device including a preprocessing unit and a chip. The preprocessing unit and the chip are configured as described above.
- In a specific example of the disclosure, as illustrated in
FIG. 27 , the operation unit of the chip may adopt a short-bit floating point data module for forward computation, including a floating point data statistical module, a short-bit floating point data conversion unit, and a short-bit floating point data operation module. - The floating point data statistical module is configured to perform statistical analysis on data of each type required by artificial neural network forward computation to obtain an EL.
- The short-bit floating point data conversion unit is configured to implement conversion from a long-bit floating point data type to a short-bit floating point data type according to the EL obtained by the floating point data statistical module.
- The short-bit floating point data operation module is configured to, after the floating point data conversion units adopts the short-bit floating point data type to represent all inputs, weights, and/or offset data required by the artificial neural network forward computation, execute the artificial neural network forward computation on short-bit floating point data.
- As one alternative implementation, the floating point data statistical module is further configured to perform statistical analysis on the data of each type required by the artificial neural network forward computation to obtain exponential offset. The short-bit floating point data conversion unit is configured to implement conversion from the long-bit floating point data type to the short-bit floating point data type according to the exponential offset and the EL obtained by the floating point data statistical module. The exponential offset and the EL are set, so that a representable data range may be extended as much as possible. Therefore, all data of the input neuron and the weight may be included.
- More specifically, as illustrated in
FIG. 28 , the short-bit floating point data conversion unit may include anoperation cache unit 31, adata conversion unit 32, and a rounding unit 33. - The operation cache unit adopts a data type with relatively high accuracy to store an intermediate result of the forward computation. This is because addition or multiplication computation may extend the data range during the forward computation. After computation is completed, a rounding operation is executed on data beyond a short-bit floating point accuracy range. Then, the data in a cache region is converted into the short-bit floating point data through the
data conversion unit 32. - The rounding unit 33 may complete the rounding operation over the data beyond the short-bit floating point accuracy range. The unit may be a random rounding unit, a rounding-off unit, a rounding-up unit, a rounding-down unit, a truncation rounding unit and the like. Different rounding units may implement different rounding operations over the data beyond the short-bit floating point accuracy range.
- The random rounding unit executes the following operation:
-
- where y represents the short-bit floating point data obtained by random rounding; x represents 32-bit floating point data before random rounding; ε is a minimum positive integer which may be represented by a present short-bit floating point data representation format, i.e., 2offset-(X-1-EL); └x┘ represents a number obtained by directly truncating the short-bit floating point data from raw data x (similar to a rounding-down operation over decimals); and w.p. represents a probability, that is, a probability that the data y obtained by random rounding is └x┘ is
-
- and a probability that the data y obtained by random rounding is └x┘+ε is
-
- The rounding-off unit executes the following operation:
-
- where y represents the short-bit floating point data obtained by rounding-off; x represents the long-bit floating point data before rounding-off; ε is the minimum positive integer which may be represented by the present short-bit floating point data representation format, i.e., 2offset-(X-1-EL); └x┘ is an integral multiple of ε; and a value of └x┘ is a maximum number less than or equal to x.
- The rounding-up unit executes the following operation:
-
y=┌x┐, - where y represents the short-bit floating point data obtained by rounding-up; x represents the long-bit floating point data before rounding-up; ┌x┐ is the integral multiple of ε; a value of ┌x┐ is a minimum number greater than or equal to x; and ε is the minimum positive integer which may be represented by the present short-bit floating point data representation format, i.e., 2offset-(X-1-EL).
- The rounding-down unit executes the following operation:
-
y=└x┘, - where y represents the short-bit floating point data obtained by rounding-up; x represents the long-bit floating point data before rounding-up; └x┘ is the integral multiple of ε; a value of ┌x┐ is a maximum number less than or equal to x; and ε is the minimum positive integer which may be represented by the present short-bit floating point data representation format, i.e., 2offset-(X-1-EL).
- The truncation rounding unit executes the following operation:
-
y=[x], - where y represents the short-bit floating point data after truncation rounding; x represents the long-bit floating point data before truncation rounding; and [x] represents the number obtained by directly truncating the short-bit floating point data from the raw data x.
- In addition, the artificial neural network chip may be applied to a terminal. The terminal may further include an image acquisition device, besides the artificial neural network chip. The image acquisition device may be a webcam and a camera. The terminal may be a desktop computer, smart home, a transportation means, or a portable electronic device. The portable electronic device may be a webcam, a mobile phone, a notebook computer, a tablet computer, a wearable device, and the like. The wearable device may include a smart watch, a smart band, smart clothes, and the like. The artificial neural network chip may also be applied to a cloud (server). Then only one application (APP) is required on a device of a user. The device uploads an acquired image, the information processing device of the disclosure calculates an output, and a user terminal makes a response.
- In addition, the disclosure further provides an information processing method, which may include the following steps.
- A storage unit receives input data, the input data including one or more images.
- A data processing unit extracts and processes a key feature included in the input data and generates a multidimensional vector for each image according to a processing result.
- The key feature may include a facial action and expression, a key point position, and the like in the image. A specific form is an FM in a neural network. The image may include a static picture, pictures forming a video, a video, or the like. The static picture, the pictures forming the video, or the video may include images of one or more parts of a face. The one or more parts of the face include facial muscles, lips, eyes, eyebrows, nose, forehead, ears, and combination thereof of the face.
- Each element of the multidimensional vector represents an emotion on the face, for example, anger, delight, pain, depression, sleepiness, and doubt. Furthermore, the information processing method may further include that: tagged data (existing image corresponding to the multidimensional vector) is learned; the multidimensional vector is output after the tagged data is learned; and a parameter of the data processing unit is updated.
- Furthermore, the information processing method may further include that: the multidimensional vector is converted into a corresponding output. The output may be a control instruction, data (0, 1 output), a tag (happiness, depression, and the like), and picture output.
- The control instruction may be single click, double click, and dragging of a mouse, single touch, multi-touch, and sliding of a touch screen, turning on and turning off of a switch, a shortcut key, and the like.
- As one alternative implementation, the information processing method may further include that: adaptive training is performed. A specific flow is as follows.
- n images are input into the storage unit, each image including a tag, each image corresponding to a vector (real emotion vector) and n being a positive integer greater than or equal to one.
- The data processing unit takes calibrated data as an input, calculates an output emotion vector, i.e., a predicted emotion vector, in a format the same as the input, compares the output emotion vector with the real emotion vector and updates a parameter of the device according to a comparison result.
- The emotion vector may include n elements. A value of each element of the emotion vector may include the following conditions.
- (1) The value of each element of the emotion vector may be a natural number between zero and one (representing a probability of appearance of a certain emotion).
- (2) The value of each element of the emotion vector may also be any number greater than or equal to zero (representing an intensity of a certain emotion). For example, a preset expression is [delight, sadness, fear], and a vector corresponding to a reluctant smiling face may be [0.5, 0.2, 0].
- (3) A value of only one element of the emotion vector is one and values of the other elements are zero. Under this condition, the emotion vector may only represent a strongest emotion. For example, a preset expression is [delight, sadness, fear], and a vector corresponding to an obvious smiling face may be [1, 0, 0].
- The predicted emotion vector may be compared with the real emotion vector in manners of calculating a Euclidean distance, calculating an absolute of dot product of the predicted emotion vector and the real emotion vector, and the like. For example, n is three; the predicted emotion vector is [a1, a2, a3]; the real emotion vector is [b1, b2, b3]; the Euclidean distance of the two is [(a1−b1)2+(a2−b2)2+(a3−b3)2]1/2; and the absolute value of the dot product of the two is |a1*b1+a2*b2+a3*b3|. Those skilled in the art may understand that the comparison manners are not limited to calculating the Euclidean distance and calculating the absolute value of the dot product, and other methods may also be adopted.
- As one alternative implementation, the information processing device is an artificial neural network chip. The value of each element of the emotion vector may be a number between zero and one (representing the probability of appearance of a certain emotion). Since emotions of a person may be overlaid, there may be multiple nonzero numbers in the emotion vector to express a complicated emotion.
- In a specific example of the disclosure, a method by which the artificial neural network chip obtains the emotion vector may include that: each neuron of a final output layer of the neural network corresponds to an element of the emotion vector, and an output neuron value is a number between zero and one and is determined as a probability of appearance of the corresponding emotion. The whole process for calculating the emotion vector is as follows.
- In S1, input data is transmitted into the storage unit through the preprocessing unit or is directly transmitted into the storage unit.
- In S2, a DMA transmits the input data in batches to corresponding on-chip caches (i.e., an instruction cache, an input neuron cache, and a weight cache).
- In S3, a control unit reads an instruction from the instruction cache and decodes and transmits the instruction into an operation unit.
- In S4, the operation unit executes corresponding computation according to the instruction. In each layer of a neural network, computation is implemented mainly in three substeps. In S41, corresponding input neurons and weights are multiplied. In S42, adder tree computation is executed, that is, a result obtained in S41 is added through an adder tree step by step to obtain a weighted sum, and the weighted sum is offset or not processed according to a requirement. In S43, activation function computation is executed on a result obtained in S42 to obtain output neurons, and the output neurons are transmitted into an output neuron cache.
- In S5, S2 to S4 are repeated until computation for all the data is completed, namely obtaining a final result required by a function. The final result is obtained by output neurons of the last layer of the neural network. Each neuron of the final output layer of the neural network corresponds to an element of the emotion vector, and an output neuron value is a number between zero and one and is determined as the probability of appearance of the corresponding emotion. The final result is output into the output neuron cache from the operation unit, and then is returned to the storage unit through the DMA.
- According to the requirement of the function: a magnitude of the emotion vector (i.e., expression type, which is also the number of the neurons of the final output layer of the artificial neural network), a comparison form (the Euclidean distance, the dot product, and the like) with the real emotion vector of the training data, and a network parameter updating manner (stochastic gradient descent, Adam algorithm, and the like) are required to be preset in an adaptive training stage.
- In some examples, a value of only one element of the emotion vector is one and values of the other elements are zero. Under this condition, the emotion vector may only represent the strongest emotion.
- In a specific example of the disclosure, the method by which the artificial neural network chip obtains the emotion vector may include that: each neuron of the final output layer of the neural network corresponds to an element of the emotion vector, but only one output neuron is one and the other output neurons are zero. The whole process for calculating the emotion vector is as follows.
- In S1, input data is transmitted into the storage unit through the preprocessing unit or is directly transmitted into the storage unit.
- In S2, the DMA transmits the input data in batches to the instruction cache, the input neuron cache, and the weight cache.
- In S3, the control unit reads the instruction from the instruction cache and decodes and transmits the instruction into the operation unit.
- In S4, the operation unit executes corresponding computation according to the instruction. In each layer of the neural network, computation is implemented mainly in three substeps. In S41, corresponding input neurons and weights are multiplied. In S42, adder tree computation is executed, that is, a result obtained in S41 is added through an adder tree step by step to obtain a weighted sum, and the weighted sum is offset or not processed according to a requirement. In S43, activation function computation is executed on a result obtained in S42 to obtain output neurons, and the output neurons are transmitted into an output neuron cache.
- In S5, S2 to S4 are repeated until computation for all the data is completed, namely obtaining a final result required by a function. The final result is obtained by the output neurons of the last layer of the neural network. Each neuron of the final output layer of the neural network corresponds to an element of the emotion vector, but only one output neuron is one and the other output neurons are zero. The final result is output into the output neuron cache from the operation unit, and then is returned to the storage unit through the DMA.
- According to the requirement of the function: a magnitude of the emotion vector (i.e., expression type, which is also the number of the neurons of the final output layer of the artificial neural network), a comparison form (the Euclidean distance, the dot product, and the like) with the real emotion vector of the training data and a network parameter updating manner (stochastic gradient descent, Adam algorithm, and the like) are required to be preset in an adaptive training stage. In addition, the real emotion vector used for training in this example is different from example 1 and should also be an “indication” vector like [1, 0, 0, 0, . . . ].
- Each functional unit/module in the disclosure may be hardware. For example, the hardware may be a circuit, including a digital circuit, an analogue circuit, and the like. Physical implementation of a hardware structure may include, but is not limited to, a physical device, and the physical device may include, but is not limited to, a transistor, a memristor, and the like. The computation module in the computation device may be any proper hardware processor, for example, a CPU, a graphics processing unit (GPU), a field-programmable gate array (FPGA), a digital signal processor (DSP), and an application specific integrated circuit (ASIC). The storage unit may be any proper magnetic storage medium or magneto-optical storage medium, for example, a resistance random access memory (RRAM), a DRAM, an SRAM, an embedded DRAM (EDRAM), a high bandwidth memory (HBM), and a hybrid memory cube (HMC).
- In addition, the neural network of the disclosure may be a convolutional neural network and may also be a fully connected neural network, a restricted Boltzmann machine (RBM) neural network, a recurrent neural network (RNN), and the like. In some other examples, computation may be other operations in the neural network besides convolutional computation, for example, fully connected computation.
- Direction terms, for example, “upper”, “lower”, “front”, “back”, “left” and “right”, mentioned in the following examples are only directions with reference to the drawings. Therefore, the direction terms are adopted to not limit but describe the disclosure.
- An example of the disclosure provides an online configurable neural network operation device, i.e., an online configurable neural network hardware processor.
FIG. 29 is a functional module diagram of a computation device according to an example of the disclosure. As illustrated inFIG. 29 , the neural network operation device of the disclosure may include a control module and an operation module, and may further include a storage module. - The operation module may include multiple operation units. Each operation unit may include at least one multiplier and at least one adder. In some examples, each operation unit may further include at least one memory. The memory may include a storage space and/or a temporary cache. The storage space is, for example, an SRAM. The temporary cache is, for example, a register.
- The control module is configured to send an instruction to the multiple operation units and control data transmit between the operation units.
- The instruction may be configured for each operation unit to transmit data to be computed or an intermediate result value to one or more other operation units in one or more directions. The transmit directions include transmit to the left/right adjacent or nonadjacent operation units, transmit to the upper/lower adjacent or nonadjacent operation units, transmit to the diagonally adjacent or nonadjacent operation units, and transmit to multiple adjacent or nonadjacent operation units in multiple directions. The direction of transmit to the diagonally adjacent or nonadjacent operation units may include a direction of transmit to the left upper diagonally, left lower diagonally, right upper diagonally, and right lower diagonally adjacent or nonadjacent operation units.
- Each operation unit is provided with multiple input ports. The multiple input ports include a port connected with the storage module and configured to receive data transmitted by the storage module and a port connected with the other operation units and configured to receive data transmitted by the operation units. Each operation unit is also provided with an output port configured to transmit the data back to the storage module or to a specified operation unit.
- The storage module may include a data storage unit and/or a temporary cache. According to a requirement, one or more data storage units and/or temporary caches may be provided. That is, the data to be computed may be stored in the same region and may also be stored separately. An intermediate result may be stored in the same region and may also be stored separately. The data storage unit is, for example, an SRAM. The temporary cache is, for example, a register.
- As one alternative example, as illustrated in
FIG. 30 , the control module may include a storage control unit and a computational control unit. The storage control unit is configured to control the storage module to store or read required data. The computational control unit is configured to control the operation module according to the type of computation to be executed and a computational requirement, including to control specific computation manners in the operation units and to control data transmit between the operation units. - The disclosure further provides a computation method, which may include the following steps.
- A control module sends an instruction.
- Multiple operation units of an operation module receive the instruction and perform data transmit according to the instruction.
- Each operation unit receives the instruction and transmits data to be computed or an intermediate result to the other operation units except itself in one or more directions according to the instruction.
- The direction may include a direction of transmit to the left/right adjacent or nonadjacent operation units, a direction of transmit to the upper/lower adjacent or nonadjacent operation units, and a direction of transmit to diagonally adjacent or nonadjacent operation units.
- The direction of transmit to the diagonally adjacent or nonadjacent operation units may include a direction of transmit to the left upper diagonally, left lower diagonally, right upper diagonally, and right lower diagonally adjacent or nonadjacent operation units.
- In a specific example of the disclosure, the operation module may include N*N (N is a positive integer) operation units and an ALU. The data may be transmitted sequentially in an S-shaped direction, as illustrated in
FIG. 31 . As one alternative implementation, the ALU is a lightweight ALU. Each operation unit may include a multiplier, an adder, a storage space, and a temporary cache. The intermediate results obtained by every computation executed by the operation units are transmitted between the operation units. - A main computation flow of a processor of the example is as follows. A storage control unit sends a read control signal to a storage module to read neuron data and synaptic weight data to be computed, and store neuron data and synaptic weight data to be computed in the storage spaces of the operation units for transmit respectively. A computational control unit sends a computational signal to be computed to each operation unit and initializes each operation unit, for example, clearing caches. The storage control unit sends an instruction and transmits a neuron to be computed to each operation unit. The computational control unit sends an instruction and each operation unit receives neuron data for multiplication with the corresponding synaptic weight data in its own storage space. A left upper operation unit transmits a computational result rightwards to a second operation unit, and the second operation unit adds the computational result received and a computational product obtained by itself to obtain a partial sum and transmits the partial sum rightwards, and so on. As illustrated in
FIG. 31 , the partial sum is transmitted according to an S-shaped path and is continuously accumulated. If accumulation is completed, the partial sum is transmitted into the ALU for computation such as activation and then a result is written into the storage module. If not, the result is temporally stored back into the storage module for subsequent scheduling and computation is continued. By such a structure, every time when computation is executed, a characteristic of weight sharing of the neural network is fully utilized, and the weight data is only required to be loaded once at the very beginning, so that the number of memory access times is greatly reduced, and a power consumption overhead is reduced. - In a specific example of the disclosure, as illustrated in
FIG. 32 , the operation module may include N*N (N is a positive integer) operation units and M−1 ALUs (M is a positive integer). As one alternative implementation, MN. Different operation units may transmit computational data in different directions. That is, there is no such requirement that all the operation units in the same operation module keep a unified transmit direction. Each operation unit may include a multiplier, an adder, and a temporary cache. The intermediate results obtained by every computation executed by the operation units are transmitted between the operation units. As one alternative implementation, the ALUs are lightweight ALUs. - An output value of an LRN layer is (1+(α/n)Σixi 2)β. When the LRN layer is calculated, accumulation of a square of input data may be completed through the operation units and then a subsequent exponential operation is completed through the ALU. Here, operations and data transmit direction of the operation units are configured as follows. The operation units in the leftmost column are configured to receive data to be computed from the storage module, to complete square operations, and to transmit square values to the right and right lower adjacent operation units. The operation units in the uppermost column are configured to receive the data from the storage module, to complete square operations, and to transmit square values to the right lower adjacent operation units. The operation units in the rightmost column are configured to receive the data from operation units of the left upper and the left, to complete accumulation and, if all accumulation is completed, to transmit the data rightwards to the ALU for subsequent exponential operations according to the instruction. The other operation units are configured to receive the data from the left upper operation units, to transmit the data to the right lower operation units, and to accumulate the data and data transmitted by the left operation units and transmit an accumulated sum rightwards. The rest may be done in the same manner until all computation is completed.
- Specific descriptions will be made below with N=3 as an example. As illustrated in
FIG. 33 , data on the horizontal lines is specific data to be transmitted and data in boxes represent computational results obtained in each operation unit. In this process, related operations may be completed in a pipeline manner. By adopting the processor of the disclosure, data which has been read on the chip may be effectively utilized; the number of memory access times is effectively reduced; a power consumption overhead is reduced; a delay brought by data reading is reduced; and a computational speed is increased. - Each functional unit/module in the disclosure may be hardware. For example, the hardware may be a circuit, including a digital circuit, an analogue circuit, and the like. Physical implementation of a hardware structure may include, but is not limited to, a physical device, and the physical device may include, but is not limited to, a transistor, a memristor, and the like. The computation module in the computation device may be any proper hardware processor, for example, a CPU, a GPU, an FPGA, a DSP, and an ASIC. The storage unit may be any proper magnetic storage medium or magneto-optical storage medium, for example, an RRAM, a DRAM, an SRAM, an EDRAM, an HBM, and an HMC.
- In addition, the neural network of the disclosure may be a convolutional neural network and may also be a fully connected neural network, an RBM neural network, an RNN, and the like. In some other examples, computation may be other operations in the neural network besides convolutional computation, for example, fully connected computation.
- The processes or methods described in the abovementioned drawings may be executed by processing logics including hardware (for example, a circuit and a dedicated logic), firmware, software (for example, software born on a non-transitory computer-readable medium), or a combination of two. Although the processes or methods have been described above according to some sequential operations, it should be understood that some described operations may be executed in different sequences. In addition, some operations may be executed not sequentially but concurrently.
- So far, the examples of the disclosure have been described in combination with the drawings in detail. According to the above descriptions, those skilled in the art should have a clear understanding to the disclosure.
- It should be noted that implementations which are not illustrated or described in the drawings or the specification are all forms known to those of ordinary skill in the art and are not described in detail. In addition, the above descriptions about each component are not limited to various specific structures and shapes mentioned in the examples. Those of ordinary skill in the art may make simple modifications or replacements. The disclosure may provide examples of parameters including specific values. However, these parameters are not required to be exactly equal to the corresponding values and, instead, may be approximate to the corresponding values within acceptable error tolerance or design constraints. The examples may be mixed and matched for use or mixed and matched with other examples for use on the basis of considerations about design and reliability. That is, technical features in different examples may be freely combined into more examples.
- Purposes, technical solutions and beneficial effects of the disclosure are further described above with the specific examples in detail. It should be understood that the above is only the specific example of the disclosure and not intended to limit the disclosure. Any modifications, equivalent replacements, improvements and the like made within the spirit and principle of the disclosure shall fall within the scope of protection of the disclosure.
Claims (21)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US16/698,996 US20200110635A1 (en) | 2017-07-05 | 2019-11-28 | Data processing apparatus and method |
Applications Claiming Priority (10)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710541899 | 2017-07-05 | ||
CN201710578784.XA CN109254867B (en) | 2017-07-14 | 2017-07-14 | Data redundancy method and device |
CN201710677922.X | 2017-08-09 | ||
CN201710677922.XA CN109376845B (en) | 2017-08-09 | 2017-08-09 | Dynamic adjustment method and dynamic adjustment coprocessor |
CN201710793531.4A CN107578014B (en) | 2017-09-06 | 2017-09-06 | Information processing apparatus and method |
CN201710910124.7A CN109583577B (en) | 2017-09-29 | 2017-09-29 | Arithmetic device and method |
CN201810616466.2A CN109213581B (en) | 2017-07-05 | 2018-06-14 | Data processing device and method |
PCT/CN2018/094710 WO2019007406A1 (en) | 2017-07-05 | 2018-07-05 | Data processing apparatus and method |
US16/698,996 US20200110635A1 (en) | 2017-07-05 | 2019-11-28 | Data processing apparatus and method |
US16/698,992 US11307864B2 (en) | 2017-07-05 | 2019-11-28 | Data processing apparatus and method |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/698,992 Continuation US11307864B2 (en) | 2017-07-05 | 2019-11-28 | Data processing apparatus and method |
Publications (1)
Publication Number | Publication Date |
---|---|
US20200110635A1 true US20200110635A1 (en) | 2020-04-09 |
Family
ID=70613525
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/698,996 Abandoned US20200110635A1 (en) | 2017-07-05 | 2019-11-28 | Data processing apparatus and method |
US16/698,993 Active 2038-08-09 US11086634B2 (en) | 2017-07-05 | 2019-11-28 | Data processing apparatus and method |
Family Applications After (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/698,993 Active 2038-08-09 US11086634B2 (en) | 2017-07-05 | 2019-11-28 | Data processing apparatus and method |
Country Status (1)
Country | Link |
---|---|
US (2) | US20200110635A1 (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR102273153B1 (en) * | 2019-04-24 | 2021-07-05 | 경희대학교 산학협력단 | Memory controller storing data in approximate momory device based on priority-based ecc, non-transitory computer-readable medium storing program code, and electronic device comprising approximate momory device and memory controller |
Family Cites Families (23)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6311306B1 (en) * | 1999-04-26 | 2001-10-30 | Motorola, Inc. | System for error control by subdividing coded information units into subsets reordering and interlacing the subsets, to produce a set of interleaved coded information units |
US7131050B2 (en) * | 2002-02-28 | 2006-10-31 | Lsi Logic Corporation | Optimized read performance method using metadata to protect against drive anomaly errors in a storage array |
JP4845755B2 (en) | 2007-01-30 | 2011-12-28 | キヤノン株式会社 | Image processing apparatus, image processing method, program, and storage medium |
US8656145B2 (en) | 2008-09-19 | 2014-02-18 | Qualcomm Incorporated | Methods and systems for allocating interrupts in a multithreaded processor |
CN101547144B (en) | 2008-12-29 | 2011-11-23 | 华为技术有限公司 | Method and device for improving data transmission quality |
US8255774B2 (en) | 2009-02-17 | 2012-08-28 | Seagate Technology | Data storage system with non-volatile memory for error correction |
CN102314213B (en) | 2010-07-09 | 2016-03-30 | 精英电脑股份有限公司 | The computer system of dynamic conditioning frequency of operation |
US20130023387A1 (en) * | 2011-07-21 | 2013-01-24 | Rodney Webb | Door Mounted Support Apparatus for a punching bag |
US20140089699A1 (en) | 2012-09-27 | 2014-03-27 | Advanced Micro Devices | Power management system and method for a processor |
US9098401B2 (en) * | 2012-11-21 | 2015-08-04 | Apple Inc. | Fast secure erasure schemes for non-volatile memory |
US8861270B2 (en) | 2013-03-11 | 2014-10-14 | Microsoft Corporation | Approximate multi-level cell memory operations |
US9122588B1 (en) * | 2013-03-15 | 2015-09-01 | Virident Systems Inc. | Managing asymmetric memory system as a cache device |
US9684559B1 (en) | 2014-04-25 | 2017-06-20 | Altera Corporation | Methods and apparatus for storing error correction information on a memory controller circuit |
GB2529670A (en) | 2014-08-28 | 2016-03-02 | Ibm | Storage system |
US9552510B2 (en) | 2015-03-18 | 2017-01-24 | Adobe Systems Incorporated | Facial expression capture for character animation |
US9600715B2 (en) | 2015-06-26 | 2017-03-21 | Intel Corporation | Emotion detection system |
CN108427990B (en) | 2016-01-20 | 2020-05-22 | 中科寒武纪科技股份有限公司 | Neural network computing system and method |
CN110135581B (en) | 2016-01-20 | 2020-11-06 | 中科寒武纪科技股份有限公司 | Apparatus and method for performing artificial neural network inverse operation |
CN107704433A (en) | 2016-01-20 | 2018-02-16 | 南京艾溪信息科技有限公司 | A kind of matrix operation command and its method |
CN106201651A (en) | 2016-06-27 | 2016-12-07 | 鄞州浙江清华长三角研究院创新中心 | The simulator of neuromorphic chip |
CN106372622A (en) | 2016-09-30 | 2017-02-01 | 北京奇虎科技有限公司 | Facial expression classification method and device |
CN106775977B (en) | 2016-12-09 | 2020-06-02 | 北京小米移动软件有限公司 | Task scheduling method, device and system |
WO2018132219A1 (en) * | 2017-01-13 | 2018-07-19 | Everspin Technologies, Inc. | Preprogrammed data recovery |
-
2019
- 2019-11-28 US US16/698,996 patent/US20200110635A1/en not_active Abandoned
- 2019-11-28 US US16/698,993 patent/US11086634B2/en active Active
Also Published As
Publication number | Publication date |
---|---|
US20200104207A1 (en) | 2020-04-02 |
US11086634B2 (en) | 2021-08-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11307864B2 (en) | Data processing apparatus and method | |
US11307865B2 (en) | Data processing apparatus and method | |
US11593658B2 (en) | Processing method and device | |
CN109003132B (en) | Advertisement recommendation method and related product | |
CN109284823B (en) | Arithmetic device and related product | |
US10901815B2 (en) | Data sharing system and data sharing method therefor | |
JP6880160B2 (en) | Arithmetic logic unit and calculation method | |
WO2020073211A1 (en) | Operation accelerator, processing method, and related device | |
US12094456B2 (en) | Information processing method and system | |
CN110163350B (en) | Computing device and method | |
CN111353591A (en) | Computing device and related product | |
CN111626413A (en) | Computing device and method | |
US11086634B2 (en) | Data processing apparatus and method | |
US11307866B2 (en) | Data processing apparatus and method | |
CN116362301A (en) | Model quantization method and related equipment | |
CN111198714B (en) | Retraining method and related product | |
CN117709497A (en) | Object information prediction method, device, computer equipment and storage medium | |
CN111382835B (en) | Neural network compression method, electronic equipment and computer readable medium | |
CN115063647A (en) | Deep learning-based distributed heterogeneous data processing method, device and equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: SHANGHAI CAMBRICON INFORMATION TECHNOLOGY CO., LTD., CHINA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:HU, SHUAI;ZHOU, XUDA;CHEN, TIANSHI;SIGNING DATES FROM 20190920 TO 20190924;REEL/FRAME:051134/0775 |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: FINAL REJECTION MAILED |
|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |