Nothing Special   »   [go: up one dir, main page]

WO2022179599A1 - 一种感知网络及数据处理方法 - Google Patents

一种感知网络及数据处理方法 Download PDF

Info

Publication number
WO2022179599A1
WO2022179599A1 PCT/CN2022/077881 CN2022077881W WO2022179599A1 WO 2022179599 A1 WO2022179599 A1 WO 2022179599A1 CN 2022077881 W CN2022077881 W CN 2022077881W WO 2022179599 A1 WO2022179599 A1 WO 2022179599A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
block
target
feature maps
output
Prior art date
Application number
PCT/CN2022/077881
Other languages
English (en)
French (fr)
Inventor
郭健元
韩凯
王云鹤
许春景
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP22758963.7A priority Critical patent/EP4296896A4/en
Publication of WO2022179599A1 publication Critical patent/WO2022179599A1/zh
Priority to US18/456,312 priority patent/US20230401826A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/7715Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present application relates to the field of artificial intelligence, and in particular, to a perception network and a data processing method.
  • Artificial intelligence is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
  • artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that responds in a similar way to human intelligence.
  • Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • Computer vision is an integral part of various intelligent/autonomous systems in various application fields, such as manufacturing, inspection, document analysis, medical diagnosis, and military. What we need is the knowledge of the data and information of the subject being photographed. To put it figuratively, it is to install eyes (cameras/camcorders) and brains (algorithms) on the computer to identify, track and measure the target instead of the human eye, so that the computer can perceive the environment. Because perception can be viewed as extracting information from sensory signals, computer vision can also be viewed as the science of how to make artificial systems "perceive" from images or multidimensional data. In general, computer vision is to use various imaging systems to replace the visual organ to obtain input information, and then use the computer to replace the brain to complete the processing and interpretation of these input information. The ultimate research goal of computer vision is to enable computers to observe and understand the world through vision like humans, and have the ability to adapt to the environment autonomously.
  • Inference models based on convolutional neural networks are widely used in various terminal tasks based on computer vision, such as image recognition, target detection, and strength segmentation.
  • the traditional basic neural network often cannot work in real time due to the large-scale parameters and computational complexity.
  • Existing lightweight reasoning networks (such as mobilenet, efficientnet, shufflenet) are designed for mobile devices such as central processing unit (CPU) and ARM (advanced RISC machine). unit, GPU) devices, tensor processing unit (TPU) devices and neural network processing Unit (NPU) devices and other processing units based on high-throughput design are not satisfactory, reasoning Even slower than traditional convolutional neural networks.
  • the present application provides a perception network, the perception network includes: a feature extraction network, the feature extraction network includes a first block, at least one second block connected in series, a target operation, and a splicing operation, The first block and the M second blocks are blocks in the same stage in the feature extraction network, and the parameter amount of the target operation is less than the parameter amount of the M second blocks;
  • the target operation can also be called a cheap operation, which can refer to a series of parameters with a small amount of parameters that will be considered as a general term for operations with a small amount of parameters, which is used to distinguish the traditional convolution operation; parameter Parameters can be used to describe the amount of parameters a neural network contains and to evaluate the size of the model.
  • the concatenation operation refers to the merging of the feature maps without changing the data of the feature maps.
  • the result of the concatenation operation of the feature map 1 and the feature map 2 is (feature map 1, feature map 2), where the feature map
  • the order between Figure 1 and Feature Figure 2 is not limited. More specifically, the result of splicing a feature map with three semantic channels and a feature map with five semantic channels is a feature map with eight semantic channels.
  • the first block is used to perform convolution processing on the input data to obtain M target feature maps, and each target feature map corresponds to a channel;
  • the at least one second block is used to perform convolution processing on M1 target feature maps in the M target feature maps to obtain M1 first feature maps, where M1 is smaller than the M;
  • the target operation is used to process M2 target feature maps in the M target feature maps to obtain M2 second feature maps, where M2 is smaller than the M;
  • the splicing operation is used for splicing the M1 first feature maps and the M2 second feature maps to obtain a spliced feature map.
  • the embodiment of the present application uses the cross-layer target operation between the same stages to allow the perception network to generate these features with high similarity to key features, which reduces the amount of parameters of the model, thereby improving the performance of GPU devices, TPU devices and NPU devices. the running speed of the model.
  • An embodiment of the present application provides a perception network, the perception network includes: a feature extraction network, the feature extraction network includes a first block, at least one second block connected in series, a target operation, and a splicing operation, the The first block and the M second blocks are blocks in the same stage in the feature extraction network, and the parameter amount of the target operation is less than the parameter amount of the M second blocks; the first block The block is used to perform convolution processing on the input data to obtain M target feature maps, each target feature map corresponds to a channel; the at least one second block is used for M1 of the M target feature maps The target feature maps are subjected to convolution processing to obtain M1 first feature maps, where M1 is smaller than the M; the target operation is used to process the M2 target feature maps in the M target feature maps to obtain M2 second feature maps are obtained, and the M2 is smaller than the M; the splicing operation is used to splicing the M1 first feature maps and the M2 second feature maps to obtain the spliced feature maps .
  • the target operation across layers between the same stages is used to allow the perception network to generate these features with high similarity to key features, which reduces the amount of parameters of the model, thereby improving the performance on GPU devices, TPU devices and NPU devices.
  • the running speed of the model is used to allow the perception network to generate these features with high similarity to key features, which reduces the amount of parameters of the model, thereby improving the performance on GPU devices, TPU devices and NPU devices.
  • the intersection of the M1 target feature maps and the M2 target feature maps is empty, and the sum of the M1 and the M2 is the M, and the spliced feature The number of channels of the graph is the M.
  • the output feature map of at least one second block and the number of channels of the output feature map are part of the number of channels of the target feature map of the output first block, and the target feature maps of the remaining part of the channels are generated by the target operation. deal with. And because the parameter amount of the target operation is smaller than the parameter amount of the at least one second block, the overall parameter amount of the perception network is reduced, which in turn can improve the running speed of the perception network on GPU devices, TPU devices and NPU devices.
  • the target operation is a convolution operation with a parameter less than the at least one second block; or,
  • the target operation is a residual connection operation from the output of the first block to the output of the splicing operation.
  • the equivalent of M2 second feature maps is M1
  • the first feature map that is, the M1 first feature maps are directly used as the M2 second feature maps.
  • the output feature map of the first block can also be split into multiple sets of feature maps, and a set of feature maps can be processed by multiple target operations, as long as the number of channels, at least one feature map output by each target operation is guaranteed
  • the sum of the feature maps output by the second block may be the same as the number of channels of the feature maps output by the first block, and the number of channels of the feature maps output by different target operations may be different.
  • the at least one second block is configured to perform convolution processing on M1 target feature maps in the M target feature maps, so as to obtain a feature map output by each second block, wherein , the output of the second block farthest from the first block in the at least one second block is the M1 first feature maps;
  • the feature extraction network also includes:
  • the fusion operation is used to fuse the feature maps output by each second block to obtain a fused feature map, the size of the fused feature map and the size of the M2 second feature maps same;
  • the splicing operation is used for splicing the M1 first feature maps and the processed M2 second feature maps to obtain a spliced feature map.
  • the feature map output by each second block can be spliced to obtain a spliced feature map (the number of channels is the sum of the feature maps output by each second block), because the number of channels of the spliced feature map
  • dimensional operation so that the number of channels of the spliced feature map is equal to M2, and then the matrix addition operation can be performed on the spliced feature map and the output of the target operation.
  • the fusion operation is used to perform splicing and dimension reduction operations on the output of each second block, so as to obtain the M2 second feature maps with the same size The fused feature map.
  • the first block and the M second blocks are blocks in the target stage stage in the feature extraction network, and the spliced feature map is used as the feature extraction network The output feature map of the target stage stage described in ; or,
  • the target stage stage further includes at least one third block, and the at least one third block is used to perform a convolution operation on the spliced feature map to obtain an output feature map of the target stage stage.
  • the spliced feature map can be used as the output feature map of the target stage stage, or the spliced feature map can also be processed by other blocks (third blocks) included in the target stage stage, and the spliced feature map It can also be processed by at least one third block to obtain the output feature map of the target stage stage.
  • the first block may be the first block in the feature extraction network, or a block in the middle layer, and at least one third block may be connected before the first block, then the first block uses for performing convolution processing on the data output by at least one third block.
  • the feature extraction network is used to obtain an input image, perform feature extraction on the input image, and output a feature map of the input image;
  • the perception network also includes:
  • the task network is used to process the corresponding task according to the feature map of the input image to obtain the processing result.
  • the tasks include object detection, image segmentation or image classification.
  • the present application provides a data processing method, the method comprising:
  • the feature extraction network includes a first block, at least one second block connected in series, a target operation and a splicing operation, the first block and the M second blocks are the feature extraction A block in the same stage in the network, and the parameter amount of the target operation is less than the parameter amount of the M second blocks;
  • the input data is subjected to convolution processing to obtain M target feature maps, each target feature map corresponding to a channel;
  • the M2 target feature maps in the M target feature maps are processed to obtain M2 second feature maps, where the M2 is smaller than the M;
  • the M1 first feature maps and the M2 second feature maps are spliced to obtain a spliced feature map.
  • the intersection of the M1 target feature maps and the M2 target feature maps is empty, and the sum of the M1 and the M2 is the M, and the spliced feature The number of channels of the graph is the M.
  • the target operation is a convolution operation with a parameter amount smaller than that of the at least one second block; or,
  • the target operation is a residual connection operation from the output of the first block to the output of the splicing operation.
  • the at least one second block is configured to perform convolution processing on M1 target feature maps in the M target feature maps, so as to obtain a feature map output by each second block, wherein , the output of the second block farthest from the first block in the at least one second block is the M1 first feature maps;
  • the method also includes:
  • the feature maps output by each second block are fused to obtain a fused feature map, and the size of the fused feature map is the same as the size of the M2 second feature maps ;
  • the said splicing operation is used to splicing the M1 first feature maps and the M2 second feature maps, including:
  • the M1 first feature maps and the processed M2 second feature maps are spliced to obtain a spliced feature map.
  • the feature map output by each second block is fused through the fusion operation, including:
  • the first block and the M second blocks are blocks in the target stage stage in the feature extraction network, and the spliced feature map is used as the feature extraction network The output feature map of the target stage stage described in ; or,
  • the target stage stage further includes at least one third block, and the at least one third block is used to perform a convolution operation on the spliced feature map to obtain an output feature map of the target stage stage.
  • the feature extraction network is used to obtain an input image, perform feature extraction on the input image, and output a feature map of the input image;
  • the method also includes:
  • the task network is used to process the corresponding task according to the feature map of the input image to obtain the processing result.
  • the tasks include object detection, image segmentation or image classification.
  • the present application provides a data processing device, the device comprising:
  • the acquisition module is used to acquire a feature extraction network
  • the feature extraction network includes a first block, at least one second block connected in series, a target operation and a splicing operation, the first block and the M second blocks is a block in the same stage in the feature extraction network, and the parameter amount of the target operation is less than the parameter amount of the M second blocks;
  • the convolution processing module through the first block, performs convolution processing on the input data to obtain M target feature maps, and each target feature map corresponds to a channel;
  • the M2 target feature maps in the M target feature maps are processed to obtain M2 second feature maps, where the M2 is smaller than the M;
  • the splicing module is used for splicing the M1 first feature maps and the M2 second feature maps through the splicing operation to obtain a spliced feature map.
  • the intersection of the M1 target feature maps and the M2 target feature maps is empty, and the sum of the M1 and the M2 is the M, and the spliced feature The number of channels of the graph is the M.
  • the target operation is a convolution operation with a parameter amount smaller than that of the at least one second block; or,
  • the target operation is a residual connection operation from the output of the first block to the output of the splicing operation.
  • the at least one second block is configured to perform convolution processing on M1 target feature maps in the M target feature maps, so as to obtain a feature map output by each second block, wherein , the output of the second block farthest from the first block in the at least one second block is the M1 first feature maps;
  • the device also includes:
  • the fusion module is used to fuse the feature maps output by each second block through a fusion operation to obtain a fused feature map, the size of the fused feature map is the same as the M2 second features The sizes of the graphs are the same; an addition operation is performed on the fused feature graph and the M2 second feature graphs to obtain the processed M2 second feature graphs;
  • the splicing module is used for splicing the M1 first feature maps and the processed M2 second feature maps to obtain a spliced feature map.
  • the fusion module is configured to perform splicing and dimension reduction operations on the output of each second block through a fusion operation, so as to obtain the size of the M2 second feature maps The same as the fused feature map.
  • the first block and the M second blocks are blocks in the target stage stage in the feature extraction network, and the spliced feature map is used as the feature extraction network The output feature map of the target stage stage described in ; or,
  • the target stage stage further includes at least one third block, and the at least one third block is used to perform a convolution operation on the spliced feature map to obtain an output feature map of the target stage stage.
  • the feature extraction network is used to obtain an input image, perform feature extraction on the input image, and output a feature map of the input image;
  • the device also includes:
  • the task processing module is used to process the corresponding task according to the feature map of the input image through the task network to obtain the processing result.
  • the tasks include object detection, image segmentation or image classification.
  • an embodiment of the present application provides a data processing apparatus, which may include a memory, a processor, and a bus system, wherein the memory is used for storing a program, and the processor is used for executing the program in the memory, so as to execute the above-mentioned first aspect and any of its optional perceptual networks.
  • an embodiment of the present application provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when it runs on a computer, causes the computer to run as described in the first aspect and any of the above.
  • An optional perception network is provided.
  • an embodiment of the present application provides a computer program, including code, which, when the code is executed, is used to run the above-mentioned first aspect and any optional sensory network thereof.
  • an embodiment of the present application provides a data processing apparatus, which may include a memory, a processor, and a bus system, wherein the memory is used to store a program, and the processor is used to execute the program in the memory, so as to execute the above-mentioned first aspect and any of its optional methods.
  • an embodiment of the present application provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when it runs on a computer, causes the computer to execute the above-mentioned first aspect and any of the above. an optional method.
  • an embodiment of the present application provides a computer program, including code, for implementing the first aspect and any optional method thereof when the code is executed.
  • the present application provides a system-on-chip
  • the system-on-a-chip includes a processor for supporting an execution device or a training device to implement the functions involved in the above aspects, for example, sending or processing data involved in the above methods; or, information.
  • the chip system further includes a memory for storing program instructions and data necessary for executing the device or training the device.
  • the chip system may be composed of chips, or may include chips and other discrete devices.
  • An embodiment of the present application provides a perception network, the perception network includes: a feature extraction network, the feature extraction network includes a first block, at least one second block connected in series, a target operation, and a splicing operation, the The first block and the M second blocks are blocks in the same stage in the feature extraction network, and the parameter amount of the target operation is less than the parameter amount of the M second blocks; the first block The block is used to perform convolution processing on the input data to obtain M target feature maps, each target feature map corresponds to a channel; the at least one second block is used for M1 of the M target feature maps The target feature maps are subjected to convolution processing to obtain M1 first feature maps, where M1 is smaller than the M; the target operation is used to process the M2 target feature maps in the M target feature maps to obtain M2 second feature maps are obtained, and the M2 is smaller than the M; the splicing operation is used to splicing the M1 first feature maps and the M2 second feature maps to obtain the spliced feature maps .
  • the target operation across layers between the same stages is used to allow the perception network to generate these features with high similarity to key features, which reduces the amount of parameters of the model, thereby improving the performance on GPU devices, TPU devices and NPU devices.
  • the running speed of the model is used to allow the perception network to generate these features with high similarity to key features, which reduces the amount of parameters of the model, thereby improving the performance on GPU devices, TPU devices and NPU devices.
  • Fig. 1 is a kind of structural schematic diagram of artificial intelligence main frame
  • FIG. 2a is a schematic diagram of an application scenario provided by an embodiment of the present application.
  • FIG. 2b is a schematic diagram of an application scenario provided by an embodiment of the present application.
  • 2c is a schematic diagram of a convolutional neural network provided by an embodiment of the present application.
  • 2d is a schematic diagram of a convolutional neural network provided by an embodiment of the present application.
  • FIG. 3 is a schematic diagram of a system architecture provided by an embodiment of the present application.
  • FIG. 4 is a schematic diagram of an embodiment of a data processing method provided by an embodiment of the present application.
  • FIG. 5a is a schematic diagram of a perception network provided by an embodiment of the present application.
  • FIG. 5b is a schematic diagram of a perception network provided by an embodiment of the present application.
  • FIG. 5c is a schematic diagram of a perception network provided by an embodiment of the present application.
  • FIG. 5d is a schematic diagram of a perception network provided by an embodiment of the present application.
  • 6 to 14 are schematic diagrams of a sensory network provided by an embodiment of the present application.
  • 15 is a schematic diagram of a data processing apparatus provided by an embodiment of the present application.
  • FIG. 16 is a schematic structural diagram of an execution device provided by an embodiment of the application.
  • FIG. 17 is a schematic structural diagram of a training device provided by an embodiment of the application.
  • FIG. 18 is a schematic structural diagram of a chip provided by an embodiment of the present application.
  • Figure 1 shows a schematic structural diagram of the main frame of artificial intelligence.
  • the above-mentioned artificial intelligence theme framework is explained in two dimensions (vertical axis).
  • the "intelligent information chain” reflects a series of processes from data acquisition to processing. For example, it can be the general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, intelligent execution and output. In this process, data has gone through the process of "data-information-knowledge-wisdom".
  • the "IT value chain” reflects the value brought by artificial intelligence to the information technology industry from the underlying infrastructure of human intelligence, information (providing and processing technology implementation) to the industrial ecological process of the system.
  • the infrastructure provides computing power support for artificial intelligence systems, realizes communication with the outside world, and supports through the basic platform. Communication with the outside world through sensors; computing power is provided by smart chips (hardware acceleration chips such as CPU, NPU, GPU, ASIC, FPGA); the basic platform includes distributed computing framework and network-related platform guarantee and support, which can include cloud storage and computing, interconnection networks, etc. For example, sensors communicate with external parties to obtain data, and these data are provided to the intelligent chips in the distributed computing system provided by the basic platform for calculation.
  • smart chips hardware acceleration chips such as CPU, NPU, GPU, ASIC, FPGA
  • the basic platform includes distributed computing framework and network-related platform guarantee and support, which can include cloud storage and computing, interconnection networks, etc. For example, sensors communicate with external parties to obtain data, and these data are provided to the intelligent chips in the distributed computing system provided by the basic platform for calculation.
  • the data on the upper layer of the infrastructure is used to represent the data sources in the field of artificial intelligence.
  • the data involves graphics, images, voice, and text, as well as IoT data from traditional devices, including business data from existing systems and sensory data such as force, displacement, liquid level, temperature, and humidity.
  • Data processing usually includes data training, machine learning, deep learning, search, reasoning, decision-making, etc.
  • machine learning and deep learning can perform symbolic and formalized intelligent information modeling, extraction, preprocessing, training, etc. on data.
  • Reasoning refers to the process of simulating human's intelligent reasoning method in a computer or intelligent system, using formalized information to carry out machine thinking and solving problems according to the reasoning control strategy, and the typical function is search and matching.
  • Decision-making refers to the process of making decisions after intelligent information is reasoned, usually providing functions such as classification, sorting, and prediction.
  • some general capabilities can be formed based on the results of data processing, such as algorithms or a general system, such as translation, text analysis, computer vision processing, speech recognition, image identification, etc.
  • Intelligent products and industry applications refer to the products and applications of artificial intelligence systems in various fields. They are the encapsulation of the overall solution of artificial intelligence, and the productization of intelligent information decision-making to achieve landing applications. Its application areas mainly include: intelligent terminals, intelligent transportation, Smart healthcare, autonomous driving, smart city, etc.
  • the embodiments of the present application are mainly applied in fields that need to complete various sensing tasks, such as driving assistance, automatic driving, and mobile phone terminals.
  • a single picture is obtained from the video after frame extraction, and the picture is sent to the perception network in the present invention to obtain 2D, 3D, Mask (mask), key points and other information of the object of interest in the picture.
  • These detection results are output to the post-processing module for processing. For example, they are sent to the planning control unit for decision-making in the automatic driving system, and the beauty algorithm is sent to the mobile phone terminal for processing to obtain the beautified pictures.
  • the following is a brief introduction to the two application scenarios of ADAS/ADS visual perception system and mobile phone beauty.
  • Application Scenario 1 ADAS/ADS Visual Perception System
  • ADAS and ADS multiple types of 2D object detection need to be performed in real time, including: dynamic obstacles (Pedestrian, Cyclist, Tricycle, Car, Truck). (Truck, Bus), Static Obstacles (TrafficCone, TrafficStick, FireHydrant, Motorcycle, Bicycle), Traffic Signs ( TrafficSign, Guidance Sign (GuideSign), Billboard, Red Traffic Light (TrafficLight_Red)/Yellow Traffic Light (TrafficLight_Yellow)/Green Traffic Light (TrafficLight_Green)/Black Traffic Light (TrafficLight_Black), Road Sign (RoadSign)).
  • dynamic obstacles Pedestrian, Cyclist, Tricycle, Car, Truck. (Truck, Bus), Static Obstacles (TrafficCone, TrafficStick, FireHydrant, Motorcycle, Bicycle), Traffic Signs ( TrafficSign, Guidance Sign (GuideSign), Billboard, Red Traffic Light (TrafficLight_Red)/
  • the mask and key points of the human body are detected through the perception network provided in the embodiment of the present application, and the corresponding parts of the human body can be enlarged and reduced, such as waist and hip beautification operations, so as to output beauty picture of.
  • the object recognition device After acquiring the to-be-classified image, the object recognition device uses the object recognition method of the present application to acquire the class of the object in the to-be-classified image, and then can classify the to-be-classified image according to the class of the object in the to-be-classified image.
  • photos For photographers, many photos are taken every day, including animals, people, and plants. Using the method of the present application, photos can be quickly classified according to the content in the photos, which can be divided into photos containing animals, photos containing people, and photos containing plants.
  • the object recognition method of the present application is used to acquire the category of the commodity in the image of the commodity, and then the commodity is classified according to the category of the commodity.
  • a neural network can be composed of neural units, and a neural unit can refer to an operation unit that takes xs and intercept 1 as inputs, and the output of the operation unit can be:
  • s 1, 2,...n, n is a natural number greater than 1
  • Ws is the weight of xs
  • b is the bias of the neural unit.
  • f is the activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal. The output signal of this activation function can be used as the input of the next convolutional layer.
  • the activation function can be a sigmoid function.
  • a neural network is a network formed by connecting many of the above single neural units together, that is, the output of one neural unit can be the input of another neural unit.
  • the input of each neural unit can be connected with the local receptive field of the previous layer to extract the features of the local receptive field, and the local receptive field can be an area composed of several neural units.
  • Deep Neural Network can be understood as a neural network with many hidden layers. There is no special metric for "many” here. The essence of the multi-layer neural network and deep neural network we often say Above is the same thing. From the division of DNN according to the position of different layers, the neural network inside DNN can be divided into three categories: input layer, hidden layer, and output layer. Generally speaking, the first layer is the input layer, the last layer is the output layer, and the middle layers are all hidden layers. The layers are fully connected, that is, any neuron in the i-th layer must be connected to any neuron in the i+1-th layer. Although DNN looks complicated, in terms of the work of each layer, it is not complicated.
  • the coefficient from the kth neuron in layer L-1 to the jth neuron in layer L is defined as Note that the input layer has no W parameter.
  • more hidden layers allow the network to better capture the complexities of the real world.
  • a model with more parameters is more complex and has a larger "capacity", which means that it can complete more complex learning tasks.
  • Convosutionas Neuras Network is a deep neural network with a convolutional structure.
  • a convolutional neural network consists of a feature extractor consisting of convolutional and subsampling layers.
  • the feature extractor can be viewed as a filter, and the convolution process can be viewed as convolution with an input image or a convolutional feature map using a trainable filter.
  • the convolutional layer refers to the neuron layer in the convolutional neural network that convolves the input signal.
  • a neuron can only be connected to some of its neighbors.
  • a convolutional layer usually contains several feature planes, and each feature plane can be composed of some neural units arranged in a rectangle.
  • Neural units in the same feature plane share weights, and the shared weights here are convolution kernels.
  • Shared weights can be understood as the way to extract image information is independent of location. The underlying principle is that the statistics of one part of the image are the same as the other parts. This means that image information learned in one part can also be used in another part. So for all positions on the image, we can use the same learned image information.
  • multiple convolution kernels can be used to extract different image information. Generally, the more convolution kernels, the richer the image information reflected by the convolution operation.
  • the convolution kernel can be initialized in the form of a matrix of random size, and the convolution kernel can obtain reasonable weights by learning during the training process of the convolutional neural network.
  • the immediate benefit of sharing weights is to reduce the connections between the layers of the convolutional neural network, while reducing the risk of overfitting.
  • CNN is a very common neural network
  • a convolutional neural network is a deep neural network with a convolutional structure and a deep learning architecture.
  • a deep learning architecture refers to an algorithm based on machine learning. learning at multiple levels of abstraction.
  • CNN is a feed-forward artificial neural network in which individual neurons can respond to images fed into it.
  • a convolutional neural network (CNN) 100 may include an input layer 120, a convolutional/pooling layer 120 (where the pooling layer is optional), and a neural network layer 130.
  • the input layer 120 can obtain the image to be processed, and submit the obtained image to be processed by the convolution layer/pooling layer 120 and the subsequent neural network layer 130 for processing, and the processing result of the image can be obtained.
  • the internal layer structure in the CNN 100 in Figure 2c is described in detail below.
  • the convolutional/pooling layer 120 may include layers 121-126 as examples, for example: in one implementation, layer 121 is a convolutional layer, layer 122 is a pooling layer, and layer 123 is a convolutional layer Layer 124 is a pooling layer, 125 is a convolutional layer, and 126 is a pooling layer; in another implementation, 121 and 122 are convolutional layers, 123 is a pooling layer, and 124 and 125 are convolutional layers. layer, 126 is the pooling layer. That is, the output of a convolutional layer can be used as the input of a subsequent pooling layer, or it can be used as the input of another convolutional layer to continue the convolution operation.
  • the following will take the convolutional layer 121 as an example to introduce the inner working principle of a convolutional layer.
  • the convolution layer 121 may include many convolution operators.
  • the convolution operator is also called a kernel. Its role in image processing is equivalent to a filter that extracts specific information from the input image matrix.
  • the convolution operator is essentially Can be a weight matrix, which is usually pre-defined, usually one pixel by one pixel (or two pixels by two pixels) along the horizontal direction on the input image during the convolution operation on the image. ...It depends on the value of the stride step) to process, so as to complete the work of extracting specific features from the image.
  • the size of the weight matrix should be related to the size of the image. It should be noted that the depth dimension of the weight matrix is the same as the depth dimension of the input image.
  • each weight matrix is stacked to form the depth dimension of the convolutional image, where the dimension can be understood as determined by the "multiple" described above.
  • Different weight matrices can be used to extract different features in the image. For example, one weight matrix is used to extract image edge information, another weight matrix is used to extract specific colors of the image, and another weight matrix is used to extract unwanted noise in the image.
  • the multiple weight matrices have the same size (row ⁇ column), and the size of the convolution feature maps extracted from the multiple weight matrices with the same size is also the same, and then the multiple extracted convolution feature maps with the same size are combined to form The output of the convolution operation.
  • weight values in these weight matrices need to be obtained through a lot of training in practical applications, and each weight matrix formed by the weight values obtained by training can be used to extract information from the input image, so that the convolutional neural network 100 can make correct predictions .
  • the initial convolutional layer for example, 121
  • the features extracted by the later convolutional layers become more and more complex, such as features such as high-level semantics.
  • the pooling layer can be a convolutional layer followed by a layer.
  • the pooling layer can also be a multi-layer convolutional layer followed by one or more pooling layers.
  • the pooling layer may include an average pooling operator and/or a max pooling operator for sampling the input image to obtain a smaller size image.
  • the average pooling operator can calculate the pixel values in the image within a certain range to produce an average value as the result of average pooling.
  • the max pooling operator can take the pixel with the largest value within a specific range as the result of max pooling. Also, just as the size of the weight matrix used in the convolutional layer should be related to the size of the image, the operators in the pooling layer should also be related to the size of the image.
  • the size of the output image after processing by the pooling layer can be smaller than the size of the image input to the pooling layer, and each pixel in the image output by the pooling layer represents the average or maximum value of the corresponding sub-region of the image input to the pooling layer.
  • the convolutional neural network 100 After being processed by the convolutional layer/pooling layer 120, the convolutional neural network 100 is not sufficient to output the required output information. Because as mentioned before, the convolutional layer/pooling layer 120 only extracts features and reduces the parameters brought by the input image. However, in order to generate the final output information (required class information or other relevant information), the convolutional neural network 100 needs to utilize the neural network layer 130 to generate one or a set of outputs of the desired number of classes. Therefore, the neural network layer 130 may include multiple hidden layers (131, 132 to 13n as shown in FIG. 2c) and the output layer 140, and the parameters contained in the multiple hidden layers may be based on specific task types The relevant training data is pre-trained, for example, the task type can include image recognition, image classification, image super-resolution reconstruction and so on.
  • the output layer 140 After the multi-layer hidden layers in the neural network layer 130, that is, the last layer of the entire convolutional neural network 100 is the output layer 140, the output layer 140 has a loss function similar to the classification cross entropy, and is specifically used to calculate the prediction error,
  • the forward propagation of the entire convolutional neural network 100 (as shown in Figure 2c, the propagation from the 120 to 140 direction is forward propagation) is completed, the back propagation (as shown in Figure 2c from the 140 to 120 direction as the back propagation) will Start to update the weight values and biases of the aforementioned layers to reduce the loss of the convolutional neural network 100 and the error between the result output by the convolutional neural network 100 through the output layer and the ideal result.
  • the convolutional neural network 120 shown in FIG. 2c is only used as an example of a convolutional neural network, and in a specific application, the convolutional neural network may also exist in the form of other network models.
  • a convolutional neural network (CNN) 100 may include an input layer 110, a convolutional/pooling layer 120 (where the pooling layer is optional), and a neural network layer 130.
  • CNN convolutional neural network
  • the multiple convolution/pooling layers in the convolutional layer/pooling layer 120 in FIG. 2d are parallel, and the separately extracted features are input to the full neural network layer 130 for processing.
  • the convolutional neural network shown in FIG. 2c and FIG. 2d is only an example, and in a specific application, the convolutional neural network may also exist in the form of other network models.
  • the convolutional neural network can use the error back propagation (BP) algorithm to correct the size of the parameters in the initial super-resolution model during the training process, so that the reconstruction error loss of the super-resolution model becomes smaller and smaller. Specifically, forwarding the input signal until the output will generate an error loss, and updating the parameters in the initial super-resolution model by back-propagating the error loss information, so that the error loss converges.
  • the back-propagation algorithm is a back-propagation motion dominated by the error loss, aiming to obtain the parameters of the optimal super-resolution model, such as the weight matrix.
  • the neural network can use the error back propagation (BP) algorithm to correct the size of the parameters in the initial neural network model during the training process, so that the reconstruction error loss of the neural network model becomes smaller and smaller.
  • BP error back propagation
  • the input signal is passed forward until the output will generate error loss, and the parameters in the initial neural network model are updated by back-propagating the error loss information, so that the error loss converges.
  • the back-propagation algorithm is a back-propagation movement dominated by error loss, aiming to obtain the parameters of the optimal neural network model, such as the weight matrix.
  • an embodiment of the present application provides a system architecture 100 .
  • the data collection device 160 is used to collect training data.
  • the training data includes: an image or image block of an object and a category of the object; and the training data is stored in the database 130, and the training device 120 trains a CNN feature extraction network based on the training data maintained in the database 130 (explaination: the feature extraction network here is the model trained in the training phase described above, which may be a neural network for feature extraction, etc.).
  • the first embodiment will be used to describe how the training device 120 obtains a CNN feature extraction network based on the training data in more detail. Input the CNN feature extraction network after relevant preprocessing.
  • the CNN feature extraction network in the embodiment of the present application may specifically be a CNN convolutional neural network.
  • the training data maintained in the database 130 may not necessarily come from the collection of the data collection device 160, and may also be received from other devices.
  • the training device 120 may not necessarily perform the training of the CNN feature extraction network completely based on the training data maintained by the database 130, and may also obtain training data from the cloud or other places for model training. Example limitation.
  • the target models/rules trained according to the training device 120 can be applied to different systems or devices, such as the execution device 110 shown in FIG. 3 , the execution device 110 can be a terminal, such as a mobile phone terminal, a tablet computer, a notebook Computer, augmented reality (AR) AR/virtual reality (VR), in-vehicle terminal, etc., or server or cloud, etc.
  • the execution device 110 is configured with an input/output (I/O) interface 112 for data interaction with external devices, and the user can input data to the I/O interface 112 through the client device 140 .
  • the input data may include: an image or an image block.
  • the execution device 120 may call the data storage system 150
  • the data, codes, etc. in the corresponding processing can also be stored in the data storage system 150 .
  • the I/O interface 112 returns the processing result, such as the obtained image or image block or the 2D, 3D, Mask, key points and other information of the object of interest in the picture to the client device 140, thereby providing it to the user.
  • the client device 140 may be a planning control unit in an automatic driving system or a beauty algorithm module in a mobile phone terminal.
  • the training device 120 can generate corresponding target models/rules based on different training data for different goals or tasks, and the corresponding target models/rules can be used to achieve the above-mentioned goals or complete the above-mentioned tasks. , which provides the user with the desired result.
  • the user can manually specify input data, which can be operated through the interface provided by the I/O interface 112 .
  • the client device 140 can automatically send the input data to the I/O interface 112 . If the user's authorization is required to request the client device 140 to automatically send the input data, the user can set the corresponding permission in the client device 140 .
  • the user can view the result output by the execution device 110 on the client device 140, and the specific presentation form can be a specific manner such as display, sound, and action.
  • the client device 140 can also be used as a data collection terminal to collect the input data of the input I/O interface 112 and the output result of the output I/O interface 112 as new sample data as shown in the figure, and store them in the database 130 .
  • the I/O interface 112 directly uses the input data input into the I/O interface 112 and the output result of the output I/O interface 112 as shown in the figure as a new sample The data is stored in database 130 .
  • FIG. 3 is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the positional relationship among the devices, devices, modules, etc. shown in the figure does not constitute any limitation.
  • the data The storage system 150 is an external memory relative to the execution device 110 , and in other cases, the data storage system 150 may also be placed in the execution device 110 .
  • the perception network provided by the embodiment of the present application is introduced by taking the model inference stage as an example.
  • FIG. 6 is a schematic structural diagram of a perception network provided by an embodiment of the present application.
  • the perception network provided by an embodiment of the present application includes: a feature extraction network, and the feature extraction network includes a first block block, at least one second block connected in series, target operation and splicing operation, the first block and the M second blocks are blocks in the same stage in the feature extraction network (shown in FIG. 6 ). out is the target stage).
  • a feature extraction network usually includes multiple stages, where each stage may include multiple blocks, and each block may be composed of at least one convolution operation.
  • FIG. 5a shows the prior art A schematic diagram of the structure of a perceptual network, wherein the feature extraction network includes four stages (stage1, stage2, stage3, stage4), wherein the input feature map and the output feature map of the block included in each stage are of the same size.
  • the so-called consistent size means that the number of channels of the feature map and the size of the feature map of each channel are consistent.
  • the size of the input feature map and output feature map of each block in stage1 is 56*56, the number of channels is 24, stage2 The size of the input feature map and output feature map of each block is 28*28, and the number of channels is 40.
  • the size of the input feature map and output feature map of each block in stage1 is 14*14, the number of channels is 80, and the size of each block in stage1 is 80.
  • the size of the input feature map and output feature map of the block is 7*7, the number of channels is 160, and each block can include at least one convolution operation, such as the three convolution operations shown in Figure 5a (1*1 convolution operation , 3*3 convolution, and 1*1 convolution), and there is also a residual connection operation that connects the input to the output.
  • convolution module in Figure 5b can be represented as a stage
  • the convolution module 1 can correspond to stage 1 in Figure 5a
  • the convolution module 2 can correspond to stage 2 in Figure 5a
  • Convolution module 3 can correspond to stage 3 in Figure 5a
  • convolution module 4 can correspond to stage 4 in Figure 5a
  • feature map C1 is the feature map output by stage 1
  • feature map C2 is the feature map output by stage 2
  • feature map C3 is The feature map output by stage3
  • feature map C4 is the feature map output by stage4
  • the feature extraction network can perform feature extraction on the input image to obtain the output feature map.
  • the output feature map can be input to the task network, and the task network can process the corresponding task to obtain the processing result.
  • the task can be target detection, then the processing result It can be the detection frame where the target is located in the image, the task can be image segmentation, and the processing result can be the image segmentation area where the target is located in the image.
  • the embodiment of the present application uses the cross-layer target operation between the same stages to allow the perception network to generate these features with high similarity to key features, which reduces the amount of parameters of the model, thereby improving the performance of GPU devices, TPU devices and NPU devices. the running speed of the model.
  • the feature extraction network includes a first block, at least one second block connected in series, a target operation and a splicing operation, and the first block and the M second blocks are the feature extraction network block in the same stage, and the parameter amount of the target operation is smaller than the parameter amount of the M second blocks.
  • the first block is used to perform convolution processing on the input data to obtain M target feature maps, each target feature map corresponding to a channel.
  • the at least one second block is used to perform convolution processing on M1 target feature maps in the M target feature maps to obtain M1 first feature maps, where M1 is smaller than the M, and the
  • the target operation is used to process M2 target feature maps in the M target feature maps to obtain M2 second feature maps, where M2 is smaller than the M, and the splicing operation is used to combine the M1
  • the first feature map and the M2 second feature maps are spliced to obtain a spliced feature map.
  • the intersection of the M1 target feature maps and the M2 target feature maps is empty, and the sum of the M1 and the M2 is the M, and the spliced feature The number of channels of the graph is the M.
  • the output feature map of at least one second block and the number of channels of the output feature map are part of the number of channels of the target feature map of the output first block, and the target feature maps of the remaining part of the channels are generated by the target operation. deal with. And because the parameter amount of the target operation is smaller than the parameter amount of the at least one second block, the overall parameter amount of the perception network is reduced, which in turn can improve the running speed of the perception network on GPU devices, TPU devices and NPU devices.
  • the target operation is a convolution operation with a parameter quantity smaller than the at least one second block; or, the target operation is from the output of the first block to the output of the splicing operation Residual join operations between .
  • the equivalent of M2 second feature maps is M1
  • the first feature map that is, the M1 first feature maps are directly used as the M2 second feature maps.
  • the output feature map of the first block can also be split into multiple sets of feature maps, and a set of feature maps can be processed by multiple target operations, where as long as the channels of the feature maps output by each target operation are guaranteed
  • the number of channels and the sum of the feature maps output by at least one second block may be the same as the number of channels of the feature maps output by the first block, and the number of channels of the feature maps output by different target operations may be different.
  • the first block and the M second blocks are blocks in the target stage stage in the feature extraction network, and the spliced feature map is used as the feature extraction network
  • the output feature map of the target stage stage described in; or, the target stage stage also includes at least one third block, and the at least one third block is used to perform a convolution operation on the spliced feature map to obtain The output feature map of the target stage stage.
  • the spliced feature map can be used as the output feature map of the target stage stage, or the spliced feature map can also be processed by other blocks (third blocks) included in the target stage stage, for example, referring to FIG. 8 , The spliced feature map can also be processed by at least one third block to obtain the output feature map of the target stage stage.
  • the first block may be the first block in the feature extraction network, or a block in the middle layer.
  • at least one third block may be connected before the first block, then The first block is used to perform convolution processing on data output by at least one third block.
  • the at least one second block is configured to perform convolution processing on M1 target feature maps in the M target feature maps, so as to obtain a feature map output by each second block, wherein , the output of the second block farthest from the first block in the at least one second block is the M1 first feature maps; the fusion operation is used to output the feature maps of each second block Perform fusion to obtain a fused feature map.
  • the size of the fused feature map is the same as the size of the M2 second feature maps.
  • the feature maps are added to obtain the processed M2 second feature maps, and the splicing operation is used to splicing the M1 first feature maps and the processed M2 second feature maps to obtain The stitched feature map.
  • each second block can output a feature map
  • the object of the fusion operation is the feature map output by each second block and the output of the target operation (M2 second feature maps), wherein the fusion operation, It is used to perform splicing and dimension reduction operations on the output of each second block, so as to obtain the fused feature map with the same size as the M2 second feature maps.
  • the feature map output by each second block can be spliced to obtain a spliced feature map (the number of channels is the sum of the feature maps output by each second block), because The number of channels of the spliced feature map is greater than the number of channels (M2) of the feature map output by the target operation.
  • M2 the number of channels of the spliced feature map
  • the feature extraction network in the perception network may be formed by stacking the stages based on the above-mentioned embodiments.
  • An example, a schematic diagram of the network structure may be shown in Table 1. Among them, output represents the size of the output feature map, and #out represents the number of channels of the output feature map (the number of channels of Block and Cheap is actually 1/2 of #out).
  • the feature extraction network is used to obtain an input image, perform feature extraction on the input image, and output a feature map of the input image;
  • the perception network further includes: a task network , which is used to process the corresponding task according to the feature map of the input image to obtain the processing result.
  • the tasks include image processing tasks such as object detection, image segmentation, or image classification.
  • the feature extraction network in the embodiment of the present application may be the backbone network shown in FIG. 11
  • the feature pyramid feature pyramid network, FPN
  • the FPN can generate a plurality of different
  • the feature map of the resolution is convolved to construct a feature pyramid.
  • FIG. 12 is a schematic structural diagram of an FPN, in which the topmost feature map C4 is processed by a convolution module 1.
  • the convolution module 1 may include at least one convolution layer.
  • the convolution module 1 Atrous convolution and 1 ⁇ 1 convolution can be used to reduce the number of channels of the top-level feature map C4 to 256 as the top-level feature map P4 of the feature pyramid; horizontally link the output of the top-level and next-layer feature map C3 and use 1
  • the FPN includes multiple convolution modules, each convolution module includes multiple convolution layers, and each convolution module can perform convolution processing on the input feature map.
  • the FPN includes The second convolutional layer is one of multiple convolutional layers included in the FPN.
  • FPN shown in FIG. 12 is only an implementation manner, and does not constitute a limitation to the present application.
  • the header is connected to the FPN, and the header can complete the detection of the 2D frame of a task according to the feature map provided by the FPN, and output the 2D frame of the object of the task.
  • the header includes a candidate region generation network (Region Proposal Network, RPN), ROI-ALIGN and RCNN three modules.
  • the RPN module can be used to predict the region where the task object is located on one or more feature maps provided by the FPN, and output a candidate 2D frame matching the region; or it can be understood that the RPN outputs one or more of the FPN.
  • Areas where the task object may exist are predicted on multiple horizontal graphs, and the frames of these areas are given, and these areas are called proposal areas. For example, when the Header is responsible for detecting cars, its RPN layer predicts a candidate frame where there may be a car; when the Header is responsible for detecting people, its RPN layer predicts a candidate frame where there may be a person. Of course, these proposals are inaccurate. On the one hand, they do not necessarily contain the objects of the task, and on the other hand, the boxes are not compact.
  • the 2D candidate area prediction process can be implemented by the RPN module of the Header, which predicts the areas where the task object may exist according to the feature map provided by the FPN, and gives the candidate frames (also called candidate areas, Proposal) of these areas.
  • the Header if the Header is responsible for detecting a car, its RPN layer predicts a candidate frame where there may be a car.
  • the feature map RPNHidden is generated on the feature map provided by FPN through convolution module 1 (eg, a 3*3 convolution).
  • the RPN layer of the following Header will predict the Proposal from the RPN Hidden. Specifically, the RPN layer of the Header predicts the coordinates and confidence of the proposal at each position of the RPN Hidden through the convolution module 2 and the convolution module 3 (for example, a 1*1 convolution respectively).
  • the higher the confidence the greater the probability that this Proposal exists in the object of the task. For example, the larger the score of a Proposal in the Header, the greater the probability that it has a car.
  • the Proposal predicted by each RPN layer needs to go through the Proposal merging module, remove the redundant Proposal according to the degree of overlap between the Proposals (this process can be used but not limited to the NMS algorithm), and select the largest score among the remaining K Proposals.
  • N (N ⁇ k) proposals are used as candidate regions where objects may exist. It can be seen from Figure 14 that these proposals are inaccurate. On the one hand, they do not necessarily contain the objects of the task, and on the other hand, the boxes are not compact. Therefore, the RPN module is only a rough detection process, and the subsequent RCNN module is required for subdivision. When the RPN module returns the coordinates of the Proposal, it does not directly return the absolute value of the coordinates, but returns the coordinates relative to the Anchor. The higher the match between these Anchors and the actual objects, the higher the probability that the RPN can detect the objects. ⁇ /k) proposals as candidate regions where objects may exist.
  • the ROI-ALIGN module is used to deduct the features of the region where the candidate 2D frame is located from a feature map provided by the FPN according to the region predicted by the RPN module; that is, the ROI-ALIGN module is mainly based on the RPN module.
  • the features of the region where each Proposal is located are deducted from a feature map, and resized to a fixed size to obtain the features of each Proposal.
  • ROI-ALIGN can use but is not limited to ROI-POOLING (region of interest pooling)/ROI-ALIGN (region of interest extraction)/PS-ROIPOOLING (position-sensitive region of interest pooling)/ Feature extraction methods such as PS-ROIALIGN (Position Sensitive Region of Interest Extraction).
  • the RCNN module is used to perform convolution processing on the features of the area where the candidate 2D frame is located through a neural network to obtain the confidence that the candidate 2D frame belongs to each object category; adjust the coordinates of the 2D frame in the candidate area through a neural network , so that the adjusted 2D candidate frame matches the shape of the actual object more closely than the candidate 2D frame, and the adjusted 2D candidate frame with a confidence greater than a preset threshold is selected as the 2D frame of the region.
  • the RCNN module mainly refines the features of each proposal proposed by the ROI-ALIGN module, and obtains the confidence of each proposal belonging to each category (for example, for the task of car, Backgroud/Car/Truck will be given /Bus 4 points), and adjust the coordinates of the Proposal 2D box to output a more compact 2D box. After these 2D boxes are merged by non-maximum suppression (NMS), they are output as the final 2D box.
  • NMS non-maximum suppression
  • the 2D candidate region subdivision classification is mainly implemented by the RCNN module of the Header in Figure 14. According to the features of each proposal extracted by the ROI-ALIGN module, it further returns a more compact 2D frame coordinate, and at the same time, the proposal is classified. Output the confidence that it belongs to each category. There are many achievable forms of RCNN, one of which is shown in Figure 13.
  • the feature size output by the ROI-ALIGN module can be N*14*14*256 (Feature of proposals), which is first processed by the convolution module 4 (Res18-Conv5) of Resnet18 in the RCNN module, and the output feature size is N* 7*7*512, and then processed through a Global Avg Pool (average pooling layer) to average the 7*7 features in each channel in the input features to obtain N*512 features, where each 1* A 512-dimensional feature vector represents the features of each proposal.
  • the perception network may also include other headers, which can further perform 3D/Mask/Keypoint detection on the basis of detecting the 2D frame.
  • the ROI-ALIGN module extracts the features of the area where each 2D box is located on the feature map output by the FPN according to the accurate 2D box provided by the Header.
  • the feature size output by the ROI-ALIGN module is M*14*14*256, which is first processed by the convolution module 5 of Resnet18 (for example, Res18-Conv5), and the output feature size is N*7*7*512, and then passed through A Global Avg Pool (average pooling layer) is used for processing, and the 7*7 features of each channel in the input features are averaged to obtain M*512 features, where each 1*512-dimensional feature vector represents each 2D box features.
  • orientation angle of the object in the box (orientation, M*1 vector)
  • coordinates of the centroid point centroid, M*2 vector
  • these two values represent the x/y coordinates of the centroid
  • Length, width and height (dimention).
  • the header includes at least one convolution module, each convolution module includes at least one convolution layer, and each convolution module can perform convolution processing on the input feature map.
  • the header includes The third convolutional layer is one of the multiple convolutional layers included in the header.
  • FIG. 13 and FIG. 14 is only an implementation manner, and does not constitute a limitation to the present application.
  • the results of the image classification data set CIFAR10 in this embodiment of the present application can be shown in Table 2 and Table 3.
  • the perceptual network structure provided in the embodiment of the present application can achieve the minimum calculation amount and parameter amount. A very high accuracy is achieved.
  • Table 2 shows the results on the image classification dataset CIFAR10
  • Table 3 shows the comparison with existing lightweight networks on the image classification dataset CIFAR10.
  • the results of the perceptual network provided in the embodiment of the present application in the image classification dataset ImageNet can be shown in Table 4, Table 5 and Table 6.
  • the perceptual network provided in the embodiment of the present application can be 1.4% improvement in inference accuracy on this large classification dataset.
  • the perceptual network provided in the embodiment of the present application can achieve the fastest reasoning speed and higher reasoning accuracy at the same time.
  • Table 4 shows the difference between the image classification dataset ImageNet and the baseline network ResNet.
  • Table 5 shows the comparison between the image classification dataset ImageNet and another baseline network RegNet
  • Table 6 shows the comparison between the image classification dataset ImageNet and other lightweight networks.
  • the perception network provided in the embodiment of the present application has the fastest inference speed while reaching the highest mAP, reaching 25.9 frames per second.
  • An embodiment of the present application provides a perception network, the perception network includes: a feature extraction network, the feature extraction network includes a first block, at least one second block connected in series, a target operation, and a splicing operation, the The first block and the M second blocks are blocks in the same stage in the feature extraction network, and the parameter amount of the target operation is less than the parameter amount of the M second blocks; the first block The block is used to perform convolution processing on the input data to obtain M target feature maps, each target feature map corresponds to a channel; the at least one second block is used for M1 of the M target feature maps The target feature maps are subjected to convolution processing to obtain M1 first feature maps, where M1 is smaller than the M; the target operation is used to process the M2 target feature maps in the M target feature maps to obtain M2 second feature maps are obtained, and the M2 is smaller than the M; the splicing operation is used to splicing the M1 first feature maps and the M2 second feature maps to obtain the spliced feature maps .
  • the target operation across layers between the same stages is used to allow the perception network to generate these features with high similarity to key features, which reduces the amount of parameters of the model, thereby improving the performance on GPU devices, TPU devices and NPU devices.
  • the running speed of the model is used to allow the perception network to generate these features with high similarity to key features, which reduces the amount of parameters of the model, thereby improving the performance on GPU devices, TPU devices and NPU devices.
  • the embodiments of the present application are mainly applied in fields based on computer vision, such as terminal mobile phones, cloud services, detection, segmentation, and underlying vision.
  • computer vision and its related tasks have continuously increased computing power requirements, which have put forward higher requirements for hardware computing speed and power consumption.
  • terminal devices such as mobile phone chips have begun to deploy GPU and NPG processing units in large numbers.
  • the lightweight network model proposed in the embodiments of the present application can greatly improve the inference speed on the GPU, and can be used as a basic network to directly replace the existing basic feature extraction network such as a target detector, and be deployed in scenarios such as automatic driving. In practical applications, it can be adapted to a wide range of application scenarios and devices (such as mobile terminals, cloud servers, etc.), and the fast inference network can be used to perform tasks such as data processing and image retrieval.
  • the embodiments of the present application can be deployed on a mobile phone terminal to bring efficient and accurate reasoning to the user.
  • a mobile phone terminal to bring efficient and accurate reasoning to the user.
  • the present invention can perform lightweight deployment on cloud services, provide users with efficient data processing services, and help deep learning speed up and increase efficiency. Users upload the data to be processed, and then the inference model on the cloud service can be used for fast data processing.
  • the present invention can directly replace the feature extraction module of the existing target detector, and compress and speed up the reasoning process of the detector.
  • an embodiment of the present application further provides a data processing method, which includes:
  • the feature extraction network includes a first block, at least one second block connected in series, a target operation, and a splicing operation, and the first block and the M second blocks are the A block in the same stage in the feature extraction network, and the parameter amount of the target operation is less than the parameter amount of the M second blocks;
  • step 1401 For the specific description of step 1401, reference may be made to the description about the feature extraction network in the foregoing embodiment, which is not repeated here.
  • step 1402 reference may be made to the description about the first block in the foregoing embodiment, and details are not repeated here.
  • step 1403 reference may be made to the description about the at least one second block in the foregoing embodiment, and details are not repeated here.
  • step 1404 For the specific description of step 1404, reference may be made to the description about the target operation in the foregoing embodiment, which is not repeated here.
  • step 1405 For the specific description of step 1405, reference may be made to the description about the splicing operation in the foregoing embodiment, and details are not repeated here.
  • the intersection of the M1 target feature maps and the M2 target feature maps is empty, and the sum of the M1 and the M2 is the M, and the spliced feature The number of channels of the graph is the M.
  • the target operation is a convolution operation with a parameter amount smaller than that of the at least one second block; or,
  • the target operation is a residual connection operation from the output of the first block to the output of the splicing operation.
  • the at least one second block is configured to perform convolution processing on M1 target feature maps in the M target feature maps, so as to obtain a feature map output by each second block, wherein , the output of the second block farthest from the first block in the at least one second block is the M1 first feature maps;
  • the method also includes:
  • the feature maps output by each second block are fused to obtain a fused feature map, and the size of the fused feature map is the same as the size of the M2 second feature maps ;
  • the said splicing operation is used to splicing the M1 first feature maps and the M2 second feature maps, including:
  • the M1 first feature maps and the processed M2 second feature maps are spliced to obtain a spliced feature map.
  • the feature map output by each second block is fused through the fusion operation, including:
  • the first block and the M second blocks are blocks in the target stage stage in the feature extraction network, and the spliced feature map is used as the feature extraction network The output feature map of the target stage stage described in ; or,
  • the target stage stage further includes at least one third block, and the at least one third block is used to perform a convolution operation on the spliced feature map to obtain an output feature map of the target stage stage.
  • the feature extraction network is used to obtain an input image, perform feature extraction on the input image, and output a feature map of the input image;
  • the method also includes:
  • the task network is used to process the corresponding task according to the feature map of the input image to obtain the processing result.
  • the tasks include object detection, image segmentation or image classification.
  • FIG. 15 is a schematic diagram of a data processing apparatus 1500 provided by an embodiment of the present application.
  • the data processing apparatus 1500 provided by the present application includes:
  • the acquisition module 1501 is used to acquire a feature extraction network, the feature extraction network includes a first block, at least one second block connected in series, a target operation and a splicing operation, the first block and the M second blocks block is a block in the same stage in the feature extraction network, and the parameter amount of the target operation is less than the parameter amount of the M second blocks;
  • the convolution processing module 1502 through the first block, performs convolution processing on the input data to obtain M target feature maps, each target feature map corresponding to a channel;
  • the M2 target feature maps in the M target feature maps are processed to obtain M2 second feature maps, where the M2 is smaller than the M;
  • the splicing module 1503 is configured to splicing the M1 first feature maps and the M2 second feature maps through the splicing operation to obtain a spliced feature map.
  • the intersection of the M1 target feature maps and the M2 target feature maps is empty, and the sum of the M1 and the M2 is the M, and the spliced feature The number of channels of the graph is the M.
  • the target operation is a convolution operation with a parameter amount smaller than that of the at least one second block; or,
  • the target operation is a residual connection operation from the output of the first block to the output of the splicing operation.
  • the at least one second block is configured to perform convolution processing on M1 target feature maps in the M target feature maps, so as to obtain a feature map output by each second block, wherein , the output of the second block farthest from the first block in the at least one second block is the M1 first feature maps;
  • the device also includes:
  • the fusion module is used to fuse the feature maps output by each second block through a fusion operation to obtain a fused feature map, the size of the fused feature map is the same as the M2 second features The sizes of the graphs are the same; an addition operation is performed on the fused feature graph and the M2 second feature graphs to obtain the processed M2 second feature graphs;
  • the splicing module is used for splicing the M1 first feature maps and the processed M2 second feature maps to obtain a spliced feature map.
  • the fusion module is configured to perform splicing and dimension reduction operations on the output of each second block through a fusion operation, so as to obtain the size of the M2 second feature maps The same as the fused feature map.
  • the first block and the M second blocks are blocks in the target stage stage in the feature extraction network, and the spliced feature map is used as the feature extraction network The output feature map of the target stage stage described in ; or,
  • the target stage stage further includes at least one third block, and the at least one third block is used to perform a convolution operation on the spliced feature map to obtain an output feature map of the target stage stage.
  • the feature extraction network is used to obtain an input image, perform feature extraction on the input image, and output a feature map of the input image;
  • the device also includes:
  • the task processing module is used to process the corresponding task according to the feature map of the input image through the task network to obtain the processing result.
  • the tasks include object detection, image segmentation or image classification.
  • FIG. 16 is a schematic structural diagram of the execution device provided by the embodiment of the present application. Smart wearable devices, servers, etc., are not limited here.
  • the data processing apparatus described in the embodiment corresponding to FIG. 15 may be deployed on the execution device 1600 to implement the data processing function in the embodiment corresponding to FIG. 15 .
  • the execution device 1600 includes: a receiver 1601, a transmitter 1602, a processor 1603 and a memory 1604 (wherein the number of processors 1603 in the execution device 1600 may be one or more, and one processor is taken as an example in FIG. 11 ) , wherein the processor 1603 may include an application processor 16031 and a communication processor 16032.
  • the receiver 1601, the transmitter 1602, the processor 1603, and the memory 1604 may be connected by a bus or otherwise.
  • Memory 1604 may include read-only memory and random access memory, and provides instructions and data to processor 1603 .
  • a portion of memory 1604 may also include non-volatile random access memory (NVRAM).
  • NVRAM non-volatile random access memory
  • the memory 1604 stores processors and operating instructions, executable modules or data structures, or a subset thereof, or an extended set thereof, wherein the operating instructions may include various operating instructions for implementing various operations.
  • the processor 1603 controls the operation of the execution device.
  • various components of the execution device are coupled together through a bus system, where the bus system may include a power bus, a control bus, a status signal bus, and the like in addition to a data bus.
  • the various buses are referred to as bus systems in the figures.
  • the methods disclosed in the above embodiments of the present application may be applied to the processor 1603 or implemented by the processor 1603 .
  • the processor 1603 may be an integrated circuit chip with signal processing capability.
  • each step of the above-mentioned method can be completed by an integrated logic circuit of hardware in the processor 1603 or an instruction in the form of software.
  • the above-mentioned processor 1603 can be a general-purpose processor, a digital signal processor (DSP), a microprocessor or a microcontroller, a vision processor (vision processing unit, VPU), a tensor processor (tensor processing) unit, TPU) and other processors suitable for AI operations, and may further include application specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs) or other programmable logic devices, Discrete gate or transistor logic devices, discrete hardware components.
  • the processor 1603 may implement or execute the methods, steps, and logical block diagrams disclosed in the embodiments of this application.
  • a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • the steps of the method disclosed in conjunction with the embodiments of the present application may be directly embodied as executed by a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor.
  • the software modules may be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other storage media mature in the art.
  • the storage medium is located in the memory 1604, and the processor 1603 reads the information in the memory 1604, and completes the steps of the above method in combination with its hardware.
  • the receiver 1601 can be used to receive input numerical or character information, and to generate signal input related to performing relevant settings and function control of the device.
  • the transmitter 1602 can be used to output digital or character information through the first interface; the transmitter 1602 can also be used to send instructions to the disk group through the first interface to modify the data in the disk group; the transmitter 1602 can also include a display device such as a display screen .
  • the execution device may run the perception network described in FIG. 6 , or execute the data processing method in the corresponding embodiment of FIG. 14 .
  • FIG. 17 is a schematic structural diagram of the training device provided by the embodiment of the present application.
  • the training device 1700 is implemented by one or more servers.
  • the training device 1700 can vary widely by configuration or performance, and can include one or more central processing units (CPUs) 1717 (eg, one or more processors) and memory 1732, one or more storage applications
  • a storage medium 1730 eg, one or more mass storage devices for programs 1742 or data 1744.
  • the memory 1732 and the storage medium 1730 may be short-term storage or persistent storage.
  • the program stored in the storage medium 1730 may include one or more modules (not shown in the figure), and each module may include a series of instructions to operate on the training device. Furthermore, the central processing unit 1717 may be configured to communicate with the storage medium 1730 to execute a series of instruction operations in the storage medium 1730 on the training device 1700 .
  • Training device 1700 may also include one or more power supplies 1726, one or more wired or wireless network interfaces 1750, one or more input and output interfaces 1758; or, one or more operating systems 1741, such as Windows ServerTM, Mac OS XTM , UnixTM, LinuxTM, FreeBSDTM and so on.
  • operating systems 1741 such as Windows ServerTM, Mac OS XTM , UnixTM, LinuxTM, FreeBSDTM and so on.
  • the model training apparatus 1700 described in FIG. 17 may be a module in a training device, and the processor in the training device may execute the model training to obtain the perceptual network described in FIG. 6 .
  • Embodiments of the present application also provide a computer program product that, when running on a computer, causes the computer to perform the steps performed by the aforementioned execution device, or causes the computer to perform the steps performed by the aforementioned training device.
  • Embodiments of the present application further provide a computer-readable storage medium, where a program for performing signal processing is stored in the computer-readable storage medium, and when it runs on a computer, the computer executes the steps performed by the aforementioned execution device. , or, causing the computer to perform the steps as performed by the aforementioned training device.
  • the execution device, training device, or terminal device provided in this embodiment of the present application may specifically be a chip, and the chip includes: a processing unit and a communication unit, the processing unit may be, for example, a processor, and the communication unit may be, for example, an input/output interface, pins or circuits, etc.
  • the processing unit can execute the computer executable instructions stored in the storage unit, so that the chip in the execution device executes the data processing method described in the above embodiments, or the chip in the training device executes the data processing method described in the above embodiment.
  • the storage unit is a storage unit in the chip, such as a register, a cache, etc.
  • the storage unit may also be a storage unit located outside the chip in the wireless access device, such as only Read-only memory (ROM) or other types of static storage devices that can store static information and instructions, random access memory (RAM), etc.
  • ROM Read-only memory
  • RAM random access memory
  • FIG. 18 is a schematic structural diagram of a chip provided by an embodiment of the application.
  • the chip may be represented as a neural network processor NPU 1800, and the NPU 1800 is mounted as a co-processor to the main CPU (Host CPU), tasks are allocated by the Host CPU.
  • the core part of the NPU is the arithmetic circuit 1803, which is controlled by the controller 1804 to extract the matrix data in the memory and perform multiplication operations.
  • the NPU 1800 can implement the model training method provided in the embodiment described in FIG. 6 through the cooperation between various internal devices, or perform reasoning on the model obtained by training.
  • the operation circuit 1803 in the NPU 1800 may perform the steps of acquiring the first neural network model and performing model training on the first neural network model.
  • the arithmetic circuit 1803 in the NPU 1800 includes multiple processing units (Process Engine, PE).
  • the arithmetic circuit 1803 is a two-dimensional systolic array.
  • the arithmetic circuit 1803 may also be a one-dimensional systolic array or other electronic circuitry capable of performing mathematical operations such as multiplication and addition.
  • arithmetic circuit 1803 is a general-purpose matrix processor.
  • the operation circuit fetches the data corresponding to the matrix B from the weight memory 1802 and buffers it on each PE in the operation circuit.
  • the arithmetic circuit fetches the data of matrix A and matrix B from the input memory 1801 to perform matrix operation, and stores the partial result or final result of the matrix in the accumulator 1808 .
  • Unified memory 1806 is used to store input data and output data.
  • the weight data directly passes through the storage unit access controller (Direct Memory Access Controller, DMAC) 1805, and the DMAC is transferred to the weight memory 1802.
  • Input data is also moved to unified memory 1806 via the DMAC.
  • DMAC Direct Memory Access Controller
  • the BIU is the Bus Interface Unit, that is, the bus interface unit 1810, which is used for the interaction between the AXI bus and the DMAC and the instruction fetch buffer (Instruction Fetch Buffer, IFB) 1809.
  • IFB Instruction Fetch Buffer
  • the bus interface unit 1810 (Bus Interface Unit, BIU for short) is used for the instruction fetch memory 1809 to obtain instructions from the external memory, and also for the storage unit access controller 1805 to obtain the original data of the input matrix A or the weight matrix B from the external memory.
  • the DMAC is mainly used to transfer the input data in the external memory DDR to the unified memory 1806 , the weight data to the weight memory 1802 , or the input data to the input memory 1801 .
  • the vector calculation unit 1807 includes a plurality of operation processing units, and further processes the output of the operation circuit 1803 if necessary, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison and so on. It is mainly used for non-convolutional/fully connected layer network computation in neural networks, such as Batch Normalization, pixel-level summation, and upsampling of feature planes.
  • the vector computation unit 1807 can store the processed output vectors to the unified memory 1806 .
  • the vector calculation unit 1807 may apply a linear function; or a nonlinear function to the output of the operation circuit 1803, such as performing linear interpolation on the feature plane extracted by the convolutional layer, such as a vector of accumulated values, to generate activation values.
  • the vector computation unit 1807 generates normalized values, pixel-level summed values, or both.
  • the vector of processed outputs can be used as activation input to the arithmetic circuit 1803, eg, for use in subsequent layers in a neural network.
  • the instruction fetch buffer (instruction fetch buffer) 1809 connected to the controller 1804 is used to store the instructions used by the controller 1804;
  • the unified memory 1806, the input memory 1801, the weight memory 1802 and the instruction fetch memory 1809 are all On-Chip memories. External memory is private to the NPU hardware architecture.
  • the processor mentioned in any one of the above may be a general-purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits for controlling the execution of the above program.
  • the device embodiments described above are only schematic, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be A physical unit, which can be located in one place or distributed over multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • the connection relationship between the modules indicates that there is a communication connection between them, which may be specifically implemented as one or more communication buses or signal lines.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general purpose computer, special purpose computer, computer network, or other programmable device.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be retrieved from a website, computer, training device, or data Transmission from the center to another website site, computer, training facility or data center via wired (eg coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (eg infrared, wireless, microwave, etc.) means.
  • wired eg coaxial cable, fiber optic, digital subscriber line (DSL)
  • wireless eg infrared, wireless, microwave, etc.
  • the computer-readable storage medium may be any available medium that can be stored by a computer, or a data storage device such as a training device, a data center, or the like that includes an integration of one or more available media.
  • the usable media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, DVD), or semiconductor media (eg, Solid State Disk (SSD)), and the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Databases & Information Systems (AREA)
  • Multimedia (AREA)
  • Medical Informatics (AREA)
  • Mathematical Physics (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Neurology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

本申请公开了一种感知网络,可以应用于人工智能领域,包括:特征提取网络,其中特征提取网络中的第一block用于对输入数据进行卷积处理,以得到M个目标特征图,特征提取网络中的至少一个第二block用于对M个目标特征图中的M1个目标特征图进行卷积处理,以得到M1个第一特征图,特征提取网络中的目标操作用于对M个目标特征图中的M2个目标特征图进行处理,以得到M2个第二特征图,特征提取网络中的拼接操作用于将M1个第一特征图和M2个第二特征图进行拼接,以得到拼接后的特征图。本申请利用相同stage之间跨层的目标操作来让感知网络生成这些与关键特征相似性高的特征,降低了模型的参数量,以此提高在GPU设备、TPU设备以及NPU设备上的模型的运行速度。

Description

一种感知网络及数据处理方法
本申请要求于2021年02月27日提交中国专利局、申请号为202110221934.8、发明名称为“一种感知网络及数据处理方法”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能领域,尤其涉及一种感知网络及数据处理方法。
背景技术
人工智能(artificial intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个分支,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式作出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。
计算机视觉是各个应用领域,如制造业、检验、文档分析、医疗诊断,和军事等领域中各种智能/自主系统中不可分割的一部分,它是一门关于如何运用照相机/摄像机和计算机来获取我们所需的,被拍摄对象的数据与信息的学问。形象地说,就是给计算机安装上眼睛(照相机/摄像机)和大脑(算法)用来代替人眼对目标进行识别、跟踪和测量等,从而使计算机能够感知环境。因为感知可以看作是从感官信号中提取信息,所以计算机视觉也可以看作是研究如何使人工系统从图像或多维数据中“感知”的科学。总的来说,计算机视觉就是用各种成象系统代替视觉器官获取输入信息,再由计算机来代替大脑对这些输入信息完成处理和解释。计算机视觉的最终研究目标就是使计算机能像人那样通过视觉观察和理解世界,具有自主适应环境的能力。
基于卷积神经网络的推理模型在以计算机视觉为基础的各个终端任务如图像识别、目标检测、实力分割等场景中都有广泛的应用。传统的基础神经网络往往由于规模较大的参数与计算量,导致各项终端任务无法实时工作。现有的轻量级推理网络(例如mobilenet,efficientnet,shufflenet)都是针对中央处理器(central processing unit,CPU)、ARM(advanced RISC machine)等移动设备设计而成,在图形处理器(graphics processing unit,GPU)设备、张量处理单元(tensor processing unit,TPU)设备以及神经网络处理器(neural network processing Unit,NPU)设备等基于大吞吐量设计的处理单元上的表现却不尽人意,推理速度甚至比传统的卷积神经网络更慢。
发明内容
第一方面,本申请提供了一种感知网络,所述感知网络包括:特征提取网络,所述特征提取网络包括第一块block、串行连接的至少一个第二block、目标操作以及拼接操作,所述第一block以及所述M个第二block为所述特征提取网络中同一个阶段stage内的block,且所述目标操作的参数量小于所述M个第二block的参数量;
其中,目标操作也可以称之为廉价操作(cheap operation),可以指一系列参数量较少多少的参数量会认为是参数量较少的操作的统称,用于区别传统的卷积操作;参数量 (parameters)可以用于描述神经网络包含的参数量,用于评价模型的大小。
拼接操作(concat)是指在不改变特征图的数据的前提下,进行特征图的合并,例如特征图1和特征图2进行拼接操作的结果为(特征图1,特征图2),其中特征图1和特征图2之间的顺序不限定。更具体的,具有三个语义通道的特征图与五个语义通道的特征图进行拼接操作的结果为具有八个语义通道的特征图。
所述第一block用于对输入数据进行卷积处理,以得到M个目标特征图,每个目标特征图对应于一个通道;
所述至少一个第二block用于对所述M个目标特征图中的M1个目标特征图进行卷积处理,以得到M1个第一特征图,所述M1小于所述M;
所述目标操作用于对所述M个目标特征图中的M2个目标特征图进行处理,以得到M2个第二特征图,所述M2小于所述M;
所述拼接操作用于将所述M1个第一特征图和所述M2个第二特征图进行拼接,以得到拼接后的特征图。
在特征提取网络的相同阶段stage之间,不同block的输出特征之间具有较高的相似性,而不同阶段stage的block的输出特征之间,相似性较低。因此本申请实施例利用相同stage之间跨层的目标操作来让感知网络生成这些与关键特征相似性高的特征,降低了模型的参数量,以此提高在GPU设备、TPU设备以及NPU设备上的模型的运行速度。
本申请实施例提供了一种感知网络,所述感知网络包括:特征提取网络,所述特征提取网络包括第一块block、串行连接的至少一个第二block、目标操作以及拼接操作,所述第一block以及所述M个第二block为所述特征提取网络中同一个阶段stage内的block,且所述目标操作的参数量小于所述M个第二block的参数量;所述第一block用于对输入数据进行卷积处理,以得到M个目标特征图,每个目标特征图对应于一个通道;所述至少一个第二block用于对所述M个目标特征图中的M1个目标特征图进行卷积处理,以得到M1个第一特征图,所述M1小于所述M;所述目标操作用于对所述M个目标特征图中的M2个目标特征图进行处理,以得到M2个第二特征图,所述M2小于所述M;所述拼接操作用于将所述M1个第一特征图和所述M2个第二特征图进行拼接,以得到拼接后的特征图。通过上述方式,利用相同stage之间跨层的目标操作来让感知网络生成这些与关键特征相似性高的特征,降低了模型的参数量,以此提高在GPU设备、TPU设备以及NPU设备上的模型的运行速度。
在一种可能的实现中,所述M1个目标特征图与所述M2个目标特征图的交集为空,且所述M1与所述M2的加和为所述M,所述拼接后的特征图的通道数为所述M。
相当于,至少一个第二block的输出特征图和输出特征图的通道数都是输出第一block的目标特征图的通道数的一部分,剩下的一部分通道的目标特征图由目标操作产来进行处理。且由于目标操作的参数量小于至少一个第二block的参数量,使得整体上感知网络的参数量降低了,进而可以提高感知网络在GPU设备、TPU设备以及NPU设备上的运行速度。
在一种可能的实现中,所述目标操作为参数量小于所述至少一个第二block的卷积操作; 或,
所述目标操作为由所述第一block的输出到所述拼接操作的输出之间的残差连接操作。
例如,目标操作为1*1的卷积,或者目标操作为由所述第一block的输出到所述拼接操作的输出之间的残差连接操作,相当于M2个第二特征图就是M1个第一特征图,即直接将M1个第一特征图作为M2个第二特征图。
应理解,还可以将第一block的输出特征图拆分为多组特征图,并由多个目标操作来处理一组特征图,其中只要保证各个目标操作输出的特征图的通道数、至少一个第二block输出的特征图的加和与第一block输出的特征图的通道数相同即可,不同目标操作输出的特征图的通道数之间可以不相同。
在一种可能的实现中,所述至少一个第二block用于对所述M个目标特征图中的M1个目标特征图进行卷积处理,以得到每个第二block输出的特征图,其中,所述至少一个第二block中距离所述第一block最远的第二block的输出为所述M1个第一特征图;
所述特征提取网络还包括:
融合操作,用于对所述每个第二block输出的特征图进行融合,以得到融合后的特征图,所述融合后的特征图的尺寸大小与所述M2个第二特征图的尺寸大小相同;
对所述融合后的特征图和所述M2个第二特征图进行加法操作,以得到处理后的M2个第二特征图;
所述拼接操作用于将所述M1个第一特征图和所述处理后的M2个第二特征图进行拼接,以得到拼接后的特征图。
其中,可以将每个第二block输出的特征图进行拼接操作,得到拼接后的特征图(通道数为每个第二block输出的特征图的加和),由于拼接后的特征图的通道数大于目标操作的输出的特征图的通道数(M2),为了能够对拼接后的特征图与目标操作的输出(M2个第二特征图)进行加法运算,则需要将拼接后的特征图进行降维操作,以使得拼接后的特征图的通道数等于M2,进而可以对拼接后的特征图与目标操作的输出进行矩阵的加法运算。
在一种可能的实现中,所述融合操作,用于对所述每个第二block的输出进行拼接以及降维操作,以得到和所述M2个第二特征图的尺寸大小相同的所述融合后的特征图。
在一种可能的实现中,所述第一block以及所述M个第二block为所述特征提取网络中目标阶段stage内的block,所述拼接后的特征图用于作为所述特征提取网络中所述目标阶段stage的输出特征图;或,
所述目标阶段stage还包括至少一个第三block,所述至少一个第三block用于对所述拼接后的特征图进行卷积操作,以得到所述目标阶段stage的输出特征图。
本申请实施例中,拼接后的特征图可以作为目标阶段stage的输出特征图,或者拼接后的特征图还可以被目标阶段stage包括的其他block(第三block)进行处理,拼接后的特征图还可以被至少一个第三block进行处理,以得到目标阶段stage的输出特征图。
在一种可能的实现中,第一block可以为特征提取网络中的第一个block,或者是中间层的block,在第一block之前还可以连接有至少一个第三block,则第一block用于对至少一个第三block输出的数据进行卷积处理。
在一种可能的实现中,所述特征提取网络用于获取输入的图像,并对所述输入的图像进行特征提取,输出所述输入的图像的特征图;
所述感知网络还包括:
任务网络,用于根据所述输入的图像的特征图,进行对应任务的处理,以得到处理结果。
在一种可能的实现中,所述任务包括目标检测、图像分割或图像分类。
第二方面,本申请提供了一种数据处理方法,所述方法包括:
获取特征提取网络,所述特征提取网络包括第一块block、串行连接的至少一个第二block、目标操作以及拼接操作,所述第一block以及所述M个第二block为所述特征提取网络中同一个阶段stage内的block,且所述目标操作的参数量小于所述M个第二block的参数量;
通过所述第一block,对输入数据进行卷积处理,以得到M个目标特征图,每个目标特征图对应于一个通道;
通过所述至少一个第二block,对所述M个目标特征图中的M1个目标特征图进行卷积处理,以得到M1个第一特征图,所述M1小于所述M;
通过所述目标操作,对所述M个目标特征图中的M2个目标特征图进行处理,以得到M2个第二特征图,所述M2小于所述M;
通过所述拼接操作,将所述M1个第一特征图和所述M2个第二特征图进行拼接,以得到拼接后的特征图。
在一种可能的实现中,所述M1个目标特征图与所述M2个目标特征图的交集为空,且所述M1与所述M2的加和为所述M,所述拼接后的特征图的通道数为所述M。
在一种可能的实现中,所述目标操作为参数量小于所述至少一个第二block的卷积操作;或,
所述目标操作为由所述第一block的输出到所述拼接操作的输出之间的残差连接操作。
在一种可能的实现中,所述至少一个第二block用于对所述M个目标特征图中的M1个目标特征图进行卷积处理,以得到每个第二block输出的特征图,其中,所述至少一个第二block中距离所述第一block最远的第二block的输出为所述M1个第一特征图;
所述方法还包括:
通过融合操作,对所述每个第二block输出的特征图进行融合,以得到融合后的特征图,所述融合后的特征图的尺寸大小与所述M2个第二特征图的尺寸大小相同;对所述融合后的 特征图和所述M2个第二特征图进行加法操作,以得到处理后的M2个第二特征图;
所述通过所述拼接操作,将所述M1个第一特征图和所述M2个第二特征图进行拼接,包括:
将所述M1个第一特征图和所述处理后的M2个第二特征图进行拼接,以得到拼接后的特征图。
在一种可能的实现中,所述通过融合操作,对所述每个第二block输出的特征图进行融合,包括:
通过融合操作,对所述每个第二block的输出进行拼接以及降维操作,以得到和所述M2个第二特征图的尺寸大小相同的所述融合后的特征图。
在一种可能的实现中,所述第一block以及所述M个第二block为所述特征提取网络中目标阶段stage内的block,所述拼接后的特征图用于作为所述特征提取网络中所述目标阶段stage的输出特征图;或,
所述目标阶段stage还包括至少一个第三block,所述至少一个第三block用于对所述拼接后的特征图进行卷积操作,以得到所述目标阶段stage的输出特征图。
在一种可能的实现中,所述特征提取网络用于获取输入的图像,并对所述输入的图像进行特征提取,输出所述输入的图像的特征图;
所述方法还包括:
通过任务网络,用于根据所述输入的图像的特征图,进行对应任务的处理,以得到处理结果。
在一种可能的实现中,所述任务包括目标检测、图像分割或图像分类。
第三方面,本申请提供了一种数据处理装置,所述装置包括:
获取模块,用于获取特征提取网络,所述特征提取网络包括第一块block、串行连接的至少一个第二block、目标操作以及拼接操作,所述第一block以及所述M个第二block为所述特征提取网络中同一个阶段stage内的block,且所述目标操作的参数量小于所述M个第二block的参数量;
卷积处理模块,通过所述第一block,对输入数据进行卷积处理,以得到M个目标特征图,每个目标特征图对应于一个通道;
通过所述至少一个第二block,对所述M个目标特征图中的M1个目标特征图进行卷积处理,以得到M1个第一特征图,所述M1小于所述M;
通过所述目标操作,对所述M个目标特征图中的M2个目标特征图进行处理,以得到M2个第二特征图,所述M2小于所述M;
拼接模块,用于通过所述拼接操作,将所述M1个第一特征图和所述M2个第二特征图进行拼接,以得到拼接后的特征图。
在一种可能的实现中,所述M1个目标特征图与所述M2个目标特征图的交集为空,且所述M1与所述M2的加和为所述M,所述拼接后的特征图的通道数为所述M。
在一种可能的实现中,所述目标操作为参数量小于所述至少一个第二block的卷积操作;或,
所述目标操作为由所述第一block的输出到所述拼接操作的输出之间的残差连接操作。
在一种可能的实现中,所述至少一个第二block用于对所述M个目标特征图中的M1个目标特征图进行卷积处理,以得到每个第二block输出的特征图,其中,所述至少一个第二block中距离所述第一block最远的第二block的输出为所述M1个第一特征图;
所述装置还包括:
融合模块,用于通过融合操作,对所述每个第二block输出的特征图进行融合,以得到融合后的特征图,所述融合后的特征图的尺寸大小与所述M2个第二特征图的尺寸大小相同;对所述融合后的特征图和所述M2个第二特征图进行加法操作,以得到处理后的M2个第二特征图;
所述拼接模块,用于将所述M1个第一特征图和所述处理后的M2个第二特征图进行拼接,以得到拼接后的特征图。
在一种可能的实现中,所述融合模块,用于通过融合操作,对所述每个第二block的输出进行拼接以及降维操作,以得到和所述M2个第二特征图的尺寸大小相同的所述融合后的特征图。
在一种可能的实现中,所述第一block以及所述M个第二block为所述特征提取网络中目标阶段stage内的block,所述拼接后的特征图用于作为所述特征提取网络中所述目标阶段stage的输出特征图;或,
所述目标阶段stage还包括至少一个第三block,所述至少一个第三block用于对所述拼接后的特征图进行卷积操作,以得到所述目标阶段stage的输出特征图。
在一种可能的实现中,所述特征提取网络用于获取输入的图像,并对所述输入的图像进行特征提取,输出所述输入的图像的特征图;
所述装置还包括:
任务处理模块,用于通过任务网络,用于根据所述输入的图像的特征图,进行对应任务的处理,以得到处理结果。
在一种可能的实现中,所述任务包括目标检测、图像分割或图像分类。
第四方面,本申请实施例提供了一种数据处理装置,可以包括存储器、处理器以及总线系统,其中,存储器用于存储程序,处理器用于执行存储器中的程序,以运行如上述第一方面及其任一可选的感知网络。
第五方面,本申请实施例提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机程序,当其在计算机上运行时,使得计算机运行如上述第一方面及其任一可选的感知网络。
第六方面,本申请实施例提供了一种计算机程序,包括代码,当代码被执行时,用于运行如上述第一方面及其任一可选的感知网络。
第七方面,本申请实施例提供了一种数据处理装置,可以包括存储器、处理器以及总线系统,其中,存储器用于存储程序,处理器用于执行存储器中的程序,以执行如上述第一方面及其任一可选的方法。
第八方面,本申请实施例提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机程序,当其在计算机上运行时,使得计算机执行如上述第一方面及其任一可选的方法。
第九方面,本申请实施例提供了一种计算机程序,包括代码,当代码被执行时,用于实现上述第一方面及其任一可选的方法。
第十方面,本申请提供了一种芯片系统,该芯片系统包括处理器,用于支持执行设备或训练设备实现上述方面中所涉及的功能,例如,发送或处理上述方法中所涉及的数据;或,信息。在一种可能的设计中,所述芯片系统还包括存储器,所述存储器,用于保存执行设备或训练设备必要的程序指令和数据。该芯片系统,可以由芯片构成,也可以包括芯片和其他分立器件。
本申请实施例提供了一种感知网络,所述感知网络包括:特征提取网络,所述特征提取网络包括第一块block、串行连接的至少一个第二block、目标操作以及拼接操作,所述第一block以及所述M个第二block为所述特征提取网络中同一个阶段stage内的block,且所述目标操作的参数量小于所述M个第二block的参数量;所述第一block用于对输入数据进行卷积处理,以得到M个目标特征图,每个目标特征图对应于一个通道;所述至少一个第二block用于对所述M个目标特征图中的M1个目标特征图进行卷积处理,以得到M1个第一特征图,所述M1小于所述M;所述目标操作用于对所述M个目标特征图中的M2个目标特征图进行处理,以得到M2个第二特征图,所述M2小于所述M;所述拼接操作用于将所述M1个第一特征图和所述M2个第二特征图进行拼接,以得到拼接后的特征图。通过上述方式,利用相同stage之间跨层的目标操作来让感知网络生成这些与关键特征相似性高的特征,降低了模型的参数量,以此提高在GPU设备、TPU设备以及NPU设备上的模型的运行速度。
附图说明
图1为人工智能主体框架的一种结构示意图;
图2a为本申请实施例提供的一种应用场景的示意图;
图2b为本申请实施例提供的一种应用场景的示意图;
图2c为本申请实施例提供的卷积神经网络的示意图;
图2d为本申请实施例提供的卷积神经网络的示意图;
图3为本申请实施例提供的一种系统架构的示意图;
图4为本申请实施例提供的一种数据处理方法的实施例示意;
图5a为本申请实施例提供的一种感知网络示意;
图5b为本申请实施例提供的一种感知网络示意;
图5c为本申请实施例提供的一种感知网络示意;
图5d为本申请实施例提供的一种感知网络示意;
图6至图14为本申请实施例提供的一种感知网络示意;
图15为本申请实施例提供的一种数据处理装置的示意;
图16为本申请实施例提供的执行设备的一种结构示意图;
图17为本申请实施例提供的训练设备一种结构示意图;
图18为本申请实施例提供的芯片的一种结构示意图。
具体实施方式
下面结合本发明实施例中的附图对本发明实施例进行描述。本发明的实施方式部分使用的术语仅用于对本发明的具体实施例进行解释,而非旨在限定本发明。
下面结合附图,对本申请的实施例进行描述。本领域普通技术人员可知,随着技术的发展和新场景的出现,本申请实施例提供的技术方案对于类似的技术问题,同样适用。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的术语在适当情况下可以互换,这仅仅是描述本申请的实施例中对相同属性的对象在描述时所采用的区分方式。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,以便包含一系列单元的过程、方法、系统、产品或设备不必限于那些单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它单元。
首先对人工智能系统总体工作流程进行描述,请参见图1,图1示出的为人工智能主体框架的一种结构示意图,下面从“智能信息链”(水平轴)和“IT价值链”(垂直轴)两个维度对上述人工智能主题框架进行阐述。其中,“智能信息链”反映从数据的获取到处理的一列过程。举例来说,可以是智能信息感知、智能信息表示与形成、智能推理、智能决策、智能执行与输出的一般过程。在这个过程中,数据经历了“数据—信息—知识—智慧”的凝练过程。“IT价值链”从人智能的底层基础设施、信息(提供和处理技术实现)到系统的产业生态过程,反映人工智能为信息技术产业带来的价值。
(1)基础设施
基础设施为人工智能系统提供计算能力支持,实现与外部世界的沟通,并通过基础平台实现支撑。通过传感器与外部沟通;计算能力由智能芯片(CPU、NPU、GPU、ASIC、FPGA等硬件加速芯片)提供;基础平台包括分布式计算框架及网络等相关的平台保障和支持,可以包括云存储和计算、互联互通网络等。举例来说,传感器和外部沟通获取数据, 这些数据提供给基础平台提供的分布式计算系统中的智能芯片进行计算。
(2)数据
基础设施的上一层的数据用于表示人工智能领域的数据来源。数据涉及到图形、图像、语音、文本,还涉及到传统设备的物联网数据,包括已有系统的业务数据以及力、位移、液位、温度、湿度等感知数据。
(3)数据处理
数据处理通常包括数据训练,机器学习,深度学习,搜索,推理,决策等方式。
其中,机器学习和深度学习可以对数据进行符号化和形式化的智能信息建模、抽取、预处理、训练等。
推理是指在计算机或智能系统中,模拟人类的智能推理方式,依据推理控制策略,利用形式化的信息进行机器思维和求解问题的过程,典型的功能是搜索与匹配。
决策是指智能信息经过推理后进行决策的过程,通常提供分类、排序、预测等功能。
(4)通用能力
对数据经过上面提到的数据处理后,进一步基于数据处理的结果可以形成一些通用的能力,比如可以是算法或者一个通用系统,例如,翻译,文本的分析,计算机视觉的处理,语音识别,图像的识别等等。
(5)智能产品及行业应用
智能产品及行业应用指人工智能系统在各领域的产品和应用,是对人工智能整体解决方案的封装,将智能信息决策产品化、实现落地应用,其应用领域主要包括:智能终端、智能交通、智能医疗、自动驾驶、智慧城市等。
本申请实施例主要应用在驾驶辅助、自动驾驶、手机终端等需要完成多种感知任务的领域。视频经过抽帧得到单张图片,该图片送入到本发明中的感知网络,得到该图片中感兴趣物体的2D、3D、Mask(掩膜)、关键点等信息。这些检测结果输出到后处理模块进行处理,比如在自动驾驶系统中送入规划控制单元进行决策、在手机终端中送入美颜算法进行处理得到美颜后的图片。下面分别对ADAS/ADS视觉感知系统和手机美颜两种应用场景做简单的介绍。
应用场景1:ADAS/ADS视觉感知系统
如图2a所示,在ADAS和ADS中,需要实时进行多类型的2D目标检测,包括:动态障碍物(行人(Pedestrian)、骑行者(Cyclist)、三轮车(Tricycle)、轿车(Car)、卡车(Truck)、公交车(Bus)),静态障碍物(交通锥标(TrafficCone)、交通棍标(TrafficStick)、消防栓(FireHydrant)、摩托车(Motocycle)、自行车(Bicycle)),交通标志(TrafficSign、导向标志(GuideSign)、广告牌(Billboard)、红色交通灯(TrafficLight_Red)/黄色交通灯(TrafficLight_Yellow)/绿色交通灯(TrafficLight_Green)/黑色交通灯(TrafficLight_Black)、路标(RoadSign))。另外,为了准确获取动态障碍物的在3维空间所占的区域,还需要对动态障碍物进行3D估计,输出3D框。为了与激光雷达的数据进行融合,需要获取动态障碍物的Mask,从而把打到动态障碍物上的激光点云筛选出来;为了进行精确的泊车位,需要同时检测出泊车位的4个关键点;为了进行构图定位,需要检测出静态目标的关键点。使用本申请实施例提供的技术方案,可以在 感知网络中完成上述的全部或一部分功能。
应用场景2:手机美颜功能
如图2b所示,在手机中,通过本申请实施例提供的感知网络检测出人体的Mask和关键点,可以对人体相应的部位进行放大缩小,比如进行收腰和美臀操作,从而输出美颜的图片。
应用场景3:图像分类场景:
物体识别装置在获取待分类图像后,采用本申请的物体识别方法获取待分类图像中的物体的类别,然后可根据待分类图像中物体的物体的类别对待分类图像进行分类。对于摄影师来说,每天会拍很多照片,有动物的,有人物,有植物的。采用本申请的方法可以快速地将照片按照照片中的内容进行分类,可分成包含动物的照片、包含人物的照片和包含植物的照片。
对于图像数量比较庞大的情况,人工分类的方式效率比较低下,并且人在长时间处理同一件事情时很容易产生疲劳感,此时分类的结果会有很大的误差;而采用本申请的方法可以快速地将图像进行分类,并且不会有误差。
应用场景4商品分类:
物体识别装置获取商品的图像后,然后采用本申请的物体识别方法获取商品的图像中商品的类别,然后根据商品的类别对商品进行分类。
由于本申请实施例涉及大量神经网络的应用,为了便于理解,下面先对本申请实施例涉及的相关术语及神经网络等相关概念进行介绍。
(1)物体识别,利用图像处理和机器学习、计算机图形学等相关方法,确定图像物体的类别。
(2)神经网络
神经网络可以是由神经单元组成的,神经单元可以是指以xs和截距1为输入的运算单元,该运算单元的输出可以为:
Figure PCTCN2022077881-appb-000001
其中,s=1、2、……n,n为大于1的自然数,Ws为xs的权重,b为神经单元的偏置。f为神经单元的激活函数(activation functions),用于将非线性特性引入神经网络中,来将神经单元中的输入信号转换为输出信号。该激活函数的输出信号可以作为下一层卷积层的输入。激活函数可以是sigmoid函数。神经网络是将许多个上述单一的神经单元联结在一起形成的网络,即一个神经单元的输出可以是另一个神经单元的输入。每个神经单元的输入可以与前一层的局部接受域相连,来提取局部接受域的特征,局部接受域可以是由若干个神经单元组成的区域。
(3)深度神经网络
深度神经网络(Deep Neural Network,DNN),可以理解为具有很多层隐含层的神经网络,这里的“很多”并没有特别的度量标准,我们常说的多层神经网络和深度神经网络其本质上是同一个东西。从DNN按不同层的位置划分,DNN内部的神经网络可以分为三类:输入层,隐含层,输出层。一般来说第一层是输入层,最后一层是输出层,中间的层数都 是隐含层。层与层之间是全连接的,也就是说,第i层的任意一个神经元一定与第i+1层的任意一个神经元相连。虽然DNN看起来很复杂,但是就每一层的工作来说,其实并不复杂,简单来说就是如下线性关系表达式:
Figure PCTCN2022077881-appb-000002
其中,
Figure PCTCN2022077881-appb-000003
是输入向量,
Figure PCTCN2022077881-appb-000004
是输出向量,
Figure PCTCN2022077881-appb-000005
是偏移向量,W是权重矩阵(也称系数),α()是激活函数。每一层仅仅是对输入向量
Figure PCTCN2022077881-appb-000006
经过如此简单的操作得到输出向量
Figure PCTCN2022077881-appb-000007
由于DNN层数多,则系数W和偏移向量
Figure PCTCN2022077881-appb-000008
的数量也就是很多了。那么,具体的参数在DNN是如何定义的呢?首先我们来看看系数W的定义。以一个三层的DNN为例,如:第二层的第4个神经元到第三层的第2个神经元的线性系数定义为
Figure PCTCN2022077881-appb-000009
上标3代表系数W所在的层数,而下标对应的是输出的第三层索引2和输入的第二层索引4。总结下,第L-1层的第k个神经元到第L层的第j个神经元的系数定义为
Figure PCTCN2022077881-appb-000010
注意,输入层是没有W参数的。在深度神经网络中,更多的隐含层让网络更能够刻画现实世界中的复杂情形。理论上而言,参数越多的模型复杂度越高,“容量”也就越大,也就意味着它能完成更复杂的学习任务。
(4)卷积神经网络(Convosutionas Neuras Network,CNN)是一种带有卷积结构的深度神经网络。卷积神经网络包含了一个由卷积层和子采样层构成的特征抽取器。该特征抽取器可以看作是滤波器,卷积过程可以看作是使用一个可训练的滤波器与一个输入的图像或者卷积特征平面(feature map)做卷积。卷积层是指卷积神经网络中对输入信号进行卷积处理的神经元层。在卷积神经网络的卷积层中,一个神经元可以只与部分邻层神经元连接。一个卷积层中,通常包含若干个特征平面,每个特征平面可以由一些矩形排列的神经单元组成。同一特征平面的神经单元共享权重,这里共享的权重就是卷积核。共享权重可以理解为提取图像信息的方式与位置无关。这其中隐含的原理是:图像的某一部分的统计信息与其他部分是一样的。即意味着在某一部分学习的图像信息也能用在另一部分上。所以对于图像上的所有位置,我们都能使用同样的学习得到的图像信息。在同一卷积层中,可以使用多个卷积核来提取不同的图像信息,一般地,卷积核数量越多,卷积操作反映的图像信息越丰富。
卷积核可以以随机大小的矩阵的形式初始化,在卷积神经网络的训练过程中卷积核可以通过学习得到合理的权重。另外,共享权重带来的直接好处是减少卷积神经网络各层之间的连接,同时又降低了过拟合的风险。
由于CNN是一种非常常见的神经网络,下面结合图2c重点对CNN的结构进行详细的介绍。如上文的基础概念介绍所述,卷积神经网络是一种带有卷积结构的深度神经网络,是一种深度学习(deep learning)架构,深度学习架构是指通过机器学习的算法,在不同的抽象层级上进行多个层次的学习。作为一种深度学习架构,CNN是一种前馈(feed-forward)人工神经网络,该前馈人工神经网络中的各个神经元可以对输入其中的图像作出响应。
在图2c中,卷积神经网络(CNN)100可以包括输入层120,卷积层/池化层120(其中池化层为可选的),以及神经网络层130。其中,输入层120可以获取待处理图像,并将获取到的待处理图像交由卷积层/池化层120以及后面的神经网络层130进行处理,可以得到图像的处理结果。下面对图2c中的CNN 100中内部的层结构进行详细的介绍。
卷积层/池化层120:
卷积层:
如图2c所示卷积层/池化层120可以包括如示例121-126层,举例来说:在一种实现中, 121层为卷积层,122层为池化层,123层为卷积层,124层为池化层,125为卷积层,126为池化层;在另一种实现方式中,121、122为卷积层,123为池化层,124、125为卷积层,126为池化层。即卷积层的输出可以作为随后的池化层的输入,也可以作为另一个卷积层的输入以继续进行卷积操作。
下面将以卷积层121为例,介绍一层卷积层的内部工作原理。
卷积层121可以包括很多个卷积算子,卷积算子也称为核,其在图像处理中的作用相当于一个从输入图像矩阵中提取特定信息的过滤器,卷积算子本质上可以是一个权重矩阵,这个权重矩阵通常被预先定义,在对图像进行卷积操作的过程中,权重矩阵通常在输入图像上沿着水平方向一个像素接着一个像素(或两个像素接着两个像素……这取决于步长stride的取值)的进行处理,从而完成从图像中提取特定特征的工作。该权重矩阵的大小应该与图像的大小相关,需要注意的是,权重矩阵的纵深维度(depth dimension)和输入图像的纵深维度是相同的,在进行卷积运算的过程中,权重矩阵会延伸到输入图像的整个深度。因此,和一个单一的权重矩阵进行卷积会产生一个单一纵深维度的卷积化输出,但是大多数情况下不使用单一权重矩阵,而是应用多个尺寸(行×列)相同的权重矩阵,即多个同型矩阵。每个权重矩阵的输出被堆叠起来形成卷积图像的纵深维度,这里的维度可以理解为由上面所述的“多个”来决定。不同的权重矩阵可以用来提取图像中不同的特征,例如一个权重矩阵用来提取图像边缘信息,另一个权重矩阵用来提取图像的特定颜色,又一个权重矩阵用来对图像中不需要的噪点进行模糊化等。该多个权重矩阵尺寸(行×列)相同,经过该多个尺寸相同的权重矩阵提取后的卷积特征图的尺寸也相同,再将提取到的多个尺寸相同的卷积特征图合并形成卷积运算的输出。
这些权重矩阵中的权重值在实际应用中需要经过大量的训练得到,通过训练得到的权重值形成的各个权重矩阵可以用来从输入图像中提取信息,从而使得卷积神经网络100进行正确的预测。
当卷积神经网络100有多个卷积层的时候,初始的卷积层(例如121)往往提取较多的一般特征,该一般特征也可以称之为低级别的特征;随着卷积神经网络100深度的加深,越往后的卷积层(例如126)提取到的特征越来越复杂,比如高级别的语义之类的特征,语义越高的特征越适用于待解决的问题。
池化层:
由于常常需要减少训练参数的数量,因此卷积层之后常常需要周期性的引入池化层,在如图2c中120所示例的121-126各层,可以是一层卷积层后面跟一层池化层,也可以是多层卷积层后面接一层或多层池化层。在图像处理过程中,池化层的唯一目的就是减少图像的空间大小。池化层可以包括平均池化算子和/或最大池化算子,以用于对输入图像进行采样得到较小尺寸的图像。平均池化算子可以在特定范围内对图像中的像素值进行计算产生平均值作为平均池化的结果。最大池化算子可以在特定范围内取该范围内值最大的像素作为最大池化的结果。另外,就像卷积层中用权重矩阵的大小应该与图像尺寸相关一样,池化层中的运算符也应该与图像的大小相关。通过池化层处理后输出的图像尺寸可以小于输入池化层的图像的尺寸,池化层输出的图像中每个像素点表示输入池化层的图像的对应子 区域的平均值或最大值。
神经网络层130:
在经过卷积层/池化层120的处理后,卷积神经网络100还不足以输出所需要的输出信息。因为如前所述,卷积层/池化层120只会提取特征,并减少输入图像带来的参数。然而为了生成最终的输出信息(所需要的类信息或其他相关信息),卷积神经网络100需要利用神经网络层130来生成一个或者一组所需要的类的数量的输出。因此,在神经网络层130中可以包括多层隐含层(如图2c所示的131、132至13n)以及输出层140,该多层隐含层中所包含的参数可以根据具体的任务类型的相关训练数据进行预先训练得到,例如该任务类型可以包括图像识别,图像分类,图像超分辨率重建等等。
在神经网络层130中的多层隐含层之后,也就是整个卷积神经网络100的最后层为输出层140,该输出层140具有类似分类交叉熵的损失函数,具体用于计算预测误差,一旦整个卷积神经网络100的前向传播(如图2c由120至140方向的传播为前向传播)完成,反向传播(如图2c由140至120方向的传播为反向传播)就会开始更新前面提到的各层的权重值以及偏差,以减少卷积神经网络100的损失,及卷积神经网络100通过输出层输出的结果和理想结果之间的误差。
需要说明的是,如图2c所示的卷积神经网络120仅作为一种卷积神经网络的示例,在具体的应用中,卷积神经网络还可以以其他网络模型的形式存在。
本申请实施例的图像处理方法具体采用的神经网络的结构可以如图2d所示。在图2d中,卷积神经网络(CNN)100可以包括输入层110,卷积层/池化层120(其中池化层为可选的),以及神经网络层130。与图2c相比,图2d中的卷积层/池化层120中的多个卷积层/池化层并行,将分别提取的特征均输入给全神经网络层130进行处理。
需要说明的是,图2c和图2d所示的卷积神经网络仅作为一种示例,在具体的应用中,卷积神经网络还可以以其他网络模型的形式存在。
(5)反向传播算法
卷积神经网络可以采用误差反向传播(back propagation,BP)算法在训练过程中修正初始的超分辨率模型中参数的大小,使得超分辨率模型的重建误差损失越来越小。具体地,前向传递输入信号直至输出会产生误差损失,通过反向传播误差损失信息来更新初始的超分辨率模型中参数,从而使误差损失收敛。反向传播算法是以误差损失为主导的反向传播运动,旨在得到最优的超分辨率模型的参数,例如权重矩阵。
(6)损失函数
在训练深度神经网络的过程中,因为希望深度神经网络的输出尽可能的接近真正想要预测的值,所以可以通过比较当前网络的预测值和真正想要的目标值,再根据两者之间的差异情况来更新每一层神经网络的权重向量(当然,在第一次更新之前通常会有初始化的过程,即为深度神经网络中的各层预先配置参数),比如,如果网络的预测值高了,就调整权重向量让它预测低一些,不断地调整,直到深度神经网络能够预测出真正想要的目标值或与真正想要的目标值非常接近的值。因此,就需要预先定义“如何比较预测值和目标值之间的差异”,这便是损失函数(loss function)或目标函数(objective function),它们是用于衡量 预测值和目标值的差异的重要方程。其中,以损失函数举例,损失函数的输出值(loss)越高表示差异越大,那么深度神经网络的训练就变成了尽可能缩小这个loss的过程。
(7)反向传播算法
神经网络可以采用误差反向传播(back propagation,BP)算法在训练过程中修正初始的神经网络模型中参数的大小,使得神经网络模型的重建误差损失越来越小。具体地,前向传递输入信号直至输出会产生误差损失,通过反向传播误差损失信息来更新初始的神经网络模型中参数,从而使误差损失收敛。反向传播算法是以误差损失为主导的反向传播运动,旨在得到最优的神经网络模型的参数,例如权重矩阵。
下面介绍本申请实施例提供系统架构。
参见图3,本申请实施例提供了一种系统架构100。如所述系统架构100所示,数据采集设备160用于采集训练数据,本申请实施例中训练数据包括:物体的图像或者图像块及物体的类别;并将训练数据存入数据库130,训练设备120基于数据库130中维护的训练数据训练得到CNN特征提取网络(解释说明:这里的特征提取网络就是前面介绍的经训练阶段训练得到的模型,可以是用于特征提取的神经网络等)。下面将以实施例一更详细地描述训练设备120如何基于训练数据得到CNN特征提取网络,该CNN特征提取网络能够用于实现本申请实施例提供的感知网络,即,将待识别图像或图像块通过相关预处理后输入该CNN特征提取网络。本申请实施例中的CNN特征提取网络具体可以为CNN卷积神经网络。需要说明的是,在实际的应用中,所述数据库130中维护的训练数据不一定都来自于数据采集设备160的采集,也有可能是从其他设备接收得到的。另外需要说明的是,训练设备120也不一定完全基于数据库130维护的训练数据进行CNN特征提取网络的训练,也有可能从云端或其他地方获取训练数据进行模型训练,上述描述不应该作为对本申请实施例的限定。
根据训练设备120训练得到的目标模型/规则可以应用于不同的系统或设备中,如应用于图3所示的执行设备110,所述执行设备110可以是终端,如手机终端,平板电脑,笔记本电脑,增强现实(augmented reality,AR)AR/虚拟现实(virtual reality,VR),车载终端等,还可以是服务器或者云端等。在图3中,执行设备110配置输入/输出(input/output,I/O)接口112,用于与外部设备进行数据交互,用户可以通过客户设备140向I/O接口112输入数据,所述输入数据在本申请实施例中可以包括:图像或者图像块。
在执行设备120对输入数据进行预处理,或者在执行设备120的计算模块111执行计算等相关的处理(比如进行本申请中神经网络的功能实现)过程中,执行设备120可以调用数据存储系统150中的数据、代码等以用于相应的处理,也可以将相应处理得到的数据、指令等存入数据存储系统150中。
最后,I/O接口112将处理结果,如上述得到的图像或图像块或者图片中感兴趣物体的2D、3D、Mask、关键点等信息返回给客户设备140,从而提供给用户。
可选地,客户设备140,可以是自动驾驶系统中的规划控制单元、手机终端中的美颜算法模块。
值得说明的是,训练设备120可以针对不同的目标或称不同的任务,基于不同的训练数据生成相应的目标模型/规则,该相应的目标模型/规则即可以用于实现上述目标或完成上述 任务,从而为用户提供所需的结果。
在图3中所示情况下,用户可以手动给定输入数据,该手动给定可以通过I/O接口112提供的界面进行操作。另一种情况下,客户设备140可以自动地向I/O接口112发送输入数据,如果要求客户设备140自动发送输入数据需要获得用户的授权,则用户可以在客户设备140中设置相应权限。用户可以在客户设备140查看执行设备110输出的结果,具体的呈现形式可以是显示、声音、动作等具体方式。客户设备140也可以作为数据采集端,采集如图所示输入I/O接口112的输入数据及输出I/O接口112的输出结果作为新的样本数据,并存入数据库130。当然,也可以不经过客户设备140进行采集,而是由I/O接口112直接将如图所示输入I/O接口112的输入数据及输出I/O接口112的输出结果,作为新的样本数据存入数据库130。
值得注意的是,图3仅是本申请实施例提供的一种系统架构的示意图,图中所示设备、器件、模块等之间的位置关系不构成任何限制,例如,在图3中,数据存储系统150相对执行设备110是外部存储器,在其它情况下,也可以将数据存储系统150置于执行设备110中。
首先以模型推理阶段为例对介绍本申请实施例提供的感知网络。
参照图6,图6为本申请实施例提供的一种感知网络的结构示意,如图6所示,本申请实施例提供的感知网络包括:特征提取网络,所述特征提取网络包括第一块block、串行连接的至少一个第二block、目标操作以及拼接操作,所述第一block以及所述M个第二block为所述特征提取网络中同一个阶段stage内的block(图6中示出的为目标stage)。
在特征提取网络中,通常包括多个阶段stage,其中每个stage可以包括多个block,每个block可以由至少一个卷积操作构成,参照图5a,其中,图5a示出了现有技术中的一种感知网络的结构示意,其中,特征提取网络包括四个阶段stage(stage1、stage2、stage3、stage4),其中,每个stage中包括的block的输入特征图和输出特征图的尺寸大小一致,所谓尺寸大小一致,是指特征图的通道数以及每个通道的特征图的大小一致,例如stage1中各个block的输入特征图和输出特征图的大小为56*56,通道数为24,stage2中各个block的输入特征图和输出特征图的大小为28*28,通道数为40,stage1中各个block的输入特征图和输出特征图的大小为14*14,通道数为80,stage1中各个block的输入特征图和输出特征图的大小为7*7,通道数为160,每一个block可以包括至少一个卷积操作,例如图5a中示出的三个卷积操作(1*1卷积、3*3卷积以及1*1卷积),且还存在由输入连接到输出的残差连接操作。
具体的可以参照图5b,其中,图5b中的卷积模块可以表示为一个阶段stage,则卷积模块1可以对应于图5a中的stage1,卷积模块2可以对应于图5a中的stage2,卷积模块3可以对应于图5a中的stage3,卷积模块4可以对应于图5a中的stage4,特征图C1为stage1输出的特征图,特征图C2为stage2输出的特征图,特征图C3为stage3输出的特征图,特征图C4为stage4输出的特征图,以stage1为例,参照图5c,其中卷积模块1可以包括多个卷积层(或者称之为block)。
特征提取网络可以对输入图像进行特征提取,以得到输出特征图,输出特征图可以输入到任务网络,由任务网络进行相应任务的处理,以得到处理结果,例如任务可以是目标检测,则处理结果可以是图像中目标所在的检测框,任务可以是图像分割,则处理结果可以是图像中目标所在的图像分割区域。
参照图5d,在特征提取网络的相同阶段stage之间,不同block的输出特征之间具有较高的相似性,而不同阶段stage的block的输出特征之间,相似性较低。因此本申请实施例利用相同stage之间跨层的目标操作来让感知网络生成这些与关键特征相似性高的特征,降低了模型的参数量,以此提高在GPU设备、TPU设备以及NPU设备上的模型的运行速度。
接下来详细描述本申请实施例中的感知网络。
本申请实施例中,特征提取网络包括第一块block、串行连接的至少一个第二block、目标操作以及拼接操作,所述第一block以及所述M个第二block为所述特征提取网络中同一个阶段stage内的block,且所述目标操作的参数量小于所述M个第二block的参数量。
参照图6,所述第一block用于对输入数据进行卷积处理,以得到M个目标特征图,每个目标特征图对应于一个通道。
其中,所述至少一个第二block用于对所述M个目标特征图中的M1个目标特征图进行卷积处理,以得到M1个第一特征图,所述M1小于所述M,所述目标操作用于对所述M个目标特征图中的M2个目标特征图进行处理,以得到M2个第二特征图,所述M2小于所述M,所述拼接操作用于将所述M1个第一特征图和所述M2个第二特征图进行拼接,以得到拼接后的特征图。
在一种可能的实现中,所述M1个目标特征图与所述M2个目标特征图的交集为空,且所述M1与所述M2的加和为所述M,所述拼接后的特征图的通道数为所述M。
相当于,至少一个第二block的输出特征图和输出特征图的通道数都是输出第一block的目标特征图的通道数的一部分,剩下的一部分通道的目标特征图由目标操作产来进行处理。且由于目标操作的参数量小于至少一个第二block的参数量,使得整体上感知网络的参数量降低了,进而可以提高感知网络在GPU设备、TPU设备以及NPU设备上的运行速度。
在一种可能的实现中,所述目标操作为参数量小于所述至少一个第二block的卷积操作;或,所述目标操作为由所述第一block的输出到所述拼接操作的输出之间的残差连接操作。
例如,目标操作为1*1的卷积,或者目标操作为由所述第一block的输出到所述拼接操作的输出之间的残差连接操作,相当于M2个第二特征图就是M1个第一特征图,即直接将M1个第一特征图作为M2个第二特征图。
应理解,参照图7,还可以将第一block的输出特征图拆分为多组特征图,并由多个目标操作来处理一组特征图,其中只要保证各个目标操作输出的特征图的通道数、至少一个第二block输出的特征图的加和与第一block输出的特征图的通道数相同即可,不同目标操作输出的特征图的通道数之间可以不相同。
在一种可能的实现中,所述第一block以及所述M个第二block为所述特征提取网络中目标阶段stage内的block,所述拼接后的特征图用于作为所述特征提取网络中所述目标阶段stage的输出特征图;或,所述目标阶段stage还包括至少一个第三block,所述至少一个第三block用于对所述拼接后的特征图进行卷积操作,以得到所述目标阶段stage的输出特征图。
本申请实施例中,拼接后的特征图可以作为目标阶段stage的输出特征图,或者拼接后的特征图还可以被目标阶段stage包括的其他block(第三block)进行处理,例如参照图8,拼接后的特征图还可以被至少一个第三block进行处理,以得到目标阶段stage的输出特征图。
在一种可能的实现中,第一block可以为特征提取网络中的第一个block,或者是中间层的block,参照图9,在第一block之前还可以连接有至少一个第三block,则第一block用于对至少一个第三block输出的数据进行卷积处理。
在一种可能的实现中,所述至少一个第二block用于对所述M个目标特征图中的M1个目标特征图进行卷积处理,以得到每个第二block输出的特征图,其中,所述至少一个第二block中距离所述第一block最远的第二block的输出为所述M1个第一特征图;融合操作,用于对所述每个第二block输出的特征图进行融合,以得到融合后的特征图,所述融合后的特征图的尺寸大小与所述M2个第二特征图的尺寸大小相同,对所述融合后的特征图和所述M2个第二特征图进行加法操作,以得到处理后的M2个第二特征图,进而,拼接操作用于将所述M1个第一特征图和所述处理后的M2个第二特征图进行拼接,以得到拼接后的特征图。
参照图10,每个第二block都可以输出一个特征图,融合操作的对象为每个第二block输出的特征图以及目标操作的输出(M2个第二特征图),其中所述融合操作,用于对所述每个第二block的输出进行拼接以及降维操作,以得到和所述M2个第二特征图的尺寸大小相同的所述融合后的特征图。
具体的,可以参照图11,其中,可以将每个第二block输出的特征图进行拼接操作,得到拼接后的特征图(通道数为每个第二block输出的特征图的加和),由于拼接后的特征图的通道数大于目标操作的输出的特征图的通道数(M2),为了能够对拼接后的特征图与目标操作的输出(M2个第二特征图)进行加法运算,则需要将拼接后的特征图进行降维操作,以使得拼接后的特征图的通道数等于M2,进而可以对拼接后的特征图与目标操作的输出进行矩阵的加法运算。
以第二block的数量为5为例,此时共有5个第二block的输出会融合进最终输出。先拼接这5个特征为Z=[Y1,Y2,Y3,Y4,Y5],再对特征Z进行降维操作,使其的维度和M2一致并进行相加。具体公式可以为如下:
τ(Z)=W 2σ(W 1pooling(Z)+b 1)+b 2
Figure PCTCN2022077881-appb-000011
本申请实施例中,感知网络中的特征提取网络可以由基于上述实施例中提供的阶段stage堆叠而成,示例性的,网络结构的一个示意可以如表1所示。其中,output代表输出特征图的大小,#out代表输出特征图的通道数(Block和Cheap的通道数实际上为#out的1/2)。
表1
Figure PCTCN2022077881-appb-000012
在一种可能的实现中,所述特征提取网络用于获取输入的图像,并对所述输入的图像进行特征提取,输出所述输入的图像的特征图;所述感知网络还包括:任务网络,用于根据所述输入的图像的特征图,进行对应任务的处理,以得到处理结果。其中,所述任务包括目标检测、图像分割或图像分类等等图像处理任务。
接下来描述一种感知网络的结构示意:
参照图11,本申请实施例中的特征提取网络可以为图11中所示的主干网络,特征金字塔(feature pyramid network,FPN)与主干网络backbone连接,FPN可以对主干网络backbone生成的多个不同分辨率的特征图进行卷积处理,来构造特征金字塔。
参照图12,图12为一种FPN的结构示意,其中,使用卷积模块1对最顶层特征图C4进行处理,卷积模块1可以包括至少一个卷积层,示例性的,卷积模块1可以使用空洞卷积和1×1卷积将最顶层特征图C4的通道数下降为256,作为特征金字塔的最顶层特征图P4;横向链接最顶层下一层特征图C3的输出结果并使用1×1卷积(卷积模块2)降低通道数至256后,与特征图p4逐像素相加得到特征图p3;以此类推,从上到下,构建出特征金字塔Φp={特征图p4,特征图p3,特征图p2,特征图p1}。
本申请实施例中,FPN包括多个卷积模块,每个卷积模块包括多个卷积层,每个卷积模块可以对输入的特征图进行卷积处理,本申请实施例中FPN包括的第二卷积层为FPN包括的多个卷积层中的一个。
需要说明的是,图12中示出的FPN仅为一种实现方式,并不构成对本申请的限定。
本申请实施例中,以任务网络要实现的任务为目标检测为例,header与FPN连接,header可以根据FPN提供的特征图,完成一个任务的2D框的检测,输出这个任务的物体的2D框以及对应的置信度等等,接下来描述一种header的结构示意,图13a为一种header的示意,如图13a中示出的那样,Header包括候选区域生成网络(Region Proposal Network,RPN)、ROI-ALIGN和RCNN三个模块。
其中,RPN模块可以用于在FPN提供的一个或者多个特征图上预测所述任务物体所在的区域,并输出匹配所述区域的候选2D框;或者可以这样理解,RPN在FPN输出的一个或者多个横图上预测出可能存在该任务物体的区域,并且给出这些区域的框,这些区域称为 候选区域(Proposal)。比如,当Header负责检测车时,其RPN层就预测出可能存在车的候选框;当Header负责检测人时,其RPN层就预测出可能存在人的候选框。当然,这些Proposal是不准确的,一方面其不一定含有该任务的物体,另一方面这些框也是不紧致的。
2D候选区域预测流程可以由Header的RPN模块实施,其根据FPN提供的特征图,预测出可能存在该任务物体的区域,并且给出这些区域的候选框(也可以叫候选区域,Proposal)。在本实施例中,若Header负责检测车,其RPN层就预测出可能存在车的候选框。
参照图14,在FPN提供的特征图上通过卷积模块1(例如一个3*3的卷积),生成特征图RPNHidden。后面Header的RPN层将会从RPN Hidden中预测Proposal。具体来说,Header的RPN层分别通过卷积模块2和卷积模块3(例如分别是一个1*1的卷积),预测出RPN Hidden每个位置处的Proposal的坐标以及置信度。这个置信度越高,表示这个Proposal存在该任务的物体的概率越大。比如,在Header中某个Proposal的score越大,就表示其存在车的概率越大。每个RPN层预测出来的Proposal需要经过Proposal合并模块,根据Proposal之间的重合程度去掉多余的Proposal(这个过程可以采用但不限制于NMS算法),在剩余的K个Proposal中挑选出score最大的N(N<k)个proposal作为候选的可能存在物体的区域。从图14可以看出,这些Proposal是不准确的,一方面其不一定含有该任务的物体,另一方面这些框也是不紧致的。因此,RPN模块只是一个粗检测的过程,需要后续的RCNN模块进行细分。在RPN模块回归Proposal的坐标时,并不是直接回归坐标的绝对值,而是回归出相对于Anchor的坐标。当这些Anchor与实际的物体匹配越高,RPN能检测出物体的概率越大。</k)个proposal作为候选的可能存在物体的区域。
ROI-ALIGN模块用于根据所述RPN模块预测得到的区域,从所述FPN提供的一个特征图中扣取出所述候选2D框所在区域的特征;也就是说,ROI-ALIGN模块主要根据RPN模块提供的Proposal,在某个特征图上把每个Proposal所在的区域的特征扣取出来,并且resize到固定的大小,得到每个Proposal的特征。可以理解的是,ROI-ALIGN模块可以使用但不局限于ROI-POOLING(感兴趣区域池化)/ROI-ALIGN(感兴趣区域提取)/PS-ROIPOOLING(位置敏感的感兴趣区域池化)/PS-ROIALIGN(位置敏感的感兴趣区域提取)等特征抽取方法。
RCNN模块用于通过神经网络对所述候选2D框所在区域的特征进行卷积处理,得到所述候选2D框属于各个物体类别的置信度;通过神经网络对所述候选区域2D框的坐标进行调整,使得调整后的2D候选框比所述候选2D框与实际物体的形状更加匹配,并选择置信度大于预设阈值的调整后的2D候选框作为所述区域的2D框。也就是说,RCNN模块主要是对ROI-ALIGN模块提出的每个Proposal的特征进行细化处理,得到每个Proposal的属于各个类别置信度(比如对于车这个任务,会给出Backgroud/Car/Truck/Bus 4个分数),同时对Proposal的2D框的坐标进行调整,输出更加紧致的2D框。这些2D框经过非极大值抑制(non maximum suppression,NMS)合并后,作为最后的2D框输出。
2D候选区域细分类主要由图14中的Header的RCNN模块实施,其根据ROI-ALIGN模块提取出来的每个Proposal的特征,进一步回归出更加紧致的2D框坐标,同时对这个Proposal进行分类,输出其属于各个类别的置信度。RCNN的可实现形式很多,其中一种实现形式如图13所示。ROI-ALIGN模块输出的特征大小可以为N*14*14*256(Feature of proposals), 其在RCNN模块中首先经过Resnet18的卷积模块4(Res18-Conv5)处理,输出的特征大小为N*7*7*512,然后通过一个Global Avg Pool(平均池化层)进行处理,把输入特征中每个通道内的7*7的特征进行平均,得到N*512的特征,其中每个1*512维的特征向量代表每个Proposal的特征。接下来通过2个全连接层FC分别回归框的精确坐标(输出N*4的向量,这4个数值分表表示框的中心点x/y坐标,框的宽高),框的类别的置信度(在Header0中,需要给出这个框是Backgroud/Car/Truck/Bus的分数)。最后通过框合并操作,选择分数最大的若干个框,并且通过NMS操作去除重复的框,从而得到紧致的框输出。
在一些实际应用场景中,该感知网络还可以包括其他Header,可以在检测出2D框的基础上,进一步进行3D/Mask/Keypoint检测。示例性的,以3D为例,ROI-ALIGN模块根据Header提供的准确的2D框,在FPN输出的特征图上提取出每个2D框所在区域的特征,假设2D框的个数为M,那么ROI-ALIGN模块输出的特征大小为M*14*14*256,其首先经过Resnet18的卷积模块5(例如为Res18-Conv5)处理,输出的特征大小为N*7*7*512,然后通过一个Global Avg Pool(平均池化层)进行处理,把输入特征中每个通道的7*7的特征进行平均,得到M*512的特征,其中每个1*512维的特征向量代表每个2D框的特征。接下来通过3个全连接层FC分别回归框中物体的朝向角(orientation,M*1向量)、质心点坐标(centroid,M*2向量,这2个数值表示质心的x/y坐标)和长宽高(dimention)。
本申请实施例中,header包括至少是一个卷积模块,每个卷积模块包括至少一个卷积层,每个卷积模块可以对输入的特征图进行卷积处理,本申请实施例中header包括的第三卷积层为header包括的多个卷积层中的一个。
需要说明的是,图13和图14示出的header仅为一种实现方式,并不构成对本申请的限定。
接下来进行本申请实施例中提供的感知网络的效果描述:
本申请实施例在图像分类数据集CIFAR10的结果可以如表2和表3所示,相比于ResNet等其他方法,本申请实施例中提供的感知网络结构在计算量和参数量最小的情况下达到了很高的精度,其中,表2为图像分类数据集CIFAR10上的结果,表3为图像分类数据集CIFAR10上与现有轻量级网络的对比。
表2
Figure PCTCN2022077881-appb-000013
表3
Figure PCTCN2022077881-appb-000014
本申请实施例中提供的感知网络在图像分类数据集ImageNet的结果可以如表4、表5和表6所示,对比具有同样推理速度的基线网络ResNet,本申请实施例中提供的感知网络可以在该大型分类数据集上提升1.4%的推理精度。相比于其他一系列轻量级推理网络,本申请实施例中提供的感知网络能同时达到最快的推理速度和较高的推理精度,表4为图像分类数据集ImageNet上与基线网络ResNet的对比,表5为图像分类数据集ImageNet上与另一基线网络RegNet的对比,表6为图像分类数据集ImageNet上与其他轻量级网络的对比。
表4
Figure PCTCN2022077881-appb-000015
表5
Figure PCTCN2022077881-appb-000016
表6
Figure PCTCN2022077881-appb-000017
以任务网络实现的任务为目标检测为例,实验结果可以如表7所示,本申请实施例中提供的感知网络在达到最高mAP的同时还具有最快的推理速度,达到每秒25.9帧。
表7
Figure PCTCN2022077881-appb-000018
本申请实施例提供了一种感知网络,所述感知网络包括:特征提取网络,所述特征提取网络包括第一块block、串行连接的至少一个第二block、目标操作以及拼接操作,所述第一block以及所述M个第二block为所述特征提取网络中同一个阶段stage内的block,且所述目标操作的参数量小于所述M个第二block的参数量;所述第一block用于对输入数据进行卷积处理,以得到M个目标特征图,每个目标特征图对应于一个通道;所述至少一个第二block用于对所述M个目标特征图中的M1个目标特征图进行卷积处理,以得到M1个第一特征图,所述M1小于所述M;所述目标操作用于对所述M个目标特征图中的M2个目标特征图进行处理,以得到M2个第二特征图,所述M2小于所述M;所述拼接操作用于将所述M1个第一特征图和所述M2个第二特征图进行拼接,以得到拼接后的特征图。通过上述方式,利用相同stage之间跨层的目标操作来让感知网络生成这些与关键特征相似性高的特征,降低了模型的参数量,以此提高在GPU设备、TPU设备以及NPU设备上的模型的运行速度。
接下来从产品应用的角度,介绍几种本申请实施例的应用场景。
本申请实施例主要应用在终端手机、云服务、检测、分割、底层视觉等以计算机视觉为基础的领域。面对海量数据的并行运算,计算机视觉及其相关任务对于算力的要求不断提升,对硬件的运算速度及功耗提出了更高的要求。现在越来越多的终端设备如手机芯片 上都已经开始大量部署GPU和NPG处理单元。本申请实施例提出的轻量级网络模型在GPU上能极大提升推理速度,并可以作为基础网络直接替换现有的如目标检测器的基础特征提取网络,部署在自动驾驶等场景。在实际应用中,可以适配广泛应用场景和设备(如手机终端、云服务器等),利用该快速推理网络进行数据处理、图像检索等任务。
在一种场景中,本申请实施例可以部署在手机终端上,为用户带来高效准确的推理。如手机拍照后的图像处理、图片识别、手机端目标检测等。
在一种场景中,本发明可以在云服务上进行轻量级部署,为用户提供高效数据处理的服务,助力深度学习提速增效。用户上传待处理的数据,即可通过云服务上的推理模型进行快速的数据处理。
在一种场景中,自动驾驶任务中,对视野范围内的行人、车辆等目标进行实时的检测对车辆做出正确的驾驶决策至关重要。本发明可以直接替换现有目标检测器的特征提取模块,压缩并加速检测器的推理过程。
参照图4,本申请实施例还提供了一种数据处理方法,所述方法包括:
1401、获取特征提取网络,所述特征提取网络包括第一块block、串行连接的至少一个第二block、目标操作以及拼接操作,所述第一block以及所述M个第二block为所述特征提取网络中同一个阶段stage内的block,且所述目标操作的参数量小于所述M个第二block的参数量;
步骤1401的具体描述可以参照上述实施例中关于特征提取网络的描述,这里不再赘述。
1402、通过所述第一block,对输入数据进行卷积处理,以得到M个目标特征图,每个目标特征图对应于一个通道;
步骤1402的具体描述可以参照上述实施例中关于第一block的描述,这里不再赘述。
1403、通过所述至少一个第二block,对所述M个目标特征图中的M1个目标特征图进行卷积处理,以得到M1个第一特征图,所述M1小于所述M;
步骤1403的具体描述可以参照上述实施例中关于至少一个第二block的描述,这里不再赘述。
1404、通过所述目标操作,对所述M个目标特征图中的M2个目标特征图进行处理,以得到M2个第二特征图,所述M2小于所述M;
步骤1404的具体描述可以参照上述实施例中关于目标操作的描述,这里不再赘述。
1405、通过所述拼接操作,将所述M1个第一特征图和所述M2个第二特征图进行拼接,以得到拼接后的特征图。
步骤1405的具体描述可以参照上述实施例中关于拼接操作的描述,这里不再赘述。
在一种可能的实现中,所述M1个目标特征图与所述M2个目标特征图的交集为空,且所述M1与所述M2的加和为所述M,所述拼接后的特征图的通道数为所述M。
在一种可能的实现中,所述目标操作为参数量小于所述至少一个第二block的卷积操作;或,
所述目标操作为由所述第一block的输出到所述拼接操作的输出之间的残差连接操作。
在一种可能的实现中,所述至少一个第二block用于对所述M个目标特征图中的M1个目标特征图进行卷积处理,以得到每个第二block输出的特征图,其中,所述至少一个第二block中距离所述第一block最远的第二block的输出为所述M1个第一特征图;
所述方法还包括:
通过融合操作,对所述每个第二block输出的特征图进行融合,以得到融合后的特征图,所述融合后的特征图的尺寸大小与所述M2个第二特征图的尺寸大小相同;对所述融合后的特征图和所述M2个第二特征图进行加法操作,以得到处理后的M2个第二特征图;
所述通过所述拼接操作,将所述M1个第一特征图和所述M2个第二特征图进行拼接,包括:
将所述M1个第一特征图和所述处理后的M2个第二特征图进行拼接,以得到拼接后的特征图。
在一种可能的实现中,所述通过融合操作,对所述每个第二block输出的特征图进行融合,包括:
通过融合操作,对所述每个第二block的输出进行拼接以及降维操作,以得到和所述M2个第二特征图的尺寸大小相同的所述融合后的特征图。
在一种可能的实现中,所述第一block以及所述M个第二block为所述特征提取网络中目标阶段stage内的block,所述拼接后的特征图用于作为所述特征提取网络中所述目标阶段stage的输出特征图;或,
所述目标阶段stage还包括至少一个第三block,所述至少一个第三block用于对所述拼接后的特征图进行卷积操作,以得到所述目标阶段stage的输出特征图。
在一种可能的实现中,所述特征提取网络用于获取输入的图像,并对所述输入的图像进行特征提取,输出所述输入的图像的特征图;
所述方法还包括:
通过任务网络,用于根据所述输入的图像的特征图,进行对应任务的处理,以得到处理结果。
在一种可能的实现中,所述任务包括目标检测、图像分割或图像分类。
参照图15,图15为本申请实施例提供的一种数据处理装置1500的示意,如图15中示出的那样,本申请提供的数据处理装置1500包括:
获取模块1501,用于获取特征提取网络,所述特征提取网络包括第一块block、串行连接的至少一个第二block、目标操作以及拼接操作,所述第一block以及所述M个第二block为所述特征提取网络中同一个阶段stage内的block,且所述目标操作的参数量小于所述M个第二block的参数量;
卷积处理模块1502,通过所述第一block,对输入数据进行卷积处理,以得到M个目标特征图,每个目标特征图对应于一个通道;
通过所述至少一个第二block,对所述M个目标特征图中的M1个目标特征图进行卷积处理,以得到M1个第一特征图,所述M1小于所述M;
通过所述目标操作,对所述M个目标特征图中的M2个目标特征图进行处理,以得到M2个第二特征图,所述M2小于所述M;
拼接模块1503,用于通过所述拼接操作,将所述M1个第一特征图和所述M2个第二特征图进行拼接,以得到拼接后的特征图。
在一种可能的实现中,所述M1个目标特征图与所述M2个目标特征图的交集为空,且所述M1与所述M2的加和为所述M,所述拼接后的特征图的通道数为所述M。
在一种可能的实现中,所述目标操作为参数量小于所述至少一个第二block的卷积操作;或,
所述目标操作为由所述第一block的输出到所述拼接操作的输出之间的残差连接操作。
在一种可能的实现中,所述至少一个第二block用于对所述M个目标特征图中的M1个目标特征图进行卷积处理,以得到每个第二block输出的特征图,其中,所述至少一个第二block中距离所述第一block最远的第二block的输出为所述M1个第一特征图;
所述装置还包括:
融合模块,用于通过融合操作,对所述每个第二block输出的特征图进行融合,以得到融合后的特征图,所述融合后的特征图的尺寸大小与所述M2个第二特征图的尺寸大小相同;对所述融合后的特征图和所述M2个第二特征图进行加法操作,以得到处理后的M2个第二特征图;
所述拼接模块,用于将所述M1个第一特征图和所述处理后的M2个第二特征图进行拼接,以得到拼接后的特征图。
在一种可能的实现中,所述融合模块,用于通过融合操作,对所述每个第二block的输出进行拼接以及降维操作,以得到和所述M2个第二特征图的尺寸大小相同的所述融合后的特征图。
在一种可能的实现中,所述第一block以及所述M个第二block为所述特征提取网络中目标阶段stage内的block,所述拼接后的特征图用于作为所述特征提取网络中所述目标阶段stage的输出特征图;或,
所述目标阶段stage还包括至少一个第三block,所述至少一个第三block用于对所述拼接后的特征图进行卷积操作,以得到所述目标阶段stage的输出特征图。
在一种可能的实现中,所述特征提取网络用于获取输入的图像,并对所述输入的图像进行特征提取,输出所述输入的图像的特征图;
所述装置还包括:
任务处理模块,用于通过任务网络,用于根据所述输入的图像的特征图,进行对应任务的处理,以得到处理结果。
在一种可能的实现中,所述任务包括目标检测、图像分割或图像分类。
接下来介绍本申请实施例提供的一种执行设备,请参阅图16,图16为本申请实施例提供的执行设备的一种结构示意图,执行设备1600具体可以表现为手机、平板、笔记本电脑、智能穿戴设备、服务器等,此处不做限定。其中,执行设备1600上可以部署有图15 对应实施例中所描述的数据处理装置,用于实现图15对应实施例中数据处理的功能。具体的,执行设备1600包括:接收器1601、发射器1602、处理器1603和存储器1604(其中执行设备1600中的处理器1603的数量可以一个或多个,图11中以一个处理器为例),其中,处理器1603可以包括应用处理器16031和通信处理器16032。在本申请的一些实施例中,接收器1601、发射器1602、处理器1603和存储器1604可通过总线或其它方式连接。
存储器1604可以包括只读存储器和随机存取存储器,并向处理器1603提供指令和数据。存储器1604的一部分还可以包括非易失性随机存取存储器(non-volatile random access memory,NVRAM)。存储器1604存储有处理器和操作指令、可执行模块或者数据结构,或者它们的子集,或者它们的扩展集,其中,操作指令可包括各种操作指令,用于实现各种操作。
处理器1603控制执行设备的操作。具体的应用中,执行设备的各个组件通过总线系统耦合在一起,其中总线系统除包括数据总线之外,还可以包括电源总线、控制总线和状态信号总线等。但是为了清楚说明起见,在图中将各种总线都称为总线系统。
上述本申请实施例揭示的方法可以应用于处理器1603中,或者由处理器1603实现。处理器1603可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,上述方法的各步骤可以通过处理器1603中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器1603可以是通用处理器、数字信号处理器(digital signal processing,DSP)、微处理器或微控制器、以及视觉处理器(vision processing unit,VPU)、张量处理器(tensor processing unit,TPU)等适用于AI运算的处理器,还可进一步包括专用集成电路(application specific integrated circuit,ASIC)、现场可编程门阵列(field-programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。该处理器1603可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器1604,处理器1603读取存储器1604中的信息,结合其硬件完成上述方法的步骤。
接收器1601可用于接收输入的数字或字符信息,以及产生与执行设备的相关设置以及功能控制有关的信号输入。发射器1602可用于通过第一接口输出数字或字符信息;发射器1602还可用于通过第一接口向磁盘组发送指令,以修改磁盘组中的数据;发射器1602还可以包括显示屏等显示设备。
执行设备可以运行图6所描述的感知网络,或者执行图14对应实施例中的数据处理方法。
本申请实施例还提供了一种训练设备,请参阅图17,图17是本申请实施例提供的训练设备一种结构示意图,具体的,训练设备1700由一个或多个服务器实现,训练设备1700可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上中央处理器(central  processing units,CPU)1717(例如,一个或一个以上处理器)和存储器1732,一个或一个以上存储应用程序1742或数据1744的存储介质1730(例如一个或一个以上海量存储设备)。其中,存储器1732和存储介质1730可以是短暂存储或持久存储。存储在存储介质1730的程序可以包括一个或一个以上模块(图示没标出),每个模块可以包括对训练设备中的一系列指令操作。更进一步地,中央处理器1717可以设置为与存储介质1730通信,在训练设备1700上执行存储介质1730中的一系列指令操作。
训练设备1700还可以包括一个或一个以上电源1726,一个或一个以上有线或无线网络接口1750,一个或一个以上输入输出接口1758;或,一个或一个以上操作系统1741,例如Windows ServerTM,Mac OS XTM,UnixTM,LinuxTM,FreeBSDTM等等。
图17中描述的模型训练装置1700可以为训练设备中的模块,训练设备中的处理器可以执行通过模型训练得到图6所描述的感知网络。
本申请实施例中还提供一种包括计算机程序产品,当其在计算机上运行时,使得计算机执行如前述执行设备所执行的步骤,或者,使得计算机执行如前述训练设备所执行的步骤。
本申请实施例中还提供一种计算机可读存储介质,该计算机可读存储介质中存储有用于进行信号处理的程序,当其在计算机上运行时,使得计算机执行如前述执行设备所执行的步骤,或者,使得计算机执行如前述训练设备所执行的步骤。
本申请实施例提供的执行设备、训练设备或终端设备具体可以为芯片,芯片包括:处理单元和通信单元,所述处理单元例如可以是处理器,所述通信单元例如可以是输入/输出接口、管脚或电路等。该处理单元可执行存储单元存储的计算机执行指令,以使执行设备内的芯片执行上述实施例描述的数据处理方法,或者,以使训练设备内的芯片执行上述实施例描述的数据处理方法。可选地,所述存储单元为所述芯片内的存储单元,如寄存器、缓存等,所述存储单元还可以是所述无线接入设备端内的位于所述芯片外部的存储单元,如只读存储器(read-only memory,ROM)或可存储静态信息和指令的其他类型的静态存储设备,随机存取存储器(random access memory,RAM)等。
具体的,请参阅图18,图18为本申请实施例提供的芯片的一种结构示意图,所述芯片可以表现为神经网络处理器NPU 1800,NPU 1800作为协处理器挂载到主CPU(Host CPU)上,由Host CPU分配任务。NPU的核心部分为运算电路1803,通过控制器1804控制运算电路1803提取存储器中的矩阵数据并进行乘法运算。
NPU 1800可以通过内部的各个器件之间的相互配合,来实现图6所描述的实施例中提供的模型训练方法,或者对训练得到的模型进行推理。
其中,NPU 1800中的运算电路1803可以执行获取第一神经网络模型以及对所述第一神经网络模型进行模型训练的步骤。
更具体的,在一些实现中,NPU 1800中的运算电路1803内部包括多个处理单元(Process Engine,PE)。在一些实现中,运算电路1803是二维脉动阵列。运算电路1803还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,运算电路1803是通用的矩阵处理器。
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路从权重存储器1802中取矩阵B相应的数据,并缓存在运算电路中每一个PE上。运算电路从输入存储器1801中取矩阵A数据与矩阵B进行矩阵运算,得到的矩阵的部分结果或最终结果,保存在累加器(accumulator)1808中。
统一存储器1806用于存放输入数据以及输出数据。权重数据直接通过存储单元访问控制器(Direct Memory Access Controller,DMAC)1805,DMAC被搬运到权重存储器1802中。输入数据也通过DMAC被搬运到统一存储器1806中。
BIU为Bus Interface Unit即,总线接口单元1810,用于AXI总线与DMAC和取指存储器(Instruction Fetch Buffer,IFB)1809的交互。
总线接口单元1810(Bus Interface Unit,简称BIU),用于取指存储器1809从外部存储器获取指令,还用于存储单元访问控制器1805从外部存储器获取输入矩阵A或者权重矩阵B的原数据。
DMAC主要用于将外部存储器DDR中的输入数据搬运到统一存储器1806或将权重数据搬运到权重存储器1802中或将输入数据数据搬运到输入存储器1801中。
向量计算单元1807包括多个运算处理单元,在需要的情况下,对运算电路1803的输出做进一步处理,如向量乘,向量加,指数运算,对数运算,大小比较等等。主要用于神经网络中非卷积/全连接层网络计算,如Batch Normalization(批归一化),像素级求和,对特征平面进行上采样等。
在一些实现中,向量计算单元1807能将经处理的输出的向量存储到统一存储器1806。例如,向量计算单元1807可以将线性函数;或,非线性函数应用到运算电路1803的输出,例如对卷积层提取的特征平面进行线性插值,再例如累加值的向量,用以生成激活值。在一些实现中,向量计算单元1807生成归一化的值、像素级求和的值,或二者均有。在一些实现中,处理过的输出的向量能够用作到运算电路1803的激活输入,例如用于在神经网络中的后续层中的使用。
控制器1804连接的取指存储器(instruction fetch buffer)1809,用于存储控制器1804使用的指令;
统一存储器1806,输入存储器1801,权重存储器1802以及取指存储器1809均为On-Chip存储器。外部存储器私有于该NPU硬件架构。
其中,上述任一处提到的处理器,可以是一个通用中央处理器,微处理器,ASIC,或一个或多个用于控制上述程序执行的集成电路。
另外需说明的是,以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。另外,本申请提供的装置实施例附图中,模块之间的连接关系表示它们之间具有通信连接,具体可以实现为一条或多条通信总线或信号线。
通过以上的实施方式的描述,所属领域的技术人员可以清楚地了解到本申请可借助软 件加必需的通用硬件的方式来实现,当然也可以通过专用硬件包括专用集成电路、专用CPU、专用存储器、专用元器件等来实现。一般情况下,凡由计算机程序完成的功能都可以很容易地用相应的硬件来实现,而且,用来实现同一功能的具体硬件结构也可以是多种多样的,例如模拟电路、数字电路或专用电路等。但是,对本申请而言更多情况下软件程序实现是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在可读取的存储介质中,如计算机的软盘、U盘、移动硬盘、ROM、RAM、磁碟或者光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,训练设备,或者网络设备等)执行本申请各个实施例所述的方法。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。
所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、训练设备或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、训练设备或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存储的任何可用介质或者是包含一个或多个可用介质集成的训练设备、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘(Solid State Disk,SSD))等。

Claims (19)

  1. 一种感知网络,其特征在于,所述感知网络包括:特征提取网络,所述特征提取网络包括第一块block、串行连接的至少一个第二block、目标操作以及拼接操作,所述第一block以及所述M个第二block为所述特征提取网络中同一个阶段stage内的block,且所述目标操作的参数量小于所述M个第二block的参数量;
    所述第一block用于对输入数据进行卷积处理,以得到M个目标特征图,每个目标特征图对应于一个通道;
    所述至少一个第二block用于对所述M个目标特征图中的M1个目标特征图进行卷积处理,以得到M1个第一特征图,所述M1小于所述M;
    所述目标操作用于对所述M个目标特征图中的M2个目标特征图进行处理,以得到M2个第二特征图,所述M2小于所述M;
    所述拼接操作用于将所述M1个第一特征图和所述M2个第二特征图进行拼接,以得到拼接后的特征图。
  2. 根据权利要求1所述的感知网络,其特征在于,所述M1个目标特征图与所述M2个目标特征图的交集为空,且所述M1与所述M2的加和为所述M,所述拼接后的特征图的通道数为所述M。
  3. 根据权利要求1或2所述的感知网络,其特征在于,所述目标操作为参数量小于所述至少一个第二block的卷积操作;或,
    所述目标操作为由所述第一block的输出到所述拼接操作的输出之间的残差连接操作。
  4. 根据权利要求1至3任一所述的感知网络,其特征在于,所述至少一个第二block用于对所述M个目标特征图中的M1个目标特征图进行卷积处理,以得到每个第二block输出的特征图,其中,所述至少一个第二block中距离所述第一block最远的第二block的输出为所述M1个第一特征图;
    所述特征提取网络还包括:
    融合操作,用于对所述每个第二block输出的特征图进行融合,以得到融合后的特征图,所述融合后的特征图的尺寸大小与所述M2个第二特征图的尺寸大小相同;
    对所述融合后的特征图和所述M2个第二特征图进行加法操作,以得到处理后的M2个第二特征图;
    所述拼接操作用于将所述M1个第一特征图和所述处理后的M2个第二特征图进行拼接,以得到拼接后的特征图。
  5. 根据权利要求4所述的感知网络,其特征在于,所述融合操作,用于对所述每个第二block的输出进行拼接以及降维操作,以得到和所述M2个第二特征图的尺寸大小相同的所述融合后的特征图。
  6. 根据权利要求1至5任一所述的感知网络,其特征在于,所述第一block以及所述M个第二block为所述特征提取网络中目标阶段stage内的block,所述拼接后的特征图用于作为所述特征提取网络中所述目标阶段stage的输出特征图;或,
    所述目标阶段stage还包括至少一个第三block,所述至少一个第三block用于对所述拼接后的特征图进行卷积操作,以得到所述目标阶段stage的输出特征图。
  7. 根据权利要求1至6任一所述的感知网络,其特征在于,所述特征提取网络用于获取输入的图像,并对所述输入的图像进行特征提取,输出所述输入的图像的特征图;
    所述感知网络还包括:
    任务网络,用于根据所述输入的图像的特征图,进行对应任务的处理,以得到处理结果。
  8. 根据权利要求7所述的感知网络,其特征在于,所述任务包括目标检测、图像分割或图像分类。
  9. 一种数据处理方法,其特征在于,所述方法包括:
    获取特征提取网络,所述特征提取网络包括第一块block、串行连接的至少一个第二block、目标操作以及拼接操作,所述第一block以及所述M个第二block为所述特征提取网络中同一个阶段stage内的block,且所述目标操作的参数量小于所述M个第二block的参数量;
    通过所述第一block,对输入数据进行卷积处理,以得到M个目标特征图,每个目标特征图对应于一个通道;
    通过所述至少一个第二block,对所述M个目标特征图中的M1个目标特征图进行卷积处理,以得到M1个第一特征图,所述M1小于所述M;
    通过所述目标操作,对所述M个目标特征图中的M2个目标特征图进行处理,以得到M2个第二特征图,所述M2小于所述M;
    通过所述拼接操作,将所述M1个第一特征图和所述M2个第二特征图进行拼接,以得到拼接后的特征图。
  10. 根据权利要求9所述的方法,其特征在于,所述M1个目标特征图与所述M2个目标特征图的交集为空,且所述M1与所述M2的加和为所述M,所述拼接后的特征图的通道数为所述M。
  11. 根据权利要求9或10所述的方法,其特征在于,所述目标操作为参数量小于所述至少一个第二block的卷积操作;或,
    所述目标操作为由所述第一block的输出到所述拼接操作的输出之间的残差连接操作。
  12. 根据权利要求9至11任一所述的方法,其特征在于,所述至少一个第二block用于对 所述M个目标特征图中的M1个目标特征图进行卷积处理,以得到每个第二block输出的特征图,其中,所述至少一个第二block中距离所述第一block最远的第二block的输出为所述M1个第一特征图;
    所述方法还包括:
    通过融合操作,对所述每个第二block输出的特征图进行融合,以得到融合后的特征图,所述融合后的特征图的尺寸大小与所述M2个第二特征图的尺寸大小相同;对所述融合后的特征图和所述M2个第二特征图进行加法操作,以得到处理后的M2个第二特征图;
    所述通过所述拼接操作,将所述M1个第一特征图和所述M2个第二特征图进行拼接,包括:
    将所述M1个第一特征图和所述处理后的M2个第二特征图进行拼接,以得到拼接后的特征图。
  13. 根据权利要求12所述的方法,其特征在于,所述通过融合操作,对所述每个第二block输出的特征图进行融合,包括:
    通过融合操作,对所述每个第二block的输出进行拼接以及降维操作,以得到和所述M2个第二特征图的尺寸大小相同的所述融合后的特征图。
  14. 根据权利要求9至13任一所述的方法,其特征在于,所述第一block以及所述M个第二block为所述特征提取网络中目标阶段stage内的block,所述拼接后的特征图用于作为所述特征提取网络中所述目标阶段stage的输出特征图;或,
    所述目标阶段stage还包括至少一个第三block,所述至少一个第三block用于对所述拼接后的特征图进行卷积操作,以得到所述目标阶段stage的输出特征图。
  15. 根据权利要求9至14任一所述的方法,其特征在于,所述特征提取网络用于获取输入的图像,并对所述输入的图像进行特征提取,输出所述输入的图像的特征图;
    所述方法还包括:
    通过任务网络,用于根据所述输入的图像的特征图,进行对应任务的处理,以得到处理结果。
  16. 根据权利要求15所述的方法,其特征在于,所述任务包括目标检测、图像分割或图像分类。
  17. 一种数据处理装置,其特征在于,所述装置包括存储器和处理器;所述存储器存储有代码,所述处理器被配置为获取所述代码,以运行如权利要求1至8任一所述的感知网络。
  18. 一种计算机存储介质,其特征在于,所述计算机存储介质存储有一个或多个指令,所述指令在由一个或多个计算机执行时使得所述一个或多个计算机运行如权利要求1至8任 一所述的感知网络。
  19. 一种计算机产品,包括代码,其特征在于,在所述代码被执行时用于运行如权利要求1至8任一所述的感知网络。
PCT/CN2022/077881 2021-02-27 2022-02-25 一种感知网络及数据处理方法 WO2022179599A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP22758963.7A EP4296896A4 (en) 2021-02-27 2022-02-25 PERCEPTION NETWORK AND DATA PROCESSING METHODS
US18/456,312 US20230401826A1 (en) 2021-02-27 2023-08-25 Perception network and data processing method

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110221934.8 2021-02-27
CN202110221934.8A CN113065637B (zh) 2021-02-27 2021-02-27 一种感知网络及数据处理方法

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/456,312 Continuation US20230401826A1 (en) 2021-02-27 2023-08-25 Perception network and data processing method

Publications (1)

Publication Number Publication Date
WO2022179599A1 true WO2022179599A1 (zh) 2022-09-01

Family

ID=76559200

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/077881 WO2022179599A1 (zh) 2021-02-27 2022-02-25 一种感知网络及数据处理方法

Country Status (4)

Country Link
US (1) US20230401826A1 (zh)
EP (1) EP4296896A4 (zh)
CN (2) CN117172285A (zh)
WO (1) WO2022179599A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024148473A1 (zh) * 2023-01-09 2024-07-18 Oppo广东移动通信有限公司 编码方法及装置、编码器、码流、设备、存储介质

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117172285A (zh) * 2021-02-27 2023-12-05 华为技术有限公司 一种感知网络及数据处理方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110298262A (zh) * 2019-06-06 2019-10-01 华为技术有限公司 物体识别方法及装置
WO2021018245A1 (zh) * 2019-07-30 2021-02-04 华为技术有限公司 图像分类方法及装置
CN112396002A (zh) * 2020-11-20 2021-02-23 重庆邮电大学 一种基于SE-YOLOv3的轻量级遥感目标检测方法
CN113065637A (zh) * 2021-02-27 2021-07-02 华为技术有限公司 一种感知网络及数据处理方法

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107704866B (zh) * 2017-06-15 2021-03-23 清华大学 基于新型神经网络的多任务场景语义理解模型及其应用
TWI709107B (zh) * 2018-05-21 2020-11-01 國立清華大學 影像特徵提取方法及包含其顯著物體預測方法
CN109086779B (zh) * 2018-07-28 2021-11-09 天津大学 一种基于卷积神经网络的注意力目标识别方法
CN109583517A (zh) * 2018-12-26 2019-04-05 华东交通大学 一种适用于小目标检测的增强的全卷积实例语义分割算法
CN110263705B (zh) * 2019-06-19 2023-07-07 上海交通大学 面向遥感技术领域两期高分辨率遥感影像变化检测系统
CN110765886B (zh) * 2019-09-29 2022-05-03 深圳大学 一种基于卷积神经网络的道路目标检测方法及装置

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110298262A (zh) * 2019-06-06 2019-10-01 华为技术有限公司 物体识别方法及装置
WO2021018245A1 (zh) * 2019-07-30 2021-02-04 华为技术有限公司 图像分类方法及装置
CN112396002A (zh) * 2020-11-20 2021-02-23 重庆邮电大学 一种基于SE-YOLOv3的轻量级遥感目标检测方法
CN113065637A (zh) * 2021-02-27 2021-07-02 华为技术有限公司 一种感知网络及数据处理方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4296896A4

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024148473A1 (zh) * 2023-01-09 2024-07-18 Oppo广东移动通信有限公司 编码方法及装置、编码器、码流、设备、存储介质

Also Published As

Publication number Publication date
CN113065637A (zh) 2021-07-02
CN117172285A (zh) 2023-12-05
US20230401826A1 (en) 2023-12-14
EP4296896A4 (en) 2024-08-14
EP4296896A1 (en) 2023-12-27
CN113065637B (zh) 2023-09-01

Similar Documents

Publication Publication Date Title
JP7289918B2 (ja) 物体認識方法及び装置
CN112446398B (zh) 图像分类方法以及装置
WO2020253416A1 (zh) 物体检测方法、装置和计算机存储介质
WO2022042713A1 (zh) 一种用于计算设备的深度学习训练方法和装置
WO2021164751A1 (zh) 一种感知网络结构搜索方法及其装置
WO2022052601A1 (zh) 神经网络模型的训练方法、图像处理方法及装置
WO2021238366A1 (zh) 一种神经网络构建方法以及装置
CN111368972B (zh) 一种卷积层量化方法及其装置
WO2021218786A1 (zh) 一种数据处理系统、物体检测方法及其装置
WO2020192736A1 (zh) 物体识别方法及装置
WO2022001805A1 (zh) 一种神经网络蒸馏方法及装置
WO2021147325A1 (zh) 一种物体检测方法、装置以及存储介质
WO2022111617A1 (zh) 一种模型训练方法及装置
CN110222718B (zh) 图像处理的方法及装置
WO2021008206A1 (zh) 神经网络结构的搜索方法、图像处理方法和装置
WO2022217434A1 (zh) 感知网络、感知网络的训练方法、物体识别方法及装置
WO2022179606A1 (zh) 一种图像处理方法及相关装置
US20230401826A1 (en) Perception network and data processing method
CN115375781A (zh) 一种数据处理方法及其装置
CN112464930A (zh) 目标检测网络构建方法、目标检测方法、装置和存储介质
CN116258176A (zh) 一种数据处理方法及其装置
CN113128285A (zh) 一种处理视频的方法及装置
CN114972182A (zh) 一种物体检测方法及其装置
CN115641490A (zh) 一种数据处理方法及其装置
CN115731530A (zh) 一种模型训练方法及其装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22758963

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2022758963

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2022758963

Country of ref document: EP

Effective date: 20230918

NENP Non-entry into the national phase

Ref country code: DE