CN111784659B

CN111784659B - Image detection method, device, electronic equipment and storage medium

Info

Publication number: CN111784659B
Application number: CN202010605474.4A
Authority: CN
Inventors: 李莹莹; 叶晓青; 谭啸; 孙昊
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Filing date: 2020-06-29
Publication date: 2024-11-12
Anticipated expiration: 2040-06-29

Abstract

The application discloses an image detection method, an image detection device, electronic equipment and a storage medium, and relates to the fields of automatic driving, computer vision and deep learning. The specific implementation scheme is as follows: acquiring an image to be detected acquired by a camera in an environment to be detected; fusing the global features of the image to be detected, the local features of the image to be detected and the depth map features of the environment to be detected to obtain fused features of the image to be detected; and predicting the depth of the obstacle in the image to be detected from the camera according to the fusion characteristics of the image to be detected. Compared with the prior art, the method and the device for predicting the depth of the obstacle in the image from the camera by combining the depth map features of the environment to be detected have the advantages that the accuracy and the robustness of image detection are improved.

Description

Image detection method, device, electronic equipment and storage medium

Technical Field

The embodiment of the application relates to the fields of computer vision, automatic driving and deep learning in computer technology, in particular to a method, a device, electronic equipment and a storage medium for image detection.

Background

The unmanned technique is a technique of sensing the surrounding environment of a vehicle by a sensor and controlling the steering and speed of the vehicle according to the road, the vehicle position, the obstacle, etc. obtained by sensing, thereby enabling the vehicle to safely and reliably travel on the road. Three-dimensional vehicle detection is used to detect obstacles around the vehicle and is critical to unmanned technology.

Currently, three-dimensional vehicle detection in road scenes is mainly based on images or radar data of onboard binocular cameras. Aiming at three-dimensional vehicle detection under a fixed monitoring scene, projection or length, width, height and orientation angle information of eight vertexes of a three-dimensional detection frame on an image can be directly predicted through a network, so that obstacle detection is realized; or the image combined depth information is converted into pseudo point cloud through monocular depth estimation, and then obstacle detection is carried out through a 3D point cloud detection method.

However, the method relying on the binocular camera has high accuracy requirement on the depth estimation of the obstacle, high calculation complexity and cannot meet the requirements of real-time performance and robustness. The mode depending on the radar does not meet the application scene requirement under the monitoring camera, the point cloud generated by the radar is sparse, and the remote detection precision is low. The mode based on the two-dimensional image is often influenced by the near size and the far size caused by perspective projection, so that the estimated 3D detection frame is inaccurate, and the detection precision is insufficient. Therefore, the existing three-dimensional vehicle detection method cannot meet the requirements of precision and robustness at the same time.

Disclosure of Invention

The application provides an image detection method, an image detection device, electronic equipment and a storage medium.

According to a first aspect of the present application there is provided a method of image detection comprising:

Acquiring an image to be detected acquired by a camera in an environment to be detected;

fusing the global features of the image to be detected, the local features of the image to be detected and the depth map features of the environment to be detected to obtain fused features of the image to be detected;

And predicting the depth of the obstacle in the image to be detected from the camera according to the fusion characteristics of the image to be detected.

According to a second aspect of the present application, there is provided an apparatus for image detection, comprising:

the acquisition module is used for acquiring an image to be detected acquired by the camera in the environment to be detected;

The fusion module is used for fusing the global feature of the image to be detected, the local feature of the image to be detected and the depth map feature of the environment to be detected to obtain the fusion feature of the image to be detected;

And the prediction module is used for predicting the depth of the obstacle in the image to be detected from the camera according to the fusion characteristics of the image to be detected.

According to a third aspect of the present application, there is provided an electronic device comprising:

at least one processor; and

A memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the first aspect.

According to a fourth aspect of the present application there is provided a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of the first aspect described above.

According to a fifth aspect of the present application, there is provided a method of image detection, comprising:

fusing global features of an image to be detected, local features of the image to be detected and depth map features of an environment to be detected to obtain fused features of the image to be detected;

The technology solves the technical problem that the three-dimensional vehicle detection in the prior art cannot meet the requirements of precision and robustness at the same time. Compared with the prior art, the method and the device for predicting the depth of the obstacle in the image from the camera by combining the depth map features of the environment to be detected have the advantages that the accuracy and the robustness of image detection are improved.

According to a sixth aspect of the present application there is provided a computer program product comprising: a computer program stored in a readable storage medium, from which it can be read by at least one processor of an electronic device, the at least one processor executing the computer program causing the electronic device to perform the method of the first aspect.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the application or to delineate the scope of the application. Other features of the present application will become apparent from the description that follows.

Drawings

The drawings are included to provide a better understanding of the present application and are not to be construed as limiting the application. Wherein:

fig. 1 is a schematic view of a scene of a method for detecting an image according to an embodiment of the present application;

Fig. 2 is a flowchart of a method for detecting an image according to an embodiment of the present application;

Fig. 3 is a schematic diagram of an image detection according to an embodiment of the present application;

FIG. 4 is a flowchart illustrating another image detection method according to an embodiment of the present application;

FIG. 5 is a flowchart illustrating a method for detecting an image according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of an image detection device according to an embodiment of the present application;

fig. 7 is a block diagram of an electronic device for implementing a method of image detection according to an embodiment of the present application.

Detailed Description

Exemplary embodiments of the present application will now be described with reference to the accompanying drawings, in which various details of the embodiments of the present application are included to facilitate understanding, and are to be considered merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The application provides an image detection method, an image detection device, electronic equipment and a storage medium, which are applied to the fields of computer vision, automatic driving and deep learning in computer technology, so as to solve the technical problem that three-dimensional vehicle detection cannot meet the requirements of precision and robustness at the same time, and achieve the effect of improving the precision and the robustness of image detection. The application is characterized in that: when image detection is carried out, the depth map features are fused on the basis of the existing global features and local features.

In order to clearly understand the technical scheme of the present application, the following explains the terms related to the present application:

monocular camera: a camera with only one vision sensor.

Binocular camera: there are two vision sensor-composed cameras, a binocular camera can obtain depth information of a scene using the principle of triangulation, and can reconstruct the three-dimensional shape and position of surrounding scenes.

Depth map: the depth map is also called a distance image map, and refers to an image in which the distance from an image collector to each point in a scene is taken as a pixel value.

Camera internal parameters: parameters related to the characteristics of the camera itself, such as the focal length of the camera, the pixel size, etc.

The use of the present application will be described below.

Fig. 1 is a schematic view of a scene of an image detection method according to an embodiment of the present application. As shown in fig. 1, a camera 102 on a vehicle 101 collects an image of the surrounding environment of the vehicle and sends the image to a server 103, the server 103 detects the image, determines the depth of an obstacle in the image from the camera 103, and generates indication information for the vehicle 101 to control the steering and speed of the vehicle 101 for automatic driving.

Wherein the camera 102 may be a monocular camera. The server 102 may be a server or a server in a cloud service platform.

It should be noted that, the application scenario of the technical solution of the present application may be an automatic driving scenario in fig. 1, but is not limited thereto, and may be applied to other scenarios requiring image detection.

It may be understood that the method for detecting an image may be implemented by the apparatus for detecting an image provided by the embodiment of the present application, where the apparatus for detecting an image may be part or all of a certain device, for example, may be a server or a processor in the server.

The following describes in detail a technical solution of an embodiment of the present application with specific embodiments by taking a server integrated with or installed with related execution codes as an example. The following embodiments may be combined with each other, and some embodiments may not be repeated for the same or similar concepts or processes.

Fig. 2 is a flow chart of a method for image detection according to an embodiment of the present application, and this embodiment relates to a process of image detection. As shown in fig. 2, the method includes:

s201, acquiring an image to be detected acquired by a camera in an environment to be detected.

The camera may be a monocular camera, and correspondingly, the image to be measured may be a monocular image.

In the application, the camera can acquire the image to be measured in the environment to be measured in real time, and then the server can receive the image to be measured sent by the camera. Wherein, if in the automatic driving scene, the camera can be installed at any position of the vehicle. For example, a camera may be mounted in front of the vehicle to capture an image to be measured in front of the vehicle; or the camera may be mounted behind the vehicle to capture an image to be measured behind the vehicle.

The application is not limited to the environment to be tested, and by taking automatic driving as an example, if the vehicle is in a road scene, the environment to be tested can be a road; if the vehicle is in a fixed monitoring scene, the environment to be tested can be a parking lot under fixed monitoring, etc.

S202, fusing global features of the image to be detected, local features of the image to be detected and depth map features of the environment to be detected to obtain fusion features of the image to be detected.

In this step, after the server acquires the image to be measured acquired by the camera in the environment to be measured, the global feature of the image to be measured, the local feature of the image to be measured and the depth map feature of the environment to be measured may be fused, so as to acquire the fusion feature of the image to be measured.

Among other things, global features (global features) may be overall attributes of an image, and exemplary global features may include color features, texture features, shape features, and the like. Global features have good invariance, simple calculation, visual representation and other specific features.

Local features (local features) may be local representations of image features, which may include scale-invariant features (scale-INVARIANT FEATURE TRANSFORM, SIFT), accelerated robust features (speeded up robust features, SURF), dense (daise) features, and the like, to name a few.

The embodiment of the application does not limit how to acquire the global feature and the local feature of the image to be measured, and in an alternative implementation manner, the image to be measured can be input into a Backbone network (backbox), and the local feature and the global feature output by the backbox can be acquired. The Backbone is a neural network model for target detection, and may be ResNet, denseNet, for example. The embodiment of the application does not limit the backstone, and can be specifically set according to actual conditions.

In an alternative embodiment, the server extracts depth map features of the environment to be measured from the ground depth map of the environment to be measured before acquiring the fused features of the image to be measured. Similarly, the server can input the ground depth map of the environment to be tested into the backbone network, and acquire the depth map characteristics of the environment to be tested output by the backbone network.

The embodiment of the application also does not limit how to fuse the global features, the local features and the depth map features. For example, if the back bone used in extracting the features is ResNet, the feature fusion is performed by ELEMENT WISE-sum method correspondingly. Wherein ELEMENT WISE-sum is a way to combine multiple features into complex vectors.

For example, if the back bone adopted in feature extraction is DenseNet, feature fusion is performed by adopting a concat mode correspondingly. Wherein the concat is a manner of directly connecting a plurality of features.

In addition, before extracting the depth map feature of the environment to be measured from the ground depth map of the environment to be measured, the server may determine the depth of at least one point on the ground of the environment to be measured from the camera according to the coordinates of the at least one point on the ground of the environment to be measured under the image coordinate system. Then, a ground depth map of the environment to be measured is established according to the depth of at least one point on the ground of the environment to be measured from the camera.

In the application, the characteristics can be rapidly and accurately extracted through the backbone network. In addition, as the three-dimensional position of the regression of different cameras has ambiguity, the ambiguity can be eliminated by adding the depth map feature of the environment to be detected, and the generalization capability is improved, so that the requirements of the accuracy and the robustness of the image detection are met.

S203, predicting the depth of the obstacle in the image to be detected from the camera according to the fusion characteristics of the image to be detected.

In this step, when the server fuses the global feature of the image to be measured, the local feature of the image to be measured and the depth map feature of the environment to be measured, after obtaining the fused feature of the image to be measured, the depth of the obstacle in the image to be measured from the camera can be predicted according to the fused feature of the image to be measured.

In some embodiments, the server may input the fusion features of the image to be measured into the neural network model, and obtain the depth of the obstacle in the image to be measured output by the neural network model from the camera.

The neural network model is a convolutional neural network model or a fully-connected neural network model.

It should be noted that, the embodiment of the application does not limit the building process of the neural network model, and the neural network can be built by adopting a common convolution layer or a full connection layer.

Fig. 3 is a schematic diagram of image detection according to an embodiment of the present application. A monocular three-dimensional region prediction network (monocular 3d region proposal network for object detection,M3D-RPN) for object detection is shown in fig. 3. In the M3D-RPN, on one hand, after acquiring an image to be detected, the server inputs the image to be detected into a backup, and acquires global features and local features of the image to be detected output by the backup. On the other hand, the server inputs the pre-acquired ground depth map into the backbox, and acquires the depth map features of the environment to be detected output by the backbox. And then, the server performs feature fusion on the global features, the local features and the depth map features, and outputs the feature fusion to the prediction module for depth prediction to obtain a depth prediction result. In addition, the prediction module can obtain an original 3D prediction result obtained in the existing M3D-RPN according to the global feature and the local feature, and combine the original 3D prediction result and the depth prediction result into a new 3D prediction result.

According to the image detection method provided by the embodiment of the application, the image to be detected acquired by the camera in the environment to be detected is firstly acquired, and then the global feature of the image to be detected, the local feature of the image to be detected and the depth map feature of the environment to be detected are fused to acquire the fusion feature of the image to be detected. And finally, predicting the depth of the obstacle in the image to be detected from the camera according to the fusion characteristics of the image to be detected. Compared with the prior art, the method and the device for predicting the depth of the obstacle in the image from the camera by combining the depth map features of the environment to be detected have the advantages that the accuracy and the robustness of image detection are improved.

On the basis of the above embodiment, a description will be given below of how to acquire depth map features of an environment to be measured. Fig. 4 is a flowchart of another image detection method according to an embodiment of the present application, where, as shown in fig. 4, the image detection method includes:

S301, determining the depth of at least one point on the ground of the environment to be measured from the camera according to the coordinates of the at least one point on the ground of the environment to be measured under the image coordinate system.

In some embodiments, the server may obtain coordinates of at least one point on the ground of the environment under test under the image coordinate system. And then determining the depth of at least one point on the ground of the environment to be measured from the camera according to the coordinates of the at least one point on the ground of the environment to be measured under the image coordinate system, the internal parameters of the camera and the ground equation.

Specifically, the server may perform reflection projection on coordinates of at least one point on the ground of the environment to be measured under the image coordinate system according to the internal parameters of the camera and the ground equation, so as to calculate the depth of the at least one point on the ground of the environment to be measured from the camera.

Wherein the internal parameters of the camera and the ground equation are calibrated in advance, each point under the image coordinates is a known coordinate, and in the earlier stage of constructing the ground depth map, each point can be assumed to be a point on the ground.

Illustratively, the point on the ground of the environment to be measured is Corner (x, y), the camera internal reference K is shown in formula (1), and the ground-to-ground equation is shown in formula (2). Accordingly, the depth of the point on the ground of the environment to be measured from the camera can be calculated by equations (3) - (5). Formulas (1) - (5) are shown below:

ax+by+cz+d＝0......................................(2)

point_cam＝K^-1*Img_p...................................(4)

Wherein a, b, c, d is an adjustable parameter, and f_x is the number of pixels in the x-axis direction f on the imaging plane with the focal length f; f_y is the number of pixels in the y-axis direction f on the imaging plane with the focal length f; c_x is the offset of the origin of the physical imaging plane in the x-axis direction, and c_y is the offset of the origin of the physical imaging plane in the y-axis direction.

According to the application, the depth of at least one point on the ground from the camera can be rapidly and accurately determined through the coordinates of at least one point on the ground of the environment to be detected under the image coordinate system, the internal parameters of the camera and the ground equation, and then a ground depth map of the environment to be detected is established.

S302, establishing a ground depth map of the environment to be detected according to the depth of at least one point on the ground of the environment to be detected from the camera.

In the step, after determining the depth of at least one point on the ground of the environment to be measured from the camera according to the coordinates of at least one point on the ground of the environment to be measured in the image coordinate system, the server may establish a ground depth map of the environment to be measured according to the depth of at least one point on the ground of the environment to be measured from the camera.

In some embodiments, the server may take as pixel values the depth of at least one point on the ground of the environment to be measured from the camera, thereby creating a ground depth map of the environment to be measured.

S303, extracting depth map features of the environment to be detected from the ground depth map of the environment to be detected.

The server may input the ground depth map of the environment to be measured into the backbone network, and obtain the depth map feature of the environment to be measured output by the backbone network.

S304, acquiring an image to be detected acquired by a camera in the environment to be detected.

S305, fusing the global features of the image to be detected, the local features of the image to be detected and the depth map features of the environment to be detected to obtain fusion features of the image to be detected.

S306, predicting the depth of the obstacle in the image to be detected from the camera according to the fusion characteristics of the image to be detected.

The terminology, effects, features, and alternative embodiments of S304-S306 may be understood with reference to S201-S203 shown in fig. 2, and will not be further described herein for repeated content.

On the basis of the above-described embodiments, a description will be given below of how to predict the depth of an obstacle from a camera in an image. Fig. 5 is a flowchart of another method for detecting an image according to an embodiment of the present application, where, as shown in fig. 5, the method for detecting an image includes:

s401, acquiring an image to be detected, which is acquired by a camera in an environment to be detected.

S402, fusing the global features of the image to be detected, the local features of the image to be detected and the depth map features of the environment to be detected to obtain fusion features of the image to be detected.

S403, inputting the fusion characteristics of the image to be detected into a neural network model, and obtaining the depth of the obstacle in the image to be detected, which is output by the neural network model, from the camera.

In the application, the depth of the obstacle from the camera can be predicted by utilizing the existing neural network model through fusion characteristics, so that the addition of extra calculation force can be avoided, and the timeliness of image detection is improved.

According to the image detection method provided by the embodiment of the application, the image to be detected acquired by the camera in the environment to be detected is firstly acquired, and then the global feature of the image to be detected, the local feature of the image to be detected and the depth map feature of the environment to be detected are fused to acquire the fusion feature of the image to be detected. And finally, inputting the fusion characteristics of the image to be detected into a neural network model, and acquiring the depth of the obstacle in the image to be detected, which is output by the neural network model, from the camera. Compared with the prior art, the method and the device for predicting the depth of the obstacle in the image from the camera by combining the depth map features of the environment to be detected have the advantages that the accuracy and the robustness of image detection are improved.

Those of ordinary skill in the art will appreciate that: all or part of the steps for implementing the above method embodiments may be implemented by hardware related to program information, and the above program may be stored in a computer readable storage medium, where the program, when executed, performs steps including the above method embodiments; and the aforementioned storage medium includes: various media that can store program code, such as ROM, RAM, magnetic or optical disks.

Fig. 6 is a schematic structural diagram of an image detection device according to an embodiment of the present application. The image detection means may be implemented by software, hardware or a combination of both, and may be, for example, the above-mentioned server or a chip in the server, for performing the above-mentioned image detection method. As shown in fig. 6, the apparatus 500 for image detection includes: the system comprises an acquisition module 501, a fusion module 502, a prediction module 503, an extraction module 504, a drawing module 505 and a calculation module 506.

An acquisition module 501, configured to acquire an image to be measured acquired by a camera in an environment to be measured;

The fusion module 502 is configured to fuse the global feature of the image to be detected, the local feature of the image to be detected, and the depth map feature of the environment to be detected, and obtain a fusion feature of the image to be detected;

and the prediction module 503 is configured to predict a depth of an obstacle in the image to be detected from the camera according to the fusion feature of the image to be detected.

In an alternative embodiment, the apparatus 500 for image detection further includes:

The extracting module 504 is configured to extract a depth map feature of the environment to be detected from a ground depth map of the environment to be detected.

In an alternative embodiment, the extracting module 504 is specifically configured to input the ground depth map of the environment to be measured into the backbone network, and obtain the depth map feature of the environment to be measured output by the backbone network.

the drawing module 505 is configured to build a ground depth map of the environment to be measured according to the depth of at least one point on the ground of the environment to be measured from the camera.

the computing module 506 is configured to determine a depth of at least one point on the ground of the environment to be measured from the camera according to coordinates of the at least one point on the ground of the environment to be measured in the image coordinate system.

In an alternative embodiment, the calculating module 506 is specifically configured to obtain coordinates of at least one point on the ground of the environment to be measured under the image coordinate system; and determining the depth of at least one point on the ground of the environment to be measured from the camera according to the coordinates of the at least one point on the ground of the environment to be measured under the image coordinate system, the internal parameters of the camera and the ground equation.

In an alternative embodiment, the calculating module 506 is specifically configured to perform reflection projection on coordinates of at least one point on the ground of the environment to be measured under the image coordinate system according to the internal parameters of the camera and the ground equation, and calculate the depth of the at least one point on the ground of the environment to be measured from the camera.

In an alternative embodiment, the prediction module 503 is specifically configured to input the fusion feature of the image to be measured into the neural network model, and obtain the depth of the obstacle in the image to be measured output by the neural network model from the camera.

In an alternative embodiment, the neural network model is a convolutional neural network model or a fully-connected neural network model.

In an alternative embodiment, the image to be measured is a monocular image.

The image detection device provided by the embodiment of the application can execute the actions of the image detection method in the method embodiment, and the implementation principle and the technical effect are similar, and are not repeated here.

According to an embodiment of the present application, the present application also provides an electronic device and a readable storage medium.

According to an embodiment of the present application, there is also provided a computer program product comprising: a computer program stored in a readable storage medium, from which at least one processor of an electronic device can read, the at least one processor executing the computer program causing the electronic device to perform the solution provided by any one of the embodiments described above.

As shown in fig. 7, there is a block diagram of an electronic device of a method of image detection according to an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the applications described and/or claimed herein.

As shown in fig. 7, the electronic device includes: one or more processors 601, memory 602, and interfaces for connecting the components, including high-speed interfaces and low-speed interfaces. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions executing within the electronic device, including instructions stored in or on memory to display graphical information of the GUI on an external input/output device, such as a display device coupled to the interface. In other embodiments, multiple processors and/or multiple buses may be used, if desired, along with multiple memories. Also, multiple electronic devices may be connected, each providing a portion of the necessary operations (e.g., as a server array, a set of blade servers, or a multiprocessor system). One processor 601 is illustrated in fig. 7.

The memory 602 is a non-transitory computer readable storage medium provided by the present application. The memory stores instructions executable by the at least one processor to cause the at least one processor to perform the method of image detection provided by the present application. The non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to perform the method of image detection provided by the present application.

The memory 602 is used as a non-transitory computer readable storage medium for storing non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules (e.g., the acquisition module 501, the fusion module 502, the prediction module 503, the extraction module 504, the drawing module 505, and the calculation module 506 shown in fig. 6) corresponding to the image detection method in the embodiment of the present application. The processor 601 executes various functional applications of the server and data processing, i.e., a method of implementing image detection in the above-described method embodiments, by running non-transitory software programs, instructions, and modules stored in the memory 602.

The memory 602 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, at least one application program required for a function; the storage data area may store data created from the use of the electronic device for image detection, and the like. In addition, the memory 602 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, memory 602 may optionally include memory remotely located relative to processor 601, which may be connected to the image detection electronics via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device of the method of image detection may further include: an input device 603 and an output device 604. The processor 601, memory 602, input device 603 and output device 604 may be connected by a bus or otherwise, for example in fig. 7.

The input device 603 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the image-detected electronic device, such as a touch screen, keypad, mouse, trackpad, touchpad, pointer stick, one or more mouse buttons, trackball, joystick, and the like. The output means 604 may include a display device, auxiliary lighting means (e.g., LEDs), tactile feedback means (e.g., vibration motors), and the like. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device may be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASIC (application specific integrated circuit), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and can be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

The embodiment of the application also provides a chip which comprises a processor and an interface. Wherein the interface is used for inputting and outputting data or instructions processed by the processor. The processor is configured to perform the methods provided in the method embodiments above. The chip can be applied to a server.

The present invention also provides a computer-readable storage medium, which may include: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random-access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, etc., in which program codes can be stored, and in particular, the computer-readable storage medium stores program information for the above-mentioned method.

The embodiment of the present application also provides a program for executing the method of image detection provided in the above method embodiment when executed by a processor.

The present application also provides a program product, such as a computer readable storage medium, having instructions stored therein, which when run on a computer, cause the computer to perform the method of image detection provided by the method embodiments described above.

According to the technical scheme of the embodiment of the application, the image to be detected acquired by the camera in the environment to be detected is firstly acquired, and then the global feature of the image to be detected, the local feature of the image to be detected and the depth map feature of the environment to be detected are fused to acquire the fusion feature of the image to be detected. And finally, predicting the depth of the obstacle in the image to be detected from the camera according to the fusion characteristics of the image to be detected. Compared with the prior art, the method and the device for predicting the depth of the obstacle in the image from the camera by combining the depth map features of the environment to be detected have the advantages that the accuracy and the robustness of image detection are improved.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present application may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed embodiments are achieved, and are not limited herein.

The above embodiments do not limit the scope of the present application. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present application should be included in the scope of the present application.

Claims

1. A method of image detection, comprising:

Acquiring an image to be detected, which is acquired by a camera in an environment to be detected, wherein the camera is a monocular camera;

Predicting the depth of an obstacle in the image to be detected from the camera according to the fusion characteristics of the image to be detected;

The depth map features of the environment to be measured are obtained according to a ground depth map of the environment to be measured, the ground depth map of the environment to be measured is established according to the depth of at least one point on the ground of the environment to be measured from a camera, and the depth of at least one point on the ground of the environment to be measured from the camera is determined according to the coordinates of at least one point on the ground of the environment to be measured under an image coordinate system, the internal parameters of the camera and ground equations.

2. The method of claim 1, further comprising, prior to the acquiring the fused feature of the image under test:

Extracting depth map features of the environment to be detected from the ground depth map of the environment to be detected.

3. The method of claim 2, wherein the extracting depth map features of the environment to be measured from a ground depth map of the environment to be measured comprises:

Inputting the ground depth map of the environment to be tested into a backbone network, and obtaining the depth map characteristics of the environment to be tested, which are output by the backbone network.

4. The method of claim 2, further comprising, prior to the extracting the depth map features of the environment under test from the ground depth map of the environment under test:

and establishing a ground depth map of the environment to be tested according to the depth of at least one point on the ground of the environment to be tested from the camera.

5. The method of claim 4, further comprising, prior to said creating a ground depth map of the environment under test:

And determining the depth of at least one point on the ground of the environment to be measured from the camera according to the coordinates of the at least one point on the ground of the environment to be measured under the image coordinate system.

6. The method of claim 5, wherein the determining the depth of the at least one point on the surface of the environment to be measured from the camera based on the coordinates of the at least one point on the surface of the environment to be measured in the image coordinate system comprises:

Acquiring coordinates of at least one point on the ground of the environment to be detected under an image coordinate system;

And determining the depth of at least one point on the ground of the environment to be measured from the camera according to the coordinates of the at least one point on the ground of the environment to be measured under the image coordinate system, the internal parameters of the camera and the ground equation.

7. The method of claim 6, wherein the determining the depth of the at least one point on the surface of the environment to be measured from the camera based on coordinates of the at least one point on the surface of the environment to be measured in the image coordinate system, the internal parameters of the camera, and the surface equation comprises:

and carrying out reflection projection on coordinates of at least one point on the ground of the environment to be detected under an image coordinate system according to the internal parameters of the camera and the ground equation, and calculating the depth of the at least one point on the ground of the environment to be detected from the camera.

8. The method of any of claims 1-7, wherein the predicting the depth of an obstacle in the image to be measured from the camera based on the fused features of the image to be measured comprises:

inputting the fusion characteristics of the image to be detected into a neural network model, and obtaining the depth of the obstacle in the image to be detected, which is output by the neural network model, from the camera.

9. The method of claim 8, wherein the neural network model is a convolutional neural network model or a fully-connected neural network model.

10. The method of any of claims 1-7, wherein the image to be measured is a monocular image.

11. An apparatus for image detection, comprising:

The acquisition module is used for acquiring an image to be detected acquired by a camera in an environment to be detected, wherein the camera is a monocular camera;

The fusion module is used for fusing the global feature of the image to be detected, the local feature of the image to be detected and the depth map feature of the environment to be detected to obtain the fusion feature of the image to be detected; the depth map features of the environment to be measured are obtained according to a ground depth map of the environment to be measured, the ground depth map of the environment to be measured is established according to the depth of at least one point on the ground of the environment to be measured from a camera, and the depth of at least one point on the ground of the environment to be measured from the camera is determined according to the coordinates of at least one point on the ground of the environment to be measured under an image coordinate system, the internal parameters of the camera and a ground equation;

12. The apparatus of claim 11, further comprising:

And the extraction module is used for extracting the depth map features of the environment to be detected from the ground depth map of the environment to be detected.

13. The device of claim 12, wherein the extracting module is specifically configured to input a ground depth map of the environment to be measured into a backbone network, and obtain a depth map feature of the environment to be measured output by the backbone network.

14. The apparatus of claim 12, further comprising:

And the drawing module is used for establishing a ground depth map of the environment to be tested according to the depth of at least one point on the ground of the environment to be tested from the camera.

15. The apparatus of claim 14, further comprising:

And the computing module is used for determining the depth of at least one point on the ground of the environment to be tested from the camera according to the coordinates of the at least one point on the ground of the environment to be tested under the image coordinate system.

16. The apparatus of claim 15, wherein the computing module is specifically configured to obtain coordinates of at least one point on the ground of the environment under test in an image coordinate system; and determining the depth of at least one point on the ground of the environment to be measured from the camera according to the coordinates of the at least one point on the ground of the environment to be measured under the image coordinate system, the internal parameters of the camera and the ground equation.

17. The device of claim 16, wherein the computing module is specifically configured to calculate a depth of the at least one point on the ground of the environment to be measured from the camera by performing reflection projection on coordinates of the at least one point on the ground of the environment to be measured in an image coordinate system according to an internal reference of the camera and the ground equation.

18. The device according to any one of claims 11-17, wherein the prediction module is specifically configured to input the fusion feature of the image to be measured into a neural network model, and obtain a depth of an obstacle in the image to be measured output by the neural network model from the camera.

19. The apparatus of claim 18, wherein the neural network model is a convolutional neural network model or a fully-connected neural network model.

20. The apparatus of any of claims 11-17, wherein the image to be measured is a monocular image.

21. An electronic device, comprising:

at least one processor; and

A memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-10.

22. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-10.

23. A method of image detection, comprising:

Fusing global features of an image to be detected, local features of the image to be detected and depth map features of an environment to be detected to obtain fusion features of the image to be detected, wherein the image to be detected is acquired by a monocular camera, the depth map features of the environment to be detected are obtained according to a ground depth map of the environment to be detected, the ground depth map of the environment to be detected is established according to the depth of at least one point on the ground of the environment to be detected from the camera, and the depth of the at least one point on the ground of the environment to be detected from the camera is determined according to the coordinates of the at least one point on the ground of the environment to be detected under an image coordinate system, internal parameters of the camera and a ground equation;

24. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any of claims 1-10.