CN115019048B

CN115019048B - Three-dimensional scene segmentation method, model training method and device and electronic equipment

Info

Publication number: CN115019048B
Application number: CN202210806894.8A
Authority: CN
Inventors: 叶晓青
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-07-08
Filing date: 2022-07-08
Publication date: 2023-04-07
Anticipated expiration: 2042-07-08
Also published as: CN115019048A

Abstract

The invention provides a three-dimensional scene segmentation method, a model training device and electronic equipment, relates to the technical field of artificial intelligence, in particular to the technical fields of image processing, computer vision, deep learning and the like, and can be applied to scenes such as 3D vision, augmented reality and the like. The implementation scheme is as follows: obtaining point cloud data corresponding to a set of points in a target three-dimensional scene, the target three-dimensional scene including a plurality of instances, each instance of the plurality of instances having a respective category in a plurality of categories, the set of points including a plurality of points from each instance of the plurality of instances; obtaining a first feature based on the point cloud data; and obtaining a segmentation result of the target three-dimensional scene based on the first feature, wherein the segmentation result indicates the instance of each point in the point set in the plurality of instances.

Description

Three-dimensional scene segmentation method, model training method and device and electronic equipment

Technical Field

The present disclosure relates to the technical field of artificial intelligence, and in particular, to the technical field of image processing, computer vision, deep learning, and the like, and may be applied to scenes such as 3D vision, augmented reality, and the like, and in particular, to a three-dimensional scene segmentation method, a three-dimensional scene segmentation model training method, an apparatus, an electronic device, a computer-readable storage medium, and a computer program product.

Background

Artificial intelligence is the subject of research that makes computers simulate some human mental processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.), both at the hardware level and at the software level. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, machine learning/deep learning, a big data processing technology, a knowledge map technology and the like.

Three-dimensional visual techniques based on artificial intelligence have penetrated various fields. For example, in a three-dimensional scene under a road scene, an instance in the three-dimensional scene is segmented based on point cloud data of the three-dimensional scene, so that objects such as pedestrians and automobiles can be identified, and the vehicles can understand the road environment.

The approaches described in this section are not necessarily approaches that have been previously conceived or pursued. Unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section. Similarly, unless otherwise indicated, the problems mentioned in this section should not be considered as having been acknowledged in any prior art.

Disclosure of Invention

The present disclosure provides a three-dimensional scene segmentation method, apparatus, electronic device, computer-readable storage medium, and computer program product.

According to an aspect of the present disclosure, there is provided a three-dimensional scene segmentation method, including: obtaining point cloud data corresponding to a set of points in a target three-dimensional scene, the target three-dimensional scene comprising a plurality of instances, each instance of the plurality of instances having a respective category in a plurality of categories, the set of points comprising a plurality of points from each instance of the plurality of instances; obtaining a first feature based on the point cloud data, wherein the first feature comprises a first sub-feature corresponding to each point in the set of points, the first sub-feature indicating a respective category of the respective point in the plurality of categories, the first sub-feature of each point in the set of points having a first relative relationship with a corresponding plurality of first global features in a plurality of first subsets of the set of points, the respective first sub-features of the plurality of points in each of the plurality of first subsets indicating a same category in the plurality of categories; and obtaining a segmentation result of the target three-dimensional scene based on the first feature, wherein the segmentation result indicates an instance of each point in the point set in the multiple instances.

According to another aspect of the present disclosure, there is provided a three-dimensional scene segmentation model training method, including: obtaining training point cloud data corresponding to a set of points in a training three-dimensional scene, the training three-dimensional scene comprising a plurality of instances, each instance of the plurality of instances having a respective category in a plurality of categories, the set of points comprising a plurality of subsets, each subset comprising a plurality of points from a same instance of the plurality of instances; obtaining a first feature based on the training point cloud data using the three-dimensional scene segmentation model, and obtaining a second feature based on the training point cloud data using the trained first model, wherein the first feature comprises a first sub-feature corresponding to each point in the set of points, the first sub-feature indicating a respective category of the respective point in the plurality of categories, the second feature comprises a second sub-feature corresponding to each point in the set of points, the second sub-feature indicating a respective category of the respective point in the plurality of categories; obtaining a first relative relationship between a first sub-feature of each point in the point set and a plurality of first global features corresponding to the plurality of subsets based on the first feature; obtaining a second relative relationship between a second sub-feature of each point in the point set and a plurality of second global features corresponding to the plurality of subsets based on the second feature; obtaining a first loss based on a first relative relationship and a second relative relationship corresponding to each point in the point set; and adjusting parameters of the three-dimensional scene segmentation model based on at least the first loss.

According to another aspect of the present disclosure, there is provided a three-dimensional scene segmentation apparatus including: a point cloud data acquisition unit configured to obtain point cloud data corresponding to a set of points in a target three-dimensional scene, the target three-dimensional scene including a plurality of instances, each of the plurality of instances having a respective category in a plurality of categories, the set of points including a plurality of points from each of the plurality of instances; a first feature obtaining unit configured to obtain a first feature based on the point cloud data, wherein the first feature includes a first sub-feature corresponding to each point in the point set, the first sub-feature indicates a respective category of the respective point in the plurality of categories, the first sub-feature of each point in the point set has a first relative relationship with a plurality of first global features corresponding to a plurality of first subsets in the point set, the respective first sub-features of the plurality of points of each first subset in the plurality of first subsets indicate a same category in the plurality of categories; and a segmentation result obtaining unit configured to obtain a segmentation result of the target three-dimensional scene based on the first feature, the segmentation result indicating an instance of each point in the set of points among the plurality of instances.

According to another aspect of the present disclosure, there is provided a three-dimensional scene segmentation model training device, including: a training data acquisition unit configured to obtain training point cloud data corresponding to a set of points in a training three-dimensional scene, the training three-dimensional scene including a plurality of instances, each of the plurality of instances having a respective category in a plurality of categories, the set of points including a plurality of subsets, each subset including a plurality of points from a same instance of the plurality of instances; a feature obtaining unit configured to obtain a first feature based on the training point cloud data using the three-dimensional scene segmentation model, and obtain a second feature based on the training point cloud data using the trained first model, the first feature including a first sub-feature corresponding to each point in the set of points, the first sub-feature indicating a respective category of the respective point in the plurality of categories, the second feature including a second sub-feature corresponding to each point in the set of points, the second sub-feature indicating a respective category of the respective point in the plurality of categories; a first relative relationship obtaining unit, configured to obtain, based on the first feature, a first relative relationship between a first sub-feature of each point in the point set and a plurality of first global features corresponding to the plurality of subsets; a second relative relationship obtaining unit, configured to obtain, based on the second feature, a second relative relationship between a second sub-feature of each point in the point set and a plurality of second global features corresponding to the plurality of subsets; a first loss calculation unit configured to obtain a first loss based on a first relative relationship and a second relative relationship corresponding to each point in the point set; and a parameter adjusting unit configured to adjust a parameter of the three-dimensional scene segmentation model based on at least the first loss.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method according to embodiments of the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method according to the embodiments of the present disclosure.

According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program, wherein the computer program realizes the method of an embodiment of the present disclosure when executed by a processor.

According to one or more embodiments of the present disclosure, the amount of calculation can be reduced, and the segmentation accuracy of a three-dimensional scene can be improved.

According to one or more embodiments of the present disclosure, the segmentation accuracy of a three-dimensional scene may be improved.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the embodiments and, together with the description, serve to explain the exemplary implementations of the embodiments. The illustrated embodiments are for purposes of illustration only and do not limit the scope of the claims. Throughout the drawings, identical reference numbers designate similar, but not necessarily identical, elements.

Fig. 1 illustrates a schematic diagram of an exemplary system in which various methods described herein may be implemented, in accordance with embodiments of the present disclosure;

FIG. 2 shows a flow diagram of a three-dimensional scene segmentation method according to an embodiment of the present disclosure;

fig. 3 shows a flow chart of a process of obtaining a first feature based on point cloud data in a three-dimensional scene segmentation method according to an embodiment of the present disclosure;

fig. 4 shows a flow chart of a process of obtaining a first feature based on voxelized data in a three-dimensional scene segmentation method according to an embodiment of the present disclosure;

FIG. 5 illustrates a flow diagram in which a three-dimensional scene segmentation model training method according to an embodiment of the present disclosure may be implemented;

fig. 6 shows a process flow diagram for obtaining a first relative relationship between a first sub-feature of each point in a point set and a plurality of first global features corresponding to a plurality of subsets in a three-dimensional scene segmentation model training method according to an embodiment of the present disclosure;

fig. 7 shows a flowchart of a process that may be implemented to obtain a first relative relationship corresponding to each point in the point set based on a plurality of first global features corresponding to a plurality of subsets and a first sub-feature of each point in the point set in the three-dimensional scene segmentation model training method according to an embodiment of the present disclosure;

fig. 8 shows a block diagram of a three-dimensional scene segmentation apparatus according to an embodiment of the present disclosure;

fig. 9 shows a block diagram of a three-dimensional scene segmentation model training apparatus according to an embodiment of the present disclosure; and

FIG. 10 illustrates a block diagram of an exemplary electronic device that can be used to implement embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, it will be recognized by those of ordinary skill in the art that various changes and modifications may be made to the embodiments described herein without departing from the scope of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the present disclosure, unless otherwise specified, the use of the terms "first", "second", etc. to describe various elements is not intended to define a positional relationship, a temporal relationship, or an importance relationship of the elements, and such terms are used only to distinguish one element from another. In some examples, a first element and a second element may refer to the same instance of the element, and in some cases, based on the context, they may also refer to different instances.

The terminology used in the description of the various described examples in this disclosure is for the purpose of describing particular examples only and is not intended to be limiting. Unless the context clearly indicates otherwise, if the number of elements is not specifically limited, the element may be one or a plurality of. Furthermore, the term "and/or" as used in this disclosure is intended to encompass any and all possible combinations of the listed items.

Embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.

Fig. 1 illustrates a schematic diagram of an exemplary system 100 in which various methods and apparatus described herein may be implemented in accordance with embodiments of the present disclosure. Referring to fig. 1, the system 100 includes one or

more client devices

101, 102, 103, 104, 105, and 106, a server 120, and one or more communication networks 110 coupling the one or more client devices to the server 120.

Client devices

101, 102, 103, 104, 105, and 106 may be configured to execute one or more applications.

In embodiments of the present disclosure, the server 120 may run one or more services or software applications that enable the segmentation method of the three-dimensional scene to be performed.

In some embodiments, the server 120 may also provide other services or software applications, which may include non-virtual environments and virtual environments. In some embodiments, these services may be provided as web-based services or cloud services, for example, provided to users of

client devices

101, 102, 103, 104, 105, and/or 106 under a software as a service (SaaS) model.

In the configuration shown in fig. 1, server 120 may include one or more components that implement the functions performed by server 120. These components may include software components, hardware components, or a combination thereof, which may be executed by one or more processors. A user operating a

client device

101, 102, 103, 104, 105, and/or 106 may, in turn, utilize one or more client applications to interact with the server 120 to take advantage of the services provided by these components. It should be understood that a variety of different system configurations are possible, which may differ from system 100. Accordingly, fig. 1 is one example of a system for implementing the various methods described herein and is not intended to be limiting.

The user may receive the segmentation results of the three-dimensional scene using

client devices

101, 102, 103, 104, 105, and/or 106. The client device may provide an interface that enables a user of the client device to interact with the client device. The client device may also output information to the user via the interface. Although fig. 1 depicts only six client devices, those skilled in the art will appreciate that any number of client devices may be supported by the present disclosure.

Client devices

101, 102, 103, 104, 105, and/or 106 may include various types of computer devices, such as portable handheld devices, general purpose computers (such as personal computers and laptop computers), workstation computers, wearable devices, smart screen devices, self-service terminal devices, service robots, gaming systems, thin clients, various messaging devices, sensors or other sensing devices, and so forth. These computer devices may run various types and versions of software applications and operating systems, such as MICROSOFT Windows, APPLE iOS, UNIX-like operating systems, linux, or Linux-like operating systems (e.g., GOOGLE Chrome OS); or include various Mobile operating systems, such as MICROSOFT Windows Mobile OS, iOS, windows Phone, android. Portable handheld devices may include cellular telephones, smart phones, tablets, personal Digital Assistants (PDAs), and the like. Wearable devices may include head-mounted displays (such as smart glasses) and other devices. The gaming system may include a variety of handheld gaming devices, internet-enabled gaming devices, and the like. The client device is capable of executing a variety of different applications, such as various Internet-related applications, communication applications (e.g., email applications), short Message Service (SMS) applications, and may use a variety of communication protocols.

Network 110 may be any type of network known to those skilled in the art that may support data communications using any of a variety of available protocols, including but not limited to TCP/IP, SNA, IPX, etc. By way of example only, one or more networks 110 may be a Local Area Network (LAN), an ethernet-based network, a token ring, a Wide Area Network (WAN), the internet, a virtual network, a Virtual Private Network (VPN), an intranet, an extranet, a blockchain network, a Public Switched Telephone Network (PSTN), an infrared network, a wireless network (e.g., bluetooth, WIFI), and/or any combination of these and/or other networks.

The server 120 may include one or more general purpose computers, special purpose server computers (e.g., PC (personal computer) servers, UNIX servers, mid-end servers), blade servers, mainframe computers, server clusters, or any other suitable arrangement and/or combination. The server 120 may include one or more virtual machines running a virtual operating system, or other computing architecture involving virtualization (e.g., one or more flexible pools of logical storage that may be virtualized to maintain virtual storage for the server). In various embodiments, the server 120 may run one or more services or software applications that provide the functionality described below.

The computing units in server 120 may run one or more operating systems including any of the operating systems described above, as well as any commercially available server operating systems. The server 120 may also run any of a variety of additional server applications and/or middle tier applications, including HTTP servers, FTP servers, CGI servers, JAVA servers, database servers, and the like.

In some implementations, the server 120 may include one or more applications to analyze and consolidate data feeds and/or event updates received from users of the

client devices

101, 102, 103, 104, 105, and/or 106. Server 120 may also include one or more applications to display data feeds and/or real-time events via one or more display devices of

client devices

101, 102, 103, 104, 105, and/or 106.

In some embodiments, the server 120 may be a server of a distributed system, or a server incorporating a blockchain. The server 120 may also be a cloud server, or a smart cloud computing server or a smart cloud host with artificial intelligence technology. The cloud Server is a host product in a cloud computing service system, and is used for solving the defects of high management difficulty and weak service expansibility in the traditional physical host and Virtual Private Server (VPS) service.

The system 100 may also include one or more databases 130. In some embodiments, these databases may be used to store data and other information. For example, one or more of the databases 130 may be used to store information such as audio files and video files. The database 130 may reside in various locations. For example, the database used by the server 120 may be local to the server 120, or may be remote from the server 120 and may communicate with the server 120 via a network-based or dedicated connection. The database 130 may be of different types. In certain embodiments, the database used by the server 120 may be, for example, a relational database. One or more of these databases may store, update, and retrieve data to and from the databases in response to the commands.

In some embodiments, one or more of the databases 130 may also be used by applications to store application data. The databases used by the application may be different types of databases, such as key-value stores, object stores, or regular stores supported by a file system.

The system 100 of fig. 1 may be configured and operated in various ways to enable application of the various methods and apparatus described in accordance with this disclosure.

According to an aspect of the present disclosure, a three-dimensional scene segmentation method is provided. Referring to fig. 2, a three-dimensional scene segmentation method 200, according to some embodiments of the present disclosure, includes:

step S210: obtaining point cloud data corresponding to a set of points in a target three-dimensional scene, the target three-dimensional scene comprising a plurality of instances, each instance of the plurality of instances having a respective category in a plurality of categories, the set of points comprising a plurality of points from each instance of the plurality of instances;

step S220: obtaining a first feature based on the point cloud data, wherein the first feature comprises a first sub-feature corresponding to each point in the set of points, the first sub-feature indicating a respective category of the respective point in the plurality of categories, the first sub-feature of each point in the set of points having a first relative relationship with a corresponding plurality of first global features in a plurality of first subsets of the set of points, the respective first sub-features of the plurality of points in each of the plurality of first subsets indicating a same category in the plurality of categories; and

step S230: based on the first features, a segmentation result of the target three-dimensional scene is obtained, the segmentation result indicating an instance of the plurality of instances with each point of the set of points.

By obtaining point cloud data corresponding to a point set in a target three-dimensional scene and obtaining first features based on the point cloud data, and enabling each first sub-feature in the first features to have a first relative relationship with a plurality of first global features of a plurality of first subsets in the point set, the first global features of each first subset in the plurality of first subsets are considered in the process of obtaining the first features, and therefore the semantic segmentation precision is improved on the premise that the calculation amount in prediction is not increased.

In the related art, in the process of segmenting the three-dimensional scene, only the sub-features corresponding to each point are considered for the features obtained based on the point cloud data, so that the obtained segmentation result of the three-dimensional scene is not accurate enough.

In the embodiment of the disclosure, in the process of segmenting the target three-dimensional scene, not only the sub-features corresponding to each point in the target three-dimensional scene are considered, but also the global features of the subset formed by each point of the same category in a plurality of categories indicated by the corresponding sub-features are considered, so that the obtained segmentation result of the target three-dimensional scene is the segmentation result of the global features of the subset formed by each point of the same category in a plurality of categories indicated by the corresponding sub-features, and the characteristics of global distribution consistency are fused, so that the accuracy of the segmentation result can be improved under the condition of processing point cloud data of the same data volume.

In some embodiments, the target three-dimensional scene is a three-dimensional scene that is determined to need to be segmented. In some embodiments, the three-dimensional scene may be any indoor scene or outdoor scene, for example, a scene of a three-dimensional space for a single classroom, a scene of a three-dimensional space for a football stadium.

In some embodiments, the point cloud dataset of the target three-dimensional scene may be a dataset acquired by a three-dimensional scanning device scanning the target three-dimensional scene. In some embodiments, the target three-dimensional scene includes a plurality of instances, and each data in the point cloud data set corresponds to a point scanned by the three-dimensional scanning device on a respective instance of the plurality of instances. Examples in the target three-dimensional scene, i.e. objects located in the three-dimensional scene that can be scanned to obtain corresponding point cloud data, may be, for example, a table, a chair, a car, a person, or the like, but are not limited thereto.

In some embodiments, the three-dimensional scanning device includes a laser radar (2D/3D), a stereo camera (stereo camera), a time-of-flight camera (time-of-flight camera), and the like.

In some embodiments, each point cloud data in the point cloud data set indicates position information, color information, gray value information, etc. of a respective point of the point cloud data.

In some embodiments, the first feature is obtained directly based on the point cloud data.

In some embodiments, as shown in fig. 3, obtaining a first feature based on the point cloud data comprises:

step S310: voxelizing the point cloud data to obtain voxelized data, the voxelized data comprising a plurality of voxels, each voxel of the plurality of voxels corresponding to at least one point in the set of points; and

step S320: obtaining the first feature based on the voxelized data.

After the point cloud data is subjected to voxelization, the first feature is obtained based on the voxelized data, so that the calculation amount is further reduced.

In some embodiments, the point cloud data is voxelized based on the coordinate locations of a set of points in the target three-dimensional scene.

In some embodiments, the point cloud data is voxelized using a trained neural network.

In some embodiments, as shown in fig. 4, obtaining the first feature based on the voxelized data comprises:

step S410: performing feature extraction on the voxelized data to obtain voxel data features, the voxel data features comprising voxel features corresponding to each of the plurality of voxels, the voxel features indicating corresponding classes of the corresponding voxels in the plurality of classes; and

step S420: determining a voxel characteristic corresponding to each voxel in the plurality of voxels as a first sub-characteristic of each point in at least one point corresponding to the voxel.

The voxel data characteristics are obtained by performing characteristic extraction on the voxel session data, and the voxel data characteristics are mapped to the corresponding points of each voxel to obtain the first characteristics, so that the first characteristics are obtained.

In some embodiments, after feature extraction is performed on the voxelized data by using a sparse convolutional encoder, features output by the sparse convolutional encoder are obtained, and semantic prediction is performed on the features output by the sparse convolutional encoder by using a multilayer perceptron (MLP) network to obtain voxel data features.

In some embodiments, the voxel characteristic to which each voxel corresponds includes a probability that the voxel corresponds to each of a plurality of classes.

In some embodiments, the voxel characteristics corresponding to each voxel include a probability and a number of channels that the voxel corresponds to each of a plurality of classes.

It is to be understood that the first sub-feature corresponding to each point in the first feature obtained based on the voxel data feature may also include a probability that the point corresponds to each of the plurality of classes, or a probability and a channel number that the point corresponds to each of the plurality of classes. For example, the first feature is represented as a matrix of dimensions N × C, where C is the number of channels and N is the number of points in the set of points in the target three-dimensional scene. Wherein each element in the matrix indicates a probability that the respective point corresponds to each of the plurality of categories.

In some embodiments, the first sub-feature corresponding to each point in the first feature comprises a probability that the point corresponds to each of the plurality of categories, and then a maximum probability of the plurality of probabilities comprised by the first sub-feature of each point in the first subset corresponds to a same category of the plurality of categories.

In some embodiments, each of the plurality of first global features comprises at least one of:

obtaining a global average pooling feature based on the first feature of each point in the first subset corresponding to the first global feature; and

and obtaining the global maximum pooling feature based on the first feature of each point in the first subset corresponding to the first global feature.

In some embodiments, the subset features corresponding to each first subset are globally averaged and pooled through a global averaging and pooling network to obtain the first global features corresponding to the first subset. And the subset characteristic corresponding to each subset is the combination of the first sub-characteristics corresponding to each point in the subset.

In some embodiments, the global average pooling is performed on the subset features corresponding to each first subset through a global average maximum pooling network to obtain first global features corresponding to the first subset. And the subset characteristic corresponding to each subset is the combination of the first sub-characteristics corresponding to each point in the subset.

In some examples, the first global feature is represented as a matrix having a dimension n × C, where n is a number of the plurality of classes and C is a number of channels. Each element in the matrix indicates a global feature of a first subset of points corresponding to each category.

In some embodiments, each of the first global features may include a probability that the respective first subset corresponds to each of the plurality of categories, or a probability and a number of channels that the respective first subset corresponds to each of the plurality of categories.

In some embodiments, the first relative relationship comprises a similarity between the first sub-feature of the respective point and each of the first global features of the plurality of first global features. In the process of obtaining the first features, the similarity between each first sub-feature and each first global feature is considered, and the accuracy of the segmentation result of the target three-dimensional scene is further improved.

In some embodiments, obtaining the first feature based on the voxelized data comprises:

inputting the voxelized data to a first feature extraction network to obtain the first feature; wherein,

the first feature extraction network is obtained by training based on guidance of a depth model including a second feature extraction network, a number of parameters of the second feature extraction network being greater than a number of parameters of the first feature extraction network and capable of obtaining a second feature based on the voxelized data, the second feature including a second sub-feature corresponding to each point in the set of points, the second sub-feature indicating a respective category of the respective point in the plurality of categories, the second sub-feature of each point in the set of points having a second relative relationship with a plurality of second global features corresponding to a plurality of second subsets of the set of points, the respective second sub-features of the plurality of points of each second subset of the plurality of second subsets indicating a same category of the plurality of categories; wherein,

a similarity between the first relative relationship and the second relative relationship of each point in the set of points is greater than a preset threshold.

The method comprises the steps of inputting voxelized data into a first feature extraction network to obtain first features, wherein the first feature extraction network is obtained by training under the guidance of a depth model comprising a second feature extraction network, and the parameter quantity of the second feature extraction network included in the depth model is larger than the parameter quantity of the first feature extraction network, so that the first features extracted by the first convolution network are similar to the second features extracted by the second convolution network as much as possible, the features similar to the features extracted by the second feature extraction network with large parameter quantity can also be extracted by the first feature extraction network with small parameter quantity, and the accuracy of the extracted features is improved.

Meanwhile, each second sub-feature in the second features extracted by the second feature extraction network is in a second relative relationship with a plurality of second global features of a plurality of second subsets, the similarity between the second relative relationship and the first relative relationship of corresponding points in the point set is greater than a preset threshold, for each point in the point set, the distribution between the point characterized by the second features and the global features is small compared with the distribution between the point characterized by the first features and the global features, namely the second features and the first features have consistency on the global distribution, the similarity between the features extracted by the first feature extraction network and the second feature extraction network is further improved, and therefore the accuracy of the features extracted by the first feature extraction network is improved.

In some embodiments, the first feature extraction network comprises a sparse convolutional encoder network and a multi-layered perceptron network.

In some embodiments, obtaining a segmentation result of the target three-dimensional scene based on the first feature comprises:

obtaining the plurality of first subsets based on a first feature; and

determining each of the plurality of first subsets as one of a plurality of instances.

In some embodiments, the first features are further feature extracted to determine the plurality of first subsets. For example, when the first sub-feature corresponding to each point in the first feature includes the probability that the point corresponds to each of the multiple categories, extracting the maximum probability of the multiple probabilities included in each of the first sub-features in the first feature and obtaining the corresponding category of the maximum probability, and determining the corresponding category as the corresponding category of the corresponding point of the first sub-feature; and adding points of which the corresponding categories are the same to the same first subset.

According to another aspect of the present disclosure, a three-dimensional scene segmentation model training method is also provided. As shown in fig. 5, method 500 includes:

step S510: obtaining training point cloud data corresponding to a set of points in a training three-dimensional scene, the training three-dimensional scene comprising a plurality of instances, each instance of the plurality of instances having a respective category in a plurality of categories, the set of points comprising a plurality of subsets, each subset comprising a plurality of points from a same instance of the plurality of instances;

step S520: obtaining a first feature based on the training point cloud data using the three-dimensional scene segmentation model, and obtaining a second feature based on the training point cloud data using the trained first model, wherein the first feature comprises a first sub-feature corresponding to each point in the set of points, the first sub-feature indicating a respective category of the respective point in the plurality of categories, the second feature comprises a second sub-feature corresponding to each point in the set of points, the second sub-feature indicating a respective category of the respective point in the plurality of categories;

step S530: obtaining a first relative relationship between a first sub-feature of each point in the point set and a plurality of first global features corresponding to the plurality of subsets based on the first feature;

step S540: obtaining a second relative relationship between a second sub-feature of each point in the point set and a plurality of second global features corresponding to the plurality of subsets based on the second feature;

step S550: obtaining a first loss based on a first relative relationship and a second relative relationship corresponding to each point in the point set; and

step S560: adjusting parameters of the three-dimensional scene segmentation model based on at least the first loss.

In the process of training the point cloud segmentation model, firstly, a trained first model is obtained, and the first model is adopted to guide the training of the point cloud segmentation model, wherein in the training process, a second relative relation of each second feature in the second features obtained by the first model based on the training point cloud data relative to a plurality of second global features of a plurality of second subsets and a first relative relation of each first feature in the first features obtained by the three-dimensional scene segmentation model based on the training point cloud data relative to a plurality of first global features of a plurality of first subsets are obtained, a first loss is obtained based on the first relative relation and the second relative relation of each point, parameters of the first feature extraction network are adjusted based on the first loss, so that the global distribution of the features represented by the second features and the global features for each point in the point set is consistent, the similarity of the features represented by the first features and the global features represented by the first features and the distribution of the global features are small, the similarity of the first model and the three-dimensional scene segmentation model based on the training point cloud segmentation model is improved, and the accuracy of the three-dimensional scene segmentation model is obtained based on the three-dimensional scene segmentation model.

In the related art, in the process of training a three-dimensional scene segmentation model based on a trained model, the trained model is used as a teacher model, the three-dimensional scene segmentation model is used as a student model, and in the training process, losses between features are calculated for the features of each point obtained by a teacher network and a student network, so that only the one-to-one difference between the points of the features extracted by the teacher model and the student model is concerned in the model training process, that is, only the difference in the dimension of the extracted features is concerned. Even if a trained model with a larger number of parameters is used as a teacher model, the precision of the three-dimensional scene segmentation model cannot be further improved through training.

In the embodiment according to the present disclosure, in the model training process, regarding differences in relative distribution between each point and the global feature in the extracted features between the first model and the three-dimensional scene segmentation model, differences in one-to-one correspondence between the points and the features extracted by the first model and the three-dimensional scene segmentation model are focused not only on each point but also on differences in global feature distribution in the model training process, so that the features extracted by the trained three-dimensional scene segmentation model are similar to the features extracted by the first model in multiple dimensions (the dimensions of the points, the dimensions of all the distributions), and thus the accuracy of the three-dimensional scene segmentation model can be improved.

In some embodiments, the training three-dimensional scene is an arbitrary three-dimensional scene, wherein each point in the three-dimensional scene is labeled with a respective label.

In some embodiments, the training point cloud data is voxelized using a trained neural network.

In some embodiments, the three-dimensional scene segmentation model comprises a first feature extraction network, the first model comprises a second feature extraction network, and wherein a number of parameters of the second feature extraction network is greater than a number of parameters of the first feature extraction network.

The parameter number of the second feature extraction networks included in the first model is larger than the parameter number of the first feature extraction networks in the three-dimensional scene segmentation model, so that the accuracy of the features obtained by the first model is higher than that of the three-dimensional scene segmentation model, and the training process of the three-dimensional scene segmentation model is known by the first model, so that the first features extracted by the first feature extraction networks in the three-dimensional scene segmentation model are as similar as possible to the second features extracted by the second feature extraction networks in the first model, so that the features similar to the features extracted by the second feature extraction networks with large parameter number can be extracted by the first feature extraction networks with small parameter number, and the accuracy of the extracted features is improved.

In some embodiments, the first and second feature extraction networks comprise a sparse convolutional encoder network and a multi-layered perceptron network, respectively.

The sparse convolution encoder is used for extracting the characteristics of the voxelized data to obtain the characteristics output by the sparse convolution encoder, and semantic prediction is carried out on the characteristics output by the sparse convolution encoder by adopting a multilayer perceptron (MLP) network to obtain the characteristics of the voxelized data.

In some embodiments, the first feature is obtained by mapping the voxel data feature onto various points in a training three-dimensional scene.

In some embodiments, the first sub-feature corresponding to each point in the first feature comprises a probability that the point corresponds to each of the plurality of categories, or a probability and a number of channels that the point corresponds to each of the plurality of categories.

Likewise, the second sub-feature corresponding to each point in the second feature includes a probability that the point corresponds to each of the plurality of categories, or a probability and a number of channels that the point corresponds to each of the plurality of categories.

In some embodiments, the obtaining a first feature based on the training point cloud data using the three-dimensional scene segmentation model comprises:

performing voxelization on the training point cloud data to obtain training voxelization data corresponding to the training point cloud data, wherein the training voxelization data comprises a plurality of voxels, and each voxel in the plurality of voxels corresponds to at least one point in the point set; and

and inputting the training voxelization data into the three-dimensional scene segmentation model respectively to obtain the first characteristics.

After the point cloud data are subjected to voxelization, the first characteristic is obtained based on the voxelization data, so that the calculated amount is further reduced.

In some embodiments, as shown in fig. 6, obtaining a first relative relationship between the first sub-feature of each point in the set of points and the plurality of first global features corresponding to the plurality of subsets comprises:

step S610: for each of the plurality of subsets, obtaining a plurality of first sub-features corresponding to a plurality of points of the subset;

step S620: for each of the plurality of subsets, globally pooling the plurality of first sub-features to obtain a first global feature corresponding to the subset, where the first global feature is a global pooled feature of the plurality of first sub-features; and

step S630: obtaining a first relative relationship corresponding to each point in the point set based on a plurality of first global features corresponding to the plurality of subsets and the first sub-feature of each point in the point set.

Global pooling is performed by aiming at a plurality of first sub-features corresponding to a plurality of points included in each subset, so that a first global feature is obtained, and the first global feature is obtained.

Similarly, in the process of obtaining the second global feature, the process as described in steps S610 to S630 may be adopted, and for a plurality of second sub-features corresponding to a plurality of points included in each subset, global pooling is performed to obtain the second global feature, so as to obtain the first global feature.

In some embodiments, as shown in fig. 7, the obtaining a first relative relationship corresponding to each point in the point set based on a plurality of first global features corresponding to the plurality of subsets and the first sub-feature of each point in the point set includes:

step S710: calculating a similarity between a first sub-feature of each point in the set of points and each of the plurality of first global features; and

step S720: and obtaining a first relative relation corresponding to each point in the point set based on a plurality of similarities corresponding to the first sub-feature of each point in the point set.

And obtaining the first relative relation by calculating the similarity between the first sub-feature of each point and each first global feature.

Similarly, in the process of obtaining the second relative relationship, the process as described in steps S710-S720 may be adopted to obtain the first relative relationship by calculating the similarity between the second sub-feature of each point and each second global feature.

In some embodiments, the obtained plurality of first relative relationships corresponding to the plurality of points in the point set are represented as a first matrix having a dimension N × N, where N is the number of points in the point set and N is the number of the plurality of categories, where each element in the matrix represents a similarity between the corresponding point and the first global feature corresponding to the plurality of points in the respective category.

Also, in some embodiments, the obtained second relative relationships of the plurality of point correspondences in the point set are represented as a second matrix having a dimension N × N, where N is the number of points in the point set and N is the number of the plurality of categories, where each element in the matrix represents a similarity between the corresponding point and the second global feature corresponding to the plurality of points in the corresponding category.

In some embodiments, the first loss is calculated by calculating a KL divergence between the first matrix and the second matrix.

In some embodiments, the three-dimensional scene segmentation model training method according to the present disclosure further includes:

obtaining a prediction result of the three-dimensional point cloud segmentation model, wherein the prediction result indicates a corresponding category of each point in the point set in a plurality of categories;

obtaining a second loss based on the prediction result and a corresponding category of the belonged instance of each point in the point set in a plurality of categories; and

and adjusting parameters of the three-dimensional scene segmentation model based on the second loss.

Through the process, the fact that the semantic truth labels are adopted for each point to supervise the training process of the three-dimensional scene segmentation model is achieved, and the precision of the three-dimensional scene segmentation model is further improved.

According to another aspect of the present disclosure, there is also provided a three-dimensional scene segmentation apparatus, as shown in fig. 8, the apparatus 800 includes: a point cloud data obtaining unit 810 configured to obtain point cloud data corresponding to a set of points in a target three-dimensional scene, the target three-dimensional scene including a plurality of instances, each of the plurality of instances having a respective category in a plurality of categories, the set of points including a plurality of points from each of the plurality of instances; a first feature obtaining unit 820 configured to obtain a first feature based on the point cloud data, wherein the first feature includes a first sub-feature corresponding to each point in the point set, the first sub-feature indicates a corresponding category of the corresponding point in the plurality of categories, the first sub-feature of each point in the point set has a first relative relationship with a plurality of corresponding first global features in a plurality of first subsets in the point set, the corresponding first sub-feature of the plurality of points in each of the plurality of first subsets indicates a same category in the plurality of categories; and a segmentation result obtaining unit 830 configured to obtain a segmentation result of the target three-dimensional scene based on the first feature, the segmentation result indicating an instance of each point in the set of points among the plurality of instances.

In some embodiments, the first feature obtaining unit 810 includes: a voxelization unit configured to voxelize the point cloud data to obtain voxelized data, the voxelized data comprising a plurality of voxels, each of the plurality of voxels corresponding to at least one point in the set of points; and a first obtaining subunit configured to obtain the first feature based on the voxelized data.

In some embodiments, the first acquisition subunit comprises: a first feature extraction subunit configured to perform feature extraction on the voxelized data to obtain voxel data features, the voxel data features comprising voxel features corresponding to each of the plurality of voxels, the voxel features indicating corresponding classes of the corresponding voxels in the plurality of classes; and the second feature extraction subunit is configured to determine a voxel feature corresponding to each voxel in the multiple voxels as a first sub-feature of each point of at least one point corresponding to the voxel.

In some embodiments, each of the plurality of first global features comprises at least one of: obtaining a global average pooling characteristic based on the first characteristic of each point in the first subset corresponding to the first global characteristic; and obtaining a global maximum pooled feature based on the first feature of each point in the first subset corresponding to the first global feature.

In some embodiments, the first relative relationship comprises a similarity between the first sub-feature of the respective point and each of the first global features of the plurality of first global features.

In some embodiments, the first acquisition subunit comprises: an input unit configured to input the voxelized data to a first feature extraction network to obtain the first feature; wherein the first feature extraction network is obtained by training based on guidance of a depth model comprising a second feature extraction network having a larger number of parameters than the first feature extraction network and capable of obtaining second features based on the voxelized data, the second features comprising second sub-features corresponding to each point in the set of points, the second sub-features indicating respective classes of the respective point in the plurality of classes, the second sub-features of each point in the set of points having a second relative relationship with a plurality of second global features corresponding to a plurality of second subsets of the set of points, the respective second sub-features of the plurality of points of each second subset of the plurality of second subsets indicating a same class in the plurality of classes; wherein a similarity between the first relative relationship and the second relative relationship of each point in the set of points is greater than a preset threshold.

According to another aspect of the present disclosure, a three-dimensional scene segmentation model training device is also provided. As shown in fig. 9, the apparatus 900 includes: a training data obtaining unit 910 configured to obtain training point cloud data corresponding to a point set in a training three-dimensional scene, the training three-dimensional scene including a plurality of instances, each of the plurality of instances having a respective category in a plurality of categories, the point set including a plurality of subsets, each subset including a plurality of points from a same instance in the plurality of instances; a feature obtaining unit 920 configured to obtain a first feature based on the training point cloud data using the three-dimensional scene segmentation model, and obtain a second feature based on the training point cloud data using the trained first model, wherein the first feature includes a first sub-feature corresponding to each point in the point set, the first sub-feature indicates a respective category of the respective point in the plurality of categories, the second feature includes a second sub-feature corresponding to each point in the point set, the second sub-feature indicates a respective category of the respective point in the plurality of categories; a first relative relationship obtaining unit 930 configured to obtain, based on the first feature, a first relative relationship between a first sub-feature of each point in the point set and a plurality of first global features corresponding to the plurality of subsets; a second relative relationship obtaining unit 940, configured to obtain, based on the second feature, second relative relationships between the second sub-feature of each point in the point set and a plurality of second global features corresponding to the plurality of subsets; a first loss calculation unit 950 configured to obtain a first loss based on the first relative relationship and the second relative relationship corresponding to each point in the point set; and a parameter adjusting unit 960 configured for adjusting parameters of the three-dimensional scene segmentation model based on at least the first loss.

In some embodiments, the feature obtaining unit 920 includes: a voxelization unit configured to voxelize the training point cloud data to obtain training voxelization data corresponding to the training point cloud data, the training voxelization data including a plurality of voxels, each of the plurality of voxels corresponding to at least one point in the set of points; and an input unit configured to input the training voxelized data to the three-dimensional scene segmentation model, respectively, to obtain the first features.

In some embodiments, the first relative relationship obtaining unit 930 includes: a first obtaining subunit, configured to obtain, for each of the plurality of subsets, a plurality of first sub-features corresponding to a plurality of points of the subset; the pooling unit is configured to perform global pooling on the plurality of first sub-features for each of the plurality of subsets to obtain first global features corresponding to the subset, wherein the first global features are global pooled features of the plurality of first sub-features; the second obtaining subunit is configured to obtain a first relative relationship corresponding to each point in the point set based on a plurality of first global features corresponding to the plurality of subsets and the first sub-feature of each point in the point set.

In some embodiments, the second obtaining subunit 940 includes: a first calculation unit configured to calculate a similarity between a first sub-feature of each point in the set of points and each first global feature of the plurality of first global features; a third obtaining subunit, configured to obtain, based on a plurality of similarities corresponding to the first sub-feature of each point in the point set, a first relative relationship corresponding to each point in the point set.

In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations and do not violate the good customs of the public order.

According to an embodiment of the present disclosure, an electronic device, a readable storage medium, and a computer program product are also provided.

Referring to fig. 10, a block diagram of a structure of an electronic device 1000, which may be a server or a client of the present disclosure, which is an example of a hardware device that may be applied to aspects of the present disclosure, will now be described. Electronic device is intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 10, the electronic device 1000 includes a computing unit 1001 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 1002 or a computer program loaded from a storage unit 1008 into a Random Access Memory (RAM) 1003. In the RAM 1003, various programs and data necessary for the operation of the electronic apparatus 1000 can also be stored. The calculation unit 1001, ROM 1002, and RAM 1003 are connected to each other by a bus 1004. An input/output (I/O) interface 1005 is also connected to bus 1004.

A number of components in the electronic device 1000 are connected to the I/O interface 1005, including: input section 1006, output section 1007, storage section 1008, and communication section 1009. The input unit 1006 may be any type of device capable of inputting information to the electronic device 1000, and the input unit 1006 may receive input numeric or character information and generate key signal inputs related to user settings and/or function controls of the electronic device, and may include, but is not limited to, a mouse, a keyboard, a touch screen, a track pad, a track ball, a joystick, a microphone, and/or a remote controller. Output unit 1007 may be any type of device capable of presenting information and may include, but is not limited to, a display, speakers, a video/audio output terminal, a vibrator, and/or a printer. The storage unit 1008 may include, but is not limited to, a magnetic disk, an optical disk. The communications unit 1009 allows the electronic device 1000 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunications networks, and may include, but is not limited to, modems, network cards, infrared communication devices, wireless communication transceivers, and/or chipsets, such as bluetooth (TM) devices, 802.11 devices, wiFi devices, wiMax devices, cellular communication devices, and/or the like.

Computing unit 1001 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 1001 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The computing unit 1001 performs the various methods and processes described above, such as the method 200. For example, in some embodiments, the method 200 may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as the storage unit 1008. In some embodiments, part or all of the computer program may be loaded and/or installed onto electronic device 1000 via ROM 1002 and/or communications unit 1009. When the computer program is loaded into RAM 1003 and executed by the computing unit 1001, one or more steps of the method 200 described above may be performed. Alternatively, in other embodiments, the computing unit 1001 may be configured to perform the method 200 in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user may provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be performed in parallel, sequentially or in different orders, and are not limited herein as long as the desired results of the technical aspects of the present disclosure can be achieved.

Although embodiments or examples of the present disclosure have been described with reference to the accompanying drawings, it is to be understood that the above-described methods, systems and apparatus are merely exemplary embodiments or examples and that the scope of the present invention is not limited by these embodiments or examples, but only by the claims as issued and their equivalents. Various elements in the embodiments or examples may be omitted or may be replaced with equivalents thereof. Further, the steps may be performed in an order different from that described in the present disclosure. Further, various elements in the embodiments or examples may be combined in various ways. It is important that as technology evolves, many of the elements described herein may be replaced with equivalent elements that appear after the present disclosure.

Claims

1. A three-dimensional scene segmentation method, comprising:

obtaining point cloud data corresponding to a set of points in a target three-dimensional scene, the target three-dimensional scene comprising a plurality of instances, each instance of the plurality of instances having a respective category in a plurality of categories, the set of points comprising a plurality of points from each instance of the plurality of instances;

obtaining a first feature based on the point cloud data, wherein the first feature comprises a first sub-feature corresponding to each point in the set of points, the first sub-feature indicating a respective category of the respective point in the plurality of categories, the first sub-feature of each point in the set of points having a first relative relationship with a corresponding plurality of first global features in a plurality of first subsets of the set of points, the respective first sub-features of the plurality of points in each of the plurality of first subsets indicating a same category in the plurality of categories; and

based on the first features, obtaining a segmentation result of the target three-dimensional scene, the segmentation result indicating an instance of the plurality of instances with each point in the set of points; wherein the obtaining a first feature based on the point cloud data comprises:

obtaining the first feature based on the point cloud data using a three-dimensional scene segmentation model; wherein,

the three-dimensional scene segmentation model is obtained based on training based on guidance including a second model that is capable of obtaining a second feature based on the point cloud data, the second feature including a second sub-feature corresponding to each point in the set of points, the second sub-feature indicating a respective category of the respective point in the plurality of categories, the second sub-feature of each point in the set of points having a second relative relationship with a plurality of second global features corresponding to a plurality of second subsets of the set of points, the respective second sub-features of the plurality of points of each of the plurality of second subsets indicating a same category of the plurality of categories; wherein a similarity between the first relative relationship and the second relative relationship of each point in the set of points is greater than a preset threshold.

2. The method of claim 1, wherein the obtaining a first feature based on the point cloud data comprises:

voxelizing the point cloud data to obtain voxelized data, the voxelized data comprising a plurality of voxels, each voxel of the plurality of voxels corresponding to at least one point in the set of points; and

obtaining the first feature based on the voxelized data.

3. The method of claim 2, wherein the obtaining the first feature based on the voxelized data comprises:

performing feature extraction on the voxelized data to obtain voxel data features, the voxel data features comprising voxel features corresponding to each of the plurality of voxels, the voxel features indicating corresponding classes of the corresponding voxels in the plurality of classes; and

determining a voxel characteristic corresponding to each voxel in the plurality of voxels as a first sub-characteristic of each point in at least one point corresponding to the voxel.

4. The method of claim 1, wherein each of the plurality of first global features comprises at least one of:

and obtaining a global maximum pooling feature based on the first feature of each point in the first subset corresponding to the first global feature.

5. The method of claim 1, wherein the first relative relationship comprises a similarity between the first sub-feature of the respective point and each of the plurality of first global features.

6. The method of claim 2, wherein the obtaining the first feature based on the voxelized data comprises:

the first feature extraction network is obtained by training based on guidance of a depth model comprising a second feature extraction network, a number of parameters of the second feature extraction network being larger than a number of parameters of the first feature extraction network and being capable of obtaining a second feature based on the voxelized data, the second feature comprising a second sub-feature corresponding to each point in the set of points, the second sub-feature indicating a respective category of the respective point in the plurality of categories, the second sub-feature of each point in the set of points having a second relative relationship with a plurality of second global features corresponding to a plurality of second subsets of the set of points, the respective second sub-features of the plurality of points of each second subset of the plurality of second subsets indicating a same category of the plurality of categories; wherein a similarity between the first relative relationship and the second relative relationship of each point in the set of points is greater than a preset threshold.

7. The method of claim 6, wherein the first feature extraction network comprises a sparse convolutional encoder network and a multi-layered perceptron network.

8. A three-dimensional scene segmentation model training method, the method comprising:

obtaining training point cloud data corresponding to a set of points in a training three-dimensional scene, the training three-dimensional scene comprising a plurality of instances, each instance of the plurality of instances having a respective category in a plurality of categories, the set of points comprising a plurality of subsets, each subset comprising a plurality of points from a same instance of the plurality of instances;

obtaining a first feature based on the training point cloud data using the three-dimensional scene segmentation model, and obtaining a second feature based on the training point cloud data using the trained first model, the first feature comprising a first sub-feature corresponding to each point in the set of points, the first sub-feature indicating a respective category of the respective point in the plurality of categories, the second feature comprising a second sub-feature corresponding to each point in the set of points, the second sub-feature indicating a respective category of the respective point in the plurality of categories;

obtaining a first relative relationship between a first sub-feature of each point in the point set and a plurality of first global features corresponding to the plurality of subsets based on the first feature;

obtaining a second relative relationship between a second sub-feature of each point in the point set and a plurality of second global features corresponding to the plurality of subsets based on the second feature;

obtaining a first loss based on a first relative relationship and a second relative relationship corresponding to each point in the point set; and

adjusting parameters of the three-dimensional scene segmentation model based on at least the first loss.

9. The method of claim 8, wherein the obtaining a first feature based on the training point cloud data using the three-dimensional scene segmentation model comprises:

inputting the training voxelization data into the three-dimensional scene segmentation model respectively to obtain the first features.

10. The method of claim 8, wherein the three-dimensional scene segmentation model comprises a first feature extraction network, the first model comprising a second feature extraction network, and wherein a number of parameters of the second feature extraction network is greater than a number of parameters of the first feature extraction network.

11. The method of claim 8, wherein the obtaining a first relative relationship between a first sub-feature of each point in the set of points and a plurality of first global features corresponding to the plurality of subsets comprises:

for each of the plurality of subsets,

obtaining a plurality of first sub-features corresponding to a plurality of points of the subset;

globally pooling the plurality of first sub-features to obtain a first global feature corresponding to the subset, wherein the first global feature is a global pooled feature of the plurality of first sub-features; and

obtaining a first relative relationship corresponding to each point in the point set based on a plurality of first global features corresponding to the plurality of subsets and the first sub-feature of each point in the point set.

12. The method of claim 11, wherein the obtaining a first relative relationship corresponding to each point in the point set based on a plurality of first global features corresponding to the plurality of subsets and the first sub-feature of each point in the point set comprises:

calculating a similarity between a first sub-feature of each point in the set of points and each of the plurality of first global features; and

and obtaining a first relative relation corresponding to each point in the point set based on a plurality of similarities corresponding to the first sub-feature of each point in the point set.

13. The method of claim 10, wherein the first feature extraction network comprises a sparse convolutional encoder network and a multi-layered perceptron network.

14. A three-dimensional scene segmentation apparatus comprising:

a point cloud data acquisition unit configured to obtain point cloud data corresponding to a set of points in a target three-dimensional scene, the target three-dimensional scene including a plurality of instances, each of the plurality of instances having a respective category in a plurality of categories, the set of points including a plurality of points from each of the plurality of instances;

a first feature obtaining unit configured to obtain a first feature based on the point cloud data, wherein the first feature includes a first sub-feature corresponding to each point in the point set, the first sub-feature indicates a respective category of the respective point in the plurality of categories, the first sub-feature of each point in the point set has a first relative relationship with a plurality of first global features corresponding to a plurality of first subsets of the point set, and the respective first sub-features of the plurality of points of each first subset in the plurality of first subsets indicate a same category in the plurality of categories; and

a segmentation result obtaining unit configured to obtain a segmentation result of the target three-dimensional scene based on the first feature, the segmentation result indicating an instance of each point in the set of points among the plurality of instances; wherein the obtaining a first feature based on the point cloud data comprises:

15. The apparatus of claim 14, wherein the first feature acquisition unit comprises:

a voxelization unit configured for voxelizing the point cloud data to obtain voxelized data, the voxelized data comprising a plurality of voxels, each of the plurality of voxels corresponding to at least one point in the set of points; and

a first obtaining subunit configured to obtain the first feature based on the voxelized data.

16. The apparatus of claim 15, wherein the first acquisition subunit comprises:

a first feature extraction subunit configured to perform feature extraction on the voxelized data to obtain voxel data features, the voxel data features comprising voxel features corresponding to each of the plurality of voxels, the voxel features indicating corresponding classes of the corresponding voxels in the plurality of classes; and

and the second feature extraction sub-unit is configured to determine a voxel feature corresponding to each voxel in the plurality of voxels as a first sub-feature of each point in at least one point corresponding to the voxel.

17. The apparatus of claim 14, wherein each of the plurality of first global features comprises at least one of:

18. The apparatus of claim 14, wherein the first relative relationship comprises a similarity between the first sub-feature of the respective point and each of the plurality of first global features.

19. The apparatus of claim 16, wherein the first acquisition subunit comprises:

an input unit configured to input the voxelized data to a first feature extraction network to obtain the first feature; wherein,

20. The apparatus of claim 19, wherein the first feature extraction network comprises a sparse convolutional encoder network and a multi-layered perceptron network.

21. A three-dimensional scene segmentation model training device comprises:

a training data acquisition unit configured to obtain training point cloud data corresponding to a set of points in a training three-dimensional scene, the training three-dimensional scene including a plurality of instances, each of the plurality of instances having a respective category in a plurality of categories, the set of points including a plurality of subsets, each subset including a plurality of points from a same instance of the plurality of instances;

a feature obtaining unit configured to obtain a first feature based on the training point cloud data using the three-dimensional scene segmentation model, and obtain a second feature based on the training point cloud data using the trained first model, wherein the first feature includes a first sub-feature corresponding to each point in the set of points, the first sub-feature indicating a respective category of the respective point in the plurality of categories, and the second feature includes a second sub-feature corresponding to each point in the set of points, the second sub-feature indicating a respective category of the respective point in the plurality of categories;

a first relative relationship obtaining unit, configured to obtain, based on the first feature, a first relative relationship between a first sub-feature of each point in the point set and a plurality of first global features corresponding to the plurality of subsets;

a second relative relationship obtaining unit, configured to obtain, based on the second feature, a second relative relationship between a second sub-feature of each point in the point set and a plurality of second global features corresponding to the plurality of subsets;

a first loss calculation unit configured to obtain a first loss based on a first relative relationship and a second relative relationship corresponding to each point in the point set; and

a parameter adjusting unit configured to adjust a parameter of the three-dimensional scene segmentation model based on at least the first loss.

22. The apparatus of claim 21, wherein the feature acquisition unit comprises:

a voxelization unit configured to voxelize the training point cloud data to obtain training voxelization data corresponding to the training point cloud data, the training voxelization data including a plurality of voxels, each of the plurality of voxels corresponding to at least one point in the set of points; and

an input unit configured to input the training voxelized data to the three-dimensional scene segmentation model, respectively, to obtain the first features.

23. The apparatus of claim 21, wherein the three-dimensional scene segmentation model comprises a first feature extraction network, the first model comprising a second feature extraction network, and wherein a number of parameters of the second feature extraction network is greater than a number of parameters of the first feature extraction network.

24. The apparatus according to claim 21, wherein the first relative relationship acquisition unit includes:

a first obtaining subunit, configured to obtain, for each of the plurality of subsets, a plurality of first sub-features corresponding to a plurality of points of the subset; and

a pooling unit configured to, for each of the plurality of subsets, globally pool the plurality of first sub-features to obtain first global features corresponding to the subset, where the first global features are global pooled features of the plurality of first sub-features;

a second obtaining subunit, configured to obtain, based on a plurality of first global features corresponding to the plurality of subsets and the first sub-feature of each point in the point set, a first relative relationship corresponding to each point in the point set.

25. The apparatus of claim 24, wherein the second acquisition subunit comprises:

a first calculation unit configured to calculate a similarity between a first sub-feature of each point in the set of points and each of the plurality of first global features;

a third obtaining subunit, configured to obtain, based on a plurality of similarities corresponding to the first sub-feature of each point in the point set, a first relative relationship corresponding to each point in the point set.

26. The apparatus of claim 23, wherein the first feature extraction network comprises a sparse convolutional encoder network and a multi-layered perceptron network.

27. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-13.

28. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-13.