CN118133845A

CN118133845A - A fusion method, device, equipment and storage medium for multi-channel semantic understanding

Info

Publication number: CN118133845A
Application number: CN202410559450.8A
Authority: CN
Inventors: 秦龙; 彭勇; 曾俊杰; 曾云秀; 段伟; 艾川; 尹路珈; 王鹏; 石超; 黄鹤松
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2024-05-08
Filing date: 2024-05-08
Publication date: 2024-06-04
Anticipated expiration: 2044-05-08
Also published as: CN118133845B

Abstract

The application relates to a fusion method, a device, equipment and a storage medium for multi-channel semantic understanding. The method comprises the following steps: and constructing a semantic understanding fusion model. Acquiring a multi-channel semantic text as a training sample data set, wherein the training sample data set comprises: voice information, visual information, and text information. The semantic understanding fusion model carries out NSP task training on the training sample data set to obtain a semantic sequence to be marked, and carries out channel marking on the semantic sequence to be marked based on a transducer architecture to obtain a semantic text of the marking sequence. The tag sequence semantic text adopts a seq2seq model to generate a standard instruction at a decoding layer, and the standard instruction carries out text structuring on the tag sequence semantic text to obtain the standard semantic text. By adopting the method, the problems of instruction conflict, redundancy and missing complementation among the multichannel data can be solved.

Description

A fusion method, device, equipment and storage medium for multi-channel semantic understanding

技术领域Technical Field

本申请涉及计算机技术领域，特别是涉及一种多通道语义理解的融合方法、装置、设备及存储介质。The present application relates to the field of computer technology, and in particular to a fusion method, device, equipment and storage medium for multi-channel semantic understanding.

背景技术Background technique

20世纪80年代的Bolt等人提出了多通道用户界面这个概念以来，多通道用户界面以其高效、自然的交互方式得到了越来越广泛的发展，并且研究开发了一系列的多通道交互系统，从结合语音和笔手势的Quickset，到结合语音和视觉注视的多通道系统，以及用于手持移动设备中的多通道交互。这些多通道交互系统中的一个重要的任务就是对各个通道输入的信息的联合处理。解决对多个通道输入信息的联合处理，有两种主要的多通道体系类型：一种是在特征层的信息融合(也称早期融合)，另外一种是在语义层的信息融合(也称晚期融合)。特征融合一般被认为更适合有紧密联系和同步的通道，比如语言和嘴唇的运动，因为这两个输入通道对同一个语义单元都提供相对应的信息，然而，当系统的每一个通道所提供的是一个内容和时间标记都不充分时的信息，例如语音和笔的输入，这两个通道提供的是不同的，但是互为补充的信息，此时就需要采用语义层的融合。Since Bolt et al. proposed the concept of multi-channel user interface in the 1980s, multi-channel user interface has been increasingly developed with its efficient and natural interaction mode, and a series of multi-channel interaction systems have been studied and developed, from Quickset that combines voice and pen gestures to multi-channel systems that combine voice and visual gaze, as well as multi-channel interaction for handheld mobile devices. An important task in these multi-channel interaction systems is the joint processing of information input from each channel. There are two main types of multi-channel systems to solve the joint processing of information input from multiple channels: one is information fusion at the feature layer (also called early fusion), and the other is information fusion at the semantic layer (also called late fusion). Feature fusion is generally considered to be more suitable for closely related and synchronized channels, such as language and lip movements, because these two input channels provide corresponding information for the same semantic unit. However, when each channel of the system provides information with insufficient content and time tags, such as voice and pen input, these two channels provide different but complementary information. At this time, semantic layer fusion is needed.

目前上述两种体系，当系统的每一个通道所提供的是一个内容和时间标记都不充分时的信息或者每个通道的提供的信息紧密度较低时，特征融合会变得异常艰难。在本场景中的通道主要是语音、手套和背夹通道的融合，通道之间关联性差，特征融合比较困难。通常的多通道语义理解只是将各个通道的语义特征按照词槽的位置进行填充，当一个通道的命令中的所有元素不能由一个通道完整的表示出来的时候，就是一个不完整的语义槽，需要其他通道语义来进行补充。但是，基于语义槽的方式在面临多指令之间存在指令冲突、指令冗余的时候，通常会通过引入规则的方式开展指令冲突和指令冗余消歧；该策略在复杂场景下导致规则过多，开展困难的问题。At present, for the above two systems, when each channel of the system provides information with insufficient content and time stamp or the information provided by each channel is of low density, feature fusion becomes extremely difficult. The channels in this scenario are mainly the fusion of voice, gloves and back clip channels. The correlation between channels is poor, and feature fusion is difficult. The usual multi-channel semantic understanding only fills the semantic features of each channel according to the position of the word slot. When all the elements of a channel's command cannot be fully represented by one channel, it is an incomplete semantic slot and needs to be supplemented by other channel semantics. However, when faced with command conflicts and command redundancy between multiple commands, the semantic slot-based approach usually introduces rules to disambiguate command conflicts and command redundancy; this strategy leads to too many rules and difficulty in implementation in complex scenarios.

发明内容Summary of the invention

基于此，有必要针对上述技术问题，提供一种能够高效、便捷融合多通道语义文本的一种多通道语义理解的融合方法、装置、设备及存储介质。Based on this, it is necessary to provide a multi-channel semantic understanding fusion method, device, equipment and storage medium that can efficiently and conveniently fuse multi-channel semantic texts in response to the above technical problems.

一种多通道语义理解的融合方法，所述方法包括：A fusion method for multi-channel semantic understanding, the method comprising:

构建语义理解融合模型。Build a semantic understanding fusion model.

获取多通道的语义文本作为训练样本数据集，训练样本数据集包括：语音信息、视觉信息以及文字信息。Multi-channel semantic text is obtained as a training sample data set, and the training sample data set includes: voice information, visual information and text information.

语义理解融合模型通过对训练样本数据集进行NSP任务训练，得到待标记语义序列，基于Transformer架构对待标记语义序列进行通道标注，得到标记序列语义文本。The semantic understanding fusion model obtains the semantic sequence to be labeled by performing NSP task training on the training sample dataset, and then performs channel annotation on the semantic sequence to be labeled based on the Transformer architecture to obtain the semantic text of the labeled sequence.

标记序列语义文本在解码层采用seq2seq模型生成标准指令，标准指令对标记序列语义文本进行文本结构化，得到标准语义文本。The labeled sequence semantic text uses a seq2seq model in the decoding layer to generate standard instructions. The standard instructions perform text structuring on the labeled sequence semantic text to obtain standard semantic text.

在其中一个实施例中，还包括：采用无监督数据采样技术与与数据泛标注技术构建语义理解融合模型。In one of the embodiments, it also includes: using unsupervised data sampling technology and data universal annotation technology to build a semantic understanding fusion model.

在其中一个实施例中，还包括：通过语义理解融合模型识别多通道的原始指令获取原始指令对应的语义文本作为训练样本数据集。In one of the embodiments, it also includes: identifying original instructions of multiple channels through a semantic understanding fusion model to obtain semantic texts corresponding to the original instructions as a training sample data set.

在其中一个实施例中，还包括：语义理解融合模型通过对训练样本数据集中任意两个连续句子之间进行NSP任务训练，得到多通道的待标记语义序列。根据待标记语义序列构建语义交叉融合模型，语义交叉融合模型基于Transformer架构对每一个通道的待标记语义序列使用segment_id进行通道标记，得到标记序列语义文本。In one embodiment, the semantic understanding fusion model also includes: the semantic understanding fusion model obtains multi-channel semantic sequences to be labeled by performing NSP task training between any two consecutive sentences in the training sample data set. A semantic cross-fusion model is constructed based on the semantic sequences to be labeled. The semantic cross-fusion model uses segment_id to perform channel labeling on the semantic sequences to be labeled in each channel based on the Transformer architecture to obtain the labeled sequence semantic text.

在其中一个实施例中，语义理解融合模型包括解码层，还包括：根据标记序列语义文本经解码层提取关键词组，将关键词组作为待解码词典进行掩码处理，得到候选解码词典，候选解码词典按照序列到序列生成原始指令。In one embodiment, the semantic understanding fusion model includes a decoding layer, and also includes: extracting keyword groups through the decoding layer according to the semantic text of the tag sequence, masking the keyword groups as a dictionary to be decoded, and obtaining candidate decoding dictionaries, and the candidate decoding dictionaries generate original instructions from sequence to sequence.

在其中一个实施例中，还包括：标记序列语义文本在解码层采用seq2seq模型解析原始指令，得到原始指令任务，根据原始指令任务生成标准指令，标准指令采用序列标注对标记序列语义文本进行文本结构化，得到标准语义文本。In one of the embodiments, it also includes: the marked sequence semantic text uses a seq2seq model at the decoding layer to parse the original instruction to obtain the original instruction task, generates a standard instruction based on the original instruction task, and the standard instruction uses sequence annotation to perform text structuring on the marked sequence semantic text to obtain a standard semantic text.

在其中一个实施例中，还包括：根据多通道的语义文本和标准语义文本，对语义理解融合模型进行训练，得到训练好的语义理解融合模型。通过训练好的语义理解融合模型进行多通道的语义理解融合。In one embodiment, the method further includes: training a semantic understanding fusion model according to the multi-channel semantic text and the standard semantic text to obtain a trained semantic understanding fusion model. The multi-channel semantic understanding fusion is performed using the trained semantic understanding fusion model.

一种多通道语义理解的融合装置，所述装置包括：A fusion device for multi-channel semantic understanding, the device comprising:

语义理解融合模型构建模块，用于构建语义理解融合模型。The semantic understanding fusion model building module is used to build the semantic understanding fusion model.

样本数据集获取模块，用于获取多通道的语义文本作为训练样本数据集，训练样本数据集包括：语音信息、视觉信息以及文字信息。The sample data set acquisition module is used to acquire multi-channel semantic text as a training sample data set. The training sample data set includes: voice information, visual information and text information.

序列语义文本标记模块，用于语义理解融合模型通过对训练样本数据集进行NSP任务训练，得到待标记语义序列，基于Transformer架构对待标记语义序列进行通道标注，得到标记序列语义文本。The sequence semantic text tagging module is used for the semantic understanding fusion model. By training the training sample data set with the NSP task, the semantic sequence to be labeled is obtained. Based on the Transformer architecture, the channel of the semantic sequence to be labeled is annotated to obtain the labeled sequence semantic text.

语义理解融合模块，用于标记序列语义文本在解码层采用seq2seq模型生成标准指令，标准指令对标记序列语义文本进行文本结构化，得到标准语义文本。The semantic understanding fusion module is used to generate standard instructions for the labeled sequence semantic text using the seq2seq model at the decoding layer. The standard instructions perform text structuring on the labeled sequence semantic text to obtain standard semantic text.

一种计算机设备，包括存储器和处理器，所述存储器存储有计算机程序，所述处理器执行所述计算机程序时实现以下步骤：A computer device comprises a memory and a processor, wherein the memory stores a computer program, and when the processor executes the computer program, the following steps are implemented:

构建语义理解融合模型。Build a semantic understanding fusion model.

一种计算机可读存储介质，其上存储有计算机程序，所述计算机程序被处理器执行时实现以下步骤：A computer-readable storage medium stores a computer program, which, when executed by a processor, implements the following steps:

构建语义理解融合模型。Build a semantic understanding fusion model.

上述一种多通道语义理解的融合方法、装置、设备及存储介质，基于预训练语义理解融合模型的基础上对不同通道的文本进行标记处理，提升通道之间的差异性，有利于提升不同通道之间语义的差异性，并提升模型对语义特征的表达能力。相比基于特征层面多通道融合，可以减少数据的标注数量和融合难度。另外，在语义融合阶段采用了约束解码的方式，可以有效提升标注指令的生成能力，确保指令生成的正确性和易理解性。借助预训练语言模型的理解能力，可以实现对冗余指令、单通道信息缺失指令进行互补，有助于保障准确性的，实现高效、便捷的融合多通道语义文本。The above-mentioned multi-channel semantic understanding fusion method, device, equipment and storage medium mark the text of different channels based on the pre-trained semantic understanding fusion model, improve the difference between channels, which is conducive to improving the semantic difference between different channels and improving the model's ability to express semantic features. Compared with multi-channel fusion based on the feature level, the number of data annotations and the difficulty of fusion can be reduced. In addition, the constrained decoding method is adopted in the semantic fusion stage, which can effectively improve the generation ability of annotation instructions and ensure the correctness and comprehensibility of instruction generation. With the help of the understanding ability of the pre-trained language model, redundant instructions and single-channel information missing instructions can be complemented, which helps to ensure accuracy and realize efficient and convenient fusion of multi-channel semantic text.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为一个实施例中一种多通道语义理解的融合方法的流程图；FIG1 is a flow chart of a fusion method for multi-channel semantic understanding in one embodiment;

图2为一个实施例中一种多通道语义理解的融合装置的结构框图；FIG2 is a structural block diagram of a fusion device for multi-channel semantic understanding in one embodiment;

图3为一个实施例中计算机设备的内部结构图。FIG. 3 is a diagram showing the internal structure of a computer device in one embodiment.

具体实施方式Detailed ways

为了使本申请的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本申请进行进一步详细说明。应当理解，此处描述的具体实施例仅仅用以解释本申请，并不用于限定本申请。In order to make the purpose, technical solution and advantages of the present application more clearly understood, the present application is further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present application and are not used to limit the present application.

在一个实施例中，如图1所示，提供了一种多通道语义理解的融合方法，包括以下步骤：In one embodiment, as shown in FIG1 , a fusion method for multi-channel semantic understanding is provided, comprising the following steps:

步骤102，构建语义理解融合模型。Step 102, constructing a semantic understanding fusion model.

步骤104，获取多通道的语义文本作为训练样本数据集。Step 104: Acquire multi-channel semantic text as a training sample data set.

训练样本数据集包括：语音信息、视觉信息以及文字信息。The training sample data set includes: voice information, visual information and text information.

步骤106，语义理解融合模型通过对训练样本数据集进行NSP任务训练，得到待标记语义序列，基于Transformer架构对待标记语义序列进行通道标注，得到标记序列语义文本。Step 106, the semantic understanding fusion model obtains the semantic sequence to be labeled by performing NSP task training on the training sample data set, and performs channel annotation on the semantic sequence to be labeled based on the Transformer architecture to obtain the labeled sequence semantic text.

具体的，目前语义理解融合模型中存在一个标记为segment_id是一种用于区分不同句子或片段的标识符。在多句子输入的场景下，为了让模型能够理解和处理多个句子之间的关系，将这些句子进行适当的标记。对于基于Transformer架构的模型（如BERT、GPT等），每个输入句子都需要添加一个特殊的开始标记([CLS])和结束标记([SEP])。在处理两个句子的情况下，可以使用两个segment_id来区分不同的句子。具体做法是，将第一个句子的标识符设置为0，将第二个句子的标识符设置为1。进而，将每个通道的序列使用segment_id作为不同通道语义文本的标记，提升通道文本的差异和文本的模型理解能力。Specifically, there is currently an identifier marked as segment_id in the semantic understanding fusion model, which is used to distinguish different sentences or fragments. In the scenario of multi-sentence input, in order to enable the model to understand and process the relationship between multiple sentences, these sentences are appropriately marked. For models based on the Transformer architecture (such as BERT, GPT, etc.), each input sentence needs to add a special start tag ([CLS]) and end tag ([SEP]). When processing two sentences, two segment_ids can be used to distinguish different sentences. Specifically, set the identifier of the first sentence to 0 and the identifier of the second sentence to 1. Furthermore, the sequence of each channel uses segment_id as a marker for the semantic text of different channels to improve the difference between channel texts and the model's ability to understand texts.

步骤108，标记序列语义文本在解码层采用seq2seq模型生成标准指令，标准指令对标记序列语义文本进行文本结构化，得到标准语义文本。Step 108, the labeled sequence semantic text uses a seq2seq model in the decoding layer to generate standard instructions, and the standard instructions perform text structuring on the labeled sequence semantic text to obtain a standard semantic text.

具体的，采用了基于约束解码seq2seq的文本生成模型实现标准指令生成任务。在解码层对待解码词典进行mask，输入文本中包含的所有词以及特殊字符集作为候选解码词典中词：Specifically, a text generation model based on constrained decoding seq2seq is used to implement the standard instruction generation task. At the decoding layer, the decoding dictionary is masked, and all words and special character sets contained in the input text are used as candidate words in the decoding dictionary:

； ;

其中，为获得候选解码词典，mask_seq 表示词典中每个词mask值，序列长度等于所有字典的长度，dictAllEmb为词典中所有词对应embedding，embedding为向量。in, To obtain the candidate decoding dictionary, mask_seq represents the mask value of each word in the dictionary, the sequence length is equal to the length of all dictionaries, dictAllEmb is the embedding corresponding to all words in the dictionary, and embedding is a vector.

进一步地，其他内容按照序列到序列的生成模型进行构建，得到标记序列语义文本，标准指令对标记序列语义文本进行文本结构化，得到标准语义文本。Furthermore, other contents are constructed according to a sequence-to-sequence generation model to obtain a labeled sequence semantic text, and standard instructions perform text structuring on the labeled sequence semantic text to obtain a standard semantic text.

进一步地，根据多通道的语义文本和标准语义文本，对语义理解融合模型进行训练，得到训练好的语义理解融合模型。通过训练好的语义理解融合模型进行多通道的语义理解融合。Furthermore, the semantic understanding fusion model is trained according to the multi-channel semantic text and the standard semantic text to obtain a trained semantic understanding fusion model. The multi-channel semantic understanding fusion is performed by using the trained semantic understanding fusion model.

上述一种多通道语义理解的融合方法中，基于预训练语义理解融合模型的基础上对不同通道的文本进行标记处理，提升通道之间的差异性，有利于提升不同通道之间语义的差异性，并提升模型对语义特征的表达能力。相比基于特征层面多通道融合，可以减少数据的标注数量和融合难度。另外，在语义融合阶段采用了约束解码的方式，可以有效提升标注指令的生成能力，确保指令生成的正确性和易理解性。借助预训练语言模型的理解能力，可以实现对冗余指令、单通道信息缺失指令进行互补，有助于保障准确性的，实现高效、便捷的融合多通道语义文本。In the above-mentioned multi-channel semantic understanding fusion method, the texts of different channels are marked based on the pre-trained semantic understanding fusion model to improve the differences between channels, which is conducive to improving the semantic differences between different channels and improving the model's ability to express semantic features. Compared with multi-channel fusion based on the feature level, the number of data annotations and the difficulty of fusion can be reduced. In addition, the constrained decoding method is used in the semantic fusion stage, which can effectively improve the ability to generate annotation instructions and ensure the correctness and comprehensibility of instruction generation. With the help of the understanding ability of the pre-trained language model, redundant instructions and single-channel information missing instructions can be complemented, which helps to ensure accuracy and realize efficient and convenient fusion of multi-channel semantic texts.

在其中一个实施例中，采用无监督数据采样技术与与数据泛标注技术构建语义理解融合模型。In one of the embodiments, unsupervised data sampling technology and data universal annotation technology are used to build a semantic understanding fusion model.

值得说明的是，预训练的语义理解融合模型是在海量数据的基础上，借助无监督数据采样技术和数据反标注技术进行多任务的模型训练，从而针对文本序列的表达。在模型训练过程中通常采用对原文的MASK词的预测，以及判断原始文本的句对是不是上下句。预训练的语义理解融合模型具有较强的深度语义理解能力，可以识别到通道输入的错误文本，由于预训练的语义理解融合模型具有较好文本特征表达能力，可以识别错误语义指令，并进行修正，语义的解析提供准确的指令数据支撑。例如：语音通道输出指令数据为【一对前进】，经过本方案中预训练的语义理解融合模型融合后修正为【一队前进】。It is worth noting that the pre-trained semantic understanding fusion model is based on massive data, with the help of unsupervised data sampling technology and data back-annotation technology to perform multi-task model training, so as to express text sequences. In the model training process, the prediction of the MASK word of the original text and the judgment of whether the sentence pair of the original text is the upper and lower sentences are usually adopted. The pre-trained semantic understanding fusion model has a strong deep semantic understanding ability and can identify the wrong text input into the channel. Since the pre-trained semantic understanding fusion model has a good text feature expression ability, it can identify wrong semantic instructions and correct them. The semantic analysis provides accurate instruction data support. For example: the output instruction data of the voice channel is [a pair forward], which is corrected to [a team forward] after fusion by the pre-trained semantic understanding fusion model in this solution.

在其中一个实施例中，通过语义理解融合模型识别多通道的原始指令获取原始指令对应的语义文本作为训练样本数据集。In one of the embodiments, a semantic understanding fusion model is used to identify original instructions of multiple channels and obtain semantic texts corresponding to the original instructions as a training sample data set.

在其中一个实施例中，语义理解融合模型通过对训练样本数据集中任意两个连续句子之间进行NSP任务训练，得到多通道的待标记语义序列。根据待标记语义序列构建语义交叉融合模型，语义交叉融合模型基于Transformer架构对每一个通道的待标记语义序列使用segment_id进行通道标记，得到标记序列语义文本。In one embodiment, the semantic understanding fusion model obtains multi-channel semantic sequences to be labeled by performing NSP task training between any two consecutive sentences in the training sample data set. A semantic cross-fusion model is constructed based on the semantic sequences to be labeled. The semantic cross-fusion model uses segment_id to perform channel labeling on the semantic sequences to be labeled in each channel based on the Transformer architecture to obtain the labeled sequence semantic text.

值得说明的是，多通道场景下文本语义融合，每个通道可以看做是一个句子，所有的通道的文本需要作为序列输入模型，部分通道的语义信息并不完整，需要借助预训练语言模型来增强语言的理解能力。预训练的语义理解融合模型通常并不直接标记不同句子，而是使用Next Sentence Prediction（以下简称NSP）的任务来训练模型对两个连续句子之间的关系进行建模。NSP任务旨在使模型能够判断两个句子是否是紧密相关或是相互独立的。因此需要对输入的多通道文本进行句子的标记，进一步提升文本的语义理解能力。由此可见，输入通道输入数据进行标记提升各通道文本差异性，有利于学习不同文本之间的特征。It is worth noting that in the multi-channel scenario, each channel can be regarded as a sentence. The text of all channels needs to be used as a sequence input model. The semantic information of some channels is incomplete, and it is necessary to use a pre-trained language model to enhance the language understanding ability. The pre-trained semantic understanding fusion model usually does not directly mark different sentences, but uses the Next Sentence Prediction (NSP) task to train the model to model the relationship between two consecutive sentences. The NSP task aims to enable the model to determine whether two sentences are closely related or independent of each other. Therefore, it is necessary to mark the sentences of the input multi-channel text to further improve the semantic understanding ability of the text. It can be seen that marking the input channel input data improves the differences between the texts of each channel, which is conducive to learning the features between different texts.

在其中一个实施例中，语义理解融合模型包括解码层。根据标记序列语义文本经解码层提取关键词组，将关键词组作为待解码词典进行掩码处理，得到候选解码词典，候选解码词典按照序列到序列生成原始指令。In one embodiment, the semantic understanding fusion model includes a decoding layer. The decoding layer extracts key words according to the semantic text of the tag sequence, and performs masking on the key words as a dictionary to be decoded to obtain a candidate decoding dictionary, which generates original instructions from sequence to sequence.

在其中一个实施例中，标记序列语义文本在解码层采用seq2seq模型解析原始指令，得到原始指令任务，根据原始指令任务生成标准指令，标准指令采用序列标注对标记序列语义文本进行文本结构化，得到标准语义文本。In one of the embodiments, the marked sequence semantic text uses a seq2seq model at the decoding layer to parse the original instruction to obtain the original instruction task, and a standard instruction is generated based on the original instruction task. The standard instruction uses sequence labeling to perform text structuring on the marked sequence semantic text to obtain a standard semantic text.

值得说明的是，主要针对语音输入转写通道、手势语义通道和穿戴设备动作表示语义通道进行融合，三个通道会存在冲突、冗余以及互补的现象。例如：三个通道的文本【语音：1组前进100米】、【手势：前进】、【动作：前进】时，三个通道语义存在冗余现象，为了从三个通道中准确理解其语义信息，需要进行语义融合。借助生成式预训练的语义理解融合模型的语义理解能力经过领域数据的优化调整，生成式标准指令，为语义理解提供标准规范的指令。多通道语义融合采用基于生成式预训练语言模型结合约束解码的seq2seq的序列生成方法，以实现标准指令的生成。由此可见，为了充分保留原始文本语义信息，在进行指令生成过程中采用约束解码的序列生成方式，提升融合过程中标准指令生成能力。It is worth noting that the fusion is mainly aimed at the voice input transcription channel, gesture semantic channel and wearable device action representation semantic channel. The three channels will have conflicts, redundancy and complementarity. For example: when the text of the three channels [voice: 1 group advances 100 meters], [gesture: advance], and [action: advance], there is redundancy in the semantics of the three channels. In order to accurately understand the semantic information from the three channels, semantic fusion is required. With the help of the semantic understanding ability of the generative pre-trained semantic understanding fusion model, the domain data is optimized and adjusted to generate standard instructions, providing standard instructions for semantic understanding. Multi-channel semantic fusion adopts a sequence generation method based on a generative pre-trained language model combined with constrained decoding seq2seq to achieve the generation of standard instructions. It can be seen that in order to fully retain the semantic information of the original text, the sequence generation method of constrained decoding is adopted in the instruction generation process to improve the ability to generate standard instructions during the fusion process.

在其中一个实施例中，根据多通道的语义文本和标准语义文本，对语义理解融合模型进行训练，得到训练好的语义理解融合模型。通过训练好的语义理解融合模型进行多通道的语义理解融合。In one embodiment, a semantic understanding fusion model is trained based on multi-channel semantic texts and standard semantic texts to obtain a trained semantic understanding fusion model. Multi-channel semantic understanding fusion is performed using the trained semantic understanding fusion model.

值得说明的是，多通道语义的文本通过融合模型获得融合后的标准指令，机器模型并不能准确理解标注指令，需要将标准指令转换为语义理解对应的字段。在本方案中以属性、对象和动作作为语义对象为例开展标注指令语义理解的设计。例如指令【迅速追击正北方向之敌】转换为语义指令为：【属性：迅速】、【对象：正北方向之敌】、【动作：追击】，标注指令的语义理解是一个标注信息抽取任务，采用预训练语言模+子分类器的途径实现。以上述三通道文本【语音：1组前进100米】、【手势：前进】、【动作：前进】为例，将三个通道的文本转换三个句子，并组成一个序列【语音：1组前进100米；手势：前进；动作：前进】，并将该序列作为预训练语言模型的输入。It is worth noting that the multi-channel semantic text obtains the fused standard instructions through the fusion model. The machine model cannot accurately understand the annotation instructions, and the standard instructions need to be converted into fields corresponding to the semantic understanding. In this scheme, attributes, objects and actions are used as semantic objects to carry out the design of semantic understanding of annotation instructions. For example, the instruction [quickly pursue the enemy in the north direction] is converted into a semantic instruction: [attribute: quick], [object: enemy in the north direction], [action: pursuit]. The semantic understanding of the annotation instruction is a task of annotation information extraction, which is implemented by the pre-trained language model + sub-classifier. Taking the above three-channel text [voice: 1 group forward 100 meters], [gesture: forward], [action: forward] as an example, the text of the three channels is converted into three sentences and formed into a sequence [voice: 1 group forward 100 meters; gesture: forward; action: forward], and the sequence is used as the input of the pre-trained language model.

应该理解的是，虽然图1的流程图中的各个步骤按照箭头的指示依次显示，但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明，这些步骤的执行并没有严格的顺序限制，这些步骤可以以其它的顺序执行。而且，图1中的至少一部分步骤可以包括多个子步骤或者多个阶段，这些子步骤或者阶段并不必然是在同一时刻执行完成，而是可以在不同的时刻执行，这些子步骤或者阶段的执行顺序也不必然是依次进行，而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。It should be understood that, although the various steps in the flowchart of FIG. 1 are shown in sequence according to the indication of the arrows, these steps are not necessarily executed in sequence according to the order indicated by the arrows. Unless there is a clear explanation in this article, the execution of these steps is not strictly limited in order, and these steps can be executed in other orders. Moreover, at least a portion of the steps in FIG. 1 may include a plurality of sub-steps or a plurality of stages, and these sub-steps or stages are not necessarily executed at the same time, but can be executed at different times, and the execution order of these sub-steps or stages is not necessarily to be carried out in sequence, but can be executed in turn or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

在一个实施例中，如图2所示，提供了一种多通道语义理解的融合装置，包括：语义理解融合模型构建模块202、样本数据集获取模块204、序列语义文本标记模块206和语义理解融合模块208，其中：In one embodiment, as shown in FIG2 , a multi-channel semantic understanding fusion device is provided, including: a semantic understanding fusion model building module 202, a sample data set acquisition module 204, a sequence semantic text tagging module 206 and a semantic understanding fusion module 208, wherein:

语义理解融合模型构建模块202，用于构建语义理解融合模型。The semantic understanding fusion model building module 202 is used to build a semantic understanding fusion model.

样本数据集获取模块204，用于获取多通道的语义文本作为训练样本数据集，训练样本数据集包括：语音信息、视觉信息以及文字信息。The sample data set acquisition module 204 is used to acquire multi-channel semantic text as a training sample data set. The training sample data set includes: voice information, visual information and text information.

序列语义文本标记模块206，用于语义理解融合模型通过对训练样本数据集进行NSP任务训练，得到待标记语义序列，基于Transformer架构对待标记语义序列进行通道标注，得到标记序列语义文本。The sequence semantic text labeling module 206 is used for the semantic understanding fusion model to obtain the semantic sequence to be labeled by performing NSP task training on the training sample data set, and to perform channel labeling on the semantic sequence to be labeled based on the Transformer architecture to obtain the labeled sequence semantic text.

语义理解融合模块208，用于标记序列语义文本在解码层采用seq2seq模型生成标准指令，标准指令对标记序列语义文本进行文本结构化，得到标准语义文本。The semantic understanding fusion module 208 is used to generate standard instructions for the marked sequence semantic text using a seq2seq model at the decoding layer. The standard instructions perform text structuring on the marked sequence semantic text to obtain a standard semantic text.

关于一种多通道语义理解的融合装置的具体限定可以参见上文中对于一种多通道语义理解的融合方法的限定，在此不再赘述。上述一种多通道语义理解的融合装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中，也可以以软件形式存储于计算机设备中的存储器中，以便于处理器调用执行以上各个模块对应的操作。For the specific definition of a fusion device for multi-channel semantic understanding, please refer to the definition of a fusion method for multi-channel semantic understanding above, which will not be repeated here. Each module in the above-mentioned fusion device for multi-channel semantic understanding can be implemented in whole or in part through software, hardware and a combination thereof. The above-mentioned modules can be embedded in or independent of the processor in the computer device in the form of hardware, or can be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.

在一个实施例中，提供了一种计算机设备，该计算机设备可以是终端，其内部结构图可以如图3所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口、显示屏和输入装置。其中，该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统和计算机程序。该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机程序被处理器执行时以实现一种多通道语义理解的融合方法。该计算机设备的显示屏可以是液晶显示屏或者电子墨水显示屏，该计算机设备的输入装置可以是显示屏上覆盖的触摸层，也可以是计算机设备外壳上设置的按键、轨迹球或触控板，还可以是外接的键盘、触控板或鼠标等。In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be shown in FIG3. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected via a system bus. Among them, the processor of the computer device is used to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and the computer program in the non-volatile storage medium. The network interface of the computer device is used to communicate with an external terminal through a network connection. When the computer program is executed by the processor, a fusion method of multi-channel semantic understanding is implemented. The display screen of the computer device may be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer device may be a touch layer covered on the display screen, or a key, trackball or touchpad provided on the housing of the computer device, or an external keyboard, touchpad or mouse, etc.

本领域技术人员可以理解，图2-图3中示出的结构，仅仅是与本申请方案相关的部分结构的框图，并不构成对本申请方案所应用于其上的计算机设备的限定，具体的计算机设备可以包括比图中所示更多或更少的部件，或者组合某些部件，或者具有不同的部件布置。Those skilled in the art will understand that the structures shown in FIGS. 2-3 are merely block diagrams of partial structures related to the scheme of the present application, and do not constitute a limitation on the computer device to which the scheme of the present application is applied. The specific computer device may include more or fewer components than shown in the figures, or combine certain components, or have a different arrangement of components.

在一个实施例中，提供了一种计算机设备，包括存储器和处理器，该存储器存储有计算机程序，该处理器执行计算机程序时实现以下步骤：In one embodiment, a computer device is provided, including a memory and a processor, wherein the memory stores a computer program, and when the processor executes the computer program, the following steps are implemented:

构建语义理解融合模型。Build a semantic understanding fusion model.

在一个实施例中，处理器执行计算机程序时还实现以下步骤：语义理解融合模型通过对训练样本数据集中任意两个连续句子之间进行NSP任务训练，得到多通道的待标记语义序列。根据待标记语义序列构建语义交叉融合模型，语义交叉融合模型基于Transformer架构对每一个通道的待标记语义序列使用segment_id进行通道标记，得到标记序列语义文本。In one embodiment, when the processor executes the computer program, the following steps are further implemented: the semantic understanding fusion model obtains multi-channel semantic sequences to be marked by performing NSP task training between any two consecutive sentences in the training sample data set. A semantic cross-fusion model is constructed based on the semantic sequences to be marked. The semantic cross-fusion model uses segment_id to perform channel marking on the semantic sequences to be marked in each channel based on the Transformer architecture to obtain the marked sequence semantic text.

在一个实施例中，处理器执行计算机程序时还实现以下步骤：标记序列语义文本在解码层采用seq2seq模型解析原始指令，得到原始指令任务，根据原始指令任务生成标准指令，标准指令采用序列标注对标记序列语义文本进行文本结构化，得到标准语义文本。In one embodiment, when the processor executes the computer program, the following steps are also implemented: the marked sequence semantic text is parsed using a seq2seq model at the decoding layer to obtain the original instruction task, a standard instruction is generated based on the original instruction task, and the standard instruction uses sequence annotation to perform text structuring on the marked sequence semantic text to obtain a standard semantic text.

在一个实施例中，提供了一种计算机可读存储介质，其上存储有计算机程序，计算机程序被处理器执行时实现以下步骤：In one embodiment, a computer readable storage medium is provided, on which a computer program is stored, and when the computer program is executed by a processor, the following steps are implemented:

构建语义理解融合模型。Build a semantic understanding fusion model.

在一个实施例中，计算机程序被处理器执行时还实现以下步骤：语义理解融合模型通过对训练样本数据集中任意两个连续句子之间进行NSP任务训练，得到多通道的待标记语义序列。根据待标记语义序列构建语义交叉融合模型，语义交叉融合模型基于Transformer架构对每一个通道的待标记语义序列使用segment_id进行通道标记，得到标记序列语义文本。In one embodiment, when the computer program is executed by the processor, the following steps are also implemented: the semantic understanding fusion model obtains multi-channel semantic sequences to be marked by performing NSP task training between any two consecutive sentences in the training sample data set. A semantic cross-fusion model is constructed based on the semantic sequences to be marked. The semantic cross-fusion model uses segment_id to perform channel marking on the semantic sequences to be marked in each channel based on the Transformer architecture to obtain the marked sequence semantic text.

在一个实施例中，计算机程序被处理器执行时还实现以下步骤：标记序列语义文本在解码层采用seq2seq模型解析原始指令，得到原始指令任务，根据原始指令任务生成标准指令，标准指令采用序列标注对标记序列语义文本进行文本结构化，得到标准语义文本。In one embodiment, when the computer program is executed by the processor, the following steps are also implemented: the marked sequence semantic text is parsed using a seq2seq model at the decoding layer to obtain the original instruction task, and a standard instruction is generated based on the original instruction task. The standard instruction uses sequence annotation to perform text structuring on the marked sequence semantic text to obtain a standard semantic text.

本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程，是可以通过计算机程序来指令相关的硬件来完成，所述的计算机程序可存储于一非易失性计算机可读取存储介质中，该计算机程序在执行时，可包括如上述各方法的实施例的流程。其中，本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用，均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器（ROM）、可编程ROM（PROM）、电可编程ROM（EPROM）、电可擦除可编程ROM（EEPROM）或闪存。易失性存储器可包括随机存取存储器（RAM）或者外部高速缓冲存储器。作为说明而非局限，RAM以多种形式可得，诸如静态RAM（SRAM）、动态RAM（DRAM）、同步DRAM（SDRAM）、双数据率SDRAM（DDRSDRAM）、增强型SDRAM（ESDRAM）、同步链路（Synchlink） DRAM（SLDRAM）、存储器总线（Rambus）直接RAM（RDRAM）、直接存储器总线动态RAM（DRDRAM）、以及存储器总线动态RAM（RDRAM）等。A person of ordinary skill in the art can understand that all or part of the processes in the above-mentioned embodiment method can be completed by instructing the relevant hardware through a computer program, and the computer program can be stored in a non-volatile computer-readable storage medium. When the computer program is executed, it can include the processes of the embodiments of the above-mentioned methods. Among them, any reference to memory, storage, database or other media used in the embodiments provided in this application may include non-volatile and/or volatile memory. Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM) or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. As an illustration and not limitation, RAM is available in various forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

以上实施例的各技术特征可以进行任意的组合，为使描述简洁，未对上述实施例中的各个技术特征所有可能的组合都进行描述，然而，只要这些技术特征的组合不存在矛盾，都应当认为是本说明书记载的范围。The technical features of the above embodiments may be arbitrarily combined. To make the description concise, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.

以上所述实施例仅表达了本申请的几种实施方式，其描述较为具体和详细，但并不能因此而理解为对发明范围的限制。应当指出的是，对于本领域的普通技术人员来说，在不脱离本申请构思的前提下，还可以做出若干变形和改进，这些都属于本申请的保护范围。因此，本申请的保护范围应以所附权利要求为准。The above-described embodiments only express several implementation methods of the present application, and the descriptions thereof are relatively specific and detailed, but they cannot be construed as limiting the scope of the invention. It should be noted that, for a person of ordinary skill in the art, several modifications and improvements may be made without departing from the concept of the present application, and these all belong to the protection scope of the present application. Therefore, the protection scope of the present application shall be subject to the attached claims.

Claims

1. A fusion method for multi-channel semantic understanding, characterized in that the method comprises:

Build a semantic understanding fusion model;

Acquire multi-channel semantic text as a training sample data set, wherein the training sample data set includes: voice information, visual information and text information;

The semantic understanding fusion model performs NSP task training on the training sample data set to obtain a semantic sequence to be labeled, and performs channel annotation on the semantic sequence to be labeled based on the Transformer architecture to obtain a labeled sequence semantic text;

The labeled sequence semantic text uses a seq2seq model to generate standard instructions at the decoding layer, and the standard instructions perform text structuring on the labeled sequence semantic text to obtain a standard semantic text.

2. The method according to claim 1, characterized in that constructing a semantic understanding fusion model comprises:

Unsupervised data sampling technology and data universal annotation technology are used to build a semantic understanding fusion model.

3. The method according to claim 1, characterized in that obtaining multi-channel semantic text as a training sample data set comprises:

The semantic understanding fusion model is used to identify original instructions of multiple channels and obtain semantic texts corresponding to the original instructions as a training sample data set.

4. The method according to claim 3 is characterized in that the semantic understanding fusion model obtains a semantic sequence to be labeled by performing NSP task training on the training sample data set, and performs channel annotation on the semantic sequence to be labeled based on the Transformer architecture to obtain a labeled sequence semantic text, including:

The semantic understanding fusion model obtains a multi-channel semantic sequence to be marked by performing NSP task training between any two consecutive sentences in the training sample data set;

A semantic cross-fusion model is constructed according to the semantic sequence to be marked. The semantic cross-fusion model uses segment_id to perform channel marking on the semantic sequence to be marked in each channel based on the Transformer architecture to obtain a marked sequence semantic text.

5. The method according to any one of claims 1 to 4, characterized in that the semantic understanding fusion model includes a decoding layer;

Before the step of generating a standard instruction by using a seq2seq model at a decoding layer for the semantic text of the marked sequence, and the standard instruction performs text structuring on the semantic text of the marked sequence to obtain a standard semantic text, the method further includes:

According to the semantic text of the tag sequence, a keyword group is extracted through the decoding layer, and the keyword group is used as a dictionary to be decoded for masking to obtain a candidate decoding dictionary, and the candidate decoding dictionary generates original instructions from sequence to sequence.

6. The method according to claim 5 is characterized in that the tag sequence semantic text uses a seq2seq model at a decoding layer to generate standard instructions, and the standard instructions perform text structuring on the tag sequence semantic text to obtain a standard semantic text, including:

The labeled sequence semantic text uses a seq2seq model at the decoding layer to parse the original instruction to obtain an original instruction task, and generates a standard instruction based on the original instruction task. The standard instruction uses sequence annotation to perform text structuring on the labeled sequence semantic text to obtain a standard semantic text.

7. The method according to claim 6 is characterized in that after the step of generating standard instructions by using a seq2seq model at a decoding layer of the labeled sequence semantic text, and the standard instructions structuring the labeled sequence semantic text to obtain a standard semantic text, the method further comprises:

According to the multi-channel semantic text and the standard semantic text, the semantic understanding fusion model is trained to obtain a trained semantic understanding fusion model;

Multi-channel semantic understanding fusion is performed through the trained semantic understanding fusion model.

8. A fusion device for multi-channel semantic understanding, characterized in that the device comprises:

Semantic understanding fusion model building module, used to build a semantic understanding fusion model;

A sample data set acquisition module is used to acquire multi-channel semantic text as a training sample data set, wherein the training sample data set includes: voice information, visual information and text information;

A sequence semantic text labeling module is used for the semantic understanding fusion model to obtain a semantic sequence to be labeled by performing NSP task training on the training sample data set, and to label the semantic sequence to be labeled based on the Transformer architecture to obtain a labeled sequence semantic text;

The semantic understanding fusion module is used for generating standard instructions for the semantic text of the tag sequence by using a seq2seq model at a decoding layer. The standard instructions perform text structuring on the semantic text of the tag sequence to obtain a standard semantic text.

9. A computer device comprising a memory and a processor, wherein the memory stores a computer program, wherein the processor implements the steps of the method according to any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium having a computer program stored thereon, wherein when the computer program is executed by a processor, the steps of the method according to any one of claims 1 to 7 are implemented.