Nothing Special   »   [go: up one dir, main page]

CN118350416A - Multimodal semantic communication method, system, device and medium based on large model - Google Patents

Multimodal semantic communication method, system, device and medium based on large model Download PDF

Info

Publication number
CN118350416A
CN118350416A CN202410773829.9A CN202410773829A CN118350416A CN 118350416 A CN118350416 A CN 118350416A CN 202410773829 A CN202410773829 A CN 202410773829A CN 118350416 A CN118350416 A CN 118350416A
Authority
CN
China
Prior art keywords
semantic
model
trained
codec
multimodal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202410773829.9A
Other languages
Chinese (zh)
Other versions
CN118350416B (en
Inventor
秦志金
倪万里
陶晓明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN202410773829.9A priority Critical patent/CN118350416B/en
Publication of CN118350416A publication Critical patent/CN118350416A/en
Application granted granted Critical
Publication of CN118350416B publication Critical patent/CN118350416B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)

Abstract

The application provides a multi-mode semantic communication method, a system, equipment and a medium based on a large model, which relate to the technical field of semantic communication and acquire a pre-trained large language model codec and multi-mode data; constructing a receiving and transmitting end semantic big model to be trained in steps by utilizing a coder-decoder, a projector and a pre-trained big language model coder-decoder which are related to a mode to be trained; maintaining parameters of a pre-trained large language model coder-decoder unchanged, and circularly and alternately training the coder-decoder, the projector and the channel coder-decoder to obtain a receiving-transmitting end semantic large model; and transmitting and reconstructing multi-mode data by using the semantic big model of the receiving and transmitting end. By the method, unified semantic extraction and fusion characterization of the multi-mode data are realized, and the accuracy and stability of semantic mapping are ensured, so that information redundancy among modes is eliminated, efficient multi-mode data semantic alignment and fusion are realized, and the universality and expansibility of a communication system are improved.

Description

基于大模型的多模态语义通信方法、系统、设备及介质Multimodal semantic communication method, system, device and medium based on large model

技术领域Technical Field

本申请涉及语义通信技术领域,特别是涉及基于大模型的多模态语义通信方法、系统、设备及介质。The present application relates to the field of semantic communication technology, and in particular to a multimodal semantic communication method, system, device and medium based on a large model.

背景技术Background technique

现有语义通信方案主要基于深度学习技术,旨在利用神经网络从信源数据中精准地提取含义或特征,并通过高度简化的方式传输所需的语义信息。在实际应用中,多模态数据往往能够提供更全面、更丰富的信息,有助于更准确地理解用户的意图和需求。Existing semantic communication solutions are mainly based on deep learning technology, which aims to use neural networks to accurately extract meaning or features from source data and transmit the required semantic information in a highly simplified way. In practical applications, multimodal data can often provide more comprehensive and richer information, which helps to more accurately understand the user's intentions and needs.

然而,现有的语义通信方案大多仅考虑单模态场景,通过设计专门的模型进行语义提取和表征,即针对特定任务,如文本处理、图像识别或语音识别等。这种方式虽然可以在特定任务上取得较好的效果,但在处理多模态数据时,却面临语义提取不充分,通用性和扩展性不足等问题。However, most existing semantic communication solutions only consider unimodal scenarios, and extract and represent semantics by designing specialized models, that is, for specific tasks such as text processing, image recognition or speech recognition. Although this approach can achieve good results on specific tasks, it faces problems such as insufficient semantic extraction, lack of versatility and scalability when processing multimodal data.

目前,多模态语义通信方案大多采用多模态融合的方式,即将来自不同模态的数据进行融合,然后利用深度学习等技术进行语义提取和表征。然而,这些方案仍然存在以下问题:由于不同模态的数据之间存在差异和互补性,如何有效地融合这些信息并提取出准确的语义信息是一个关键问题。现有的多模态融合方法往往难以充分利用不同模态之间的关联性和互补性,导致提取的语义信息冗余或不准确。At present, most multimodal semantic communication solutions adopt the method of multimodal fusion, that is, fusing data from different modalities, and then using deep learning and other technologies for semantic extraction and representation. However, these solutions still have the following problems: Due to the differences and complementarities between data of different modalities, how to effectively fuse this information and extract accurate semantic information is a key issue. Existing multimodal fusion methods often fail to fully utilize the correlation and complementarity between different modalities, resulting in redundant or inaccurate extracted semantic information.

因此,针对现有语义通信方案在多模态场景下的不足,需要探索新的语义通信方案,以实现多模态数据的统一语义信息提取与表征,提高语义通信的准确性和效率。Therefore, in view of the shortcomings of existing semantic communication solutions in multimodal scenarios, it is necessary to explore new semantic communication solutions to achieve unified semantic information extraction and representation of multimodal data and improve the accuracy and efficiency of semantic communication.

发明内容Summary of the invention

本申请提供一种基于大模型的多模态语义通信方法、系统、设备及介质,以解决语义通信方法中多模态数据间的语义对齐与融合传输问题。The present application provides a large-model-based multimodal semantic communication method, system, device and medium to solve the problem of semantic alignment and fusion transmission between multimodal data in the semantic communication method.

在本申请实施例第一方面提出一种基于大模型的多模态语义通信方法,所述方法包括:In a first aspect of an embodiment of the present application, a multimodal semantic communication method based on a large model is proposed, the method comprising:

获取预训练的大语言模型编解码器以及包括文本、图像、音频和视频的多模态数据,所述预训练的大语言模型编解码器是通过语料库进行预训练得到的;Obtain a pre-trained large language model codec and multimodal data including text, image, audio, and video, wherein the pre-trained large language model codec is obtained by pre-training with a corpus;

利用待训练的与模态相关的编解码器和与模态相关的投影器,以及所述预训练的大语言模型编解码器,构建待训练的基于大模型的语义编解码器,以及利用待训练的信道编解码器和所述待训练的基于大模型的语义编解码器,构建待训练的收发端语义大模型;Using the modality-related codec and the modality-related projector to be trained, and the pre-trained large language model codec, construct a semantic codec based on the large model to be trained, and using the channel codec to be trained and the semantic codec based on the large model to be trained, construct a semantic large model of the transceiver to be trained;

在保持所述预训练的大语言模型编解码器的参数不变的前提下,循环交替训练所述待训练的与模态相关的编解码器和与模态相关的投影器,以及所述待训练的信道编解码器,得到收发端语义大模型;Under the premise of keeping the parameters of the pre-trained large language model codec unchanged, cyclically and alternately training the codec and projector related to the modality to be trained, and the channel codec to be trained, to obtain a large semantic model of the transceiver;

利用所述收发端语义大模型,进行所述多模态数据的传输和重建。The multimodal data is transmitted and reconstructed using the semantic big model of the transmitting and receiving ends.

其中,所述方法具体包括:The method specifically includes:

发射端通过所述收发端语义大模型中的基于大模型的语义编码器,对来自于信源的多模态数据进行联合编码,捕获并融合不同模态之间的语义关联,生成跨模态共享的高维语义空间表示;The transmitter jointly encodes the multimodal data from the source through the semantic encoder based on the large model in the semantic large model of the transmitter and receiver, captures and fuses the semantic associations between different modalities, and generates a high-dimensional semantic space representation shared across modalities;

发射端通过所述收发端语义大模型中的自适应信道编码器,将所述跨模态共享的高维语义空间表示转化为语义符号,选择与信道状态相匹配的编码速率进行编码,并选择与信道状态相匹配的传输策略进行传输;The transmitter converts the cross-modal shared high-dimensional semantic space representation into semantic symbols through an adaptive channel encoder in the semantic large model of the transmitter and receiver, selects a coding rate matching the channel state for encoding, and selects a transmission strategy matching the channel state for transmission;

接收端通过所述收发端语义大模型中的自适应信道解码器将收到的所述语义符号恢复成跨模态共享的高维语义空间表示,选择与信道状态相匹配的解码速率进行解码,并选择与信道状态相匹配的纠错策略进行纠错;The receiving end restores the received semantic symbols into a high-dimensional semantic space representation shared across modalities through an adaptive channel decoder in the semantic large model of the transceiver end, selects a decoding rate matching the channel state for decoding, and selects an error correction strategy matching the channel state for error correction;

接收端通过所述收发端语义大模型中基于大模型的语义解码器将恢复出的跨模态共享的高维语义空间表示进行重建,得到重建的多模态数据。The receiving end reconstructs the restored cross-modal shared high-dimensional semantic space representation through a semantic decoder based on the large model in the semantic large model of the transmitting and receiving end to obtain reconstructed multimodal data.

其中,所述收发端语义大模型的训练过程如下:The training process of the large semantic model of the transmitting and receiving end is as follows:

第一阶段训练:冻结所述预训练的大语言模型编解码器的参数,单独训练所述待训练的信道编解码器,以适应信道状态;The first stage of training: freezing the parameters of the pre-trained large language model codec, and separately training the channel codec to be trained to adapt to the channel state;

第二阶段训练:冻结训练到一定程度的信道编解码器的参数,训练所述待训练的与模态相关的编解码器和与模态相关的投影器的参数;The second stage of training: freezing the parameters of the channel codec trained to a certain degree, and training the parameters of the codec and projector related to the modality to be trained;

循环交替以上第一阶段训练和第二阶段训练,优化所述待训练的信道编解码器、所述待训练的与模态相关的编解码器和与模态相关的投影器的性能,直至所述收发端语义大模型的损失函数值收敛或满足预定的性能指标。The first stage training and the second stage training are cyclically alternated to optimize the performance of the channel codec to be trained, the modality-related codec to be trained, and the modality-related projector, until the loss function value of the semantic large model of the transceiver converges or meets the predetermined performance index.

其中,所述收发端语义大模型的训练过程还包括:The training process of the large semantic model of the transmitting and receiving end further includes:

在训练过程中,计算重建的多模态数据的重建通信意图与来自信源的多模态数据的实际通信意图之间的差异;During the training process, a difference between a reconstructed communication intent of the reconstructed multimodal data and an actual communication intent of the multimodal data from the source is calculated;

根据所述差异,评价所述收发端语义大模型是否保持语义一致;According to the difference, evaluating whether the semantic big model of the sending and receiving end maintains semantic consistency;

基于评价结果对所述收发端语义大模型的参数和训练策略进行调整,以降低所述差异。Based on the evaluation results, the parameters and training strategy of the semantic large model of the transmitting and receiving ends are adjusted to reduce the difference.

其中,在对来自于信源的多模态数据进行联合编码之前,所述方法包括:Before jointly encoding the multimodal data from the information source, the method includes:

对多模态数据进行预处理,以使预处理后的多模态数据具有预设的数据格式;Preprocessing the multimodal data so that the preprocessed multimodal data has a preset data format;

所述对来自于信源的多模态数据进行联合编码是通过所述基于大模型的语义编码器实现的,所述基于大模型的语义编码器包括模态编码器、输入投影器和大语言模型编码器;The joint encoding of the multimodal data from the information source is implemented by the semantic encoder based on the large model, and the semantic encoder based on the large model includes a modality encoder, an input projector and a large language model encoder;

所述模态编码器,用于对所述预处理后的多模态数据进行特征提取;The modal encoder is used to extract features from the preprocessed multimodal data;

所述输入投影器,用于对提取的特征进行对齐,以使不同模态的数据能够在同一特征空间内被有效整合;The input projector is used to align the extracted features so that data from different modalities can be effectively integrated in the same feature space;

所述大语言模型编码器,用于将不同模态的数据映射到统一的语义空间中,实现多模态数据的语义信息融合。The large language model encoder is used to map data of different modalities into a unified semantic space to achieve semantic information fusion of multimodal data.

其中,所述基于大模型的语义解码器包括模态解码器、输出投影器和大语言模型解码器,所述基于大模型的语义解码器将恢复出的跨模态共享的高维语义空间表示进行重建,得到重建的多模态数据,包括:The large model-based semantic decoder includes a modality decoder, an output projector, and a large language model decoder. The large model-based semantic decoder reconstructs the restored cross-modal shared high-dimensional semantic space representation to obtain reconstructed multimodal data, including:

所述大语言模型解码器,用于将所述高维语义空间表示从统一的语义空间恢复成不同模态数据的第一特征;The large language model decoder is used to restore the high-dimensional semantic space representation from a unified semantic space into a first feature of different modal data;

所述输出投影器,用于对所述不同模态数据的第一特征进行解码,得到不同模态数据的第二特征;The output projector is used to decode the first feature of the different modal data to obtain the second feature of the different modal data;

所述模态解码器,用于将所述不同模态数据的第二特征重建为原始的所述多模态数据。The modality decoder is used to reconstruct the second features of the different modality data into the original multi-modality data.

其中,在利用所述收发端语义大模型,进行所述多模态数据的传输和重建之后,所述方法包括:After the multimodal data is transmitted and reconstructed using the large semantic model of the transceiver, the method includes:

基于所述收发端语义大模型重建后的多模态数据,采用下游任务执行功能组件执行相应的下游任务,得到输出结果;Based on the multimodal data reconstructed by the large semantic model of the transmitting and receiving end, a downstream task execution function component is used to execute corresponding downstream tasks to obtain output results;

基于所述输出结果对所述下游任务执行功能组件的参数进行优化和调整,得到调整后的下游任务执行功能组件;Optimizing and adjusting the parameters of the downstream task execution functional component based on the output result to obtain an adjusted downstream task execution functional component;

通过所述调整后的下游任务执行功能组件基于用户的交互指令和重建后的多模态数据,执行对应的任务。The adjusted downstream task execution functional component executes the corresponding task based on the user's interactive instructions and the reconstructed multimodal data.

在本申请实施例第二方面提出一种基于大模型的多模态语义通信系统,所述系统包括:In a second aspect of an embodiment of the present application, a multimodal semantic communication system based on a large model is proposed, the system comprising:

获取模块,用于获取预训练的大语言模型编解码器以及包括文本、图像、音频和视频的多模态数据,所述预训练的大语言模型编解码器是通过语料库进行预训练得到的;An acquisition module is used to acquire a pre-trained large language model codec and multimodal data including text, image, audio and video, wherein the pre-trained large language model codec is obtained by pre-training with a corpus;

构建模块,用于利用待训练的与模态相关的编解码器和与模态相关的投影器,以及所述预训练的大语言模型编解码器,构建待训练的基于大模型的语义编解码器,以及利用待训练的信道编解码器和所述待训练的基于大模型的语义编解码器,构建待训练的收发端语义大模型;A construction module is used to construct a semantic codec based on a large model to be trained by using the codec and projector related to the modality to be trained, and the pre-trained large language model codec, and to construct a semantic large model of the transceiver to be trained by using the channel codec to be trained and the semantic codec based on the large model to be trained;

训练模块,用于在保持所述预训练的大语言模型编解码器的参数不变的前提下,循环交替训练所述待训练的与模态相关的编解码器和与模态相关的投影器,以及所述待训练的信道编解码器,得到收发端语义大模型;A training module is used to cyclically and alternately train the codec and projector related to the modality to be trained, and the channel codec to be trained, under the premise of keeping the parameters of the pre-trained large language model codec unchanged, to obtain a large semantic model of the transceiver;

传输重建模块,用于利用所述收发端语义大模型,进行所述多模态数据的传输和重建。The transmission and reconstruction module is used to transmit and reconstruct the multimodal data by using the semantic big model of the transmitting and receiving end.

在本申请实施例第三方面提出一种电子设备,包括存储器、处理器及存储在所述存储器上的计算机程序,所述处理器执行所述计算机程序以实现上述第一方面中任一项所述的基于大模型的多模态语义通信方法。In a third aspect of an embodiment of the present application, an electronic device is proposed, comprising a memory, a processor, and a computer program stored in the memory, wherein the processor executes the computer program to implement the large model-based multimodal semantic communication method described in any one of the first aspects above.

在本申请实施例第四方面提出一种计算机可读存储介质,其上存储有计算机程序/指令,该计算机程序/指令被处理器执行时实现上述第一方面中任一项所述的基于大模型的多模态语义通信方法。In a fourth aspect of an embodiment of the present application, a computer-readable storage medium is proposed, on which a computer program/instruction is stored. When the computer program/instruction is executed by a processor, the multimodal semantic communication method based on a large model as described in any one of the first aspects above is implemented.

本申请包括以下优点:本申请提供一种基于大模型的多模态语义通信方法、系统、设备及介质,获取预训练的大语言模型编解码器以及包括文本、图像、音频和视频的多模态数据,所述预训练的大语言模型编解码器是通过语料库进行预训练得到的;利用待训练的与模态相关的编解码器和与模态相关的投影器,以及所述预训练的大语言模型编解码器,构建待训练的基于大模型的语义编解码器,以及利用待训练的信道编解码器和所述待训练的基于大模型的语义编解码器,构建待训练的收发端语义大模型;在保持所述预训练的大语言模型编解码器的参数不变的前提下,循环交替训练所述待训练的与模态相关的编解码器和与模态相关的投影器,以及所述待训练的信道编解码器,得到收发端语义大模型;利用所述收发端语义大模型,进行所述多模态数据的传输和重建。通过上述方法实现对多模态数据的统一语义提取与融合表征,确保语义映射的准确性和稳定性,从而消除模态间的信息冗余,实现高效的多模态数据语义对齐与融合,提升通信系统的通用性和扩展性。The present application includes the following advantages: The present application provides a multimodal semantic communication method, system, device and medium based on a large model, obtains a pre-trained large language model codec and multimodal data including text, image, audio and video, wherein the pre-trained large language model codec is obtained by pre-training with a corpus; uses the modality-related codec and the modality-related projector to be trained, and the pre-trained large language model codec to construct a semantic codec based on a large model to be trained, and uses the channel codec to be trained and the semantic codec based on the large model to be trained to construct a large semantic model of the transceiver to be trained; on the premise of keeping the parameters of the pre-trained large language model codec unchanged, the modality-related codec and the modality-related projector to be trained, and the channel codec to be trained are cyclically and alternately trained to obtain a large semantic model of the transceiver; uses the large semantic model of the transceiver to transmit and reconstruct the multimodal data. The above method can realize unified semantic extraction and fusion representation of multimodal data, ensure the accuracy and stability of semantic mapping, eliminate information redundancy between modalities, realize efficient semantic alignment and fusion of multimodal data, and improve the versatility and scalability of communication systems.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本申请实施例的技术方案,下面将对本申请实施例的描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for use in the description of the embodiments of the present application will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present application. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying any creative labor.

图1是本申请实施例提供的一种基于大模型的多模态语义通信方法的步骤流程图;FIG1 is a flowchart of a multimodal semantic communication method based on a large model provided in an embodiment of the present application;

图2是本申请实施例提供的一种基于大模型的多模态语义通信方法的流程示意图;FIG2 is a flow chart of a multimodal semantic communication method based on a large model provided in an embodiment of the present application;

图3是本申请实施例提供的一种收发端语义大模型的训练过程示意图;FIG3 is a schematic diagram of a training process of a large semantic model of a transmitting and receiving end provided in an embodiment of the present application;

图4是本申请实施例提供的一种基于大模型的多模态语义通信系统的架构图;FIG4 is an architecture diagram of a multimodal semantic communication system based on a large model provided in an embodiment of the present application;

图5是本申请实施例提供的一种电子设备的示意图。FIG. 5 is a schematic diagram of an electronic device provided in an embodiment of the present application.

具体实施方式Detailed ways

下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The following will be combined with the drawings in the embodiments of the present application to clearly and completely describe the technical solutions in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by ordinary technicians in this field without creative work are within the scope of protection of this application.

现有语义通信方案主要基于深度学习技术,旨在利用神经网络从信源数据中精准地提取含义或特征,并通过高度简化的方式传输所需的语义信息。该技术的核心在于从海量的数据中提炼出关键语义信息,并通过高度优化的方式进行语义传输,从而大幅节省带宽资源,提升网络容量和通信效率。其中,多模态数据是指由不同模态(如文本、图像、音频、视频等)组成的数据集。在实际应用中,多模态数据往往能够提供更全面、更丰富的信息,有助于更准确地理解用户的意图和需求。Existing semantic communication solutions are mainly based on deep learning technology, which aims to use neural networks to accurately extract meaning or features from source data and transmit the required semantic information in a highly simplified way. The core of this technology is to extract key semantic information from massive data and transmit semantic information in a highly optimized way, thereby greatly saving bandwidth resources and improving network capacity and communication efficiency. Among them, multimodal data refers to a data set composed of different modes (such as text, images, audio, video, etc.). In practical applications, multimodal data can often provide more comprehensive and richer information, which helps to more accurately understand the user's intentions and needs.

然而,现有的语义通信方案大多仅考虑单模态场景,通过设计专门的模型进行语义提取和表征,即针对特定任务,如文本处理、图像识别或语音识别等。这种方式虽然可以在特定任务上取得较好的效果,但在处理多模态数据时,却面临语义提取不充分,通用性和扩展性不足等问题。具体而言,由于不同模态的数据在表达方式上存在差异,单模态语义通信可能无法充分捕捉和传递原始数据的全部语义信息。尽管单模态语义通信在模型设计和处理流程上相对简单,但在某些特定任务或场景中,仍可能需要复杂的预处理和后处理步骤。当需要处理多种模态的信息时,可能需要重新设计或调整模型,这增加了开发和维护的难度。However, most existing semantic communication solutions only consider unimodal scenarios, and perform semantic extraction and representation by designing specialized models, that is, targeting specific tasks such as text processing, image recognition, or speech recognition. Although this approach can achieve good results on specific tasks, it faces problems such as insufficient semantic extraction, lack of versatility and scalability when processing multimodal data. Specifically, due to the differences in the expression of data in different modalities, unimodal semantic communication may not be able to fully capture and transmit all the semantic information of the original data. Although unimodal semantic communication is relatively simple in model design and processing flow, complex pre-processing and post-processing steps may still be required in certain specific tasks or scenarios. When it is necessary to process information in multiple modalities, the model may need to be redesigned or adjusted, which increases the difficulty of development and maintenance.

目前,多模态语义通信方案大多采用多模态融合的方式,即将来自不同模态的数据进行融合,然后利用深度学习等技术进行语义提取和表征。然而,这些方案仍然存在以下问题:由于不同模态的数据之间存在差异和互补性,如何有效地融合这些信息并提取出准确的语义信息是一个关键问题。现有的多模态融合方法往往难以充分利用不同模态之间的关联性和互补性,导致提取的语义信息冗余或不准确。不同的模态数据在表示方式上存在差异,如文本数据通常使用词向量表示,而图像数据则使用像素值或特征图表示。这使得多模态数据的融合和语义提取变得困难,难以形成一个统一的表示方式。At present, most multimodal semantic communication solutions adopt the method of multimodal fusion, that is, fusing data from different modalities, and then using deep learning and other technologies for semantic extraction and representation. However, these solutions still have the following problems: Due to the differences and complementarities between data of different modalities, how to effectively fuse this information and extract accurate semantic information is a key issue. Existing multimodal fusion methods often fail to fully utilize the correlation and complementarity between different modalities, resulting in redundant or inaccurate extracted semantic information. There are differences in the representation of different modal data. For example, text data is usually represented by word vectors, while image data is represented by pixel values or feature maps. This makes it difficult to fuse and extract semantic information from multimodal data, and it is difficult to form a unified representation method.

上述问题出现的原因为:(1)多模态数据的复杂性:多模态数据由不同来源和不同表示方式的数据组成,其复杂性和多样性使得对其进行统一的语义提取和表示变得困难。(2)技术发展的局限性:虽然深度学习等技术在单模态数据的处理上取得了显著进展,但在多模态数据的处理上仍面临诸多挑战。目前的技术还无法充分利用不同模态之间的关联性和互补性,实现准确的语义提取和表示。The reasons for the above problems are: (1) The complexity of multimodal data: Multimodal data consists of data from different sources and in different representations. Its complexity and diversity make it difficult to extract and represent its semantics in a unified way. (2) The limitations of technological development: Although technologies such as deep learning have made significant progress in processing single-modal data, they still face many challenges in processing multimodal data. Current technologies cannot fully utilize the correlation and complementarity between different modalities to achieve accurate semantic extraction and representation.

基于此,本申请提出一种基于大模型的多模态语义通信方法,针对现有语义通信方案在多模态场景下的不足,以实现多模态数据的统一语义信息提取与表征,提高语义通信的准确性和效率。Based on this, this application proposes a multimodal semantic communication method based on a large model, which aims to address the shortcomings of existing semantic communication solutions in multimodal scenarios, so as to achieve unified semantic information extraction and representation of multimodal data and improve the accuracy and efficiency of semantic communication.

在本申请实施例第一方面提供一种基于大模型的多模态语义通信方法,参阅图1,图1是本申请实施例提供的一种基于大模型的多模态语义通信方法的步骤流程图,所述方法包括以下步骤:In a first aspect of an embodiment of the present application, a multimodal semantic communication method based on a large model is provided. Referring to FIG. 1 , FIG. 1 is a flowchart of a multimodal semantic communication method based on a large model provided by an embodiment of the present application. The method comprises the following steps:

步骤101:获取预训练的大语言模型编解码器以及包括文本、图像、音频和视频的多模态数据,所述预训练的大语言模型编解码器是通过语料库进行预训练得到的;Step 101: obtaining a pre-trained large language model codec and multimodal data including text, image, audio and video, wherein the pre-trained large language model codec is obtained by pre-training with a corpus;

步骤102:利用待训练的与模态相关的编解码器和与模态相关的投影器,以及所述预训练的大语言模型编解码器,构建待训练的基于大模型的语义编解码器,以及利用待训练的信道编解码器和所述待训练的基于大模型的语义编解码器,构建待训练的收发端语义大模型;Step 102: constructing a semantic codec based on a large model to be trained by using the codec and projector related to the modality to be trained, and the pre-trained large language model codec, and constructing a semantic large model of the transceiver to be trained by using the channel codec to be trained and the semantic codec based on the large model to be trained;

步骤103:在保持所述预训练的大语言模型编解码器的参数不变的前提下,循环交替训练所述待训练的与模态相关的编解码器和与模态相关的投影器,以及所述待训练的信道编解码器,得到收发端语义大模型;Step 103: Under the premise of keeping the parameters of the pre-trained large language model codec unchanged, cyclically and alternately train the codec and projector related to the modality to be trained, and the channel codec to be trained to obtain a large semantic model of the transceiver;

步骤104:利用所述收发端语义大模型,进行所述多模态数据的传输和重建。Step 104: Utilize the large semantic model of the transmitting and receiving end to transmit and reconstruct the multimodal data.

在本申请实施例中,大模型是指具有大规模参数和复杂计算结构的机器学习模型。这些模型通常由深度神经网络构建而成,并包含数十亿甚至数千亿个参数。大模型旨在提高模型的表达能力和预测性能,使其能够处理更加复杂的任务和数据。大模型通过训练海量数据来学习复杂的模式和特征,具有更强大的泛化能力。大语言模型(Large LanguageModel,LLM)是一种基于海量文本数据训练的大模型。它不仅能够生成自然语言文本,还能够深入理解文本含义,处理各种自然语言任务,如文本摘要、问答、翻译等。利用多层神经网络去建模语言的统计规律和潜在语义信息,大语言模型会对大量的文本数据进行学习和抽象,从而生成具有逻辑和连贯性的语言输出。In the embodiments of the present application, a large model refers to a machine learning model with large-scale parameters and complex computing structures. These models are usually built from deep neural networks and contain billions or even hundreds of billions of parameters. The large model aims to improve the expressiveness and predictive performance of the model, enabling it to handle more complex tasks and data. The large model learns complex patterns and features by training massive data and has a stronger generalization ability. A large language model (LLM) is a large model trained based on massive text data. It can not only generate natural language text, but also deeply understand the meaning of the text and handle various natural language tasks, such as text summarization, question answering, translation, etc. Using multi-layer neural networks to model the statistical laws and potential semantic information of language, the large language model will learn and abstract a large amount of text data to generate logical and coherent language output.

多模态数据是指不同的存在形式或信息来源均可被称为一种模态,因此由两种或两种以上模态组成的数据即为多模态数据,可以包括文本、图片、音频、视频以及混合数据等多种形式。Multimodal data means that different forms of existence or sources of information can be called a modality. Therefore, data composed of two or more modalities is multimodal data, which can include text, pictures, audio, video, and mixed data.

语义通信是一种通信方式,其核心思想是在信源提取要发送的消息的语义信息,而不是传输消息的比特流。这意味着语义通信不仅关注消息的具体内容,还关注消息背后的含义或特征。语义通信能够预先理解业务需求与环境,对信源的语义特征进行理解、提取及传输,同时确保信宿能够理解收到的信源语义特征,从而成功恢复基于语义的信源信息。与传统的语法通信相比,语义通信并不要求译码序列与编码序列严格匹配,只要求接收端恢复的语义信息与发送语义信息匹配即可。这种方式有助于减少大带宽业务的传输带宽需求,提高通信效率,并改进用户体验。Semantic communication is a communication method whose core idea is to extract the semantic information of the message to be sent at the source, rather than the bit stream of the transmitted message. This means that semantic communication focuses not only on the specific content of the message, but also on the meaning or characteristics behind the message. Semantic communication can pre-understand the business needs and environment, understand, extract and transmit the semantic features of the source, and ensure that the destination can understand the semantic features of the received source, so as to successfully recover the semantic-based source information. Compared with traditional grammatical communication, semantic communication does not require the decoding sequence to strictly match the encoding sequence, but only requires that the semantic information recovered by the receiving end matches the sent semantic information. This method helps to reduce the transmission bandwidth requirements of large-bandwidth services, improve communication efficiency, and improve user experience.

为了清楚的说明本申请实施例提供的一种基于大模型的多模态语义通信方法,接下来结合图2进行详细说明,图2是本申请实施例提供的一种基于大模型的多模态语义通信方法的流程示意图。In order to clearly illustrate a multimodal semantic communication method based on a large model provided in an embodiment of the present application, a detailed description is given below in conjunction with Figure 2, which is a flow chart of a multimodal semantic communication method based on a large model provided in an embodiment of the present application.

具体实施步骤101时,本申请提出的基于大模型的多模态语义通信方法的核心在于预训练的大语言模型编解码器,通过语料库对大语言模型编解码器进行预训练,使预训练的大语言模型编解码器具有良好的语言理解和生成能力。同时,获取包括文本、图像、音频和视频的多模态数据用于进行语义通信。When implementing step 101, the core of the multimodal semantic communication method based on a large model proposed in this application is the pre-trained large language model codec, which is pre-trained through a corpus so that the pre-trained large language model codec has good language understanding and generation capabilities. At the same time, multimodal data including text, images, audio and video are obtained for semantic communication.

具体实施步骤102时,通过待训练的与模态相关的编解码器和与模态相关的投影器,以及预训练的大语言模型编解码器,构建得到待训练的基于大模型的语义编解码器,通过待训练的信道编解码器和构建得到的待训练的基于大模型的语义编解码器,构建得到待训练的收发端语义大模型。其中,所述与模态相关的编解码器包括:文本编解码器、图像编解码器、音频编解码器和视频编解码器,所述与模态相关的投影器包括:图像输入/输出投影器、音频输入/输出投影器和视频输入/输出投影器。When step 102 is specifically implemented, a semantic codec based on a large model to be trained is constructed by using a codec to be trained and a projector to be trained and a pre-trained large language model codec, and a semantic large model of the transceiver to be trained is constructed by using a channel codec to be trained and the constructed semantic codec based on a large model to be trained. The codecs to be trained and the modality-related codecs include text codecs, image codecs, audio codecs, and video codecs, and the projectors to be trained and the modality-related codecs include image input/output projectors, audio input/output projectors, and video input/output projectors.

具体实施步骤103时,在保持上述预训练的大语言模型编解码器的参数不变的前提下,循环交替训练待训练的信道编解码器、与模态相关的编解码器和与模态相关的投影器,直至训练结束,得到收发端语义大模型。When step 103 is specifically implemented, under the premise of keeping the parameters of the pre-trained large language model codec unchanged, the channel codec to be trained, the codec related to the modality and the projector related to the modality are trained alternately in a loop until the training is completed to obtain a large semantic model of the transceiver.

具体实施步骤104时,在收发端语义大模型训练好之后,将其用于理解和表示不同模态数据的语义信息,并实现多模态数据的融合、转换、传输和重建。When step 104 is specifically implemented, after the semantic big model at the transmitting and receiving end is trained, it is used to understand and represent the semantic information of data of different modalities, and realize the fusion, conversion, transmission and reconstruction of multimodal data.

在本申请可选的一实施例中,通过发射端将多模态数据进行传输至接收端,其中,在发射端,所述收发端语义大模型至少包含语义-信道联合编码模块,语义-信道联合编码模块包括基于大模型的语义编码器和自适应信道编码器,在接收端,所述收发端语义大模型至少包含语义-信道联合解码模块,语义-信道联合解码模块包括自适应信道解码器和基于大模型的语义解码器。所述方法具体包括:在发射端,通过收发端语义大模型中的基于大模型的语义编码器,对来自于信源的多模态数据进行联合编码,捕获并融合不同模态之间的语义关联,生成跨模态共享的高维语义空间表示,然后通过收发端语义大模型中的自适应信道编码器,将跨模态共享的高维语义空间表示转化为语义符号,选择与信道状态相匹配的编码速率进行编码,并选择与信道状态相匹配的传输策略进行语义符号的传输。在接收端,通过收发端语义大模型中的自适应信道解码器将收到的语义符号恢复成跨模态共享的高维语义空间表示,选择与信道状态相匹配的解码速率进行解码,并选择与信道状态相匹配的纠错策略进行纠错,确保解码后多模态数据的准确性和完整性;最后通过收发端语义大模型中基于大模型的语义解码器将恢复出的跨模态共享的高维语义空间表示进行重建,得到重建的多模态数据。In an optional embodiment of the present application, multimodal data is transmitted to a receiving end through a transmitting end, wherein, at the transmitting end, the semantic large model of the transceiver includes at least a semantic-channel joint encoding module, the semantic-channel joint encoding module includes a semantic encoder based on a large model and an adaptive channel encoder, and at the receiving end, the semantic large model of the transceiver includes at least a semantic-channel joint decoding module, the semantic-channel joint decoding module includes an adaptive channel decoder and a semantic decoder based on a large model. The method specifically includes: at the transmitting end, the multimodal data from the source is jointly encoded by a semantic encoder based on a large model in the semantic large model of the transceiver, the semantic associations between different modalities are captured and integrated, and a high-dimensional semantic space representation shared across modalities is generated, and then the high-dimensional semantic space representation shared across modalities is converted into a semantic symbol through an adaptive channel encoder in the semantic large model of the transceiver, a coding rate matching the channel state is selected for encoding, and a transmission strategy matching the channel state is selected for transmission of semantic symbols. At the receiving end, the received semantic symbols are restored into a high-dimensional semantic space representation shared across modalities through the adaptive channel decoder in the semantic big model of the transceiver. The decoding rate matching the channel state is selected for decoding, and the error correction strategy matching the channel state is selected for error correction to ensure the accuracy and integrity of the decoded multimodal data. Finally, the restored high-dimensional semantic space representation shared across modalities is reconstructed through the big model-based semantic decoder in the semantic big model of the transceiver to obtain the reconstructed multimodal data.

在本申请可选的一实施例中,在对来自于信源的多模态数据进行联合编码之前,对获取的多模态数据进行预处理,以使预处理后的多模态数据具有预设的数据格式,其中,多模态数据可以为文本、图像、音频、视频等等。具体地,对获取的多模态数据进行统一预设格式的转换、清洗、标准化处理以及特征提取,去除噪声和无关信息。通过深度学习方法,将原始非结构化的多模态数据转化为机器可理解和处理的形式,比如对文本数据进行分词和词性标注,对图像数据进行缩放、裁剪、归一化并转换为视觉特征向量,对音频数据进行采样、量化、特征提取并转换为声学特征序列等。In an optional embodiment of the present application, before the multimodal data from the information source is jointly encoded, the acquired multimodal data is preprocessed so that the preprocessed multimodal data has a preset data format, wherein the multimodal data can be text, image, audio, video, etc. Specifically, the acquired multimodal data is converted, cleaned, standardized, and feature extracted in a unified preset format to remove noise and irrelevant information. Through deep learning methods, the original unstructured multimodal data is converted into a form that can be understood and processed by a machine, such as word segmentation and part-of-speech tagging of text data, scaling, cropping, normalization, and conversion of image data into visual feature vectors, sampling, quantization, feature extraction, and conversion of audio data into acoustic feature sequences, etc.

在本申请可选的一实施例中,在对多模态数据进行预处理之后,采用于大模型的语义编码器对预处理之后的多模态特征进行联合编码,捕获并融合不同模态之间的内在语义关联,生成跨模态共享的高维语义空间表示。上述基于大模型的语义编码器包括模态编码器、输入投影器和大语言模型编码器;其中,模态编码器对预处理后的多模态数据进行特征提取,例如,模态编码器可以为文本编码器、图像编码器、音频编码器、视频编码器等;输入投影器对提取的特征进行对齐,以使不同模态的数据能够在同一特征空间内被有效整合,例如,输入投影器可以为图像输入投影器、音频输入投影器、视频输入投影器等;大语言模型编码器(对应图2中的LLM编码器)将不同模态的数据映射到统一的语义空间中,实现多模态数据的语义信息融合(参见图2中的语义大模型编码器部分)。例如,当经过多模态数据预处理之后的数据为文本、图像、音频和视频时,分别经过文本编码器、图像编码器、音频编码器、视频编码器编码后,除了文本特征之外,其余图像、音频和视频对应的特征分别经过图像输入投影器、音频输入投影器、视频输入投影器投影至同一特征空间,然后经过LLM编码器映射至统一的语义空间,实现多模态数据的语义信息融合。In an optional embodiment of the present application, after preprocessing the multimodal data, a semantic encoder based on a large model is used to jointly encode the preprocessed multimodal features, capture and fuse the intrinsic semantic associations between different modalities, and generate a high-dimensional semantic space representation shared across modalities. The above-mentioned semantic encoder based on the large model includes a modality encoder, an input projector, and a large language model encoder; wherein the modality encoder extracts features from the preprocessed multimodal data, for example, the modality encoder can be a text encoder, an image encoder, an audio encoder, a video encoder, etc.; the input projector aligns the extracted features so that data of different modalities can be effectively integrated in the same feature space, for example, the input projector can be an image input projector, an audio input projector, a video input projector, etc.; the large language model encoder (corresponding to the LLM encoder in Figure 2) maps data of different modalities into a unified semantic space to achieve semantic information fusion of multimodal data (see the semantic large model encoder part in Figure 2). For example, when the data after multimodal data preprocessing is text, image, audio and video, after being encoded by the text encoder, image encoder, audio encoder and video encoder respectively, except for the text features, the remaining features corresponding to the image, audio and video are projected to the same feature space through the image input projector, audio input projector and video input projector respectively, and then mapped to a unified semantic space through the LLM encoder to realize the semantic information fusion of multimodal data.

在本申请可选的一实施例中,在接收端的基于大模型的语义解码器包括模态解码器、输出投影器和大语言模型解码器,上述基于大模型的语义解码器将恢复出的跨模态共享的高维语义空间表示进行重建,得到重建的多模态数据,具体包括:通过大语言模型解码器(对应图2中的LLM解码器)将高维语义空间表示从统一的语义空间恢复成不同模态数据的第一特征;输出投影器对不同模态数据的第一特征进行解码,得到不同模态数据的第二特征,其中,输出投影器可以为图像输出投影器、音频输出投影器、视频输出投影器等;模态解码器将不同模态数据的第二特征重建为原始的多模态数据,其中,模态解码器可以为文本解码器、图像解码器、音频解码器、视频解码器等(参见图2中的语义大模型解码器部分)。通过基于大模型的语义解码器对联合语义空间理解和重建,从而使得接收端能够准确理解发送端的通信意图。例如,LLM解码器将包括文本、图像、音频和视频的多模态数据的高维语义空间表示从统一的语义空间恢复成文本特征、图像特征、音频特征和视频特征,然后图像特征、音频特征和视频特征分别经过图像输出投影器、音频输出投影器、视频输出投影器解码为对应模态的特征,再分别经过图像解码器、音频解码器、视频解码器,重建为图像数据、音频数据和视频数据,其中,文本特征直接经过文本解码器重建为文本数据。In an optional embodiment of the present application, the semantic decoder based on the large model at the receiving end includes a modality decoder, an output projector and a large language model decoder. The semantic decoder based on the large model reconstructs the restored high-dimensional semantic space representation shared across modalities to obtain reconstructed multimodal data, specifically including: restoring the high-dimensional semantic space representation from a unified semantic space to the first feature of different modal data through a large language model decoder (corresponding to the LLM decoder in Figure 2); the output projector decodes the first feature of the different modal data to obtain the second feature of the different modal data, wherein the output projector can be an image output projector, an audio output projector, a video output projector, etc.; the modality decoder reconstructs the second feature of the different modal data into the original multimodal data, wherein the modality decoder can be a text decoder, an image decoder, an audio decoder, a video decoder, etc. (see the semantic large model decoder part in Figure 2). The joint semantic space is understood and reconstructed by the semantic decoder based on the large model, so that the receiving end can accurately understand the communication intention of the sending end. For example, the LLM decoder recovers the high-dimensional semantic space representation of multimodal data including text, images, audio and video from a unified semantic space into text features, image features, audio features and video features, and then the image features, audio features and video features are decoded into features of the corresponding modalities through the image output projector, audio output projector and video output projector respectively, and then reconstructed into image data, audio data and video data through the image decoder, audio decoder and video decoder respectively, among which the text features are directly reconstructed into text data through the text decoder.

在本申请可选的一实施例中,在利用上述收发端语义大模型,进行所述多模态数据的传输和重建之后,还可以将重建后的多模态数据用于下游任务执行(参阅图2中下游任务部分),具体为,基于收发端语义大模型重建后的多模态数据,采用下游任务执行功能组件执行相应的下游任务,得到输出结果;基于输出结果对下游任务执行功能组件的参数进行优化和调整,得到调整后的下游任务执行功能组件;通过调整后的下游任务执行功能组件基于用户的交互指令和重建后的多模态数据,执行对应的任务。In an optional embodiment of the present application, after using the above-mentioned semantic big model of the transceiver to transmit and reconstruct the multimodal data, the reconstructed multimodal data can also be used for downstream task execution (see the downstream task part in Figure 2). Specifically, based on the multimodal data reconstructed by the semantic big model of the transceiver, the downstream task execution functional component is used to execute the corresponding downstream task to obtain an output result; based on the output result, the parameters of the downstream task execution functional component are optimized and adjusted to obtain an adjusted downstream task execution functional component; and the adjusted downstream task execution functional component is used to execute the corresponding task based on the user's interactive instructions and the reconstructed multimodal data.

上述用于执行下游任务的功能组件执行的下游任务可以包括各种应用场景,如数据恢复、视频会议、智能家居、虚拟现实等。通过与用户的交互和指令,以及解码后的多模态数据,实现智能问答、智能控制、虚拟交互等功能。还可以根据用户反馈和系统性能进行优化和调整,以提供更加个性化和高效的服务。在下游任务执行阶段,可以根据具体任务需求选择合适的算法和模型。例如,在问答任务中,可以使用基于注意力机制的深度学习模型来捕捉问题和答案之间的关联;在情感分析任务中,可以利用情感词典和机器学习算法来识别文本中的情感倾向。The downstream tasks performed by the functional components for performing downstream tasks described above may include various application scenarios, such as data recovery, video conferencing, smart home, virtual reality, etc. Through interaction and instructions with users, as well as decoded multimodal data, functions such as intelligent question and answer, intelligent control, and virtual interaction are realized. It can also be optimized and adjusted based on user feedback and system performance to provide more personalized and efficient services. In the downstream task execution stage, appropriate algorithms and models can be selected according to specific task requirements. For example, in question and answer tasks, a deep learning model based on an attention mechanism can be used to capture the association between questions and answers; in sentiment analysis tasks, sentiment dictionaries and machine learning algorithms can be used to identify sentiment tendencies in texts.

本申请实施例提出的收发端语义大模型的训练过程是通过多个模块协同完成的,包括多模态数据采集模块、多模态数据预处理模块、收发端语义大模型构建模块、收发端语义大模型训练模块、语义大模型输出模块和语义大模型性能评价模块,参阅图3,图3是本申请实施例提供的一种收发端语义大模型的训练过程示意图,接下来对训练过程中,以上每个模块的具体内容进行描述。The training process of the semantic big model at the transceiver end proposed in the embodiment of the present application is completed through the collaboration of multiple modules, including a multimodal data acquisition module, a multimodal data preprocessing module, a semantic big model construction module at the transceiver end, a semantic big model training module at the transceiver end, a semantic big model output module and a semantic big model performance evaluation module. Refer to Figure 3, which is a schematic diagram of the training process of the semantic big model at the transceiver end provided in the embodiment of the present application. Next, the specific content of each of the above modules during the training process is described.

通过多模态数据采集模块从真实世界或模拟环境中获取多样性的多模态数据,包括文本文档、语音对话记录、图像文件、视频等多种类型的数据资源。对于每种模态,收发端语义大模型都需确保数据的多样性、丰富性和实时性,以便在后续训练阶段能够涵盖各种可能的语义和情境,促使待训练的收发端语义大模型能够更好地学习和理解不同模态之间的关联和差异。The multimodal data acquisition module is used to obtain diverse multimodal data from the real world or simulated environment, including text documents, voice conversation records, image files, videos and other types of data resources. For each modality, the sender-receiver semantic big model must ensure the diversity, richness and real-time nature of the data, so that it can cover all possible semantics and situations in the subsequent training stage, so that the sender-receiver semantic big model to be trained can better learn and understand the associations and differences between different modalities.

采用多模态数据预处理模块对多模态数据采集模块采集到的语音、图像和文本数据进行基础处理和特征提取。如,对图像信息进行图像分类和特征提取,识别出图像中的关键元素;对文本信息进行分词、词性标注等处理,提取出文本中的关键信息。同时,还可以通过多模态数据预处理模块对多模态数据进行情感特征识别,提取出情感特征。预处理完成后,每个特征的表现都被数值量化,为后续模型训练提供统一的数据格式。The multimodal data preprocessing module is used to perform basic processing and feature extraction on the voice, image and text data collected by the multimodal data acquisition module. For example, image information is classified and feature extracted to identify the key elements in the image; text information is processed by word segmentation and part-of-speech tagging to extract the key information in the text. At the same time, the multimodal data preprocessing module can also be used to identify the emotional features of multimodal data and extract emotional features. After the preprocessing is completed, the performance of each feature is numerically quantified to provide a unified data format for subsequent model training.

进一步地,通过收发端语义大模型构建模块构建待训练的收发端语义大模型,利用模态编解码器、输入/输出投影器和大语言模型编解码,构建能够处理多模态数据的语义编解码器。该模块需要能够理解和表示不同模态数据的语义信息,实现多模态数据的融合和转换。其中基于大模型的语义编码器,用于将多模态输入映射到一个共同的高维语义空间;而基于大模型的语义解码器则负责将这个空间中的表征解码回相应的模态输出。同时,该模块还需要构建信道编解码器,用于在有噪声和信号失真的通信环境下优化信息传输质量。通过训练使信道编解码器具备选择与信道状态相匹配的编码速率进行编码,并选择与信道状态相匹配的传输策略进行语义符号的传输的能力,使信道解码器具有选择与信道状态相匹配的解码速率进行解码,并选择与信道状态相匹配的纠错策略进行纠错的能力。Furthermore, the semantic big model of the transceiver to be trained is constructed through the transceiver semantic big model construction module, and the semantic codec capable of processing multimodal data is constructed by using the modal codec, input/output projector and large language model codec. This module needs to be able to understand and represent the semantic information of different modal data to achieve the fusion and conversion of multimodal data. The semantic encoder based on the big model is used to map the multimodal input to a common high-dimensional semantic space; while the semantic decoder based on the big model is responsible for decoding the representation in this space back to the corresponding modal output. At the same time, this module also needs to build a channel codec to optimize the information transmission quality in a communication environment with noise and signal distortion. Through training, the channel codec has the ability to select a coding rate matching the channel state for encoding, and select a transmission strategy matching the channel state for the transmission of semantic symbols, so that the channel decoder has the ability to select a decoding rate matching the channel state for decoding, and select an error correction strategy matching the channel state for error correction.

在待训练收发端语义大模型构建之后,基于迭代优化和分步训练的方式,通过收发端语义大模型训练模块对该模型进行训练,其训练过程具体如下:在第一阶段的训练中,冻结预训练的大语言模型编解码器的参数,单独训练待训练的信道编解码器,以适应信道状态,并优化其抗噪声能力;在第二阶段的训练中,当待训练的信道编解码器训练到可以在一定程度上根据信道状态动态地调整传输策略时,冻结训练到一定程度的信道编解码器的参数,训练待训练的与模态相关的编解码器和与模态相关的投影器的参数,提升多模态数据的语义理解和重建能力。循环交替以上第一阶段训练和第二阶段训练,优化待训练的信道编解码器、待训练的与模态相关的编解码器和与模态相关的投影器的性能,直至收发端语义大模型的损失函数值收敛或满足预定的性能指标。After the semantic big model of the transceiver to be trained is constructed, the model is trained by the semantic big model training module of the transceiver based on iterative optimization and step-by-step training. The training process is as follows: In the first stage of training, the parameters of the pre-trained big language model codec are frozen, and the channel codec to be trained is trained separately to adapt to the channel state and optimize its anti-noise ability; in the second stage of training, when the channel codec to be trained is trained to a certain extent to dynamically adjust the transmission strategy according to the channel state, the parameters of the channel codec trained to a certain extent are frozen, and the parameters of the modality-related codec and the modality-related projector to be trained are trained to improve the semantic understanding and reconstruction capabilities of multimodal data. The first stage training and the second stage training are alternately cycled to optimize the performance of the channel codec to be trained, the modality-related codec to be trained, and the modality-related projector until the loss function value of the semantic big model of the transceiver converges or meets the predetermined performance indicators.

在训练收发端语义大模型的过程中,预训练的大语言模型编解码器的参数是一直被冻结的,这是由于大语言模型编解码器已经在语料库上进行了预训练,具有良好的语言理解和生成能力。通过冻结大语言模型编解码器的参数,可以确保在训练过程中不改变其原有的能力,同时专注于训练与模态相关的编解码器、输入/输出投影器,以适应多模态数据的通信需求。通过分步训练和迭代优化,模型能够在较短时间内达到较好的性能,提高模型训练效率。保持预训练好的大语言模型编解码器的参数不变,仅训练与模态相关的编解码器和与模态相关的投影器。这可以确保大语言模型编解码器强大的语言处理能力不被破坏,同时使模型能够更好地适应多模态数据的语义提取与处理需求。In the process of training the semantic large model of the transmitter and receiver, the parameters of the pre-trained large language model codec are always frozen. This is because the large language model codec has been pre-trained on the corpus and has good language understanding and generation capabilities. By freezing the parameters of the large language model codec, it can be ensured that its original capabilities are not changed during the training process, while focusing on training the codec and input/output projectors related to the modality to meet the communication needs of multimodal data. Through step-by-step training and iterative optimization, the model can achieve better performance in a shorter time and improve the efficiency of model training. Keep the parameters of the pre-trained large language model codec unchanged, and only train the codec related to the modality and the projector related to the modality. This ensures that the powerful language processing capabilities of the large language model codec are not destroyed, while enabling the model to better adapt to the semantic extraction and processing needs of multimodal data.

在语义大模型输出模块中,语义大模型接收到多模态数据输入后,通过语义编码器将多模态信息编码为统一的语义表示,经由信道编码器适配无线信道后发送出去。接收端接收到信号后,通过信道解码器恢复出原始的语义信息,再由语义解码器将其还原为预期的多模态数据或生成相应的下游任务输出。In the semantic big model output module, after receiving the multimodal data input, the semantic big model encodes the multimodal information into a unified semantic representation through the semantic encoder, and sends it out after adapting to the wireless channel through the channel encoder. After receiving the signal, the receiving end recovers the original semantic information through the channel decoder, and then the semantic decoder restores it to the expected multimodal data or generates the corresponding downstream task output.

在收发端语义大模型的训练过程中,通过语义大模型性能评价模块对训练过程中的收发端语义大模型的性能进行评估,具体为,在收发端语义大模型的训练过程中,计算重建的多模态数据的重建通信意图与来自信源的多模态数据的实际通信意图之间的差异;根据计算得到的差异,评价收发端语义大模型是否保持语义一致;将评价结果反馈至收发端语义大模型训练模块,用以指导收发端语义大模型进一步优化,并对收发端语义大模型的参数和训练策略进行调整,以降低重建的多模态数据的重建通信意图与来自信源的多模态数据的实际通信意图之间的差异,从而提升其整体性能。During the training process of the semantic big model at the transceiver end, the performance of the semantic big model at the transceiver end during the training process is evaluated through the semantic big model performance evaluation module. Specifically, during the training process of the semantic big model at the transceiver end, the difference between the reconstructed communication intention of the reconstructed multimodal data and the actual communication intention of the multimodal data from the information source is calculated; based on the calculated difference, whether the semantic big model at the transceiver end maintains semantic consistency is evaluated; the evaluation result is fed back to the semantic big model training module at the transceiver end to guide further optimization of the semantic big model at the transceiver end, and the parameters and training strategies of the semantic big model at the transceiver end are adjusted to reduce the difference between the reconstructed communication intention of the reconstructed multimodal data and the actual communication intention of the multimodal data from the information source, thereby improving its overall performance.

本申请实施例提供一种基于大模型的多模态语义通信方法,获取预训练的大语言模型编解码器以及包括文本、图像、音频和视频的多模态数据,所述预训练的大语言模型编解码器是通过语料库进行预训练得到的;利用待训练的与模态相关的编解码器和与模态相关的投影器,以及所述预训练的大语言模型编解码器,构建待训练的基于大模型的语义编解码器,以及利用待训练的信道编解码器和所述待训练的基于大模型的语义编解码器,构建待训练的收发端语义大模型;在保持所述预训练的大语言模型编解码器的参数不变的前提下,循环交替训练所述待训练的与模态相关的编解码器和与模态相关的投影器,以及所述待训练的信道编解码器,得到收发端语义大模型;利用所述收发端语义大模型,进行所述多模态数据的传输和重建。通过上述方法实现对多模态数据的统一语义提取与融合表征,确保语义映射的准确性和稳定性,从而消除模态间的信息冗余,实现高效的多模态数据语义对齐与融合,提升通信系统的通用性和扩展性。The embodiment of the present application provides a multimodal semantic communication method based on a large model, which obtains a pre-trained large language model codec and multimodal data including text, image, audio and video, wherein the pre-trained large language model codec is obtained by pre-training with a corpus; a codec related to the modality to be trained and a projector related to the modality are used, as well as the pre-trained large language model codec, to construct a semantic codec based on a large model to be trained, and a channel codec to be trained and the semantic codec based on the large model to be trained are used to construct a semantic large model of the transceiver to be trained; on the premise of keeping the parameters of the pre-trained large language model codec unchanged, the codec related to the modality to be trained and the projector related to the modality, as well as the channel codec to be trained are cyclically and alternately trained to obtain a semantic large model of the transceiver; and the multimodal data is transmitted and reconstructed using the semantic large model of the transceiver. The above method can realize unified semantic extraction and fusion representation of multimodal data, ensure the accuracy and stability of semantic mapping, eliminate information redundancy between modalities, realize efficient semantic alignment and fusion of multimodal data, and improve the versatility and scalability of communication systems.

在本申请实施例第二方面提供一种基于大模型的多模态语义通信系统,参阅4,图4是本申请实施例提供的一种基于大模型的多模态语义通信系统的架构图,所述系统包括:In a second aspect of an embodiment of the present application, a multimodal semantic communication system based on a large model is provided. Referring to FIG4 , FIG4 is an architecture diagram of a multimodal semantic communication system based on a large model provided in an embodiment of the present application, and the system includes:

获取模块401,用于获取预训练的大语言模型编解码器以及包括文本、图像、音频和视频的多模态数据,所述预训练的大语言模型编解码器是通过语料库进行预训练得到的;An acquisition module 401 is used to acquire a pre-trained large language model codec and multimodal data including text, image, audio and video, wherein the pre-trained large language model codec is obtained by pre-training with a corpus;

构建模块402,用于利用待训练的与模态相关的编解码器和与模态相关的投影器,以及所述预训练的大语言模型编解码器,构建待训练的基于大模型的语义编解码器,以及利用待训练的信道编解码器和所述待训练的基于大模型的语义编解码器,构建待训练的收发端语义大模型;A construction module 402 is used to construct a semantic codec based on a large model to be trained by using the codec and projector related to the modality to be trained, and the pre-trained large language model codec, and to construct a semantic large model of the transceiver to be trained by using the channel codec to be trained and the semantic codec based on the large model to be trained;

训练模块403,用于在保持所述预训练的大语言模型编解码器的参数不变的前提下,循环交替训练所述待训练的与模态相关的编解码器和与模态相关的投影器,以及所述待训练的信道编解码器,得到收发端语义大模型;The training module 403 is used to cyclically and alternately train the codec and the projector related to the modality to be trained, and the channel codec to be trained, under the premise of keeping the parameters of the pre-trained large language model codec unchanged, to obtain a large semantic model of the transceiver;

传输重建模块404,用于利用所述收发端语义大模型,进行所述多模态数据的传输和重建。The transmission and reconstruction module 404 is used to transmit and reconstruct the multimodal data using the semantic big model of the transmitting and receiving end.

其中,所述系统具体包括:The system specifically includes:

第一编码模块,用于发射端通过所述收发端语义大模型中的基于大模型的语义编码器,对来自于信源的多模态数据进行联合编码,捕获并融合不同模态之间的语义关联,生成跨模态共享的高维语义空间表示;A first encoding module is used for the transmitter to jointly encode the multimodal data from the source through a semantic encoder based on a large model in the semantic large model of the transmitter and receiver, capture and fuse the semantic associations between different modalities, and generate a high-dimensional semantic space representation shared across modalities;

第二编码模块,用于发射端通过所述收发端语义大模型中的自适应信道编码器,将所述跨模态共享的高维语义空间表示转化为语义符号,选择与信道状态相匹配的编码速率进行编码,并选择与信道状态相匹配的传输策略进行传输;A second encoding module is used for the transmitter to convert the cross-modal shared high-dimensional semantic space representation into semantic symbols through an adaptive channel encoder in the semantic large model of the transmitter and receiver, select a coding rate matching the channel state for encoding, and select a transmission strategy matching the channel state for transmission;

第一解码模块,用于接收端通过所述收发端语义大模型中的自适应信道解码器将收到的所述语义符号恢复成跨模态共享的高维语义空间表示,选择与信道状态相匹配的解码速率进行解码,并选择与信道状态相匹配的纠错策略进行纠错;A first decoding module is used for the receiving end to restore the received semantic symbols into a high-dimensional semantic space representation shared across modalities through an adaptive channel decoder in the semantic large model of the transceiver end, select a decoding rate matching the channel state for decoding, and select an error correction strategy matching the channel state for error correction;

第二解码模块,用于接收端通过所述收发端语义大模型中基于大模型的语义解码器将恢复出的跨模态共享的高维语义空间表示进行重建,得到重建的多模态数据。The second decoding module is used for the receiving end to reconstruct the restored cross-modal shared high-dimensional semantic space representation through the semantic decoder based on the large model in the semantic large model of the transmitting and receiving end to obtain reconstructed multimodal data.

其中,所述训练模块包括:Wherein, the training module includes:

第一训练子模块,用于冻结所述预训练的大语言模型编解码器的参数,单独训练所述待训练的信道编解码器,以适应信道状态;A first training submodule, used for freezing the parameters of the pre-trained large language model codec, and separately training the channel codec to be trained to adapt to the channel state;

第二训练子模块,用于冻结训练到一定程度的信道编解码器的参数,训练所述待训练的与模态相关的编解码器和与模态相关的投影器的参数;A second training submodule is used to freeze the parameters of the channel codec trained to a certain degree, and train the parameters of the codec and projector related to the modality to be trained;

循环交替训练子模块,用于循环交替以上第一训练子模块和第二训练子模块,优化所述待训练的信道编解码器、所述待训练的与模态相关的编解码器和与模态相关的投影器的性能,直至所述收发端语义大模型的损失函数值收敛或满足预定的性能指标。A cyclic alternating training submodule is used to cyclically alternate the first training submodule and the second training submodule to optimize the performance of the channel codec to be trained, the modality-related codec to be trained, and the modality-related projector until the loss function value of the semantic large model of the transceiver converges or meets the predetermined performance index.

其中,所述训练模块还包括:Wherein, the training module also includes:

差异计算子模块,用于在训练过程中,计算重建的多模态数据的重建通信意图与来自信源的多模态数据的实际通信意图之间的差异;A difference calculation submodule, used for calculating the difference between the reconstructed communication intent of the reconstructed multimodal data and the actual communication intent of the multimodal data from the source during the training process;

评价子模块,用于根据所述差异,评价所述收发端语义大模型是否保持语义一致;An evaluation submodule, used for evaluating whether the semantic large model of the transmitting and receiving ends maintains semantic consistency according to the difference;

调整子模块,用于基于评价结果对所述收发端语义大模型的参数和训练策略进行调整,以降低所述差异。The adjustment submodule is used to adjust the parameters and training strategy of the semantic large model of the transmitting and receiving ends based on the evaluation results to reduce the difference.

其中,所述系统还包括:Wherein, the system further comprises:

预处理模块,用于对多模态数据进行预处理,以使预处理后的多模态数据具有预设的数据格式;A preprocessing module, used to preprocess the multimodal data so that the preprocessed multimodal data has a preset data format;

所述第一编码模块是通过所述基于大模型的语义编码器实现的,所述基于大模型的语义编码器包括模态编码器、输入投影器和大语言模型编码器;所述模态编码器,用于对所述预处理后的多模态数据进行特征提取;所述输入投影器,用于对提取的特征进行对齐,以使不同模态的数据能够在同一特征空间内被有效整合;所述大语言模型编码器,用于将不同模态的数据映射到统一的语义空间中,实现多模态数据的语义信息融合。The first encoding module is implemented by the large model-based semantic encoder, which includes a modal encoder, an input projector and a large language model encoder; the modal encoder is used to extract features from the preprocessed multimodal data; the input projector is used to align the extracted features so that data of different modalities can be effectively integrated in the same feature space; the large language model encoder is used to map data of different modalities into a unified semantic space to achieve semantic information fusion of multimodal data.

其中,所述第二解码模块中的所述基于大模型的语义解码器包括模态解码器、输出投影器和大语言模型解码器,所述大语言模型解码器,用于将所述高维语义空间表示从统一的语义空间恢复成不同模态数据的第一特征;所述输出投影器,用于对所述不同模态数据的第一特征进行解码,得到不同模态数据的第二特征;所述模态解码器,用于将所述不同模态数据的第二特征重建为原始的所述多模态数据。Among them, the large model-based semantic decoder in the second decoding module includes a modality decoder, an output projector and a large language model decoder, the large language model decoder is used to restore the high-dimensional semantic space representation from a unified semantic space into a first feature of different modal data; the output projector is used to decode the first feature of the different modal data to obtain a second feature of the different modal data; the modality decoder is used to reconstruct the second feature of the different modal data into the original multimodal data.

其中,所述系统还包括:Wherein, the system further comprises:

输出结果获取模块,用于基于所述收发端语义大模型重建后的多模态数据,采用下游任务执行功能组件执行相应的下游任务,得到输出结果;An output result acquisition module is used to execute corresponding downstream tasks using a downstream task execution function component based on the multimodal data reconstructed by the large semantic model of the transmitting and receiving end to obtain an output result;

优化模块,用于基于所述输出结果对所述下游任务执行功能组件的参数进行优化和调整,得到调整后的下游任务执行功能组件;An optimization module, used for optimizing and adjusting the parameters of the downstream task execution functional component based on the output result to obtain an adjusted downstream task execution functional component;

下游任务执行模块,用于通过所述调整后的下游任务执行功能组件基于用户的交互指令和重建后的多模态数据,执行对应的任务。The downstream task execution module is used to execute the corresponding task based on the user's interactive instructions and the reconstructed multimodal data through the adjusted downstream task execution functional component.

基于同一申请构思,本申请实施例在第三方面公开了一种电子设备,图5示出了本申请实施例公开的一种电子设备示意图,如图5所示,电子设备100包括:存储器110和处理器120,所述电子设备的存储器不少于12G,处理器主频不低于2.4GHz,存储器110与处理器120之间通过总线通信连接,存储器110中存储有计算机程序,该计算机程序可在处理器120上运行,以实现本申请实施例公开的一种基于大模型的多模态语义通信方法。Based on the same application concept, the embodiment of the present application discloses an electronic device in the third aspect. Figure 5 shows a schematic diagram of an electronic device disclosed in the embodiment of the present application. As shown in Figure 5, the electronic device 100 includes: a memory 110 and a processor 120. The memory of the electronic device is not less than 12G, and the main frequency of the processor is not less than 2.4GHz. The memory 110 and the processor 120 are connected through bus communication. A computer program is stored in the memory 110. The computer program can be run on the processor 120 to implement a multimodal semantic communication method based on a large model disclosed in the embodiment of the present application.

基于同一申请构思,本申请实施例在第四方面公开了一种计算机可读存储介质,其上存储有计算机程序/指令,该计算机程序/指令被处理器执行时实现本申请实施例公开的一种基于大模型的多模态语义通信方法。Based on the same application concept, the fourth aspect of the embodiment of the present application discloses a computer-readable storage medium, on which a computer program/instruction is stored. When the computer program/instruction is executed by a processor, a multimodal semantic communication method based on a large model disclosed in the embodiment of the present application is implemented.

本说明书中的各个实施例均采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似的部分互相参见即可。The various embodiments in this specification are described in a progressive manner, and each embodiment focuses on the differences from other embodiments. The same or similar parts between the various embodiments can be referenced to each other.

本申请实施例是参照根据本申请实施例的方法、装置、电子设备和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理终端设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理终端设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The embodiments of the present application are described with reference to the flowcharts and/or block diagrams of the methods, devices, electronic devices, and computer program products according to the embodiments of the present application. It should be understood that each process and/or box in the flowchart and/or block diagram, as well as the combination of the processes and/or boxes in the flowchart and/or block diagram, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, a special-purpose computer, an embedded processor, or other programmable data processing terminal device to generate a machine, so that the instructions executed by the processor of the computer or other programmable data processing terminal device generate a device for implementing the functions specified in one process or multiple processes in the flowchart and/or one box or multiple boxes in the block diagram.

这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理终端设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal device to operate in a specific manner, so that the instructions stored in the computer-readable memory produce a manufactured product including an instruction device that implements the functions specified in one or more processes in the flowchart and/or one or more boxes in the block diagram.

这些计算机程序指令也可装载到计算机或其他可编程数据处理终端设备上,使得在计算机或其他可编程终端设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程终端设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded onto a computer or other programmable data processing terminal device so that a series of operating steps are executed on the computer or other programmable terminal device to produce computer-implemented processing, so that the instructions executed on the computer or other programmable terminal device provide steps for implementing the functions specified in one or more processes in the flowchart and/or one or more boxes in the block diagram.

尽管已描述了本申请实施例的优选实施例,但本领域内的技术人员一旦得知了基本创造性概念,则可对这些实施例做出另外的变更和修改。所以,所附权利要求意欲解释为包括优选实施例以及落入本申请实施例范围的所有变更和修改。Although the preferred embodiments of the present application have been described, those skilled in the art may make additional changes and modifications to these embodiments once they have learned the basic creative concept. Therefore, the appended claims are intended to be interpreted as including the preferred embodiments and all changes and modifications that fall within the scope of the embodiments of the present application.

最后,还需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者终端设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者终端设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者终端设备中还存在另外的相同要素。Finally, it should be noted that, in this article, relational terms such as first and second, etc. are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Moreover, the terms "include", "comprise" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, article or terminal device including a series of elements includes not only those elements, but also other elements not explicitly listed, or also includes elements inherent to such process, method, article or terminal device. In the absence of further restrictions, the elements defined by the sentence "including one..." do not exclude the existence of other identical elements in the process, method, article or terminal device including the elements.

以上对本申请所提供的一种基于大模型的多模态语义通信方法、系统、设备及介质,进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。The above is a detailed introduction to a multimodal semantic communication method, system, device and medium based on a large model provided by the present application. Specific examples are used in this article to illustrate the principles and implementation methods of the present application. The description of the above embodiments is only used to help understand the method of the present application and its core idea; at the same time, for general technical personnel in this field, according to the idea of the present application, there will be changes in the specific implementation method and application scope. In summary, the content of this specification should not be understood as a limitation on the present application.

Claims (10)

1.一种基于大模型的多模态语义通信方法,其特征在于,所述方法包括:1. A multimodal semantic communication method based on a large model, characterized in that the method comprises: 获取预训练的大语言模型编解码器以及包括文本、图像、音频和视频的多模态数据,所述预训练的大语言模型编解码器是通过语料库进行预训练得到的;Obtain a pre-trained large language model codec and multimodal data including text, image, audio, and video, wherein the pre-trained large language model codec is obtained by pre-training with a corpus; 利用待训练的与模态相关的编解码器和与模态相关的投影器,以及所述预训练的大语言模型编解码器,构建待训练的基于大模型的语义编解码器,以及利用待训练的信道编解码器和所述待训练的基于大模型的语义编解码器,构建待训练的收发端语义大模型;Using the modality-related codec and the modality-related projector to be trained, and the pre-trained large language model codec, construct a semantic codec based on the large model to be trained, and using the channel codec to be trained and the semantic codec based on the large model to be trained, construct a semantic large model of the transceiver to be trained; 在保持所述预训练的大语言模型编解码器的参数不变的前提下,循环交替训练所述待训练的与模态相关的编解码器和与模态相关的投影器,以及所述待训练的信道编解码器,得到收发端语义大模型;Under the premise of keeping the parameters of the pre-trained large language model codec unchanged, cyclically and alternately training the codec and projector related to the modality to be trained, and the channel codec to be trained, to obtain a large semantic model of the transceiver; 利用所述收发端语义大模型,进行所述多模态数据的传输和重建。The multimodal data is transmitted and reconstructed using the semantic big model of the transmitting and receiving ends. 2.根据权利要求1所述的基于大模型的多模态语义通信方法,其特征在于,所述方法具体包括:2. The multimodal semantic communication method based on a large model according to claim 1, characterized in that the method specifically comprises: 发射端通过所述收发端语义大模型中的基于大模型的语义编码器,对来自于信源的多模态数据进行联合编码,捕获并融合不同模态之间的语义关联,生成跨模态共享的高维语义空间表示;The transmitter jointly encodes the multimodal data from the source through the semantic encoder based on the large model in the semantic large model of the transmitter and receiver, captures and fuses the semantic associations between different modalities, and generates a high-dimensional semantic space representation shared across modalities; 发射端通过所述收发端语义大模型中的自适应信道编码器,将所述跨模态共享的高维语义空间表示转化为语义符号,选择与信道状态相匹配的编码速率进行编码,并选择与信道状态相匹配的传输策略进行传输;The transmitter converts the cross-modal shared high-dimensional semantic space representation into semantic symbols through an adaptive channel encoder in the semantic large model of the transmitter and receiver, selects a coding rate matching the channel state for encoding, and selects a transmission strategy matching the channel state for transmission; 接收端通过所述收发端语义大模型中的自适应信道解码器将收到的所述语义符号恢复成跨模态共享的高维语义空间表示,选择与信道状态相匹配的解码速率进行解码,并选择与信道状态相匹配的纠错策略进行纠错;The receiving end restores the received semantic symbols into a high-dimensional semantic space representation shared across modalities through an adaptive channel decoder in the semantic large model of the transceiver end, selects a decoding rate matching the channel state for decoding, and selects an error correction strategy matching the channel state for error correction; 接收端通过所述收发端语义大模型中基于大模型的语义解码器将恢复出的跨模态共享的高维语义空间表示进行重建,得到重建的多模态数据。The receiving end reconstructs the restored cross-modal shared high-dimensional semantic space representation through a semantic decoder based on the large model in the semantic large model of the transmitting and receiving end to obtain reconstructed multimodal data. 3.根据权利要求1所述的基于大模型的多模态语义通信方法,其特征在于,所述收发端语义大模型的训练过程如下:3. According to the large model-based multimodal semantic communication method of claim 1, it is characterized in that the training process of the semantic large model of the transmitting and receiving end is as follows: 第一阶段训练:冻结所述预训练的大语言模型编解码器的参数,单独训练所述待训练的信道编解码器,以适应信道状态;The first stage of training: freezing the parameters of the pre-trained large language model codec, and separately training the channel codec to be trained to adapt to the channel state; 第二阶段训练:冻结训练到一定程度的信道编解码器的参数,训练所述待训练的与模态相关的编解码器和与模态相关的投影器的参数;The second stage of training: freezing the parameters of the channel codec trained to a certain degree, and training the parameters of the codec and projector related to the modality to be trained; 循环交替以上第一阶段训练和第二阶段训练,优化所述待训练的信道编解码器、所述待训练的与模态相关的编解码器和与模态相关的投影器的性能,直至所述收发端语义大模型的损失函数值收敛或满足预定的性能指标。The first stage training and the second stage training are cyclically alternated to optimize the performance of the channel codec to be trained, the modality-related codec to be trained, and the modality-related projector, until the loss function value of the semantic large model of the transceiver converges or meets the predetermined performance index. 4.根据权利要求3所述的基于大模型的多模态语义通信方法,其特征在于,所述收发端语义大模型的训练过程还包括:4. The method for multimodal semantic communication based on a large model according to claim 3, wherein the training process of the semantic large model of the transmitting and receiving end further comprises: 在训练过程中,计算重建的多模态数据的重建通信意图与来自信源的多模态数据的实际通信意图之间的差异;During the training process, a difference between a reconstructed communication intent of the reconstructed multimodal data and an actual communication intent of the multimodal data from the source is calculated; 根据所述差异,评价所述收发端语义大模型是否保持语义一致;According to the difference, evaluating whether the semantic big model of the sending and receiving end maintains semantic consistency; 基于评价结果对所述收发端语义大模型的参数和训练策略进行调整,以降低所述差异。Based on the evaluation results, the parameters and training strategy of the semantic large model of the transmitting and receiving ends are adjusted to reduce the difference. 5.根据权利要求2所述的基于大模型的多模态语义通信方法,其特征在于,在对来自于信源的多模态数据进行联合编码之前,所述方法包括:5. The method for multimodal semantic communication based on a large model according to claim 2, characterized in that before jointly encoding the multimodal data from the information source, the method comprises: 对多模态数据进行预处理,以使预处理后的多模态数据具有预设的数据格式;Preprocessing the multimodal data so that the preprocessed multimodal data has a preset data format; 所述对来自于信源的多模态数据进行联合编码是通过所述基于大模型的语义编码器实现的,所述基于大模型的语义编码器包括模态编码器、输入投影器和大语言模型编码器;The joint encoding of the multimodal data from the information source is implemented by the semantic encoder based on the large model, and the semantic encoder based on the large model includes a modality encoder, an input projector and a large language model encoder; 所述模态编码器,用于对所述预处理后的多模态数据进行特征提取;The modal encoder is used to extract features from the preprocessed multimodal data; 所述输入投影器,用于对提取的特征进行对齐,以使不同模态的数据能够在同一特征空间内被有效整合;The input projector is used to align the extracted features so that data from different modalities can be effectively integrated in the same feature space; 所述大语言模型编码器,用于将不同模态的数据映射到统一的语义空间中,实现多模态数据的语义信息融合。The large language model encoder is used to map data of different modalities into a unified semantic space to achieve semantic information fusion of multimodal data. 6.根据权利要求2所述的基于大模型的多模态语义通信方法,其特征在于,所述基于大模型的语义解码器包括模态解码器、输出投影器和大语言模型解码器,所述基于大模型的语义解码器将恢复出的跨模态共享的高维语义空间表示进行重建,得到重建的多模态数据,包括:6. The multimodal semantic communication method based on a large model according to claim 2, characterized in that the semantic decoder based on the large model comprises a modality decoder, an output projector and a large language model decoder, and the semantic decoder based on the large model reconstructs the restored high-dimensional semantic space representation shared across modalities to obtain reconstructed multimodal data, including: 所述大语言模型解码器,用于将所述高维语义空间表示从统一的语义空间恢复成不同模态数据的第一特征;The large language model decoder is used to restore the high-dimensional semantic space representation from a unified semantic space into a first feature of different modal data; 所述输出投影器,用于对所述不同模态数据的第一特征进行解码,得到不同模态数据的第二特征;The output projector is used to decode the first feature of the different modal data to obtain the second feature of the different modal data; 所述模态解码器,用于将所述不同模态数据的第二特征重建为原始的所述多模态数据。The modality decoder is used to reconstruct the second features of the different modality data into the original multi-modality data. 7.根据权利要求1所述的基于大模型的多模态语义通信方法,其特征在于,在利用所述收发端语义大模型,进行所述多模态数据的传输和重建之后,所述方法包括:7. The method for multimodal semantic communication based on a large model according to claim 1, characterized in that after the multimodal data is transmitted and reconstructed using the semantic large model of the transmitting and receiving end, the method comprises: 基于所述收发端语义大模型重建后的多模态数据,采用下游任务执行功能组件执行相应的下游任务,得到输出结果;Based on the multimodal data reconstructed by the large semantic model of the transmitting and receiving end, a downstream task execution function component is used to execute corresponding downstream tasks to obtain output results; 基于所述输出结果对所述下游任务执行功能组件的参数进行优化和调整,得到调整后的下游任务执行功能组件;Optimizing and adjusting the parameters of the downstream task execution functional component based on the output result to obtain an adjusted downstream task execution functional component; 通过所述调整后的下游任务执行功能组件基于用户的交互指令和重建后的多模态数据,执行对应的任务。The adjusted downstream task execution functional component executes the corresponding task based on the user's interactive instructions and the reconstructed multimodal data. 8.一种基于大模型的多模态语义通信系统,其特征在于,所述系统包括:8. A multimodal semantic communication system based on a large model, characterized in that the system comprises: 获取模块,用于获取预训练的大语言模型编解码器以及包括文本、图像、音频和视频的多模态数据,所述预训练的大语言模型编解码器是通过语料库进行预训练得到的;An acquisition module is used to acquire a pre-trained large language model codec and multimodal data including text, image, audio and video, wherein the pre-trained large language model codec is obtained by pre-training with a corpus; 构建模块,用于利用待训练的与模态相关的编解码器和与模态相关的投影器,以及所述预训练的大语言模型编解码器,构建待训练的基于大模型的语义编解码器,以及利用待训练的信道编解码器和所述待训练的基于大模型的语义编解码器,构建待训练的收发端语义大模型;A construction module is used to construct a semantic codec based on a large model to be trained by using the codec and projector related to the modality to be trained, and the pre-trained large language model codec, and to construct a semantic large model of the transceiver to be trained by using the channel codec to be trained and the semantic codec based on the large model to be trained; 训练模块,用于在保持所述预训练的大语言模型编解码器的参数不变的前提下,循环交替训练所述待训练的与模态相关的编解码器和与模态相关的投影器,以及所述待训练的信道编解码器,得到收发端语义大模型;A training module is used to cyclically and alternately train the codec and projector related to the modality to be trained, and the channel codec to be trained, under the premise of keeping the parameters of the pre-trained large language model codec unchanged, to obtain a large semantic model of the transceiver; 传输重建模块,用于利用所述收发端语义大模型,进行所述多模态数据的传输和重建。The transmission and reconstruction module is used to utilize the large semantic model of the transmitting and receiving end to transmit and reconstruct the multimodal data. 9.一种电子设备,包括存储器、处理器及存储在所述存储器上的计算机程序,其特征在于,所述处理器执行所述计算机程序以实现权利要求1-7中任一项所述的基于大模型的多模态语义通信方法。9. An electronic device comprising a memory, a processor and a computer program stored in the memory, wherein the processor executes the computer program to implement the large model-based multimodal semantic communication method described in any one of claims 1 to 7. 10.一种计算机可读存储介质,其特征在于,其上存储有计算机程序/指令,该计算机程序/指令被处理器执行时实现权利要求1-7中任一项所述的基于大模型的多模态语义通信方法。10. A computer-readable storage medium, characterized in that a computer program/instruction is stored thereon, and when the computer program/instruction is executed by a processor, the multimodal semantic communication method based on a large model described in any one of claims 1 to 7 is implemented.
CN202410773829.9A 2024-06-17 2024-06-17 Multi-mode semantic communication method, system, equipment and medium based on large model Active CN118350416B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410773829.9A CN118350416B (en) 2024-06-17 2024-06-17 Multi-mode semantic communication method, system, equipment and medium based on large model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410773829.9A CN118350416B (en) 2024-06-17 2024-06-17 Multi-mode semantic communication method, system, equipment and medium based on large model

Publications (2)

Publication Number Publication Date
CN118350416A true CN118350416A (en) 2024-07-16
CN118350416B CN118350416B (en) 2024-08-20

Family

ID=91817518

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410773829.9A Active CN118350416B (en) 2024-06-17 2024-06-17 Multi-mode semantic communication method, system, equipment and medium based on large model

Country Status (1)

Country Link
CN (1) CN118350416B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119142366A (en) * 2024-11-11 2024-12-17 吉林大学 Automatic driving interpretation text determining method based on large visual language model

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023172153A1 (en) * 2022-03-09 2023-09-14 Huawei Technologies Co., Ltd. Method of video coding by multi-modal processing
CN116939320A (en) * 2023-06-12 2023-10-24 南京邮电大学 Method for generating multimode mutually-friendly enhanced video semantic communication
CN117421591A (en) * 2023-10-16 2024-01-19 长春理工大学 Multi-modal characterization learning method based on text-guided image block screening
CN117475278A (en) * 2023-10-30 2024-01-30 安徽大学 Guided vehicle-centered multi-modal pre-training system and method based on structural information
CN117671426A (en) * 2023-12-07 2024-03-08 北京智源人工智能研究院 Concept distillation and CLIP-based hintable segmentation model pre-training method and system
US20240144568A1 (en) * 2022-09-06 2024-05-02 Nvidia Corporation Speech-driven animation using one or more neural networks
CN118039056A (en) * 2023-12-13 2024-05-14 中国科学技术大学苏州高等研究院 Pre-training method, system and application of context-aware medical visual language model

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023172153A1 (en) * 2022-03-09 2023-09-14 Huawei Technologies Co., Ltd. Method of video coding by multi-modal processing
US20240144568A1 (en) * 2022-09-06 2024-05-02 Nvidia Corporation Speech-driven animation using one or more neural networks
CN116939320A (en) * 2023-06-12 2023-10-24 南京邮电大学 Method for generating multimode mutually-friendly enhanced video semantic communication
CN117421591A (en) * 2023-10-16 2024-01-19 长春理工大学 Multi-modal characterization learning method based on text-guided image block screening
CN117475278A (en) * 2023-10-30 2024-01-30 安徽大学 Guided vehicle-centered multi-modal pre-training system and method based on structural information
CN117671426A (en) * 2023-12-07 2024-03-08 北京智源人工智能研究院 Concept distillation and CLIP-based hintable segmentation model pre-training method and system
CN118039056A (en) * 2023-12-13 2024-05-14 中国科学技术大学苏州高等研究院 Pre-training method, system and application of context-aware medical visual language model

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119142366A (en) * 2024-11-11 2024-12-17 吉林大学 Automatic driving interpretation text determining method based on large visual language model

Also Published As

Publication number Publication date
CN118350416B (en) 2024-08-20

Similar Documents

Publication Publication Date Title
Qin et al. Semantic communications: Principles and challenges
Gündüz et al. Beyond transmitting bits: Context, semantics, and task-oriented communications
Liang et al. Generative AI-driven semantic communication networks: Architecture, technologies and applications
CN112800247B (en) Semantic encoding/decoding method, equipment and communication system based on knowledge graph sharing
CN110059324B (en) Neural network machine translation method and device based on dependency information supervision
CN118350416B (en) Multi-mode semantic communication method, system, equipment and medium based on large model
CN111444367A (en) An image caption generation method based on global and local attention mechanism
CN113315972A (en) Video semantic communication method and system based on hierarchical knowledge expression
CN115880762B (en) Human-machine hybrid vision-oriented scalable face image coding method and system
CN116167920A (en) Image compression and reconstruction method based on super-resolution and priori knowledge
CN116204674A (en) Image description method based on visual concept word association structural modeling
CN112492313B (en) Picture transmission system based on generation countermeasure network
Grassucci et al. Enhancing semantic communication with deep generative models: An overview
CN111723194A (en) Abstract generation method, device and equipment
CN112866715B (en) Universal video compression coding system supporting man-machine hybrid intelligence
Dong et al. Innovative semantic communication system
CN117875266B (en) Training method and device for text coding model, electronic equipment and storage medium
CN116208772A (en) Data processing method, device, electronic equipment and computer readable storage medium
CN115114928B (en) An interpretable semantic communication system based on feature selection
CN117093864A (en) Text generation model training method and device
CN115116444A (en) A method, device, device and storage medium for processing speech recognition text
CN118629393B (en) Method, system and computer equipment for generating semantic communication oriented to speech synthesis
Mingkai et al. Task-oriented semantic communication with foundation models
Chen et al. Generative Multi-Modal Mutual Enhancement Video Semantic Communications.
CN118353968B (en) Semantic channel joint coding and decoding method, device and product based on OAR model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant