CN118585614A

CN118585614A - Data processing method, device, electronic equipment and computer readable storage medium

Info

Publication number: CN118585614A
Application number: CN202410426150.2A
Authority: CN
Inventors: 赵翔宇
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2024-04-09
Filing date: 2024-04-09
Publication date: 2024-09-03

Abstract

The application discloses a data processing method, a data processing device, electronic equipment and a computer readable storage medium, and relates to the technical field of large models. Wherein the method comprises the following steps: acquiring target position data in a target query sequence; acquiring a query matrix, a key matrix and a numerical matrix corresponding to the target position data; before performing attention computation on the target position data, transposing the numerical matrix to obtain a transposed matrix corresponding to the target position data; and executing attention calculation on the target position data based on the query matrix, the key matrix and the transpose matrix corresponding to the target position data to obtain an attention result corresponding to the target position data. The application solves the technical problem of low attention calculation efficiency in the related technology.

Description

Data processing method, device, electronic equipment and computer readable storage medium

Technical Field

The present application relates to the field of large model technologies, and in particular, to a data processing method, apparatus, electronic device, and computer readable storage medium.

Background

Large language models (Large Language Model, LLM) of the converter fransformer architecture, in the current artificial intelligence field, particularly in the natural language processing (Natural Language Processing, NLP) direction, pre-trained language models such as jia Le Ma LLaMA, group chat Chatglm, etc., exhibit powerful language understanding and generating capabilities. However, due to the huge amount of parameters of such models, the structure is complex, resulting in extremely high demands on computing and memory resources in practical deployment and application processes.

Attention computation operators are the core components of such models for performing Attention computation operations that extract key information by computing the degree of association between pairs of query key values, requiring analysis of historical input information, such as key matrices and numerical matrices of the historical input information, during the Attention computation at the LLM model decoding stage. In addition, in the related art, the LLM model is deployed through a general giant language model (GPT-3for Generalized Large Language Mode,lGGML), but the cache optimization problem in the Attention calculation process is not considered in the mode, so that the cost is high when the Attention calculation is performed, and the Attention calculation efficiency is low.

In view of the above problems, no effective solution has been proposed at present.

Disclosure of Invention

The embodiment of the application provides a data processing method, a data processing device, electronic equipment and a computer readable storage medium, which are used for at least solving the technical problem of low attention computing efficiency in the related technology.

According to an aspect of an embodiment of the present application, there is provided a data processing method including: acquiring target position data in a target query sequence; acquiring a query matrix, a key matrix and a numerical matrix corresponding to the target position data; before performing attention computation on the target position data, transposing the numerical matrix to obtain a transposed matrix corresponding to the target position data; and executing the attention calculation on the target position data based on the query matrix, the key matrix and the transpose matrix corresponding to the target position data to obtain an attention result corresponding to the target position data.

According to another aspect of the embodiment of the present application, there is also provided a data processing apparatus including: the first acquisition module is used for acquiring target position data in a target query sequence; the second acquisition module is used for acquiring a query matrix, a key matrix and a numerical matrix corresponding to the target position data; the matrix transposition module is used for transposing the numerical matrix before executing attention calculation on the target position data to obtain a transposed matrix corresponding to the target position data; and the attention calculating module is used for executing the attention calculation on the target position data based on the query matrix, the key matrix and the transpose matrix corresponding to the target position data to obtain an attention result corresponding to the target position data.

According to another aspect of the embodiment of the present application, there is also provided an electronic device, including: a memory storing an executable program; and the processor is used for running the program, wherein the program executes any one of the data processing methods when running.

According to another aspect of the embodiment of the present application, there is also provided a computer readable storage medium, where the computer readable storage medium includes a stored executable program, and when the executable program runs, the computer readable storage medium is controlled to execute any one of the data processing methods.

In the embodiment of the application, a numerical matrix transposition leading mode is adopted, and target position data in a target query sequence is obtained; acquiring a query matrix, a key matrix and a numerical matrix corresponding to the target position data; before performing attention computation on the target position data, transposing the numerical matrix to obtain a transposed matrix corresponding to the target position data; based on the query matrix, the key matrix and the transpose matrix corresponding to the target position data, attention calculation is performed on the target position data to obtain an attention result corresponding to the target position data, and the aim of completing transposition of the numerical matrix before performing the attention calculation is fulfilled, so that the attention calculation cost is reduced, the technical effects of reducing the attention calculation cost and improving the attention calculation efficiency are achieved, and the technical problem of low attention calculation efficiency in the related art is solved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application, as claimed.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:

Fig. 1 is a schematic view of an application scenario of a data processing method according to an embodiment of the present application;

FIG. 2 is a flow chart of a data processing method according to an embodiment of the application;

FIG. 3 is a flow chart of attention computation in accordance with the prior art;

FIG. 4 is a schematic diagram of a similarity calculation process according to the prior art;

FIG. 5 is a schematic diagram of an attention calculation process according to the prior art;

FIG. 6 is a schematic diagram of a transposed multiplication calculation process in accordance with the prior art;

FIG. 7 is a schematic diagram of an alternative attention calculation process according to an embodiment of the present application;

FIG. 8 is a schematic diagram of a data processing apparatus according to an embodiment of the present application;

Fig. 9 is a block diagram of a computer terminal according to an embodiment of the present application.

Detailed Description

In order that those skilled in the art will better understand the present application, a technical solution in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present application without making any inventive effort, shall fall within the scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The technical scheme provided by the application is mainly realized by adopting a large model technology, wherein the large model refers to a deep learning model with large-scale model parameters, and the deep learning model can generally contain hundreds of millions, billions, trillions and even billions of model parameters. The large Model can be called as a Foundation Model, the large Model is pre-trained through a large-scale unlabeled corpus, a pre-trained Model with more than one hundred million parameters is produced, the Model can adapt to a wide downstream task, and the Model has better generalization capability, such as a large-scale language Model LLM, a multi-modal pre-training Model (multi-modal pre-training Model) and the like.

It should be noted that, when the large model is actually applied, the pretrained model can be finely tuned by a small number of samples, so that the large model can be applied to different tasks. For example, the large model can be widely applied to fields of Natural Language Processing (NLP), computer vision, voice processing and the like, and can be particularly applied to tasks of the computer vision fields such as vision question-answering (Visual Question Answering, VQA for short), image description (IC for short), image generation and the like, and can also be widely applied to tasks of the natural language processing fields such as emotion classification based on texts, text abstract generation, machine translation and the like. Thus, major application scenarios for large models include, but are not limited to, digital assistants, intelligent robots, searches, online education, office software, electronic commerce, intelligent design, and the like. In the embodiment of the application, the explanation is given by taking the data processing through a large language model as an example in the scene of information query and content generation based on large data volume,

First, partial terms or terminology appearing in the course of describing embodiments of the application are applicable to the following explanation:

Large language models (Large Language Model, LLM), a class of deep learning models that are widely used in the field of natural language processing (Natural Language Processing, NLP). These models are typically trained based on a large amount of text data, aimed at learning the structure and rules of the language, so that various NLP tasks can be performed, such as text generation, language understanding, questions and answers, etc.

Stage prefill of prefilling, and stage of outputting inputted characters to first word.

The decoding stage, the successive word outputting stage after the first word outputting, may include generating a next word, a next sentence, or completing the generation of the entire text using the model. In this process, the model generates a text sequence based on the entered context information and parameters of the model.

Attention weight, which is an Attention mechanism used in deep learning models, simulates the Attention mechanism of human beings when processing information, and can help the models focus on relevant parts (Attention trend recognition) when processing sequence data, thereby improving the performance of the models. In the attention mechanism, the model learns how to assign the attention weights of the inputs at different locations so as to focus on different degrees of input in each step. Such a mechanism enables the model to better handle long sequences and complex relationships of data.

Multi-headed attention calculation, a method of calculating attention weights using multiple attention heads in an attention mechanism. In deep learning, attention mechanisms are widely used in natural language processing and other serial data processing tasks to capture correlations between different locations in an input sequence.

Preferably Le Ma LLaMA, generally referred to as a natural language processing framework, has the characteristics of multi-language support, including english, chinese, etc.

Group chat ChatGLM is a large language model focused on dialog generation. The method has strong text generation and understanding capability, and can carry out smooth and natural dialogue with a user. ChatGLM learn a large amount of dialogue data during the training process so that it can understand and generate text conforming to grammar and semantic specifications, thereby realizing high-quality dialogue interaction.

The transducer architecture, a neural network architecture for sequence-to-sequence learning, has wide application in natural language processing and other related fields. The core principle of the architecture mainly relates to the attention mechanism and multi-head attention. In a transducer, the attention mechanism allows the model to shift and focus information between different positions in the sequence, helping the model to find and focus on information in the sequence that is relevant to the current processing position. This mechanism is implemented by one attention layer and one self-attention layer, the attention layer calculating weights for each position and applying these weights to the input sequence, enabling the model to adaptively integrate information according to different contexts.

A general giant language model (GPT-3 for Generalized Large Language Mode,lGGML), a deployment library of large language models. GGML is a modified version of the GPT-3 (GENERATIVE PRE-trained Transformer 3) model that can be used to train and deploy large language models for various natural language processing tasks such as text generation, text classification, emotion analysis, etc.

Example 1

According to an embodiment of the present application, there is provided a data processing method, it being noted that the steps shown in the flowcharts of the figures may be performed in a computer system such as a set of computer executable instructions, and that although a logical order is shown in the flowcharts, in some cases the steps shown or described may be performed in an order different from that herein.

Considering that the model parameters of the large model are huge and the operation resources of the mobile terminal are limited, the data processing method provided by the embodiment of the application can be applied to the application scenario shown in fig. 1, but is not limited thereto. In the application scenario illustrated in fig. 1, the large model is deployed in a server 10, and the server 10 may connect to one or more client devices 20 via a local area network connection, a wide area network connection, an internet connection, or other type of data network, where the client devices 20 may include, but are not limited to: smart phones, tablet computers, notebook computers, palm computers, personal computers, smart home devices, vehicle-mounted devices and the like. The client device 20 can interact with a user through a graphical user interface to realize the invocation of the large model, thereby realizing the method provided by the embodiment of the application.

In an embodiment of the present application, a system formed by a client device and a server may perform the following steps: the client device performs input of query information (such as a target query sequence), and the server performs acquisition of target position data in the target query sequence; acquiring a query matrix, a key matrix and a numerical matrix corresponding to the target position data; before performing attention computation on the target position data, transposing the numerical matrix to obtain a transposed matrix corresponding to the target position data; and executing attention calculation on the target position data based on the query matrix, the key matrix and the transpose matrix corresponding to the target position data to obtain an attention result corresponding to the target position data. It should be noted that, in the case that the operation resource of the client device can meet the deployment and operation conditions of the large model, the embodiment of the present application may be performed in the client device.

In the above-described operating environment, the present application provides a data processing method as shown in fig. 2. Fig. 2 is a flowchart of a data processing method according to embodiment 1 of the present application. As shown in fig. 2, the method may include the steps of:

step S202, obtaining target position data in a target query sequence.

Optionally, the target query sequence is a query sequence input into a large language model, and the target query sequence may be, but is not limited to, text, such as a news article, an email, a social media comment, a text question, and the like; voice recordings, such as telephone recordings, conference recordings, voice messages, etc.; images such as photographs, drawings, maps, etc.; time series data such as stock price trend, air temperature change, traffic flow statistics, etc.; numerical data such as student performance, demographic data, financial statements, and the like; symbol sequences such as musical scores, programming codes, passwords, etc.; other forms of data input such as biometric information data, geographic information data, sensor data, and the like. The target position data is any one position data in the target query sequence.

Step S204, obtaining a query matrix, a key matrix and a numerical matrix corresponding to the target position data.

Optionally, each position data in the target query sequence corresponds to a separate query vector, key matrix and numerical vector, and is used for performing attention calculation on the corresponding position data, wherein the query matrix is used for representing that for each position data (such as target position data) in the target query sequence, a query vector is calculated according to the position data, and the query vector represents information about which the position data needs to be focused when generating output; the key matrix is obtained based on the conversion of the corresponding position data and is used for calculating the association degree with the query matrix; the numerical matrix contains specific information related to the corresponding position data, which information will be weighted and aggregated into the output according to the attention weight. The query matrix, the key matrix, and the numerical matrix are important indexes required for performing attention calculation, so that the query matrix, the key matrix, and the numerical matrix corresponding to the target position data need to be acquired when performing attention calculation on the target position data.

In step S206, before performing attention computation on the target position data, the matrix is transposed to obtain a transposed matrix corresponding to the target position data.

Alternatively, the LLM model is mainly divided into two stages, namely a pre-filling prefill stage and a decoding stage, wherein the prefill stage can be understood as a stage of inputting a first word into the large language model and outputting the first word, the decoding stage can be understood as a continuous output text stage after outputting the first word, the attention calculation can be multi-head attention calculation performed by the decoding stage of the large language model, the target query sequence can be divided into a plurality of groups of position data, and each group of position data can be used as one head for the multi-head attention calculation.

Alternatively, the attention calculation phase may be divided into two phases, namely: a first stage of starting to execute attention calculation based on the query matrix and the key matrix to obtain an attention weight matrix corresponding to the target position data; and obtaining a second stage of the attention result corresponding to the target position data based on the attention weight matrix and the transpose matrix corresponding to the target position data, wherein the stage before the attention calculation can be the stage before the output of the first stage result.

Optionally, the stage before the attention calculation may be a stage of performing data preparation before performing the attention calculation operation, in which data preparation is required. For example, before performing attention computation on the target position data, a query matrix, a key matrix, and a numerical matrix of the target position data need to be prepared, and when performing attention computation in the LLM model decoding stage, historical input information such as a key matrix and a numerical matrix of the historical position data need to be analyzed, if key matrices respectively corresponding to one or more historical position data, numerical matrices respectively corresponding to one or more historical position data, and the like need to be acquired, and after the above preparation work is done, the process of attention computation is entered.

In the related art, in the process of performing attention computation, operations such as splicing and transposition are required to be performed on key matrixes and numerical matrixes corresponding to the historical position data and the target position data respectively, but under the condition that the historical input information is large, the amount of computation required for performing transposition operation is large, so that the time for performing matrix transposition is increased, the attention computation cost is increased, the time consumption is increased, the computation efficiency is low, and the output result efficiency of a large language model is low. Based on the method, matrix transposition operation is advanced, only the numerical matrix of the target position data is transposed before attention calculation is carried out on the target position data, and because the data size of the numerical matrix of the target position data is smaller, the time consumed by executing the transposition operation can be ignored, so that the attention calculation cost can be greatly reduced, and the attention calculation efficiency is improved.

Specifically, when performing attention calculation on target position data, a value matrix transposition process is required, in which the value matrix is transposed in the calculation process of an attention mechanism in the related art, fig. 3 is an attention calculation flow chart according to the prior art, and as shown in fig. 3, the size of a query matrix, a key matrix, and a value matrix input by target position data in a target query sequence is [1, head_dim ], where head_dim is the dimension of a target bit corresponding to an attention head of data. For the key matrix and the numerical matrix of the target position data, since past information, i.e., the key matrix and the numerical matrix corresponding to the history position data, is required to be included in the attention calculation, it is required to splice the key matrix and the numerical matrix corresponding to one or more history position data. Specifically, the key matrixes corresponding to the one or more historical position data are spliced to obtain a spliced key matrix K_cache, the numerical matrixes corresponding to the one or more historical position data are spliced to obtain a spliced numerical matrix V_cache, and the dimensions of the spliced key matrix K_cache and the spliced numerical matrix V_cache are [ cache, head_dim ], wherein the cache is the total number of the historical position data. The key matrix and the numerical matrix of the target position data are spliced with the K_cache and the V_cache respectively, so that a new spliced key matrix K [ sk, head_dim ] and a new spliced numerical matrix V [ sk, head_dim ] are obtained, wherein sk=cache+1. At this time, the spliced key matrix k_cache is synchronously updated to a new spliced key matrix K [ sk, head_dim ], the corresponding spliced numerical matrix v_cache is synchronously updated to a new spliced numerical matrix V [ sk, head_dim ], and the number of the historical position data is also recorded in cache=cache+1, so that the process of cache transformation is completed.

Then, based on the query matrix input by the target position data, the new spliced key matrix and the new spliced numerical matrix perform attention calculation in the following manner:

Where Q represents the query matrix of the target position data input, K represents the new stitched key matrix, softmax () is a softmax function, and d _k represents the vector dimensions of the new stitched key matrix K and the new stitched key matrix V. In the decoding stage, the dimension of the attention computation input Q is [ sq, head_dim ]; the dimension of K is [ sk, head_dim ]; the dimension of V is [ sk, head_dim ]. The dimensions of the result obtained by Q X K ^T are [ sq, sk ]. In the decoding phase, sq is typically 1 and sk is a value much greater than 1.

In the calculation of the attention, res=q×k ^T is first calculated, where res represents the similarity between the query matrix Q input by the target position data and the new spliced key matrix K, and fig. 4 is a schematic diagram of a similarity calculation process according to the prior art, as shown in fig. 4, and in the calculation, the similarity res is obtained by multiplying and adding the query matrix Q input by the target position data and the new spliced key matrix K according to rows, where the dimension of res is [1, sk ]. In the calculation process, the memory arrangement of the query matrix Q input by the target position data and the memory arrangement of the new spliced key matrix K are continuous, and the calculation effect of the similarity res is better by using single instruction stream multiple data stream (Single Instruction Multiple Data, SIMD) vector acceleration.

Further, an attention weight matrix of the target position data is calculated by the following way, wherein the dimension of the attention weight matrix is also [1, sk ]:

Where context_res represents the attention weight matrix of the target position data.

Further, context_res is calculated to obtain the attention result of the target position data, fig. 5 is a schematic diagram of an attention calculation process according to the prior art, and if matrix multiplication is directly performed, as shown in fig. 5, when a new spliced key matrix takes a value, the memory arrangement is discontinuous, and at this time, the multiplication operation efficiency is low, and it is difficult to accelerate with a SIMD vector. Therefore, the new spliced numerical matrix needs to be transposed and then calculated, so as to ensure that the multiplication calculation process of context_res_v is continuous in memory. Fig. 6 is a schematic diagram of a transpose multiplication calculation process according to the prior art, as shown in fig. 6, when sk is larger, that is, when a new spliced key matrix is larger, the length of the cache is continuously increased, and when the length of the cache is increased, the time for transposing the new spliced key matrix is linearly increased along with the increase of the cache, and the transpose overhead is large, so that the attention calculation process overhead is excessive and the attention calculation efficiency is low.

Based on this, the present application advances the transposition process of the numerical matrix, as shown in fig. 7, before advancing the transposition process of the target position data to the attention calculation process, but in the early data preparation stage of the attention calculation, the numerical matrix of the corresponding position data, such as the target position data, is transposed so that the dimension thereof is changed from [1, head_dim ] to [ head_dim,1], and the dimension of the new spliced numerical matrix v_cache is also adjusted to [ head_dim, sk ], so that the dimensions of the query matrix Q, the new spliced key matrix K, and the new spliced numerical matrix V corresponding to the target position data before executing the attention calculation process are as follows: q: [1, head_dim ]; k: [ sk, head_dim ]; v: the above processing results in no need for additional transposition in the subsequent execution of the attention calculation.

It should be noted that, since sk >1, the time for performing transposition on the numerical matrix corresponding to the single position data in the present application is much smaller than that in the scheme in the related art, and the data handling of the size (sk-1) head_dim is reduced, so that the time consumed for attention calculation can be greatly reduced in the case of sk >1, for example, in the case of sk greater than 200. In the attention calculation, the time for the numerical matrix corresponding to the single-position data to transpose from [1, head dim ] to [ head_dim,1] is negligible. In addition, since a temporary memory, i.e. a buffer, is required for storing intermediate information during the transpose calculation, the buffer size required for the transpose calculation in the related art is sk_dim_sizeof (float). In this way, the temporary memory applied during the transpose calculation is only head_dim_sizeof (foat), which is 1/sk in the related art, and when sk is larger, it is time-consuming to repeatedly apply for a large memory block, and the application speed of a small memory block is much faster.

In an alternative embodiment, the amount of data included in the target query sequence is greater than or equal to 200.

It should be noted that, when the attention calculation is performed, the length of the cache is continuously increased, when the length of the cache is increased, the time for transposing the newly spliced key matrix V is linearly increased along with the increase of the cache, and when the cache is >200, the transposed time can occupy about 30% of the total time consumed by the attention calculation. Therefore, in the case where the query sequence input into the large language model is a target query sequence including a data amount of 200 or more, advancing the transposition process of the numerical matrix corresponding to the target position data, performing an attention calculation operation for the target position data based on the obtained transpose matrix of the target position data; if the length of the query sequence input into the large language model is smaller, before the attention calculation is performed on the position data in the query sequence, the numerical matrix conversion operation is not required to be performed on the position data in the query sequence, so that the attention calculation overhead is reduced, and the attention calculation execution efficiency is improved.

Step S208, based on the query matrix, the key matrix and the transpose matrix corresponding to the target position data, attention calculation is performed on the target position data, and an attention result corresponding to the target position data is obtained.

Optionally, the attention computation is an attention computation corresponding to a decoding stage in the large language model, where the decoding stage is a stage in which the large language model outputs the first data and then continuously outputs the data. Taking the target query sequence as an input text sequence as an example, the decoding stage is a stage of continuously outputting characters after outputting the first character by the large language model. The attention result may be a context vector corresponding to the target location data. By the method, attention calculation is directly performed on the target position data based on the query matrix, the key matrix and the transpose matrix corresponding to the target position data, and the transpose operation is not required to be performed in the process of performing the attention calculation on the target position data, so that the attention calculation overhead is reduced, and the attention calculation execution efficiency is improved.

In an alternative embodiment, the attention computation is a multi-headed attention computation in a large language model, and the target location data is a head for which the multi-headed attention computation is directed.

It will be appreciated that the multi-headed attentiveness mechanism enables the model to capture different features and patterns in the target query sequence in parallel by dividing the input target query sequence into a plurality of heads. The parallel processing mode remarkably improves the calculation efficiency and the processing capacity of the large language model, so that the large language model can better understand and generate complex natural language texts; second, the multi-headed attentiveness mechanism helps the large language model to understand the meaning and context information of the text more deeply. By performing the attention calculations on different attention heads, a large language model is enabled to focus on the importance of different parts of the target query sequence and thereby generate successive sequence generation results (e.g., text generation results) that are more in accordance with the grammatical and semantic specifications; in addition, the multi-head attention mechanism also helps to promote the generalization capability and robustness of the large language model. Because each attention head can independently learn and capture different characteristics, the large language model can more comprehensively understand information implicit in the target query sequence, and the dependence on a specific data set is reduced.

In an alternative embodiment, performing attention computation on the target location data based on the query matrix, the key matrix, and the transpose matrix corresponding to the target location data to obtain an attention result corresponding to the target location data, including: determining one or more historical location data in the target query sequence that precede the target location data; acquiring a history splicing matrix corresponding to one or more pieces of history position data, wherein the history splicing matrix is obtained by splicing transposed matrixes of numerical matrixes corresponding to the one or more pieces of history position data respectively; based on the query matrix, the key matrix, the transpose matrix corresponding to the target position data, and one or more history splicing matrices corresponding to the history position data, performing attention calculation on the target position data to obtain an attention result corresponding to the target position data.

Optionally, if the target position data is the first position data in the target query sequence, attention calculation is directly performed on the target position data based on the key matrix and the transpose matrix corresponding to the target position data, so as to obtain an attention result corresponding to the target position data. If the target location data is not the first location data in the target query sequence, in order to ensure that the attention result obtained by performing the attention calculation can be better combined with the context information in the target query sequence, for example, when the target query sequence is an input text sequence, in order to ensure that the attention result can be better combined with the context semantic information of the input text sequence, the past information needs to be included in the attention calculation. I.e., combining the key matrix and the numerical matrix corresponding to the historical location data, wherein the historical location data is the location data located before the target location data in the target query sequence. It should be noted that, since the location data in the target query sequence is input into the large language model one by one, each location data corresponds to a different time step, and the historical location data may also be understood as location data in the target query sequence, where the time step is located before the target location data.

In addition, when performing the attention calculation, it is necessary to splice the numerical matrix corresponding to the target position data and the numerical matrix corresponding to the history position data. According to the application, firstly, a numerical matrix corresponding to target position data is transposed, and the transposed matrix corresponding to the target position data is spliced with a history splicing matrix, wherein the history splicing matrix is obtained by splicing one or more transposed matrices (obtained by transposing the corresponding numerical matrices) corresponding to the history position data respectively, and is stored in a cache. When the attention calculation is carried out, the history splicing matrix, the transpose matrix combined with the target position data, the query matrix and the key matrix are directly obtained from the cache, and the attention calculation operation is carried out. In the above way, the transposition operation of the numerical matrix is not needed to be executed in the process of attention calculation, so that the attention calculation cost is reduced, and the execution efficiency of the attention calculation is improved.

In an alternative embodiment, the historical splice matrix corresponding to the one or more historical location data is pre-stored in a cache. When the attention calculation is carried out, the attention calculation is directly called from the cache.

Optionally, starting from a first data position in the target query sequence, before performing attention calculation on the first data position, transposing a numerical matrix corresponding to the first data position, and storing the obtained transpose matrix corresponding to the first data position into a cache as a history splicing matrix; before performing attention computation on the next data position, the numerical matrix corresponding to the next data is also required to be transposed, the obtained transposed matrix corresponding to the next data position is stored in the buffer, the transposed matrix corresponding to the first data position stored in the buffer is spliced to be used as a new history splicing matrix, and the like, before performing attention computation on the next data position, the attention computation is performed again, the transposed matrix is stored in the buffer, and the updating operation of the history splicing matrix is performed, which is not repeated here.

In an alternative embodiment, performing attention computation on the target location data based on the query matrix, the key matrix, the transpose matrix corresponding to the target location data, and the history stitching matrix corresponding to the one or more history location data, to obtain an attention result corresponding to the target location data, including: splicing the transpose matrix corresponding to the target position data to the history splicing matrix to obtain a target splicing matrix; and based on the query matrix, the key matrix and the target splicing matrix, performing attention calculation on the target position data to obtain an attention result corresponding to the target position data.

Optionally, as shown in fig. 7, when performing attention computation, the history splicing data v_cache [ head_dim, cache ] and the transpose matrix V [ head_dim,1] of the target position data are directly obtained from the cache, and then spliced to obtain the target splicing matrix V [ head_dim, k ]. By the method, the transposition operation of the numerical matrix is not required to be executed in the attention calculating process, so that the attention calculating cost is reduced, and the attention calculating execution efficiency is improved.

Optionally, when performing attention computation on target position data, in addition to a query matrix, a key matrix and a target splicing matrix of the target position data, a new spliced key matrix K [ cache, head_dim ] needs to be combined, where the new spliced key matrix is a key matrix based on the target position data and is spliced by one or more key matrices corresponding to historical position data, and a specific splicing process is as follows: firstly, splicing key matrixes corresponding to one or more pieces of historical position data respectively to obtain spliced key matrixes, and further splicing the spliced key matrixes with the key matrixes of the target position data to obtain new spliced key matrixes. A specific attention calculation process is shown in fig. 7.

In an alternative embodiment, obtaining a history splice matrix corresponding to one or more history location data includes: before performing attention computation on one or more pieces of historical position data according to a sequence in a target query sequence, transposing a numerical matrix corresponding to the one or more pieces of historical position data to obtain a corresponding transposed matrix; and splicing the transposed matrixes corresponding to the one or more historical position data according to the sequence order to obtain a historical splicing matrix.

Optionally, the attention operations performed on the location data included in the target query sequence are performed in sequence order in the target query sequence. The attention calculations for one or more historical location data that precede the target location data are also performed sequentially in a sequential order. Before performing attention calculation on each historical position data, the numerical matrix of the historical position data needs to be transposed, and the obtained transposed matrix corresponding to one or more historical position data is spliced in sequence according to the sequence order, so that a historical splicing matrix is obtained.

For example, starting from a first data position in the target query sequence according to the sequence order in the target query sequence, transposing a numerical matrix corresponding to the first data position before attention calculation is performed on the first data position, and storing the obtained transpose matrix corresponding to the first data position into a cache as a history splicing matrix; before performing attention computation on the next data position, the numerical matrix corresponding to the next data is also required to be transposed, the obtained transposed matrix corresponding to the next data position is stored in the buffer, the transposed matrix corresponding to the first data position stored in the buffer is spliced to be used as a new history splicing matrix, and the like, before performing attention computation on the next data position, the attention computation is performed again according to the transposition of the advanced numerical matrix, and the transposed matrix is stored in the buffer to perform updating operation of the history splicing matrix, which is not repeated here.

In an alternative embodiment, performing attention computation on the target location data based on the query matrix, the key matrix, and the transpose matrix corresponding to the target location data to obtain an attention result corresponding to the target location data, including: starting to execute attention calculation based on the query matrix and the key matrix to obtain an attention weight matrix corresponding to the target position data; and obtaining the attention result corresponding to the target position data based on the attention weight matrix and the transpose matrix corresponding to the target position data.

Optionally, when performing attention computation on the target position data, attention computation is first started based on the query matrix and the key matrix corresponding to the target position data. Specifically, the key matrix corresponding to the target position data is spliced with one or more pieces of history position data, and a new spliced key matrix is obtained. And calculating based on the query matrix corresponding to the target position data and the new spliced key matrix to obtain the attention weight matrix corresponding to the target position data, wherein the specific calculation mode is the same as the above, and the detailed description is omitted. And multiplying the attention weight matrix with a transpose matrix corresponding to the target position data to obtain an attention result corresponding to the target position data. By the method, continuity of the multiplication calculation process can be guaranteed, meanwhile, a matrix transposition process is not required to be executed in the attention calculation process, and attention calculation overhead is reduced.

In an alternative embodiment, the method further comprises: the method comprises the steps of obtaining attention results of other position data in the same mode as the attention results corresponding to the target position data, wherein the other position data are position data except the target position data in a target query sequence; and obtaining a query result corresponding to the target query sequence based on the attention result corresponding to the target position data and the attention result of other position data.

Optionally, in the case that the target query sequence is an input text sequence, the corresponding sequence generation result is a text generation result. When information query is performed based on a large language model, attention results of target positions are required to be considered, attention results of other data positions are required to be considered, namely query results are generated according to attention results respectively corresponding to all position data in a target query sequence, consideration factors are comprehensive, and the obtained sequence generation results are accurate and reliable.

In the present application, the acquisition mode of the attention result of other position data is the same as the acquisition mode of the attention result of the target position sequence, and the specific acquisition mode may be: acquiring a query matrix, a key matrix and a numerical matrix corresponding to other position data; before performing attention computation on other position data, transposing a numerical matrix of the other position data to obtain a transposed matrix corresponding to the other position data; and executing attention calculation on the other position data based on the query matrix, the key matrix and the transpose matrix corresponding to the other position data to obtain attention results corresponding to the other position data.

Based on the above embodiment and the optional embodiment, the present application proposes an implementation of an optional data processing method, which includes:

step S1, determining a target query sequence, which specifically comprises the following sub-steps:

step S11, inputting a query sequence in the large language model, wherein the query sequence can be a text sequence;

step S12, judging whether the data volume of the query sequence is greater than or equal to 200;

and step S13, if the data volume of the query sequence is greater than or equal to 200, the query sequence is taken as a target query sequence.

Step S2, sequentially taking the position data included in the target query sequence as target position data, and obtaining attention results respectively corresponding to the position data included in the target query sequence in the following manner:

Step S21, acquiring a query matrix, a key matrix and a numerical matrix corresponding to the target position data; performing transposition processing on the numerical matrix of the target position data to obtain a transposed matrix of the target position data; detecting whether the target position data is the first position data in the target query sequence, if so, executing the step S22, otherwise, executing the step S23;

step S22, based on the query matrix, the key matrix and the transpose matrix corresponding to the target position data, performing attention calculation on the target position data to obtain an attention result corresponding to the target position data;

Step S23, determining one or more historical position data before the target position data in the target query sequence;

Step S24, acquiring one or more history splicing matrixes corresponding to the history position data, wherein the history splicing matrixes are obtained by splicing transposed matrixes of numerical matrixes corresponding to the history position data respectively;

Step S25, splicing the transpose matrix corresponding to the target position data to the history splicing matrix to obtain a target splicing matrix;

step S26, based on the query matrix, the key matrix and the target splicing matrix, attention calculation is performed on the target position data, and an attention result corresponding to the target position data is obtained.

And S3, generating texts based on the attention results respectively corresponding to the position data included in the target query sequence, and obtaining text generation results.

Through the steps S1 to S3, the transposition is performed under the condition that the single input data size of the large language model is large, before the attention calculation is performed on each position data in the target query sequence, the transposition operation is performed on the numerical vector of the position data, and the corresponding history splicing matrix is also the transposed matrix, so that the transposition operation is not required to be performed in the process of performing the attention on the target position data, thereby reducing the attention calculation cost and improving the attention calculation execution efficiency.

It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are information and data authorized by the user or fully authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related country and region, and provide corresponding operation entries for the user to select authorization or rejection.

It should be noted that, for simplicity of description, the foregoing method embodiments are all described as a series of acts, but it should be understood by those skilled in the art that the present application is not limited by the order of acts described, as some steps may be performed in other orders or concurrently in accordance with the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required for the present application.

From the description of the above embodiments, it will be clear to a person skilled in the art that the method according to the above embodiments may be implemented by means of software plus a necessary general hardware platform, but that it may also be implemented by means of hardware. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method according to the embodiments of the present application.

Example 2

According to an embodiment of the present application, there is also provided a data processing apparatus for implementing the above data processing method, as shown in fig. 8, the apparatus including: a first acquisition module 800, a second acquisition module 802, a matrix transpose module 804, an attention calculation module 806, wherein,

A first obtaining module 800, configured to obtain target location data in a target query sequence;

a second obtaining module 802, coupled to the first obtaining module 800, for obtaining a query matrix, a key matrix, and a numerical matrix corresponding to the target position data;

The matrix transpose module 804 is connected to the second obtaining module 802, and is configured to transpose the numerical matrix before performing attention calculation on the target position data, so as to obtain a transpose matrix corresponding to the target position data;

The attention calculating module 806 is connected to the matrix transpose module 804, and is configured to perform attention calculation on the target position data based on the query matrix, the key matrix, and the transpose matrix corresponding to the target position data, so as to obtain an attention result corresponding to the target position data.

In the embodiment of the present application, the first obtaining module 800 is configured to obtain target location data in a target query sequence; a second obtaining module 802, coupled to the first obtaining module 800, for obtaining a query matrix, a key matrix, and a numerical matrix corresponding to the target position data; the matrix transpose module 804 is connected to the second obtaining module 802, and is configured to transpose the numerical matrix before performing attention calculation on the target position data, so as to obtain a transpose matrix corresponding to the target position data; the attention calculating module 806 is connected to the matrix transpose module 804, and is configured to perform attention calculation on the target position data based on the query matrix, the key matrix, and the transpose matrix corresponding to the target position data, so as to obtain an attention result corresponding to the target position data, thereby achieving the purpose of completing the transpose of the numerical matrix before performing the attention calculation, so as to reduce the attention calculation overhead, thereby realizing the technical effects of reducing the attention calculation overhead and improving the attention calculation efficiency, and further solving the technical problem of low attention calculation efficiency in the related art.

In an alternative embodiment, the attention calculating module includes: a first determining unit 810 for determining one or more historical location data preceding the target location data in the target query sequence; a first obtaining unit 812, configured to obtain a history splicing matrix corresponding to one or more history position data, where the history splicing matrix is obtained by splicing transposed matrices of numerical matrices corresponding to the one or more history position data respectively; the first calculating unit 814 is configured to perform attention calculation on the target location data based on the query matrix, the key matrix, the transpose matrix corresponding to the target location data, and the history stitching matrix corresponding to the one or more history location data, to obtain an attention result corresponding to the target location data.

In an alternative embodiment, the first computing unit includes: the first splicing unit is used for splicing the transpose matrix corresponding to the target position data to the history splicing matrix to obtain a target splicing matrix; and the second calculation unit is used for executing attention calculation on the target position data based on the query matrix, the key matrix and the target splicing matrix to obtain an attention result corresponding to the target position data.

In an alternative embodiment, the first obtaining unit includes: the first transfer unit is used for transferring the numerical matrix corresponding to the one or more historical position data before performing attention calculation on the one or more historical position data according to the sequence in the target query sequence to obtain a corresponding transfer matrix; and the second splicing unit is used for splicing the transposed matrixes corresponding to the one or more pieces of history position data according to the sequence order to obtain a history splicing matrix.

In an alternative embodiment, the historical splice matrix corresponding to the one or more historical location data is pre-stored in a cache.

In an alternative embodiment, the first computing unit includes: a third calculation unit, configured to start performing attention calculation based on the query matrix and the key matrix, to obtain an attention weight matrix corresponding to the target position data; and the second acquisition unit is used for acquiring the attention result corresponding to the target position data based on the attention weight matrix and the transpose matrix corresponding to the target position data.

In an alternative embodiment, the apparatus further comprises: a third obtaining unit, configured to obtain an attention result of other location data in the same manner as the attention result corresponding to the target location data, where the other location data is location data other than the target location data in the target query sequence; and the sequence generating unit is used for obtaining a query result corresponding to the target query sequence based on the attention result corresponding to the target position data and the attention result of other position data.

It should be noted that, the first obtaining module 800, the second obtaining module 802, the matrix transpose module 804, and the attention calculating module 806 correspond to steps S202 to S208 in embodiment 1, and the two modules are the same as the examples and application scenarios implemented by the corresponding steps, but are not limited to the disclosure of the first embodiment. It should be noted that the above-mentioned modules or units may be hardware components or software components stored in a memory (for example, the memory 104) and processed by one or more processors (for example, the processors 102a,102b, … …,102 n), or the above-mentioned modules may be part of an apparatus and may be executed in the computer terminal 10 provided in the first embodiment.

It should be noted that, the preferred embodiment of the present application in the above examples is the same as the embodiment provided in example 1, the application scenario and the implementation process, but is not limited to the embodiment provided in example 1.

Example 3

Embodiments of the present application may provide a computer terminal, which may be any one of a group of computer terminals. Alternatively, in the present embodiment, the above-described computer terminal may be replaced with a terminal device such as a mobile terminal.

Alternatively, in this embodiment, the above-mentioned computer terminal may be located in at least one network device among a plurality of network devices of the computer network.

In this embodiment, the above-mentioned computer terminal may execute the program code of the following steps in the data processing method: acquiring target position data in a target query sequence; acquiring a query matrix, a key matrix and a numerical matrix corresponding to the target position data; before performing attention computation on the target position data, transposing the numerical matrix to obtain a transposed matrix corresponding to the target position data; and executing attention calculation on the target position data based on the query matrix, the key matrix and the transpose matrix corresponding to the target position data to obtain an attention result corresponding to the target position data.

Alternatively, fig. 9 is a block diagram of a computer terminal according to an embodiment of the present application. As shown in fig. 9, the computer terminal may include: one or more (only one is shown) processors 102, memory 104, memory controller, and peripheral interfaces, where the peripheral interfaces are connected to the radio frequency module, audio module, and display.

The memory may be used to store software programs and modules, such as program instructions/modules corresponding to the data processing methods and apparatuses in the embodiments of the present application, and the processor executes the software programs and modules stored in the memory, thereby executing various functional applications and data processing, that is, implementing the data processing methods described above. The memory may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory may further include memory remotely located with respect to the processor, the remote memory being connectable to the terminal through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The processor may call the information and the application program stored in the memory through the transmission device to perform the following steps: acquiring target position data in a target query sequence; acquiring a query matrix, a key matrix and a numerical matrix corresponding to the target position data; before performing attention computation on the target position data, transposing the numerical matrix to obtain a transposed matrix corresponding to the target position data; and executing attention calculation on the target position data based on the query matrix, the key matrix and the transpose matrix corresponding to the target position data to obtain an attention result corresponding to the target position data.

Optionally, the above processor may further execute program code for: determining one or more historical location data in the target query sequence that precede the target location data; acquiring a history splicing matrix corresponding to one or more pieces of history position data, wherein the history splicing matrix is obtained by splicing transposed matrixes of numerical matrixes corresponding to the one or more pieces of history position data respectively; based on the query matrix, the key matrix, the transpose matrix corresponding to the target position data, and one or more history splicing matrices corresponding to the history position data, performing attention calculation on the target position data to obtain an attention result corresponding to the target position data.

Optionally, the above processor may further execute program code for: splicing the transpose matrix corresponding to the target position data to the history splicing matrix to obtain a target splicing matrix; and based on the query matrix, the key matrix and the target splicing matrix, performing attention calculation on the target position data to obtain an attention result corresponding to the target position data.

Optionally, the above processor may further execute program code for: before performing attention computation on one or more pieces of historical position data according to a sequence in a target query sequence, transposing a numerical matrix corresponding to the one or more pieces of historical position data to obtain a corresponding transposed matrix; and splicing the transposed matrixes corresponding to the one or more historical position data according to the sequence order to obtain a historical splicing matrix.

Optionally, the above processor may further execute program code for: starting to execute attention calculation based on the query matrix and the key matrix to obtain an attention weight matrix corresponding to the target position data; and obtaining the attention result corresponding to the target position data based on the attention weight matrix and the transpose matrix corresponding to the target position data.

Optionally, the above processor may further execute program code for: the method comprises the steps of obtaining attention results of other position data in the same mode as the attention results corresponding to the target position data, wherein the other position data are position data except the target position data in a target query sequence; and obtaining a query result corresponding to the target query sequence based on the attention result corresponding to the target position data and the attention result of other position data.

By adopting the embodiment of the application, a scheme for data processing is provided. Acquiring target position data in a target query sequence; acquiring a query matrix, a key matrix and a numerical matrix corresponding to the target position data; before performing attention computation on the target position data, transposing the numerical matrix to obtain a transposed matrix corresponding to the target position data; based on the query matrix, the key matrix and the transpose matrix corresponding to the target position data, attention calculation is performed on the target position data to obtain an attention result corresponding to the target position data, so that the aim of completing transposition of the numerical matrix before performing the attention calculation to reduce the attention calculation expenditure is fulfilled, and the technical problem of low attention calculation efficiency in the related art is solved.

It will be appreciated by those skilled in the art that the structure shown in the figure is merely illustrative, and the computer terminal may be a smart phone (such as an Android phone, an iOS phone, etc.), a tablet computer, a palm computer, a mobile internet device (MobileInternetDevices, MID), a PAD, etc. Fig. 9 is not limited to the structure of the electronic device. For example, the computer terminal may also include more or fewer components (e.g., network interfaces, display devices, etc.) than shown in FIG. 9, or have a different configuration than shown in FIG. 9.

Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of the above embodiments may be implemented by a program for instructing a terminal device to execute in association with hardware, the program may be stored in a computer readable storage medium, and the storage medium may include: flash disk, read-Only Memory (ROM), random-access Memory (Random Access Memory, RAM), magnetic disk or optical disk, etc.

Example 4

The embodiment of the application also provides a storage medium. Alternatively, in this embodiment, the storage medium may be used to store the program code executed by the data processing method provided in the first embodiment.

Alternatively, in this embodiment, the storage medium may be located in any one of the computer terminals in the computer terminal group in the computer network, or in any one of the mobile terminals in the mobile terminal group.

Alternatively, in the present embodiment, the storage medium is configured to store program code for performing the steps of: acquiring target position data in a target query sequence; acquiring a query matrix, a key matrix and a numerical matrix corresponding to the target position data; before performing attention computation on the target position data, transposing the numerical matrix to obtain a transposed matrix corresponding to the target position data; and executing attention calculation on the target position data based on the query matrix, the key matrix and the transpose matrix corresponding to the target position data to obtain an attention result corresponding to the target position data.

The foregoing embodiment numbers of the present application are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.

In the foregoing embodiments of the present application, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments.

In the several embodiments provided in the present application, it should be understood that the disclosed technology may be implemented in other manners. The above-described embodiments of the apparatus are merely exemplary, and the division of the units, such as the division of the units, is merely a logical function division, and may be implemented in another manner, for example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interfaces, units or modules, or may be in electrical or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a usb disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing is merely a preferred embodiment of the present application and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present application, which are intended to be comprehended within the scope of the present application.

Claims

1. A method of data processing, comprising:

Acquiring target position data in a target query sequence;

Acquiring a query matrix, a key matrix and a numerical matrix corresponding to the target position data;

before performing attention computation on the target position data, transposing the numerical matrix to obtain a transposed matrix corresponding to the target position data;

And executing the attention calculation on the target position data based on the query matrix, the key matrix and the transpose matrix corresponding to the target position data to obtain an attention result corresponding to the target position data.

2. The method of claim 1, wherein the performing the attention computation on the target location data based on the query matrix, the key matrix, and a transpose matrix corresponding to the target location data to obtain an attention result corresponding to the target location data comprises:

Determining one or more historical location data in the target query sequence that precede the target location data;

Acquiring a history splicing matrix corresponding to the one or more history position data, wherein the history splicing matrix is obtained by splicing transposed matrixes of numerical matrixes respectively corresponding to the one or more history position data;

And executing the attention calculation on the target position data based on the query matrix, the key matrix, the transpose matrix corresponding to the target position data and the history splicing matrix corresponding to the one or more history position data to obtain an attention result corresponding to the target position data.

3. The method of claim 2, wherein the performing the attention computation on the target location data based on the query matrix, the key matrix, the transpose matrix corresponding to the target location data, and the history stitching matrix corresponding to the one or more history location data to obtain the attention result corresponding to the target location data comprises:

splicing the transpose matrix corresponding to the target position data to the history splicing matrix to obtain a target splicing matrix;

And executing the attention calculation on the target position data based on the query matrix, the key matrix and the target splicing matrix to obtain an attention result corresponding to the target position data.

4. The method of claim 2, wherein the obtaining the historical splice matrix corresponding to the one or more historical location data comprises:

Before performing attention computation on the one or more pieces of historical position data according to the sequence order in the target query sequence, transposing a numerical matrix corresponding to the one or more pieces of historical position data to obtain a corresponding transposed matrix;

and splicing the transpose matrixes corresponding to the one or more pieces of history position data according to the sequence order to obtain the history splicing matrix.

5. The method of claim 2, wherein the one or more history splice matrices corresponding to the history location data are pre-stored in a cache.

6. The method of claim 1, wherein the performing the attention computation on the target location data based on the query matrix, the key matrix, and a transpose matrix corresponding to the target location data to obtain an attention result corresponding to the target location data comprises:

Starting to execute the attention calculation based on the query matrix and the key matrix to obtain an attention weight matrix corresponding to the target position data;

And obtaining the attention result corresponding to the target position data based on the attention weight matrix and the transpose matrix corresponding to the target position data.

7. The method according to claim 1, wherein the method further comprises:

Obtaining attention results of other position data in the same manner as the attention results corresponding to the target position data, wherein the other position data are position data except the target position data in the target query sequence;

And obtaining a query result corresponding to the target query sequence based on the attention result corresponding to the target position data and the attention result of the other position data.

8. The method according to any one of claims 1 to 7, wherein the attention computation is a multi-headed attention computation in a large language model, and the target position data is one head for which the multi-headed attention computation is directed.

9. The method of claim 8, wherein the amount of data included in the target query sequence is greater than or equal to 200.

10. A data processing apparatus, comprising:

the first acquisition module is used for acquiring target position data in a target query sequence;

The second acquisition module is used for acquiring a query matrix, a key matrix and a numerical matrix corresponding to the target position data;

the matrix transposition module is used for transposing the numerical matrix before executing attention calculation on the target position data to obtain a transposed matrix corresponding to the target position data;

and the attention calculating module is used for executing the attention calculation on the target position data based on the query matrix, the key matrix and the transpose matrix corresponding to the target position data to obtain an attention result corresponding to the target position data.

11. An electronic device, comprising:

A memory storing an executable program;

A processor for executing the program, wherein the program when executed performs the data processing method of any one of claims 1 to 9.

12. A computer readable storage medium, characterized in that the computer readable storage medium comprises a stored executable program, wherein the executable program when run controls a device in which the computer readable storage medium is located to perform the data processing method according to any one of claims 1 to 9.