CN115525263A

CN115525263A - Training method of code completion model, code completion method and device

Info

Publication number: CN115525263A
Application number: CN202110706253.0A
Authority: CN
Inventors: 陶韬
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2021-06-24
Filing date: 2021-06-24
Publication date: 2022-12-27

Abstract

The application provides a training method for a code completion model, a code completion method and a device thereof, and relates to the field of artificial intelligence. The training method of the code completion model comprises the following steps: acquiring first training data, wherein the first training data comprises a program code, and the program code comprises a morpheme sequence represented by a double-line parallel sequence, wherein the double-line parallel sequence comprises a Type line Type-line and a Token line Token-line, the Type line is used for indicating structural grammar information of the program code, and the Token-line is used for indicating semantic information of the program code; and inputting the first training data into the neural network model and training the neural network model based on the multi-layer neural network of the customized Transformer to obtain a target code completion model. In the scheme, the target code completion model can generate the whole line of code completion end to end, the prediction is accurate, the reasoning is rapid, and the development efficiency and experience of a user are improved.

Description

Training method of code completion model, code completion method and device

技术领域technical field

本申请涉及人工智能领域，并且更具体地，涉及一种代码补全模型的训练方法、代码补全方法及其装置。The present application relates to the field of artificial intelligence, and more specifically, to a code completion model training method, a code completion method and a device thereof.

背景技术Background technique

代码补全技术一直是软件工程领域的研究热点之一，该技术极大减少了程序员的工作量，提升了开发效率、质量及体验，当前所有主流集成开发环境(integrated developenvironment，IDE)的核心功能中均包括了代码自动补全功能。Code completion technology has always been one of the research hotspots in the field of software engineering. This technology greatly reduces the workload of programmers, improves development efficiency, quality and experience, and is the core of all current mainstream integrated development environments (integrated development environment, IDE). The functions include the code auto-completion function.

传统IDE中包含的代码补全系统仅考虑应用程序编程接口(applicationprogramming interface，API)的类型兼容性和可引用性，此类代码补全系统面对复杂软件框架时，所推荐的API准确率低。主要原因是此类推荐方法简单根据静态分析结果对所有API筛选之后，推荐大量的方法或字段，最后按照字母表排序或频次统计给出推荐结果。The code completion system included in traditional IDEs only considers the type compatibility and quotability of the application programming interface (API). When this type of code completion system faces complex software frameworks, the accuracy of the recommended API is low. . The main reason is that this type of recommendation method simply screens all APIs based on static analysis results, recommends a large number of methods or fields, and finally gives recommendation results according to alphabetical order or frequency statistics.

目前，智能代码整行补全技术在业界并无成熟的解决方案，这主要是因为程序语言存在诸多区别于自然语言的特性，例如词汇量不足(out-of-vocabulary，OOV)问题、多样性问题、长距依赖问题，传统的自然语言处理技术面临很多困难，同时出于对代码隐私保护的考虑，深度模型需要部署在本地处理器上，因而对算法的运行时效率有着更加苛刻的要求。At present, there is no mature solution for intelligent code completion technology in the industry, mainly because programming languages have many characteristics different from natural languages, such as out-of-vocabulary (OOV) problems, diversity traditional natural language processing technology faces many difficulties, and at the same time, due to the consideration of code privacy protection, the deep model needs to be deployed on the local processor, so there are more stringent requirements for the runtime efficiency of the algorithm.

发明内容Contents of the invention

本申请提供一种代码补全模型的训练方法、代码补全方法及其装置，有助于准确、快速的实现代码补全。The present application provides a training method for a code completion model, a code completion method and a device thereof, which are helpful for realizing code completion accurately and quickly.

第一方面，提供了一种代码补全模型的训练方法，包括：获取第一训练数据，第一训练数据包括程序代码，程序代码包括使用双行平行序列表示的词素序列，其中双行平行序列包括类型行Type-line和令牌行Token-line，Type-line用于指示程序代码的结构化语法信息，Token-line用于指示程序代码的语义信息；将第一训练数据输入神经网络模型并基于定制Transformer的多层神经网络训练神经网络模型，得到目标代码补全模型。In the first aspect, a training method for a code completion model is provided, including: obtaining first training data, the first training data includes program code, and the program code includes a morpheme sequence represented by a double-line parallel sequence, wherein the double-line parallel sequence Including type line Type-line and token line Token-line, Type-line is used to indicate the structured syntax information of the program code, Token-line is used to indicate the semantic information of the program code; the first training data is input into the neural network model and The neural network model is trained based on the multi-layer neural network of the customized Transformer, and the target code completion model is obtained.

根据本申请的技术方案，采用双行平行序列对程序代码进行表示，从而具有更好的样本分布质量，同时能更好地支持后续推理加速模型训练的过程。According to the technical solution of the application, the program code is represented by a double-line parallel sequence, thereby having better sample distribution quality, and at the same time better supporting the process of subsequent reasoning acceleration model training.

更进一步地，根据本申请的技术方案得到的目标代码补全模型能够端对端生成整行代码补全，预测准确，推理快速，有助于提升用户开发效率和体验。Furthermore, the target code completion model obtained according to the technical solution of the present application can generate end-to-end completion of the entire line of code, with accurate prediction and fast reasoning, which helps to improve user development efficiency and experience.

结合第一方面，在第一方面的某些实现方法中，上述将第一训练数据输入神经网络模型，包括：构建词汇表，词汇表包括类型Type词汇表和令牌Token词汇表，Type词汇表通过Type词汇全集构建，Token词汇表中各索引对应的键值不固定；根据词汇表将第一训练数据中的词素序列映射为整数序列，将整数序列输入神经网络模型。In conjunction with the first aspect, in some implementation methods of the first aspect, the above-mentioned input of the first training data into the neural network model includes: building a vocabulary, the vocabulary includes a Type vocabulary and a token Token vocabulary, and the Type vocabulary Through the construction of the complete set of Type vocabulary, the key values corresponding to each index in the Token vocabulary are not fixed; according to the vocabulary, the morpheme sequence in the first training data is mapped to an integer sequence, and the integer sequence is input into the neural network model.

结合第一方面，在第一方面的某些实现方法中，第一训练数据中的词素序列包括词汇表外OOV词汇，上述方法还包括：将OOV词汇按照距离当前光标位置的顺序从Token词汇表头部依次填入；和/或，将OOV词汇从Token词汇表尾部依次填入。In conjunction with the first aspect, in some implementation methods of the first aspect, the morpheme sequence in the first training data includes OOV vocabulary outside the vocabulary, and the above method further includes: selecting the OOV vocabulary from the Token vocabulary in the order of distance from the current cursor position The head is filled in order; and/or, the OOV vocabulary is filled in order from the end of the Token vocabulary.

根据本申请的技术方案，通过引入动态位置词汇表，简单、高效地解决了OOV问题，有助于快速、准确的训练得到目标代码补全模型。According to the technical solution of the present application, the OOV problem is solved simply and efficiently by introducing the dynamic location vocabulary, which is helpful for fast and accurate training to obtain the target code completion model.

结合第一方面，在第一方面的某些实现方法中，上述基于Transformer的多层神经网络训练神经网络模型，包括：将表征Type和表征Token的整数序列嵌入为高维实数矢量序列，并通过编码后拼接为88维的输入矢量序列；使用第一变换网络对输入矢量序列进行特征编码得到特征图；使用第二变换网络对特征图进行语法解码，使用第三变换网络对特征图进行语义解码以训练神经网络；其中，第一变换网络的深度大于第二变换网络的深度和第三变换网络的深度，第三变换网络的深度大于第二变换网络的深度。In conjunction with the first aspect, in some implementation methods of the first aspect, the above-mentioned Transformer-based multi-layer neural network training neural network model includes: embedding the integer sequence representing Type and representing Token into a high-dimensional real number vector sequence, and passing After encoding, it is spliced into an 88-dimensional input vector sequence; use the first transformation network to perform feature encoding on the input vector sequence to obtain a feature map; use the second transformation network to decode the feature map syntax, and use the third transformation network to perform semantic decoding on the feature map to train the neural network; wherein, the depth of the first transformation network is greater than the depth of the second transformation network and the depth of the third transformation network, and the depth of the third transformation network is greater than the depth of the second transformation network.

根据本申请的技术方案，基于Transformer的多层神经网络训练神经网络模型，具有更好的长距依赖学习能力，同时易于并行计算的特点也能够提升训练目标代码补全模型的速度。According to the technical solution of this application, the neural network model based on Transformer's multi-layer neural network training has better long-distance dependency learning ability, and at the same time, the characteristics of easy parallel computing can also improve the speed of training the target code completion model.

可选地，在训练目标模型时，可以采用多任务联合学习进行训练，这样可以加快模型收敛速度，提高模型泛化能力。Optionally, when training the target model, multi-task joint learning can be used for training, which can speed up the convergence speed of the model and improve the generalization ability of the model.

第二方面，提供了一种代码补全的方法，包括：获取用户输入的程序代码；将程序代码输入到目标代码补全模型，得到代码补全结果，目标代码补全模型是利用第一训练数据并基于变形Transformer的多层神经网络结构训练神经网络模型得到的，第一训练数据包括程序代码，程序代码包括使用双行平行序列表示的词素序列，其中双行平行序列包括类型行Type-line和令牌行Token-line，Type-line用于指示程序代码的结构化语法信息，Token-line用于指示程序代码的语义信息。In the second aspect, a method for code completion is provided, including: obtaining the program code input by the user; inputting the program code into the target code completion model to obtain the code completion result, and the target code completion model uses the first training The data is obtained by training the neural network model based on the multi-layer neural network structure of the deformed Transformer. The first training data includes program code, and the program code includes a morpheme sequence represented by a double-line parallel sequence, wherein the double-line parallel sequence includes a type line Type-line And token line Token-line, Type-line is used to indicate the structured syntax information of the program code, and Token-line is used to indicate the semantic information of the program code.

根据本申请的技术方案，采用目标代码补全模型能够端对端生成整行代码补全，预测准确，推理快速，有助于提升用户开发效率和体验。According to the technical solution of this application, the target code completion model can be used to generate end-to-end completion of the entire line of code, with accurate prediction and fast reasoning, which helps to improve user development efficiency and experience.

结合第二方面，在第二方面的某些实现方式中，上述将程序代码输入到目标代码补全模型，得到代码补全结果，包括：将程序代码转换为使用双行平行序列表示的词素序列；根据词汇表将词素序列映射为整数序列，词汇表包括类型Type词汇表和令牌Token词汇表，Type词汇表通过Type词汇全集构建，Token词汇表中各索引对应的键值不固定；根据整数序列，得到代码补全结果。In combination with the second aspect, in some implementations of the second aspect, the above inputting the program code into the target code completion model to obtain the code completion result includes: converting the program code into a morpheme sequence represented by a double-line parallel sequence ;The morpheme sequence is mapped to an integer sequence according to the vocabulary. The vocabulary includes the Type vocabulary and the token Token vocabulary. The Type vocabulary is constructed from the Type vocabulary collection. The key values corresponding to each index in the Token vocabulary are not fixed; according to the integer sequence to get the code completion result.

根据本申请的技术方案，采用双行平行序列对程序代码进行表示，从而具有更好的样本分布质量，同时能更好地对其进行推理，以得到代码补全结果。According to the technical solution of the present application, the program code is represented by a double-line parallel sequence, thereby having better sample distribution quality, and at the same time, it can be reasoned better to obtain a code completion result.

进一步地，根据本申请的技术方案，通过引入动态位置词汇表，简单、高效地解决了OOV问题，有助于快速、准确的得到代码补全结果。Furthermore, according to the technical solution of the present application, by introducing a dynamic location vocabulary, the OOV problem is solved simply and efficiently, which helps to obtain code completion results quickly and accurately.

结合第二方面，在第二方面的某些实现方式中，上述根据整数序列，得到代码补全结果，包括：对表征Type的整数序列进行单步推理得到表征Type的单步推理结果；对表征Token的整数序列进行单步推理得到表征Token的单步推理结果；根据表征Type的单步推理结果和表征Token的单步推理结果得到代码补全结果。In combination with the second aspect, in some implementations of the second aspect, the code completion result is obtained based on the integer sequence, including: performing single-step reasoning on the integer sequence representing the Type to obtain a single-step reasoning result representing the Type; Perform single-step inference on the integer sequence of Token to obtain the one-step inference result representing Token; obtain the code completion result according to the one-step inference result representing Type and the one-step inference result representing Token.

可选地，上述方法还包括：对整数序列进行静态分析得到静态分析结果，根据表征Type的单步推理结果和表征Token的单步推理结果得到代码补全结果，包括：使用静态分析结果对表征Type的单步推理结果和表征Token的单步推理结果进行条件剪枝处理得到代码补全结果。Optionally, the above method further includes: performing static analysis on the integer sequence to obtain a static analysis result, and obtaining a code completion result according to the single-step inference result representing the Type and the single-step reasoning result representing the Token, including: using the static analysis result to characterize The single-step reasoning result of Type and the single-step reasoning result of Token are subjected to conditional pruning processing to obtain the code completion result.

根据本申请的技术方案，采用静态分析结果对单步推理的结果进行条件剪枝处理，有助于保证推理结果的语法正确性，同时，进一步地，还可以提高模型推理速度。According to the technical solution of the present application, the static analysis result is used to perform conditional pruning on the result of single-step reasoning, which helps to ensure the grammatical correctness of the reasoning result, and at the same time, can further improve the speed of model reasoning.

结合第二方面，在第二方面的某些实现方式中，表征Type的单步推理结果和表征Token的单步推理结果包括多个分支组，多个分支组的输出结果相似度低于预设值，代码补全结果包括多个组的输出结果。In combination with the second aspect, in some implementations of the second aspect, the single-step reasoning results representing the Type and the single-step reasoning results representing the Token include multiple branch groups, and the similarity of the output results of the multiple branch groups is lower than the preset value, the code completion results include the output results of multiple groups.

根据本申请的技术方案，采用多个分支组的形式输出结果相似度低于预设值的结果，有助于提高代码补全结果的多样性。According to the technical solution of the present application, multiple branch groups are used to output results whose result similarity is lower than a preset value, which helps to improve the diversity of code completion results.

第三方面，提供了一种代码补全模型的训练装置，其特征在于，包括：获取单元，用于获取第一训练数据，第一训练数据包括程序代码，程序代码包括使用双行平行序列表示的词素序列，其中双行平行序列包括类型行Type-line和令牌行Token-line，Type-line用于指示程序代码的结构化语法信息，Token-line用于指示程序代码的语义信息；训练单元，用于将第一训练数据输入神经网络模型并基于定制Transformer的多层神经网络训练神经网络模型，得到目标代码补全模型。In a third aspect, a training device for a code completion model is provided, which is characterized in that it includes: an acquisition unit for acquiring first training data, the first training data includes program code, and the program code includes a two-line parallel sequence representation The morpheme sequence, wherein the double-line parallel sequence includes the type line Type-line and the token line Token-line, the Type-line is used to indicate the structured syntax information of the program code, and the Token-line is used to indicate the semantic information of the program code; training A unit is used for inputting the first training data into the neural network model and training the neural network model based on the multi-layer neural network of the customized Transformer to obtain the target code completion model.

结合第三方面，在第三方面的某些实现方式中，训练单元具体用于：构建词汇表，词汇表包括类型Type词汇表和令牌Token词汇表，Type词汇表通过Type词汇全集构建，Token词汇表中各索引对应的键值不固定；根据词汇表将第一训练数据中的词素序列映射为整数序列，将整数序列输入神经网络模型。In combination with the third aspect, in some implementations of the third aspect, the training unit is specifically used for: constructing a vocabulary, the vocabulary includes a Type vocabulary and a token Token vocabulary, the Type vocabulary is constructed from the complete set of Type vocabulary, and the Token The key value corresponding to each index in the vocabulary is not fixed; the morpheme sequence in the first training data is mapped to an integer sequence according to the vocabulary, and the integer sequence is input into the neural network model.

结合第三方面，在第三方面的某些实现方式中，训练单元还用于：将OOV词汇按照距离当前光标位置的顺序从Token词汇表头部依次填入；和/或，将OOV词汇从Token词汇表尾部依次填入。In combination with the third aspect, in some implementations of the third aspect, the training unit is also used to: fill in the OOV vocabulary from the head of the Token vocabulary in order of distance from the current cursor position; and/or, fill in the OOV vocabulary from the The end of the Token glossary is filled in order.

结合第三方面，在第三方面的某些实现方式中，训练单元具体用于：将表征Type和表征Token的整数序列嵌入为高维实数矢量序列，并通过编码后拼接为88维的输入矢量序列；使用第一变换网络对输入矢量序列进行特征编码得到特征图；使用第二变换网络对特征图进行语法解码，使用第三变换网络对特征图进行语义解码以训练神经网络模型；其中，第一变换网络的深度大于第二变换网络的深度和第三变换网络的深度，第三变换网络的深度大于第二变换网络的深度。In combination with the third aspect, in some implementations of the third aspect, the training unit is specifically used to: embed the integer sequence representing the Type and the representing Token into a high-dimensional real number vector sequence, and concatenate it into an 88-dimensional input vector after encoding Sequence; use the first transformation network to perform feature encoding on the input vector sequence to obtain a feature map; use the second transformation network to decode the syntax of the feature map, and use the third transformation network to perform semantic decoding on the feature map to train the neural network model; wherein, the first The depth of the first transformation network is greater than the depth of the second transformation network and the depth of the third transformation network, and the depth of the third transformation network is greater than the depth of the second transformation network.

第四方面，提供了一种代码补全的装置，包括：获取单元，用于获取用户输入的程序代码；处理单元，用于将程序代码输入到目标代码补全模型，得到代码补全结果，目标代码补全模型是利用第一训练数据并基于变形Transformer的多层神经网络结构训练神经网络模型得到的，第一训练数据包括程序代码，程序代码包括使用双行平行序列表示的词素序列，其中双行平行序列包括类型行Type-line和令牌行Token-line，Type-line用于指示程序代码的结构化语法信息，Token-line用于指示程序代码的语义信息。In a fourth aspect, a code completion device is provided, including: an acquisition unit, configured to acquire program code input by a user; a processing unit, configured to input the program code into the target code completion model to obtain a code completion result, The target code completion model is obtained by using the first training data and training the neural network model based on the multi-layer neural network structure of the deformed Transformer. The first training data includes program code, and the program code includes a morpheme sequence represented by a double-line parallel sequence, wherein The two-line parallel sequence includes a type line Type-line and a token line Token-line, Type-line is used to indicate the structural syntax information of the program code, and Token-line is used to indicate the semantic information of the program code.

结合第四方面，在第四方面的某些实现方式中，处理单元具体用于：将程序代码转换为使用双行平行序列表示的词素序列；根据词汇表将词素序列映射为整数序列，词汇表包括类型Type词汇表和令牌Token词汇表，Type词汇表通过Type词汇全集构建，Token词汇表中各索引对应的键值不固定；根据整数序列，得到代码补全结果。In combination with the fourth aspect, in some implementations of the fourth aspect, the processing unit is specifically configured to: convert the program code into a morpheme sequence represented by a double-line parallel sequence; map the morpheme sequence into an integer sequence according to the vocabulary, and the vocabulary It includes Type vocabulary and token Token vocabulary. The Type vocabulary is constructed through the complete set of Type vocabulary. The key values corresponding to each index in the Token vocabulary are not fixed; according to the integer sequence, the code completion result is obtained.

结合第四方面，在第四方面的某些实现方式中，处理单元具体用于：对表征Type的整数序列进行单步推理得到表征Type的单步推理结果；对表征Token的整数序列进行单步推理得到表征Token的单步推理结果；根据表征Type的单步推理结果和表征Token的单步推理结果得到代码补全结果。In combination with the fourth aspect, in some implementations of the fourth aspect, the processing unit is specifically configured to: perform single-step reasoning on the integer sequence representing Type to obtain a single-step reasoning result representing Type; perform single-step reasoning on the integer sequence representing Token Inference obtains the single-step reasoning result representing Token; according to the single-step reasoning result representing Type and the single-step reasoning result representing Token, the code completion result is obtained.

结合第四方面，在第四方面的某些实现方式中，处理单元还用于：对整数序列进行静态分析得到静态分析结果，根据表征Type的单步推理结果和表征Token的单步推理结果得到代码补全结果，包括：使用静态分析结果对表征Type的单步推理结果和表征Token的单步推理结果进行条件剪枝处理得到代码补全结果。In combination with the fourth aspect, in some implementations of the fourth aspect, the processing unit is also used to: statically analyze the integer sequence to obtain the static analysis result, and obtain the result according to the single-step reasoning result representing Type and the single-step reasoning result representing Token The code completion results include: use the static analysis results to perform conditional pruning on the single-step reasoning results representing Type and the single-step reasoning results representing Token to obtain code completion results.

结合第四方面，在第四方面的某些实现方式中，表征Type的单步推理结果和表征Token的单步推理结果包括多个分支组，多个分支组的输出结果相似度低于预设值，代码补全结果包括多个组的输出结果。In combination with the fourth aspect, in some implementations of the fourth aspect, the single-step reasoning results representing the Type and the single-step reasoning results representing the Token include multiple branch groups, and the similarity of the output results of the multiple branch groups is lower than the preset value, the code completion results include the output results of multiple groups.

第五方面，提供了一种代码补全模型的训练装置，所述训练装置包括处理器与数据接口，所述处理器通过所述数据接口读取存储器上存储的指令，以执行第一方面所述的训练方法。In a fifth aspect, a training device for a code completion model is provided, the training device includes a processor and a data interface, and the processor reads instructions stored in a memory through the data interface to execute the code completion model described in the first aspect. the training method described.

第六方面，提供了一种代码补全装置，所述装置包括处理器与数据接口，所述处理器通过所述数据接口读取存储器上存储的指令，以执行第二方面所述的方法。In a sixth aspect, a code completion device is provided, the device includes a processor and a data interface, and the processor reads instructions stored in a memory through the data interface to execute the method described in the second aspect.

第七方面，提供了一种计算机可读存储介质，其特征在于，所述计算机可读介质存储用于设备执行的程序代码，该程序代码包括用于执行第一方面或第二方面所述方法的指令。In a seventh aspect, a computer-readable storage medium is provided, wherein the computer-readable medium stores program code for execution by a device, and the program code includes a program code for executing the method described in the first aspect or the second aspect instructions.

第八方面，提供了一种计算机程序产品，其特征在于，当所述计算机程序在计算机上执行时，使得所述计算机执行第一方面或第二方面所述的方法。In an eighth aspect, a computer program product is provided, wherein when the computer program is executed on a computer, the computer is made to execute the method described in the first aspect or the second aspect.

附图说明Description of drawings

图1是本申请一种人工智能主体框架示意图。Fig. 1 is a schematic diagram of an artificial intelligence subject framework of the present application.

图2是现有的一种基于规则模板的代码补全方法示意图。Fig. 2 is a schematic diagram of an existing code completion method based on a rule template.

图3是本申请实施例的一种系统架构示意图。FIG. 3 is a schematic diagram of a system architecture of an embodiment of the present application.

图4是本申请一种卷积神经网络的结构示意图。FIG. 4 is a schematic structural diagram of a convolutional neural network in the present application.

图5是本申请一种芯片的硬件结构示意图。FIG. 5 is a schematic diagram of a hardware structure of a chip of the present application.

图6是本申请代码补全模型的训练方法的一例示意性流程图。Fig. 6 is a schematic flowchart of an example of the training method of the code completion model of the present application.

图7是本申请适用的一种网络结构的结构示意图。FIG. 7 is a schematic structural diagram of a network structure applicable to this application.

图8是本申请代码补全方法的一例示意性流程图。Fig. 8 is a schematic flowchart of an example of the code completion method of the present application.

图9是本申请的代码补全模型的训练装置的示意性框图。FIG. 9 is a schematic block diagram of a training device for a code completion model of the present application.

图10是本申请的代码补全模型的训练装置的硬件结构示意图。FIG. 10 is a schematic diagram of the hardware structure of the training device for the code completion model of the present application.

图11是本申请的代码补全装置的示意性框图。Fig. 11 is a schematic block diagram of the code completion device of the present application.

图12是本申请的代码补全装置的硬件结构示意图。Fig. 12 is a schematic diagram of the hardware structure of the code completion device of the present application.

具体实施方式detailed description

下面将结合附图，对本申请中的技术方案进行描述。The technical solution in this application will be described below with reference to the accompanying drawings.

图1是本申请实施例一种人工智能主体框架示意图，该主体框架描述了人工智能系统总体工作流程，适用于通用的人工智能领域需求。FIG. 1 is a schematic diagram of an artificial intelligence main framework according to an embodiment of the present application. The main framework describes the overall workflow of an artificial intelligence system and is applicable to general artificial intelligence field requirements.

下面从“智能信息链”(水平轴)和“信息技术(information technology，IT)价值链”(垂直轴)两个维度对上述人工智能主题框架进行详细的阐述。The following is a detailed elaboration of the above artificial intelligence theme framework from the two dimensions of "intelligent information chain" (horizontal axis) and "information technology (IT) value chain" (vertical axis).

“智能信息链”反映从数据的获取到处理的一系列过程。举例来说，可以是智能信息感知、智能信息表示与形成、智能推理、智能决策、智能执行与输出的一般过程。在这个过程中，数据经历了“数据—信息—知识—智慧”的凝练过程。"Intelligent information chain" reflects a series of processes from data acquisition to processing. For example, it can be the general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, intelligent execution and output. In this process, the data has undergone a condensed process of "data-information-knowledge-wisdom".

“IT价值链”从人智能的底层基础设施、信息(提供和处理技术实现)到系统的产业生态过程，反映人工智能为信息技术产业带来的价值。"IT value chain" reflects the value brought by artificial intelligence to the information technology industry from the underlying infrastructure of artificial intelligence, information (provided and processed by technology) to the systematic industrial ecological process.

(1)基础设施：(1) Infrastructure:

基础设施为人工智能系统提供计算能力支持，实现与外部世界的沟通，并通过基础平台实现支撑。The infrastructure provides computing power support for the artificial intelligence system, realizes communication with the outside world, and realizes support through the basic platform.

基础设施可以通过传感器与外部沟通，基础设施的计算能力可以由智能芯片提供。The infrastructure can communicate with the outside through sensors, and the computing power of the infrastructure can be provided by smart chips.

这里的智能芯片可以是中央处理器(central processing unit，CPU)、神经网络处理器(neural-network processing unit，NPU)、图形处理器(graphics processingunit，GPU)、专门应用的集成电路(application specific integrated circuit，ASIC)或现场可编程门阵列(field programmable gate array，FPGA)等硬件加速芯片。The smart chip here can be a central processing unit (central processing unit, CPU), a neural network processor (neural-network processing unit, NPU), a graphics processing unit (graphics processing unit, GPU), an application-specific integrated circuit (application specific integrated circuit, ASIC) or field programmable gate array (field programmable gate array, FPGA) and other hardware acceleration chips.

基础设施的基础平台可以包括分布式计算框架及网络等相关的平台保障和支持，可以包括云存储和计算、互联互通网络等。The basic platform of infrastructure can include related platform guarantees and supports such as distributed computing framework and network, and can include cloud storage and computing, interconnection and interworking network, etc.

例如，对于基础设施来说，可以通过传感器和外部沟通获取数据，然后将这些数据提供给基础平台提供的分布式计算系统中的智能芯片进行计算。For example, for infrastructure, data can be obtained through sensors and external communication, and then these data can be provided to smart chips in the distributed computing system provided by the basic platform for calculation.

(2)数据：(2) Data:

基础设施的上一层的数据用于表示人工智能领域的数据来源。该数据涉及到图形、图像、语音、文本等信息的至少一种。该数据在不同应用领域不同，且可以有不同的表现形式。例如，涉及到物联网领域时，该数据的内容与具体的物联网连接终端有关，例如可以包括力、位移、液位、温度、或湿度等感知数据。Data from the upper layer of the infrastructure is used to represent data sources in the field of artificial intelligence. The data involves at least one of graphics, images, voice, text and other information. This data is different in different application areas and can have different representations. For example, when it comes to the field of the Internet of Things, the content of the data is related to a specific connected terminal of the Internet of Things, for example, may include sensory data such as force, displacement, liquid level, temperature, or humidity.

在本申请实施例中该数据例如为代码数据，代码数据可以是从网站爬取获取的打过莫代码预料库。In the embodiment of the present application, the data is, for example, code data, and the code data may be a code prediction library obtained from website crawling.

(3)数据处理：(3) Data processing:

上述数据处理通常包括数据训练，机器学习，深度学习，搜索，推理，决策等处理方式。The above data processing usually includes data training, machine learning, deep learning, search, reasoning, decision-making and other processing methods.

其中，机器学习和深度学习可以对数据进行符号化和形式化的智能信息建模、抽取、预处理、训练等。Among them, machine learning and deep learning can symbolize and formalize intelligent information modeling, extraction, preprocessing, training, etc. of data.

推理是指在计算机或智能系统中，模拟人类的智能推理方式，依据推理控制策略，利用形式化的信息进行机器思维和求解问题的过程，典型的功能是搜索与匹配。Reasoning refers to the process of simulating human intelligent reasoning in a computer or intelligent system, and using formalized information to carry out machine thinking and solve problems according to reasoning control strategies. The typical functions are search and matching.

决策是指智能信息经过推理后进行决策的过程，通常提供分类、排序、预测等功能。Decision-making refers to the process of decision-making after intelligent information is reasoned, and usually provides functions such as classification, sorting, and prediction.

(4)通用能力：(4) General ability:

对数据经过上面提到的数据处理后，进一步基于数据处理的结果可以形成一些通用的能力，比如可以是算法或者一个通用系统，例如，翻译，文本的分析，计算机视觉的处理，语音识别，图像的识别等等。After the above-mentioned data processing is performed on the data, some general capabilities can be formed based on the results of data processing, such as algorithms or a general system, such as translation, text analysis, computer vision processing, speech recognition, image processing identification, etc.

(5)智能产品及行业应用：(5) Intelligent products and industry applications:

智能产品及行业应用指人工智能系统在各领域的产品和应用，是对人工智能整体解决方案的封装，将智能信息决策产品化、实现落地应用，其应用领域主要包括：智能制造、智能交通、智能家居、智能医疗、智能安防、自动驾驶，平安城市，或智能终端等。Intelligent products and industry applications refer to the products and applications of artificial intelligence systems in various fields. It is the packaging of the overall solution of artificial intelligence, which commercializes intelligent information decision-making and realizes landing applications. Its application fields mainly include: intelligent manufacturing, intelligent transportation, Smart home, smart medical care, smart security, automatic driving, safe city, or smart terminals, etc.

本申请实施例可以应用在智能代码补全领域，例如程序开发、程序修改等。The embodiments of the present application can be applied in the field of intelligent code completion, such as program development and program modification.

目前，现有技术存在多种代码补全的方法，下面将对现有的代码补全方法进行简单介绍。At present, there are various code completion methods in the prior art, and the existing code completion methods will be briefly introduced below.

(1)当前应用最为广泛的的智能代码补全技术，基于基于机器学习分类模型的代码补全方法。使用该技术的代表性产品有VS Code官方插件Intellicode与第三方商业软件Kite。该方法首先根据专家知识进行特征工程提取特定特征，然后基于选取特征建立机器学习分类模型，利用分类模型的预测结果实现对后续代码的补全。该技术典型的代码补全场景可以是，例如，在“if path”后键入“.”时，该技术可以弹出补全列表提供“StartsWith”、“Length”、“Replace”、“EndsWith”等内容供用户选择。但是该技术限于机器学习模型较弱的表达能力，无法对复杂的语义和规范进行学习，且分类模型不具备对序列位置关系的学习能力，对超过一个令牌(token)的预测效果很差，此外该方法完成整行补全需触发多次预测与多次用户选择，效率低下，用户体验不佳。(1) The most widely used intelligent code completion technology is based on the code completion method based on the machine learning classification model. Representative products using this technology include Intellicode, the official plug-in for VS Code, and Kite, a third-party commercial software. This method first extracts specific features through feature engineering based on expert knowledge, then establishes a machine learning classification model based on the selected features, and uses the prediction results of the classification model to complete subsequent codes. The typical code completion scenario of this technology can be, for example, when typing "." after "if path", the technology can pop up a completion list to provide "StartsWith", "Length", "Replace", "EndsWith", etc. for users to choose. However, this technology is limited by the weak expressive ability of the machine learning model, which cannot learn complex semantics and norms, and the classification model does not have the ability to learn the sequence position relationship, and the prediction effect on more than one token (token) is very poor. In addition, this method needs to trigger multiple predictions and multiple user selections to complete the entire line, which is inefficient and the user experience is not good.

(2)基于规则模板的代码补全方法。如图2所示，该方法首先基于规则驱动或数据驱动判断当前语句的意图，然后根据规则生成当前意图下带空白槽位的语句模板，最后使用深度学习模型对空白槽位进行填充，完成整行补全。但是制定规则模板十分耗时且模板不具有通用性，当语言发生切换或者更新时移植工作量大，且补全基于规则模板，因此补全结果较为机械、固定，对上下文语义的捕捉能力弱，当上下文发生变化时，该技术的补全结果可能不能及时更新以满足当前语境。此外，很明显的，该方法受限于规则模板，模板之外的语句补全效果极差，也无法对代码风格和编程模式进行有效学习。(2) Code completion method based on rule template. As shown in Figure 2, this method first judges the intent of the current statement based on rule-driven or data-driven, then generates a statement template with blank slots under the current intent according to the rules, and finally uses a deep learning model to fill the blank slots to complete the whole process. Line completion. However, formulating rule templates is time-consuming and the templates are not universal. When the language is switched or updated, the migration workload is heavy, and the completion is based on the rule template. Therefore, the completion results are relatively mechanical and fixed, and the ability to capture contextual semantics is weak. When the context changes, the completion results of this technology may not be updated in time to meet the current context. In addition, it is obvious that this method is limited by the rule template, and the effect of statement completion outside the template is extremely poor, and it cannot effectively learn the code style and programming mode.

(3)基于深度学习序列生成模型的补全方法。该方法使用seq2seq模型直接将输入代码片段映射为连续代码序列，可在任意位置实现任意代码整行补全。中国发明专利申请号201810231329.7提出了一种基于长短期记忆网络(Long-Short term memory，LSTM)的代码补全方法，该方法针对已有代码补全技术整行补全准确率低、未有效结合语法树信息等不足，使用抽象语法树(abstract syntax code，AST)对源代码进行解析，然后使用LSTM训练语言模型，最后在推理预测的过程中使用字符级LSTM，达到了代码补全的效果。该方法选用的特征提取器为LSTM这一时序网络，其最大的缺点在于无法高效的并行计算与参数复用，因而无法处理大规模词汇表。因此，该方法无法采样token级别的词汇表而采用了字符级别加标识符的词汇表构成。然而，由于代码语言的特点之一就是字符级别不连续(即，一行代码中一个字符发生变化可能带来语义的“跃变”)，该方法无法精确地对代码序列进行端到端学习，只能大量依赖基于规则的后处理确保生成结果的合理性，增加了方法的耦合度，降低了实际预测精度。另外，LSTM网络无法处理序列中跨度较大的依赖关系，而这种依赖关系在代码语言中尤为常见(例如一个变量的定义或者一个逻辑语句的上一分支，可以追溯到十几行之前)。且模型推理用时较长，无法部署于本地设备，最重要的是，该方法未有效结合静态分析进行语法过滤，补全结果含有较多语法错误。(3) Completion method based on deep learning sequence generation model. This method uses the seq2seq model to directly map input code fragments into continuous code sequences, and can complete any line of code at any position. China Invention Patent Application No. 201810231329.7 proposes a code completion method based on Long-Short term memory (LSTM), which aims at the low accuracy rate and ineffective combination of the existing code completion technology. Insufficient syntax tree information, etc., use the abstract syntax tree (abstract syntax code, AST) to parse the source code, then use LSTM to train the language model, and finally use character-level LSTM in the process of reasoning and prediction to achieve the effect of code completion. The feature extractor selected by this method is a sequential network called LSTM. Its biggest disadvantage is that it cannot efficiently parallelize computing and parameter reuse, so it cannot handle large-scale vocabularies. Therefore, this method cannot sample token-level vocabularies and uses character-level plus identifier vocabularies. However, since one of the characteristics of the code language is that the character level is discontinuous (that is, the change of a character in a line of code may bring about a semantic "jump"), this method cannot accurately learn the code sequence end-to-end, only It can rely heavily on rule-based post-processing to ensure the rationality of the generated results, which increases the coupling degree of the method and reduces the actual prediction accuracy. In addition, LSTM networks cannot handle large-span dependencies in sequences, which are especially common in code languages (such as the definition of a variable or the previous branch of a logic statement, which can be traced back more than a dozen lines). Moreover, model reasoning takes a long time and cannot be deployed on local devices. Most importantly, this method does not effectively combine static analysis for grammatical filtering, and the completion results contain many grammatical errors.

基于上述原因，本申请提出了一种代码补全模型的训练方法、代码补全的方法和装置，有助于准确、快速的实现代码补全。Based on the above reasons, the present application proposes a code completion model training method, a code completion method and a device, which are helpful to realize code completion accurately and quickly.

图3是本申请实施例的一种系统架构示意图，可以用于训练神经网络模型，例如代码补全模型。如图3所示，数据采集设备160用于采集训练数据。针对本申请实施例的方法来说，训练数据可以包括训练序列以及训练训练序列对应的补全结果，其中，训练序列的补全结果可以是人工预先标注的结果。对于本申请实施例的代码补全模型的训练来说，该训练序列可以是通过预先构建的动态词汇表将词素序列映射的整数序列。FIG. 3 is a schematic diagram of a system architecture of an embodiment of the present application, which can be used to train a neural network model, such as a code completion model. As shown in FIG. 3 , the data collection device 160 is used to collect training data. For the method of the embodiment of the present application, the training data may include a training sequence and a completion result corresponding to the training sequence, where the completion result of the training sequence may be a manually pre-labeled result. For the training of the code completion model in the embodiment of the present application, the training sequence may be an integer sequence mapped to a morpheme sequence through a pre-built dynamic vocabulary.

在采集到训练数据之后，数据采集设备160将这些训练数据存入数据库130，训练设备120基于数据库130中维护的训练数据训练得到目标模型/规则101。“A/B”描述关联对象的关联关系，表示可以存在三种关系，例如，A/B，可以表示：单独存在A，同时存在A和B，单独存在B这三种情况。After collecting the training data, the data collection device 160 stores the training data in the database 130 , and the training device 120 obtains the target model/rule 101 based on training data maintained in the database 130 . "A/B" describes the association relationship of associated objects, indicating that there may be three types of relationships. For example, A/B may indicate: A exists alone, A and B exist simultaneously, and B exists independently.

下面对训练设备120基于训练数据得到目标模型/规则101进行描述。一种情况下，训练设备120对输入的序列的类别特征向量进行处理，将输出的类别与标签类别进行对比，直到训练设备120输出的类别的准确率大于或等于一定的阈值，从而完成目标模型/规则101的训练。这种情况下，可以训练得到本申请实施例的代码补全模型。The following describes how the training device 120 obtains the target model/rule 101 based on the training data. In one case, the training device 120 processes the category feature vector of the input sequence, and compares the output category with the label category until the accuracy rate of the category output by the training device 120 is greater than or equal to a certain threshold, thereby completing the target model / Rule 101 training. In this case, the code completion model of the embodiment of the present application can be obtained by training.

上述目标模型/规则101能够用于实现本申请实施例的方法。本申请实施例中的目标模型/规则101具体可以为神经网络。需要说明的是，在实际的应用中，所述数据库130中维护的训练数据不一定都来自于数据采集设备160的采集，也有可能是从其他设备接收得到的。另外需要说明的是，训练设备120也不一定完全基于数据库130维护的训练数据进行目标模型/规则101的训练，也有可能从云端或其他地方获取训练数据进行模型训练，上述描述不应该作为对本申请实施例的限定。The above target model/rule 101 can be used to implement the method of the embodiment of the present application. The target model/rule 101 in the embodiment of the present application may specifically be a neural network. It should be noted that, in practical applications, the training data maintained in the database 130 may not all be collected by the data collection device 160, but may also be received from other devices. In addition, it should be noted that the training device 120 does not necessarily perform the training of the target model/rules 101 based entirely on the training data maintained by the database 130, and it is also possible to obtain training data from the cloud or other places for model training. Limitations of the Examples.

根据训练设备120训练得到的目标模型/规则101可以应用于不同的系统或设备中，如应用于图3所示的执行设备110，所述执行设备110可以是终端，如手机终端，平板电脑，笔记本电脑，增强现实(augmented reality，AR)/虚拟现实(virtual reality，VR)，车载终端等，还可以是服务器或者云端等。在图3中，执行设备110配置输入/输出(input/output，I/O)接口112，用于与外部设备进行数据交互，用户可以通过客户设备140向I/O接口112输入数据，所述输入数据在本申请实施例中可以包括：客户设备输入的原始序列。The target model/rule 101 obtained by training according to the training device 120 can be applied to different systems or devices, such as the execution device 110 shown in FIG. Laptop, augmented reality (augmented reality, AR)/virtual reality (virtual reality, VR), vehicle-mounted terminal, etc., can also be a server or cloud, etc. In FIG. 3 , the execution device 110 is configured with an input/output (input/output, I/O) interface 112 for data interaction with external devices, and the user can input data to the I/O interface 112 through the client device 140, the described The input data in this embodiment of the application may include: the original sequence input by the client device.

预处理模块113和预处理模块114用于根据I/O接口112接收到的输入数据(如原始序列)进行预处理，在本申请实施例中，也可以没有预处理模块113和预处理模块114(也可以只有其中的一个预处理模块)，而直接采用计算模块111对输入数据进行处理。The preprocessing module 113 and the preprocessing module 114 are used to perform preprocessing according to the input data (such as the original sequence) received by the I/O interface 112. In the embodiment of the present application, the preprocessing module 113 and the preprocessing module 114 may not be present (There may also be only one preprocessing module), and the calculation module 111 is directly used to process the input data.

在执行设备110对输入数据进行预处理，或者在执行设备110的计算模块111执行计算等相关的处理过程中，执行设备110可以调用数据存储系统150中的数据、代码等以用于相应的处理，也可以将相应处理得到的数据、指令等存入数据存储系统150中。When the execution device 110 preprocesses the input data, or in the calculation module 111 of the execution device 110 performs calculation and other related processing, the execution device 110 can call the data, codes, etc. in the data storage system 150 for corresponding processing , the correspondingly processed data and instructions may also be stored in the data storage system 150 .

最后，I/O接口112将处理结果返回给客户设备140，从而提供给用户。Finally, the I/O interface 112 returns the processing result to the client device 140, thereby providing it to the user.

值得说明的是，训练设备120可以针对不同的目标或称不同的任务，基于不同的训练数据生成相应的目标模型/规则101，该相应的目标模型/规则101即可以用于实现上述目标或完成上述任务，从而为用户提供所需的结果。It is worth noting that the training device 120 can generate corresponding target models/rules 101 based on different training data for different goals or different tasks, and the corresponding target models/rules 101 can be used to achieve the above goals or complete above tasks, thereby providing the desired result to the user.

在图3中所示情况下，用户可以手动给定输入数据，该手动给定可以通过I/O接口112提供的界面进行操作。另一种情况下，客户设备140可以自动地向I/O接口112发送输入数据，如果要求客户设备140自动发送输入数据可以预先获得用户的授权，则用户可以在客户设备140中设置相应权限。用户可以在客户设备140查看执行设备110输出的结果，具体的呈现形式可以是显示或动作等具体方式。客户设备140也可以作为数据采集端，采集如图所示输入I/O接口112的输入数据及输出I/O接口112的输出结果作为新的样本数据，并存入数据库130。当然，也可以不经过客户设备140进行采集，而是由I/O接口112直接将如图所示输入I/O接口112的输入数据及输出I/O接口112的输出结果，作为新的样本数据存入数据库130。In the situation shown in FIG. 3 , the user can manually specify the input data, and the manual specification can be operated through the interface provided by the I/O interface 112 . In another case, the client device 140 can automatically send input data to the I/O interface 112 . If the client device 140 is required to automatically send the input data, the authorization of the user can be obtained in advance, and the user can set corresponding permissions in the client device 140 . The user can view the results output by the execution device 110 on the client device 140, and the specific presentation form may be a specific way such as display or action. The client device 140 can also be used as a data collection terminal, collecting the input data input to the I/O interface 112 as shown in the figure and the output results of the output I/O interface 112 as new sample data, and storing them in the database 130 . Of course, the client device 140 may not be used for collection, but the I/O interface 112 directly uses the input data input to the I/O interface 112 as shown in the figure and the output result of the output I/O interface 112 as a new sample. The data is stored in database 130 .

值得注意的是，图3仅是本申请实施例提供的一种系统架构的示意图，图中所示设备、器件、模块等之间的位置关系不构成任何限制，例如，在图3中，数据存储系统150相对执行设备110是外部存储器，在其它情况下，也可以将数据存储系统150置于执行设备110中。It should be noted that FIG. 3 is only a schematic diagram of a system architecture provided by the embodiment of the present application, and the positional relationship between devices, devices, modules, etc. shown in the figure does not constitute any limitation. For example, in FIG. 3, the data The storage system 150 is an external memory relative to the execution device 110 , and in other cases, the data storage system 150 may also be placed in the execution device 110 .

如图3所示，根据训练设备120训练得到目标模型/规则101，该目标模型/规则101可以是利用本申请实施例的方法得到的神经网络，具体的，本申请实施例的神经网络可以是能够用于代码补全的卷积神经网络(convolutional neuron network，CNN)，或深度卷积神经网络(deep convolutional neural networks，DCNN)等等。As shown in Figure 3, the target model/rule 101 is obtained according to the training device 120. The target model/rule 101 may be a neural network obtained by using the method of the embodiment of the present application. Specifically, the neural network of the embodiment of the present application may be A convolutional neural network (CNN) that can be used for code completion, or a deep convolutional neural network (DCNN), etc.

由于CNN是一种非常常见的神经网络，且是本申请实施例重点关注的神经网络，下面结合图4重点对CNN的结构进行详细的介绍。如上文的基础概念介绍所述，卷积神经网络是一种带有卷积结构的深度神经网络，是一种深度学习(deep learning)架构，深度学习架构是指通过机器学习的算法，在不同的抽象层级上进行多个层次的学习。作为一种深度学习架构，CNN是一种前馈(feed-forward)人工神经网络，该前馈人工神经网络中的各个神经元可以对输入其中的序列做出响应。Since CNN is a very common neural network, and it is the neural network that the embodiment of the present application focuses on, the structure of CNN will be introduced in detail below with reference to FIG. 4 . As mentioned in the introduction to the basic concepts above, a convolutional neural network is a deep neural network with a convolutional structure. It is a deep learning (deep learning) architecture. Multiple levels of learning are performed on the abstraction level. As a deep learning architecture, CNN is a feed-forward artificial neural network in which each neuron can respond to a sequence input therein.

在一种实现中，本申请实施例的代码补全方法中的基础神经网络具体采用的神经网络的结构可以如图4所示。In one implementation, the structure of the neural network specifically adopted by the basic neural network in the code completion method of the embodiment of the present application may be shown in FIG. 4 .

图4是卷积神经网络的结构示意图。在图4中，卷积神经网络(CNN)200可以包括输入层210，层220(层220可以包括卷积层和池化层，或者，层220可以包括卷积层而不包括池化层)，以及全连接层230。其中，输入层210可以获取用户输入的代码或者工程文件代码，并将获取到的待补全的代码交由层220以及后面的全连接层230进行处理，可以得到代码补全的结果。下面对图4中的CNN 200中内部的层结构进行详细的介绍。Figure 4 is a schematic diagram of the structure of a convolutional neural network. In FIG. 4, a convolutional neural network (CNN) 200 may include an input layer 210, a layer 220 (layer 220 may include a convolutional layer and a pooling layer, or, layer 220 may include a convolutional layer without a pooling layer) , and a fully connected layer 230 . Among them, the input layer 210 can obtain the code input by the user or the project file code, and pass the obtained code to be completed to the layer 220 and the subsequent fully connected layer 230 for processing, and the code completion result can be obtained. The internal layer structure of CNN 200 in FIG. 4 will be introduced in detail below.

层220：Layer 220:

卷积层：Convolution layer:

以图4为例，如图4所示层220可以包括如示例221-226层，举例来说：在一种实现中，221层为卷积层，222层为池化层，223层为卷积层，224层为池化层，225为卷积层，226为池化层；在另一种实现方式中，221、222为卷积层，223为池化层，224、225为卷积层，226为池化层。即卷积层的输出可以作为随后的池化层的输入，也可以作为另一个卷积层的输入以继续进行卷积操作。这里对卷积层和池化层的数量和位置仅为举例，可以有更多或更少的卷积层和池化层，且也可以不包括池化层。Taking FIG. 4 as an example, as shown in FIG. 4, the layer 220 may include layers 221-226 as examples. For example: in one implementation, layer 221 is a convolutional layer, layer 222 is a pooling layer, and layer 223 is a convolutional layer. Layers, 224 are pooling layers, 225 are convolutional layers, and 226 are pooling layers; in another implementation, 221 and 222 are convolutional layers, 223 are pooling layers, and 224 and 225 are convolutional layers Layer, 226 is a pooling layer. That is, the output of the convolutional layer can be used as the input of the subsequent pooling layer, or it can be used as the input of another convolutional layer to continue the convolution operation. The number and positions of convolutional layers and pooling layers are only examples here, and there may be more or fewer convolutional layers and pooling layers, and no pooling layers may be included.

下面将以卷积层221为例，介绍一层卷积层的内部工作原理。The following will take the convolutional layer 221 as an example to introduce the inner working principle of one convolutional layer.

卷积层221可以包括很多个卷积算子，卷积算子也称为核，其在代码处理中的作用相当于一个从输入程序代码中提取特定信息的过滤器，卷积算子本质上可以是一个权重矩阵，这个权重矩阵通常被预先定义，在对程序代码进行卷积操作的过程中，权重矩阵通常可以将一段程序代码解析为语法树，然后对该语法树进行低语义节点过滤、富语义节点细分、冗余节点压缩、结构化信息提取与系统库分离等进行处理，从而完成提升数据质量的工作，最后可以使用深度优先遍历得到节点序列。The convolution layer 221 may include many convolution operators, which are also called kernels, and their role in code processing is equivalent to a filter for extracting specific information from the input program code. The convolution operators are essentially It can be a weight matrix. This weight matrix is usually pre-defined. During the convolution operation on the program code, the weight matrix can usually parse a piece of program code into a syntax tree, and then perform low semantic node filtering on the syntax tree. Semantic-rich node subdivision, redundant node compression, structured information extraction and system library separation are processed to complete the work of improving data quality, and finally the node sequence can be obtained using depth-first traversal.

这些权重矩阵中的权重值在实际应用中需要经过大量的训练得到，通过训练得到的权重值形成的各个权重矩阵可以用来从输入代码中提取信息，从而使得卷积神经网络200进行正确的预测。The weight values in these weight matrices need to be obtained through a lot of training in practical applications, and each weight matrix formed by the weight values obtained through training can be used to extract information from the input code, so that the convolutional neural network 200 can make correct predictions .

当卷积神经网络200有多个卷积层的时候，初始的卷积层(例如221)往往提取较多的一般特征，该一般特征也可以称之为低级别的特征；随着卷积神经网络200深度的加深，越往后的卷积层(例如226)提取到的特征越来越复杂，比如高级别的语义之类的特征，语义越高的特征越适用于待解决的问题。When the convolutional neural network 200 has multiple convolutional layers, the initial convolutional layer (such as 221) often extracts more general features, which can also be referred to as low-level features; As the depth of the network 200 deepens, the features extracted by the later convolutional layers (such as 226) become more and more complex, such as features such as high-level semantics, and features with higher semantics are more suitable for the problem to be solved.

池化层：Pooling layer:

由于常常需要减少训练参数的数量，因此卷积层之后可以周期性的引入池化层，在如图4中220所示例的221-226各层，可以是一层卷积层后面跟一层池化层，也可以是多层卷积层后面接一层或多层池化层。Since it is often necessary to reduce the number of training parameters, a pooling layer can be periodically introduced after the convolutional layer. In each layer of 221-226 as shown in 220 in Figure 4, a layer of convolutional layer can be followed by a layer of pooling It can also be a multi-layer convolutional layer followed by one or more pooling layers.

全连接层(fully connected)230：Fully connected layer (fully connected) 230:

在经过层220的处理后，卷积神经网络200还不足以输出所需要的输出信息。为了生成最终的输出信息(所需要的类信息或其他相关信息)，卷积神经网络200进一步利用全连接层230来生成一个或者一组所需要的类的数量的输出。因此，在全连接层230中可以包括多层隐含层(如图4所示的231、232至23n)以及输出层240，该多层隐含层中所包含的参数可以根据具体的任务类型的相关训练数据进行预先训练得到，例如该任务类型可以包括OOV问题、语法校验、多样性等等。After being processed by the layer 220, the convolutional neural network 200 is not enough to output the required output information. In order to generate the final output information (required class information or other relevant information), the convolutional neural network 200 further utilizes the fully connected layer 230 to generate one or a group of outputs with the required number of classes. Therefore, the fully connected layer 230 may include multiple hidden layers (231, 232 to 23n as shown in FIG. 4 ) and an output layer 240, and the parameters contained in the multi-layer hidden layers may be based on the specific task type Relevant training data are obtained through pre-training, for example, the task type can include OOV problems, grammar verification, diversity and so on.

在全连接层230中的多层隐含层之后，也就是整个卷积神经网络200的最后层为输出层240，该输出层240具有类似分类交叉熵的损失函数，具体用于计算预测误差，一旦整个卷积神经网络200的前向传播(如图4由210至240方向的传播为前向传播)完成，反向传播(如图4由240至210方向的传播为反向传播)就会开始更新前面提到的各层的权重值以及偏差，以减少卷积神经网络200的损失，及卷积神经网络200通过输出层输出的结果和理想结果之间的误差。After the multi-layer hidden layer in the fully connected layer 230, that is, the last layer of the entire convolutional neural network 200 is the output layer 240. The output layer 240 has a loss function similar to the classification cross entropy, and is specifically used to calculate the prediction error. Once the forward propagation of the entire convolutional neural network 200 (as shown in Fig. 4, the propagation from 210 to 240 direction is forward propagation) is completed, the backpropagation (as shown in Fig. 4, the propagation from 240 to 210 direction is back propagation) will Start to update the weights and biases of the aforementioned layers to reduce the loss of the convolutional neural network 200 and the error between the output of the convolutional neural network 200 through the output layer and the ideal result.

需要说明的是，图4所示的卷积神经网络仅作为一种本申请实施例的代码补全方法的基础神经网络的一种可能的卷积神经网络的示例，在具体的应用中，本申请实施例的代码补全方法的基础神经网络所采用的卷积神经网络还可以以其他网络模型的形式存在。It should be noted that the convolutional neural network shown in FIG. 4 is only an example of a possible convolutional neural network of the basic neural network of the code completion method of the embodiment of the present application. In a specific application, this The convolutional neural network used in the basic neural network of the code completion method of the embodiment of the application may also exist in the form of other network models.

图5是本申请实施例的一种芯片的硬件结构示意图。该芯片包括神经网络处理器(图示NPU300)。该芯片可以被设置在如图3所示的执行设备110中，用以完成计算模块111的计算工作。该芯片也可以被设置在如图3所示的训练设备120中，用以完成训练设备120的训练工作并输出目标模型/规则101。如图4所示的卷积神经网络中各层的算法均可在如图5所示的芯片中得以实现。FIG. 5 is a schematic diagram of a hardware structure of a chip according to an embodiment of the present application. The chip includes a neural network processor (NPU300 shown). The chip can be set in the execution device 110 shown in FIG. 3 to complete the computing work of the computing module 111 . The chip can also be set in the training device 120 shown in FIG. 3 to complete the training work of the training device 120 and output the target model/rule 101 . The algorithms of each layer in the convolutional neural network shown in FIG. 4 can be implemented in the chip shown in FIG. 5 .

NPU300作为协处理器挂载到主中央处理器(central processing unit，CPU)(host CPU)上，由主CPU分配任务。NPU的核心部分为运算电路303，控制器304控制运算电路303提取存储器(权重存储器或输入存储器)中的数据并进行运算。The NPU300 is mounted on a main central processing unit (central processing unit, CPU) (host CPU) as a coprocessor, and the main CPU assigns tasks. The core part of the NPU is the operation circuit 303, and the controller 304 controls the operation circuit 303 to extract data in the memory (weight memory or input memory) and perform operations.

在一些实现中，运算电路303内部包括多个处理单元(process engine，PE)。在一些实现中，运算电路303是二维脉动阵列。运算电路303还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中，运算电路303是通用的矩阵处理器。In some implementations, the operation circuit 303 includes multiple processing units (process engine, PE). In some implementations, arithmetic circuit 303 is a two-dimensional systolic array. The arithmetic circuit 303 may also be a one-dimensional systolic array or other electronic circuits capable of performing mathematical operations such as multiplication and addition. In some implementations, arithmetic circuit 303 is a general-purpose matrix processor.

举例来说，假设有输入矩阵A，权重矩阵B，输出矩阵C。运算电路从权重存储器302中取矩阵B相应的数据，并缓存在运算电路中每一个PE上。运算电路从输入存储器301中取矩阵A数据与矩阵B进行矩阵运算，得到的矩阵的部分结果或最终结果，保存在累加器(accumulator)308中。For example, suppose there is an input matrix A, a weight matrix B, and an output matrix C. The operation circuit fetches the data corresponding to the matrix B from the weight memory 302, and caches it in each PE in the operation circuit. The operation circuit fetches the data of matrix A from the input memory 301 and performs matrix operation with matrix B, and the obtained partial results or final results of the matrix are stored in the accumulator (accumulator) 308 .

向量计算单元307可以对运算电路的输出做进一步处理，如向量乘，向量加，指数运算，对数运算，大小比较等等。例如，向量计算单元307可以用于神经网络中非卷积/非FC层的网络计算，如池化(pooling)，批归一化(batch normalization)，局部响应归一化(local response normalization)等。The vector computing unit 307 can perform further processing on the output of the computing circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison and so on. For example, the vector calculation unit 307 can be used for network calculations of non-convolution/non-FC layers in the neural network, such as pooling (pooling), batch normalization (batch normalization), local response normalization (local response normalization), etc. .

在一些实现中，向量计算单元能307将经处理的输出的向量存储到统一缓存器305。例如，向量计算单元307可以将非线性函数应用到运算电路303的输出，例如累加值的向量，用以生成激活值。在一些实现中，向量计算单元307生成归一化的值、合并值，或二者均有。在一些实现中，处理过的输出的向量能够用作到运算电路303的激活输入，例如用于在神经网络中的后续层中的使用。In some implementations, the vector computation unit can 307 store the vector of the processed output to the unified buffer 305 . For example, the vector computing unit 307 may apply a non-linear function to the output of the computing circuit 303, such as a vector of accumulated values, to generate activation values. In some implementations, vector computation unit 307 generates normalized values, binned values, or both. In some implementations, the vector of processed outputs can be used as an activation input to the arithmetic circuit 303, for example for use in a subsequent layer in a neural network.

统一存储器305用于存放输入数据以及输出数据。The unified memory 305 is used to store input data and output data.

权重数据直接通过存储单元访问控制器305(direct memory accesscontroller，DMAC)将外部存储器中的输入数据搬运到输入存储器301和/或统一存储器305、将外部存储器中的权重数据存入权重存储器302，以及将统一存储器305中的数据存入外部存储器。The weight data directly transfers the input data in the external memory to the input memory 301 and/or the unified memory 305 through the storage unit access controller 305 (direct memory access controller, DMAC), stores the weight data in the external memory into the weight memory 302, and Store the data in the unified storage 305 into the external storage.

总线接口单元(bus interface unit，BIU)310，用于通过总线实现主CPU、DMAC和取指存储器309之间进行交互。A bus interface unit (bus interface unit, BIU) 310 is configured to implement interaction between the main CPU, DMAC and instruction fetch memory 309 through the bus.

与控制器304连接的取指存储器(instruction fetch buffer)509，用于存储控制器504使用的指令；An instruction fetch buffer (instruction fetch buffer) 509 connected to the controller 304 is used to store instructions used by the controller 504;

控制器304，用于调用指存储器309中缓存的指令，实现控制该运算加速器的工作过程。The controller 304 is configured to invoke instructions cached in the memory 309 to control the operation process of the computing accelerator.

入口：可以根据实际说明这里的数据是说明数据，比如输入的程序代码等。Entry: It can be explained according to the actual situation that the data here is explanatory data, such as the input program code, etc.

可选地，统一存储器305，输入存储器301，权重存储器302以及取指存储器309均为片上(on-chip)存储器，外部存储器为该NPU外部的存储器，该外部存储器可以为双倍数据率同步动态随机存储器(double data rate synchronous dynamic random accessmemory，DDR SDRAM)、高带宽存储器(high bandwidth memory，HBM)或其他可读可写的存储器。Optionally, the unified memory 305, the input memory 301, the weight memory 302, and the instruction fetch memory 309 are all on-chip memory, and the external memory is a memory outside the NPU, and the external memory can be double data rate synchronous dynamic Random access memory (double data rate synchronous dynamic random access memory, DDR SDRAM), high bandwidth memory (high bandwidth memory, HBM) or other readable and writable memory.

其中，图4所示的卷积神经网络中各层的运算可以由运算电路303或向量计算单元307执行。Wherein, the operations of each layer in the convolutional neural network shown in FIG. 4 can be performed by the operation circuit 303 or the vector calculation unit 307 .

上文中介绍的图3中的执行设备110能够执行本申请实施例的代码补全方法或代码补全模型的训练方法的各个步骤，图4所示的CNN模型和图5所示的芯片也可以用于执行本申请实施例的代码补全方法的各个步骤。The execution device 110 in FIG. 3 introduced above can execute each step of the code completion method or the code completion model training method of the embodiment of the present application. The CNN model shown in FIG. 4 and the chip shown in FIG. 5 can also be It is used to execute various steps of the code completion method of the embodiment of the present application.

图6是本申请实施例的代码补全模型的训练方法的一例示意性流程图，下面对图6的各个步骤进行介绍。FIG. 6 is a schematic flowchart of an example of a code completion model training method according to an embodiment of the present application. The steps in FIG. 6 will be introduced below.

S410，获取第一训练数据。S410. Acquire first training data.

其中，第一训练数据包括程序代码，程序代码可以是经过处理后使用双行平行序列表示的词素序列，其中双行平行序列包括类型行Type-line和令牌行Token-line，Type-line用于指示所述程序代码的结构化语法信息，所述Token-line用于指示所述程序代码的语义信息。Wherein, the first training data includes program code, and the program code can be a morpheme sequence represented by a double-line parallel sequence after processing, wherein the double-line parallel sequence includes a type line Type-line and a token line Token-line, and Type-line uses In order to indicate the structured syntax information of the program code, the Token-line is used to indicate the semantic information of the program code.

具体地，可以获取大规模的代码语料库，首先将其解析为程序结构接口(programstructure interface，PSI)语法树，然后对该PSI树进行低语义节点过滤、富语义节点细分、冗余节点压缩、结构化信息提取与系统库分离等一系列预处理操作以提升数据质量，最后使用深度优先遍历得到节点序列，其中每一节点包含Type与Token两个字段(对于部分不含Token字段的节点，将人工标记Token为“Empty”)，于是可以得到包含Type-line与Token-line的双行平行序列结构。其中，Type-line表征了代码的结构化语法信息，Token-line表征了代码的语义信息。Specifically, a large-scale code corpus can be obtained, first parse it into a program structure interface (PSI) syntax tree, and then perform low-semantic node filtering, rich-semantic node subdivision, redundant node compression, and A series of preprocessing operations such as structured information extraction and system library separation to improve data quality, and finally use depth-first traversal to obtain the node sequence, where each node contains two fields Type and Token (for some nodes without Token field, set Manually mark Token as "Empty"), so a double-line parallel sequence structure including Type-line and Token-line can be obtained. Among them, Type-line represents the structured syntax information of the code, and Token-line represents the semantic information of the code.

S420，将第一训练数据输入神经网络模型并基于定制Transformer的多层神经网络训练神经网络模型，得到目标代码补全模型。S420. Input the first training data into the neural network model and train the neural network model based on the multi-layer neural network of the customized Transformer to obtain the target code completion model.

其中，上述将第一训练数据输入神经网络模型，还可以包括：构建词汇表，词汇表包括类型Type词汇表和令牌Token词汇表，Type词汇表通过Type词汇全集构建，Token词汇表中索引对应的键值不固定；根据词汇表将第一训练数据中的词素序列映射为整数序列，将整数序列输入神经网络模型。Wherein, the above-mentioned inputting of the first training data into the neural network model may also include: constructing a vocabulary, the vocabulary includes a Type vocabulary and a token Token vocabulary, the Type vocabulary is constructed through the complete set of Type vocabulary, and the index in the Token vocabulary corresponds to The key value of is not fixed; the morpheme sequence in the first training data is mapped to an integer sequence according to the vocabulary, and the integer sequence is input into the neural network model.

可选地，第一训练数据中的词素序列包括词汇表外OOV词汇，上述方法还包括：将OOV词汇按照距离当前光标位置的顺序从Token词汇表头部依次填入；和/或，将OOV词汇从Token词汇表尾部依次填入。Optionally, the morpheme sequence in the first training data includes an OOV vocabulary outside the vocabulary, and the above method also includes: filling in the OOV vocabulary from the head of the Token vocabulary in the order of distance from the current cursor position; and/or, filling the OOV Vocabulary is filled in order from the end of the Token vocabulary.

具体地，由于Type词汇表不存在OOV问题，可直接通过Type词汇全集构建词汇表，本实施例中Type词汇表的维度为185，而对于Token词汇表，可以将其分为静态词汇表和动态词汇表，其中，实施例中静态词汇表的索引范围为[0，30000]，动态词汇表的索引范围为[30001，30500]。静态词汇表中各索引的键值固定，主要由语言内建词汇(如关键字、运算符等)、内置库词汇(如Java中JDK内置的类名、方法名、变量名)、训练语料库高频词汇以及辅助词汇(如“pad”、“Empty”、“UNK”等)构成。与静态词汇表不同，动态词汇表中各索引对应的键值不固定，而是根据一定规则将运行时上下文与依赖文件中出现的OOV词汇动态地填入词汇表。其具体规则如下：Specifically, since there is no OOV problem in the Type vocabulary, the vocabulary can be directly constructed through the complete set of Type vocabulary. In this embodiment, the dimension of the Type vocabulary is 185, and for the Token vocabulary, it can be divided into static vocabulary and dynamic vocabulary. Vocabulary, wherein, in the embodiment, the index range of the static vocabulary is [0, 30000], and the index range of the dynamic vocabulary is [30001, 30500]. The key values of each index in the static vocabulary are fixed, mainly composed of language built-in vocabulary (such as keywords, operators, etc.), built-in library vocabulary (such as JDK built-in class names, method names, variable names in Java), training corpus height Frequent vocabulary and auxiliary vocabulary (such as "pad", "Empty", "UNK", etc.). Different from the static vocabulary, the key values corresponding to each index in the dynamic vocabulary are not fixed, but the OOV words appearing in the runtime context and dependent files are dynamically filled into the vocabulary according to certain rules. The specific rules are as follows:

a)对于上下文所引入的OOV词汇，按照距离当前光标位置由近及远顺序从动态词表头部(此例中即30001位置)依次填入。例如“stringBuilder”将被填入30001位置，“string”将被填入30002位置，以此类推。a) For the OOV words introduced by the context, fill them in order from the head of the dynamic vocabulary (position 30001 in this example) according to the order from near to far from the current cursor position. For example, "stringBuilder" will be filled in position 30001, "string" will be filled in position 30002, and so on.

b)对于依赖文件所引入的OOV词汇，以任意顺序从动态词表尾部(本申请实施例中即30500位置)依次填入。b) For the OOV vocabulary introduced by the dependent file, fill in it sequentially from the end of the dynamic vocabulary (position 30500 in this embodiment of the application) in any order.

c)填充完所有OOV词汇或动态词表被全部占满即停止。c) Stop when all OOV vocabulary is filled or the dynamic vocabulary is fully occupied.

通过设置动态词汇表，可以将所有的OOV词汇都得到相应表示。与此同时，对于动态词表头部的词汇，模型可通过其相对位置关系对其语义进行学习。例如，索引30001永远指向光标之前第一个出现的OOV词汇位置(此例中为“stringBuilder”所在位置)，30002永远指向光标之前第二个出现的OOV词汇位置(此例中为“string”所在位置)，从而模型可通过这些位置前后出现的词汇及其相对位置关系学习到针对这些位置而不是各个具体词汇(作用类似于“指针”)的特征表示，从而无论用户将这些变量如何自定义命名，模型总能给出图中所示的准确预测。By setting a dynamic vocabulary, all OOV vocabulary can be represented accordingly. At the same time, for the vocabulary at the head of the dynamic vocabulary, the model can learn its semantics through its relative position relationship. For example, index 30001 always points to the first occurrence of the OOV word position before the cursor (in this case, the position of "stringBuilder"), and 30002 always points to the second occurrence of the OOV word position before the cursor (in this case, the position of "string" position), so that the model can learn the feature representation for these positions instead of each specific word (similar to a "pointer") through the words that appear before and after these positions and their relative positional relationship, so that no matter how the user customizes the naming of these variables , the model always gives accurate predictions as shown in the figure.

需要说明的是，由于该方法仅在预处理阶段增加了动态词表构建操作，而在运行时推理阶段，其额外计算量仅来自于将词表由30000维扩充至30500维，这对于计算负担的增加几乎可以忽略。It should be noted that since this method only increases the dynamic vocabulary construction operation in the preprocessing stage, and in the run-time reasoning stage, the additional calculation amount only comes from expanding the vocabulary from 30,000 dimensions to 30,500 dimensions. The increase is almost negligible.

可选地，上述基于Transformer的多层神经网络训练神经网络模型，包括：将表征Type和表征Token的整数序列嵌入为高维实数矢量序列，并通过编码后拼接为88维的输入矢量序列；使用第一变换网络对输入矢量序列进行特征编码得到特征图；使用第二变换网络对特征图进行语法解码，使用第三变换网络对特征图进行语义解码以训练神经网络；其中，第一变换网络的深度大于第二变换网络的深度和第三变换网络的深度，第三变换网络的深度大于第二变换网络的深度。Optionally, the above-mentioned Transformer-based multi-layer neural network training neural network model includes: embedding the integer sequences representing the Type and Token as a high-dimensional real number vector sequence, and splicing it into an 88-dimensional input vector sequence after encoding; using The first transformation network performs feature encoding on the input vector sequence to obtain a feature map; the second transformation network is used to decode the syntax of the feature map, and the third transformation network is used to perform semantic decoding on the feature map to train the neural network; wherein, the first transformation network The depth is greater than the depth of the second transformation network and the depth of the third transformation network, and the depth of the third transformation network is greater than the depth of the second transformation network.

具体地，本申请实施例使用的网络结构如图7所示，主要包括以下几个部分：Specifically, the network structure used in the embodiment of this application is shown in Figure 7, which mainly includes the following parts:

(1)嵌入(Embedding)层(1) Embedding layer

该部分可通过两个可训练的Embedding矩阵(本实施例中的维度分别为185x16、30500x72)分别将表征Type与Token的整数序列嵌入为高维实数矢量序列，分别通过相对位置编码后拼接为88维的输入矢量序列。This part can respectively embed the integer sequences representing Type and Token into high-dimensional real number vector sequences through two trainable Embedding matrices (the dimensions in this embodiment are 185x16 and 30500x72), respectively, and splicing them into 88 Dimensional sequence of input vectors.

(2)编码(Encoder)层(2) Encoder layer

该部分使用一个极深的第一变换网络Transformer1(本实施例中为深度为8的串行Transformer结构，其中8block和8heads并没有完全对应的中文含义，可以理解为8个块和8个头部)对输入矢量序列进行特征编码。越深的网络将带来越好的学习能力，同时带来越大的计算负担。其中，适用深度为8的网络作为编码层可以满足对大跨度的上下文信息进行特征提取，还可以保证推理效率。This part uses a very deep first transformation network Transformer1 (in this embodiment, it is a serial Transformer structure with a depth of 8, where 8block and 8heads do not have a completely corresponding Chinese meaning, which can be understood as 8 blocks and 8 heads ) to encode the features of the input vector sequence. The deeper the network will bring better learning ability, but also bring greater computational burden. Among them, using a network with a depth of 8 as the encoding layer can satisfy the feature extraction of large-span context information, and can also ensure the efficiency of reasoning.

(3)译码(Decoder)层(3) Decoder layer

由于语法信息的表征复杂度远低于语义信息(具体反映为Type的词汇表维度为185，而Token的词汇表维度为30500)，该部分可以首选适用一个极浅的第二变换网络Transformer2(本实施例中为深度为1的Transformer结构)对特征图进行语法解码，然后使用一个较深的第三变换网络Transformer3(本实施例中为深度为4的串行Transformer结构)进行语义解码。Since the representation complexity of grammatical information is much lower than that of semantic information (specifically reflected in the fact that the vocabulary dimension of Type is 185, and the vocabulary dimension of Token is 30500), this part can be firstly applied with a very shallow second transformation network Transformer2 (this In the embodiment, a Transformer structure with a depth of 1) performs syntax decoding on the feature map, and then uses a deep third transformation network Transformer3 (in this embodiment, a serial Transformer structure with a depth of 4) for semantic decoding.

(4)特定任务层(4) Specific task layer

该部分可以同时进行多个相关的子任务并产生多种任务相关的输出，如图所示，其中“N_preds”为对Type的预测，“T_preds”为对Token的预测，“Loss-T-non-blacklist”表示令牌词素Loss值，“Loss-T-identifier”表示标识符令牌词素Loss值，“Loss-N”表示类型词素Loss值。This part can perform multiple related sub-tasks at the same time and generate a variety of task-related outputs, as shown in the figure, where "N_preds" is the prediction of Type, "T_preds" is the prediction of Token, and "Loss-T-non -blacklist" indicates the token morpheme Loss value, "Loss-T-identifier" indicates the identifier token morpheme Loss value, and "Loss-N" indicates the type morpheme Loss value.

同时，在训练目标模型时，还可以通过采用多任务联合学习进行训练，这样可以加快模型收敛速度，提高模型泛化能力。At the same time, when training the target model, multi-task joint learning can also be used for training, which can speed up the convergence speed of the model and improve the generalization ability of the model.

根据本申请的技术方案得到的目标代码补全模型能够端对端生成整行代码补全，预测准确，推理快速，有助于提升用户开发效率和体验。The target code completion model obtained according to the technical solution of the present application can generate end-to-end completion of the entire line of code, with accurate prediction and fast reasoning, which helps to improve user development efficiency and experience.

图8是本申请实施例的代码补全的方法的一例示意性流程图，下面对图8的各个步骤进行介绍。FIG. 8 is a schematic flowchart of an example of a code completion method according to an embodiment of the present application. The steps in FIG. 8 are described below.

S610，获取用户输入的程序代码。S610. Obtain the program code input by the user.

其中，用户输入的代码包括实时输入的代码和工程文件。Wherein, the codes input by the user include real-time input codes and project files.

S620，将程序代码输入到目标代码补全模型，得到代码补全结果。S620. Input the program code into the target code completion model to obtain a code completion result.

其中，目标代码补全模型是利用第一训练数据并基于变形Transformer的多层神经网络结构训练神经网络模型得到的，所述第一训练数据包括程序代码，所述程序代码包括使用双行平行序列表示的词素序列，其中所述双行平行序列包括类型行Type-line和令牌行Token-line，所述Type-line用于指示所述程序代码的结构化语法信息，所述Token-line用于指示所述程序代码的语义信息。Wherein, the target code completion model is obtained by using the first training data and training the neural network model based on the multi-layer neural network structure of the deformed Transformer. The morpheme sequence represented, wherein the two-line parallel sequence includes a type line Type-line and a token line Token-line, the Type-line is used to indicate the structured syntax information of the program code, and the Token-line uses Indicates the semantic information of the program code.

可选地，上述将程序代码输入到目标代码补全模型，得到代码补全结果，包括：将程序代码转换为使用双行平行序列表示的词素序列；根据词汇表将词素序列映射为整数序列，词汇表包括类型Type词汇表和令牌Token词汇表，Type词汇表通过Type词汇全集构建，Token词汇表中各索引对应的键值不固定；根据整数序列，得到代码补全结果。Optionally, the above-mentioned program code is input into the target code completion model to obtain the code completion result, including: converting the program code into a morpheme sequence represented by a double-line parallel sequence; mapping the morpheme sequence into an integer sequence according to the vocabulary, The vocabulary includes a Type vocabulary and a token Token vocabulary. The Type vocabulary is constructed from the complete set of Type vocabulary. The key values corresponding to each index in the Token vocabulary are not fixed; according to the integer sequence, the code completion result is obtained.

具体地，将程序代码转换位词素序列，再将词素序列映射为整数序列，用于得到代码补全结果，具体请参照图6中的描述，在此不再赘述。Specifically, the program code is converted into a morpheme sequence, and then the morpheme sequence is mapped to an integer sequence to obtain a code completion result. For details, please refer to the description in FIG. 6 , which will not be repeated here.

可选地，上述根据整数序列，得到代码补全结果，包括：对表征Type的整数序列进行单步推理得到表征Type的单步推理结果；对表征Token的整数序列进行单步推理得到表征Token的单步推理结果；根据表征Type的单步推理结果和表征Token的单步推理结果得到代码补全结果。Optionally, the above code completion results are obtained based on the integer sequence, including: performing single-step reasoning on the integer sequence representing Type to obtain the single-step reasoning result representing Type; performing single-step reasoning on the integer sequence representing Token to obtain the token representing Token Single-step reasoning result; the code completion result is obtained according to the single-step reasoning result representing Type and the single-step reasoning result representing Token.

具体地，在推理过程中，不再采用同步执行Type-line与Token-line的推理，而是执行Type/Token分离式推理模式，具体流程如下：Specifically, in the reasoning process, instead of synchronously executing Type-line and Token-line reasoning, Type/Token separate reasoning mode is implemented. The specific process is as follows:

a)执行Type-line推理，计算N_preds。a) Perform Type-line reasoning and calculate N_preds.

b)当N_preds结果属于Identifier集合(Token为identifier的节点的Type类型集合，包括TYPE_ID、VAR_ID、METHOD_ID、CLASS_ID等)时，执行Token-line推理，计算对应T_preds。b) When the N_preds result belongs to the Identifier set (the Type type set of the node whose Token is the identifier, including TYPE_ID, VAR_ID, METHOD_ID, CLASS_ID, etc.), perform Token-line reasoning and calculate the corresponding T_preds.

c)当N_preds结果不属于标识Identifier集合，说明当前节点为关键字、运算符或结构节点，此时可根据N_preds与语法规则快速得出T_preds。c) When the N_preds result does not belong to the Identifier set, it means that the current node is a keyword, operator or structure node. At this time, T_preds can be quickly obtained according to N_preds and grammar rules.

根据本申请的技术方案，采用分离式推理的方式，大幅减少了计算次数，从而显著加快了实时推理的速度。According to the technical solution of the present application, the number of calculations is greatly reduced by adopting a separated reasoning method, thereby significantly speeding up the speed of real-time reasoning.

具体地，当预测序列“if(file.”的下一节点时，首先计算得到N_preds为“METHOD_ID”，于是触发Token-line推理。此时，并非Token词汇表中的全部候选词汇均为可行的，而只有满足静态分析的部分候选词汇(例如“exist”、“getPath”等变量“file”能够引用的方法名)可行。基于这些可行词汇的索引值，我们可对原始网络中的Embedding层与最终输出的分类器(Softmax)层进行剪枝，即相应矩阵仅保留可行词汇索引值所对应的维度。这种方法确保了模型只会产生满足静态分析语法约束的预测结果。另一方面，由于Softmax层的矩阵乘运算占用了整个模型计算开销的40％左右，而该方法将Softmax层的维度由30500降低到了数百，这将极大加快模型的推理速度。Specifically, when predicting the next node of the sequence "if(file."), first calculate N_preds as "METHOD_ID", and then trigger Token-line reasoning. At this time, not all candidate words in the Token vocabulary are feasible , and only some of the candidate words that satisfy the static analysis (such as "exist", "getPath" and other variable "file" can refer to the method name) are feasible. Based on the index values of these feasible words, we can compare the Embedding layer and The final output classifier (Softmax) layer is pruned, that is, the corresponding matrix only retains the dimension corresponding to the feasible vocabulary index value. This method ensures that the model will only produce prediction results that meet the grammatical constraints of static analysis. On the other hand, due to The matrix multiplication operation of the Softmax layer takes up about 40% of the calculation cost of the entire model, and this method reduces the dimension of the Softmax layer from 30,500 to hundreds, which will greatly speed up the reasoning speed of the model.

可选地，在本申请实施例中，表征Type的单步推理结果和表征Token的单步推理结果包括多个分支组，多个分支组的输出结果相似度低于预设值，代码补全结果包括多个组的输出结果。这样，采用多个分支组的形式输出结果相似度低于预设值的结果，有助于提高代码补全结果的多样性。Optionally, in this embodiment of the present application, the single-step reasoning results representing Type and the single-step reasoning results representing Token include multiple branch groups, and the similarity of the output results of multiple branch groups is lower than the preset value, code completion The results include output results for multiple groups. In this way, using multiple branch groups to output results whose result similarity is lower than a preset value helps to improve the diversity of code completion results.

图9是本申请代码补全模型的训练装置的示意性框图。图9所示的代码补全模型的训练装置700包括获取单元701和训练单元702。Fig. 9 is a schematic block diagram of a training device for a code completion model of the present application. The code completion model training apparatus 700 shown in FIG. 9 includes an acquisition unit 701 and a training unit 702 .

获取单元701和训练单元702可以用于执行本申请实施例的代码补全模型的训练方法，具体地，获取单元701可以执行上述步骤S410，训练单元702可以执行上述步骤S420。The acquisition unit 701 and the training unit 702 can be used to implement the code completion model training method of the embodiment of the present application. Specifically, the acquisition unit 701 can perform the above step S410, and the training unit 702 can perform the above step S420.

应理解，上述装置700中的训练单元702可以相当于下文中的装置800中的处理器802。It should be understood that the training unit 702 in the above device 700 may be equivalent to the processor 802 in the device 800 hereinafter.

图10是本申请代码补全模型的训练装置的硬件结构示意图。图10所示的代码补全模型的训练装置800(该装置800具体可以是一种计算机设备)包括存储器801、处理器802、通信接口803以及总线804。其中，存储器801、处理器802、通信接口803通过总线804实现彼此之间的通信连接。FIG. 10 is a schematic diagram of the hardware structure of the training device for the code completion model of the present application. The code completion model training apparatus 800 shown in FIG. 10 (the apparatus 800 may specifically be a computer device) includes a memory 801 , a processor 802 , a communication interface 803 and a bus 804 . Wherein, the memory 801 , the processor 802 , and the communication interface 803 are connected to each other through a bus 804 .

存储器801可以是只读存储器(read only memory，ROM)，静态存储设备，动态存储设备或者随机存取存储器(random access memory，RAM)。存储器801可以存储程序，当存储器801中存储的程序被处理器802执行时，处理器802和通信接口803用于执行本申请实施例的代码补全模型的训练方法的各个步骤。The memory 801 may be a read only memory (read only memory, ROM), a static storage device, a dynamic storage device or a random access memory (random access memory, RAM). The memory 801 may store a program. When the program stored in the memory 801 is executed by the processor 802, the processor 802 and the communication interface 803 are used to execute each step of the method for training the code completion model of the embodiment of the present application.

处理器802可以采用CPU，微处理器，应用专用集成电路(application specificintegrated circuit，ASIC)，图形处理器(graphics processing unit，GPU)或者一个或多个集成电路，用于执行相关程序，以实现本申请实施例的代码补全模型的训练装置中的单元所需执行的功能，或者执行本申请方法实施例的代码补全模型的训练方法。The processor 802 may adopt a CPU, a microprocessor, an application specific integrated circuit (application specific integrated circuit, ASIC), a graphics processing unit (graphics processing unit, GPU) or one or more integrated circuits for executing related programs, so as to realize this The functions to be performed by the units in the code completion model training device of the embodiment of the application, or execute the code completion model training method of the method embodiment of the application.

处理器802还可以是一种集成电路芯片，具有信号的处理能力。在实现过程中，本申请的代码补全模型的训练方法的各个步骤可以通过处理器802中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器802，还可以是通用处理器、数字信号处理器(digital signal processing，DSP)、ASIC、现成可编程门阵列(field programmable gatearray，FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成，或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器，闪存、只读存储器，可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器801，处理器802读取存储器801中的信息，结合其硬件完成本申请实施例的代码补全模型的训练装置中包括的单元所需执行的功能，或者执行本申请方法实施例的代码补全模型的训练方法。The processor 802 may also be an integrated circuit chip with signal processing capabilities. In the implementation process, each step of the code completion model training method of the present application may be completed by an integrated logic circuit of hardware in the processor 802 or instructions in the form of software. The above-mentioned processor 802 may also be a general-purpose processor, a digital signal processor (digital signal processing, DSP), an ASIC, an off-the-shelf programmable gate array (field programmable gate array, FPGA) or other programmable logic devices, discrete gates or transistors Logic devices, discrete hardware components. Various methods, steps, and logic block diagrams disclosed in the embodiments of the present application may be implemented or executed. A general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module can be located in a mature storage medium in the field such as random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, register. The storage medium is located in the memory 801, and the processor 802 reads the information in the memory 801, and combines its hardware to complete the functions required by the units included in the training device of the code completion model of the embodiment of the present application, or execute the implementation of the method of the present application Example code completion model training method.

通信接口803使用例如但不限于收发器一类的收发装置，来实现装置800与其他设备或通信网络之间的通信。例如，可以通过通信接口803获取上述第一训练数据。The communication interface 803 implements communication between the apparatus 800 and other devices or communication networks by using a transceiver device such as but not limited to a transceiver. For example, the above-mentioned first training data may be obtained through the communication interface 803 .

总线804可包括在装置800各个部件(例如，存储器801、处理器802、通信接口803)之间传送信息的通路。The bus 804 may include pathways for transferring information between various components of the device 800 (eg, memory 801 , processor 802 , communication interface 803 ).

图11是本申请代码补全装置的示意性框图。图11所示的代码补全装置900包括获取单元901和处理单元902。Fig. 11 is a schematic block diagram of the code completion device of the present application. The code completion device 900 shown in FIG. 11 includes an acquisition unit 901 and a processing unit 902 .

获取单元901和处理单元902可以用于执行本申请实施例的代码补全方法，具体地，获取单元901可以执行上述步骤901，处理单元902可以执行上述步骤902。The acquisition unit 901 and the processing unit 902 may be configured to execute the code completion method of the embodiment of the present application. Specifically, the acquisition unit 901 may perform the above step 901, and the processing unit 902 may perform the above step 902.

处理单元902能够实现图8所示的代码补全神经网络的功能。The processing unit 902 can realize the function of the code completion neural network shown in FIG. 8 .

应理解，上述装置900中的处理单元902可以相当于下文中的装置1000中的处理器1002。It should be understood that the processing unit 902 in the above-mentioned device 900 may be equivalent to the processor 1002 in the device 1000 hereinafter.

图12是本申请实施例提供的代码补全装置的硬件结构示意图。图12所示的代码补全装置1000(该装置1000具体可以是一种计算机设备)包括存储器1001、处理器1002、通信接口1003以及总线1004。其中，存储器1001、处理器1002、通信接口1003通过总线1004实现彼此之间的通信连接。Fig. 12 is a schematic diagram of the hardware structure of the code completion device provided by the embodiment of the present application. The code completion apparatus 1000 shown in FIG. 12 (the apparatus 1000 may specifically be a computer device) includes a memory 1001 , a processor 1002 , a communication interface 1003 and a bus 1004 . Wherein, the memory 1001 , the processor 1002 , and the communication interface 1003 are connected to each other through a bus 1004 .

存储器1001可以是ROM，静态存储设备，动态存储设备或者RAM。存储器1001可以存储程序，当存储器1001中存储的程序被处理器1002执行时，处理器1002和通信接口1003用于执行本申请实施例的代码补全方法的各个步骤。The memory 1001 may be a ROM, a static storage device, a dynamic storage device or a RAM. The memory 1001 may store a program. When the program stored in the memory 1001 is executed by the processor 1002, the processor 1002 and the communication interface 1003 are used to execute each step of the code completion method of the embodiment of the present application.

处理器1002可以采用通用的CPU，微处理器，ASIC，GPU或者一个或多个集成电路，用于执行相关程序，以实现本申请实施例的代码补全装置中的单元所需执行的功能，或者执行本申请方法实施例的代码补全方法。The processor 1002 may adopt a general-purpose CPU, microprocessor, ASIC, GPU or one or more integrated circuits for executing related programs, so as to realize the functions required by the units in the code completion device of the embodiment of the present application, Or execute the code completion method of the method embodiment of the present application.

处理器1002还可以是一种集成电路芯片，具有信号的处理能力。在实现过程中，本申请的代码补全方法的各个步骤可以通过处理器1002中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器1002还可以是通用处理器、DSP、ASIC、FPGA或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成，或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器，闪存、只读存储器，可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器1001，处理器1002读取存储器1001中的信息，结合其硬件完成本申请实施例的代码补全装置中包括的单元所需执行的功能，或者执行本申请方法实施例的代码补全方法。The processor 1002 may also be an integrated circuit chip with signal processing capability. During implementation, each step of the code completion method of the present application may be completed by an integrated logic circuit of hardware in the processor 1002 or instructions in the form of software. The aforementioned processor 1002 may also be a general processor, DSP, ASIC, FPGA or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components. Various methods, steps, and logic block diagrams disclosed in the embodiments of the present application may be implemented or executed. A general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module can be located in a mature storage medium in the field such as random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, register. The storage medium is located in the memory 1001, and the processor 1002 reads the information in the memory 1001, and combines its hardware to complete the functions required by the units included in the code completion device of the embodiment of the application, or execute the code of the method embodiment of the application Completion method.

通信接口1003使用例如但不限于收发器一类的收发装置，来实现装置1000与其他设备或通信网络之间的通信。例如，可以通过通信接口1003获取上述人脸图像。The communication interface 1003 implements communication between the apparatus 1000 and other devices or communication networks by using a transceiver device such as but not limited to a transceiver. For example, the aforementioned face image may be acquired through the communication interface 1003 .

总线1004可包括在装置1000各个部件(例如，存储器1001、处理器1002、通信接口1003)之间传送信息的通路。The bus 1004 may include a pathway for transferring information between various components of the device 1000 (eg, memory 1001 , processor 1002 , communication interface 1003 ).

应注意，尽管图10所示的装置800、图12所示的装置1000仅仅示出了存储器、处理器、通信接口，但是在具体实现过程中，本领域的技术人员应当理解，装置800、装置1000还包括实现正常运行所必须的其他器件。同时，根据具体需要，本领域的技术人员应当理解，装置800、装置1000还可包括实现其他附加功能的硬件器件。此外，本领域的技术人员应当理解，装置800、装置1000也可仅仅包括实现本申请实施例所必须的器件，而不必包括图10、图12中所示的全部器件。It should be noted that although the device 800 shown in FIG. 10 and the device 1000 shown in FIG. 12 only show a memory, a processor, and a communication interface, in the specific implementation process, those skilled in the art should understand that the device 800, the device The 1000 also includes other devices necessary for proper operation. Meanwhile, according to specific requirements, those skilled in the art should understand that the apparatus 800 and the apparatus 1000 may also include hardware devices for implementing other additional functions. In addition, those skilled in the art should understand that the device 800 and the device 1000 may only include the devices necessary to realize the embodiment of the present application, instead of all the devices shown in FIG. 10 and FIG. 12 .

本领域普通技术人员可以意识到，结合本文中所公开的实施例描述的各示例的单元及算法步骤，能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行，取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同装置来实现所描述的功能，但是这种实现不应认为超出本申请的范围。Those skilled in the art can appreciate that the units and algorithm steps of the examples described in conjunction with the embodiments disclosed herein can be implemented by electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are executed by hardware or software depends on the specific application and design constraints of the technical solution. A skilled artisan may use different means to implement the described functions for each particular application, but such implementation should not be considered as exceeding the scope of the present application.

所属领域的技术人员可以清楚地了解到，为描述的方便和简洁，上述描述的系统、装置和单元的具体工作过程，可以参考前述方法实施例中的对应过程，在此不再赘述。Those skilled in the art can clearly understand that for the convenience and brevity of the description, the specific working process of the above-described system, device and unit can refer to the corresponding process in the foregoing method embodiment, which will not be repeated here.

在本申请所提供的几个实施例中，应该理解到，所揭露的系统、方法和装置，可以通过其它的方式实现。例如，以上所描述的装置实施例仅仅是示意性的，例如，所述单元的划分，仅仅为一种逻辑功能划分，实际实现时可以有另外的划分方式，例如多个单元或组件可以结合或者可以集成到另一个系统，或一些特征可以忽略，或不执行。另一点，所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口，装置或单元的间接耦合或通信连接，可以是电性，机械或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed systems, methods and devices can be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components can be combined or May be integrated into another system, or some features may be ignored, or not implemented. In another point, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.

所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.

另外，在本申请各个实施例中的各功能单元可以集成在一个处理单元中，也可以是各个单元单独物理存在，也可以两个或两个以上单元集成在一个单元中。In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.

所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储介质中。基于这样的理解，本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括：通用串行总线闪存盘(USB flash disk，UFD)，UFD也可以简称为U盘或者优盘、移动硬盘、ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。If the functions described above are realized in the form of software function units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application is essentially or the part that contributes to the prior art or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage media include: Universal Serial Bus flash disk (UFD), UFD can also be referred to as U disk or USB flash drive, mobile hard disk, ROM, RAM, magnetic disk or optical disk, etc., which can store program codes. medium.

以上所述，仅为本申请的具体实施方式，但本申请的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本申请揭露的技术范围内，可轻易想到变化或替换，都应涵盖在本申请的保护范围之内。因此，本申请的保护范围应以所述权利要求的保护范围为准。The above is only a specific implementation of the application, but the scope of protection of the application is not limited thereto. Anyone familiar with the technical field can easily think of changes or substitutions within the technical scope disclosed in the application. Should be covered within the protection scope of this application. Therefore, the protection scope of the present application should be determined by the protection scope of the claims.

Claims

1. A training method of a code completion model is characterized by comprising the following steps:

acquiring first training data, wherein the first training data comprises program code, and the program code comprises a morpheme sequence represented by a double-line parallel sequence, wherein the double-line parallel sequence comprises a Type line Type-line and a Token line Token-line, the Type line Type-line is used for indicating structural grammar information of the program code, and the Token-line is used for indicating semantic information of the program code;

and inputting the first training data into a neural network model and training the neural network model based on a multi-layer neural network of a customized Transformer to obtain a target code completion model.

2. The training method of claim 1, wherein inputting the first training data into a neural network model comprises:

constructing a vocabulary table, wherein the vocabulary table comprises a Type vocabulary table and a Token vocabulary table, the Type vocabulary table is constructed through a Type vocabulary complete set, and key values corresponding to indexes in the Token vocabulary table are not fixed;

and mapping the morpheme sequence in the first training data into an integer sequence according to the vocabulary, and inputting the integer sequence into the neural network model.

3. The training method of claim 2, wherein the sequence of morphemes in the first training data comprises out-of-vocabulary OOV words, the method further comprising:

sequentially filling the OOV vocabularies from the head of the Token vocabulary according to the sequence from the current cursor position; and/or the presence of a gas in the gas,

and filling the OOV vocabularies from the tail part of the Token vocabulary in sequence.

4. The training method of claim 1, wherein the transform-based multi-layer neural network structure training the neural network model comprises:

embedding the integer sequences representing the Type and Token into a high-dimensional real number vector sequence, and splicing the integer sequences into an 88-dimensional input vector sequence after coding;

performing feature coding on the input vector sequence by using a first transformation network to obtain a feature map;

syntactic-decoding the feature map using a second transformation network, semantic-decoding the feature map using a third transformation network to train the neural network;

wherein a depth of the first transformation network is greater than a depth of the second transformation network and a depth of the third transformation network, the depth of the third transformation network being greater than the depth of the second transformation network.

5. A method of code completion, comprising:

acquiring a program code input by a user;

inputting the program code into a target code completion model to obtain a code completion result, wherein the target code completion model is obtained by training a neural network model by using first training data and based on a multi-layer neural network structure of a Transformer, the first training data comprises program codes, and the program codes comprise morpheme sequences represented by using double-line parallel sequences, wherein the double-line parallel sequences comprise Type lines and Token lines, the Type lines are used for indicating structural grammar information of the program codes, and the Token lines are used for indicating semantic information of the program codes.

6. The method of claim 5, wherein inputting the program code into the object code completion model to obtain a code completion result comprises:

converting the program code into a sequence of morphemes represented using the two-line parallel sequence;

mapping the morpheme sequence into an integer sequence according to a vocabulary, wherein the vocabulary comprises a Type vocabulary and a Token vocabulary, the Type vocabulary is constructed by a Type vocabulary complete set, and key values corresponding to indexes in the Token vocabulary are not fixed;

and obtaining the code completion result according to the integer sequence.

7. The method of claim 6, wherein obtaining the code completion result according to the integer sequence comprises:

performing single-step reasoning on the integer sequence representing the Type to obtain a single-step reasoning result representing the Type;

performing single-step reasoning on the integer sequence for representing the Token to obtain a single-step reasoning result for representing the Token;

and obtaining the code completion result according to the single-step reasoning result of the characterization Type and the single-step reasoning result of the characterization Token.

8. The method of claim 7, further comprising:

performing static analysis on the integer sequence to obtain a static analysis result,

the obtaining the code completion result according to the single-step reasoning result of the characterization Type and the single-step reasoning result of the characterization Token comprises:

and performing conditional pruning on the single-step reasoning result of the characterization Type and the single-step reasoning result of the characterization Token by using the static analysis result to obtain the code completion result.

9. The method according to any one of claims 5 to 8, wherein the single-step inference result characterizing Type and the single-step inference result characterizing Token comprise a plurality of branch groups, the similarity of output results of the plurality of branch groups is lower than a preset value, and the code completion result comprises the output results of the plurality of groups.

10. A training apparatus for a code completion model, comprising:

an obtaining unit, configured to obtain first training data, where the first training data includes a program code, and the program code includes a morpheme sequence represented by a two-line parallel sequence, where the two-line parallel sequence includes a Type line Type-line and a Token line Token-line, the Type line Type-line is used to indicate structural syntax information of the program code, and the Token-line is used to indicate semantic information of the program code;

and the training unit is used for inputting the first training data into a neural network model and training the neural network model based on the multi-layer neural network of the customized Transformer to obtain a target code completion model.

11. Training device according to claim 10, wherein the training unit is specifically configured to:

12. The training device of claim 11, wherein the training unit is further configured to:

sequentially filling the OOV vocabularies from the head of the Token vocabulary according to the sequence from the current cursor position; and/or

13. Training device according to claim 12, wherein the training unit is specifically configured to:

using a second transformation network to syntactically decode the feature map, and using a third transformation network to semantically decode the feature map to train the neural network model;

14. An apparatus for code completion, comprising:

an acquisition unit configured to acquire a program code input by a user;

and the processing unit is used for inputting the program code into a target code completion model to obtain a code completion result, wherein the target code completion model is obtained by training a neural network model by utilizing first training data and based on a multi-layer neural network structure of a Transformer, the first training data comprises the program code, and the program code comprises a morpheme sequence represented by a double-line parallel sequence, wherein the double-line parallel sequence comprises a Type line Type-line and a Token line Token-line, the Type-line is used for indicating structural grammar information of the program code, and the Token-line is used for indicating semantic information of the program code.

15. The apparatus according to claim 14, wherein the processing unit is specifically configured to:

and obtaining the code completion result according to the integer sequence.

16. The apparatus according to claim 15, wherein the processing unit is specifically configured to:

17. The apparatus of claim 16, wherein the processing unit is further configured to:

18. The apparatus according to any one of claims 14 to 17, wherein the single step inference result characterizing Type and the single step inference result characterizing Token comprise a plurality of branch groups, output results of the plurality of branch groups have a similarity lower than a preset value, and the code completion result comprises output results of the plurality of groups.

19. A computer-readable storage medium, characterized in that the computer-readable medium stores program code for execution by a device, the program code comprising instructions for performing the method of any of claims 1 to 4 or any of claims 5 to 9.

20. A training apparatus for a code completion model, the training apparatus comprising a processor and a data interface, the processor reading instructions stored on a memory through the data interface to perform the training method according to any one of claims 1 to 4.

21. A code completion apparatus, characterized in that the apparatus comprises a processor and a data interface, the processor reading instructions stored on a memory through the data interface to perform the method according to any of claims 5 to 9.

22. A computer program product, characterized in that the computer program, when executed on a computer, causes the computer to perform the method of any of claims 1 to 4 or of any of claims 5 to 9.