CN112702659B

CN112702659B - Video subtitle processing method and device, electronic equipment and readable storage medium

Info

Publication number: CN112702659B
Application number: CN202011556224.2A
Authority: CN
Inventors: 肖伟平; 高斌
Original assignee: Chengdu New Hope Finance Information Co Ltd
Current assignee: Chengdu New Hope Finance Information Co Ltd
Priority date: 2020-12-24
Filing date: 2020-12-24
Publication date: 2023-01-31
Anticipated expiration: 2040-12-24
Also published as: CN112702659A

Abstract

The application provides a video subtitle processing method and device, electronic equipment and a readable storage medium, and relates to the technical field of video processing. The method comprises the following steps: acquiring a target audio file corresponding to the video file; inputting the target audio file into an audio separation tool, and acquiring human voice data from the target audio file; inputting the voice data output by the audio separation tool into a voice recognition tool, and determining a target subtitle text corresponding to the voice data; and adding the target subtitle text corresponding to the voice data into the video file corresponding to the relative time position based on the relative time position of the voice data in the target audio file to obtain the target video file. According to the scheme, automatic generation and automatic addition of the video file subtitles can be achieved, and the video subtitle manufacturing efficiency can be improved.

Description

Video subtitle processing method and device, electronic equipment and readable storage medium

Technical Field

The present application relates to the field of video processing technologies, and in particular, to a method and an apparatus for processing video subtitles, an electronic device, and a readable storage medium.

Background

With the rise of short videos and various live broadcast platforms, people have more and more requirements on making and sharing videos, wherein video subtitles can better present video and audio contents to experience users. Currently, subtitles of video are usually made manually and added. During manual subtitle adding, a series of text filling, adding and uploading are required to be performed on audio content in a video to synthesize subtitles, so that the efficiency of making and adding video subtitles is low.

Disclosure of Invention

An object of the embodiments of the present application is to provide a method, an apparatus, an electronic device and a readable storage medium for processing video subtitles, which can solve the problem of low efficiency of creating and adding video subtitles.

In order to achieve the above object, the embodiments of the present application are implemented as follows:

in a first aspect, an embodiment of the present application provides a method for processing video subtitles, where the method includes:

acquiring a target audio file corresponding to the video file;

inputting the target audio file into an audio separation tool, and acquiring human voice data from the target audio file;

inputting the voice data output by the audio separation tool into a voice recognition tool for determining a target caption text corresponding to the voice data;

and adding the target subtitle text corresponding to the voice data into the video file corresponding to the relative time position based on the relative time position of the voice data in the target audio file to obtain a target video file.

In the above embodiment, the human voice data is extracted from the audio data, which is beneficial to reducing the interference of the environmental sound to the voice conversion, so as to improve the accuracy of the voice conversion into the caption text. And then, adding the obtained target subtitle text into the video file, so that automatic generation and automatic addition of the subtitle of the video file can be realized, and the production efficiency of the video subtitle can be improved.

With reference to the first aspect, in some optional embodiments, inputting the target audio file into an audio separation tool for acquiring human voice data from the target audio file includes:

inputting the target audio file into the audio separation tool;

and extracting the voice data corresponding to each voice from the target audio file through the audio separation tool.

In the above embodiment, the accuracy of the converted subtitle text is improved and the interference of the environmental sound to the converted subtitle text is reduced by separating the human voice data from the audio file.

With reference to the first aspect, in some optional embodiments, before inputting the human voice data output by the audio separation tool to a speech recognition tool, the method further comprises:

and selecting the voice data with the voice frequency larger than the preset frequency and the duration time larger than the preset frequency and longer than or equal to the preset time length from the voice data as the voice data input to the audio separation tool.

In the above embodiment, by presetting the frequency and the duration, the obtained voice data can be filtered again, so that unnecessary conversion of the voice data is reduced, and the conversion efficiency and accuracy of the subtitle text are improved.

With reference to the first aspect, in some optional implementations, the target subtitle text includes a first subtitle text and a second subtitle text, and the human voice data output by the audio separation tool is input to a speech recognition tool for determining the target subtitle text corresponding to the human voice data, including:

inputting the human voice data output by the audio separation tool to the speech recognition tool;

determining a first caption text of a first language corresponding to the human voice data through the voice recognition tool;

translating the first subtitle text into a second subtitle text in a second language through a voice recognition tool;

and determining the target subtitle text according to the first subtitle text and the second subtitle text.

In the above embodiment, the target subtitle text may include two types of text, which is advantageous for enriching the subtitle style so as to configure the corresponding subtitle text according to different languages.

With reference to the first aspect, in some optional embodiments, obtaining a target audio file corresponding to a video file includes:

extracting the video file to obtain an initial audio file;

and converting the initial audio file into the target audio file in a preset format.

selecting any video file from the video files without the target subtitle text added in the processing list to load;

and acquiring a target audio file corresponding to the loaded video file.

In the above embodiment, based on the processing list, the corresponding subtitles can be automatically configured and generated for the batch of video files, which is beneficial to improving the efficiency of generating the subtitles.

With reference to the first aspect, in some optional embodiments, before inputting the target audio file into an audio separation tool, the method comprises:

and when the content of the target audio file is empty, reloading the video file, and acquiring a new target audio file corresponding to the video file.

In a second aspect, an embodiment of the present application further provides a video subtitle processing apparatus, where the apparatus includes:

an acquisition unit configured to acquire a target audio file corresponding to a video file;

the input unit is used for inputting the target audio file into an audio separation tool and acquiring human voice data from the target audio file;

the input unit is further used for inputting the voice data output by the audio separation tool into a voice recognition tool and determining a target subtitle text corresponding to the voice data;

and the subtitle adding unit is used for adding the target subtitle text corresponding to the voice data into the video file corresponding to the relative time position based on the relative time position of the voice data in the target audio file to obtain a target video file.

In a third aspect, an embodiment of the present application further provides an electronic device, where the electronic device includes a processor and a memory coupled to each other, and a computer program is stored in the memory, and when the computer program is executed by the processor, the electronic device is caused to perform the method described above.

In a fourth aspect, the present application further provides a computer-readable storage medium, in which a computer program is stored, and when the computer program runs on a computer, the computer is caused to execute the above method.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.

Fig. 1 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Fig. 2 is a schematic view of a communication connection between an electronic device and a server according to an embodiment of the present application.

Fig. 3 is a flowchart illustrating a video subtitle processing method according to an embodiment of the present application.

Fig. 4 is a block diagram of a video subtitle processing apparatus according to an embodiment of the present application.

Icon: 10-an electronic device; 11-a processing module; 12-a storage module; 20-a server; 100-video subtitle processing means; 110-an obtaining unit; 120-an input unit; 130-subtitle adding unit.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application. It should be noted that the terms "first," "second," and the like are used merely to distinguish one description from another, and are not intended to indicate or imply relative importance. The embodiments and features of the embodiments described below can be combined with each other without conflict.

Referring to fig. 1, an embodiment of the present application provides an electronic device 10, which can automatically generate and add a corresponding subtitle to a video file to improve the efficiency of producing the subtitle of the video file.

The electronic device 10 may include a processing module 11 and a memory module 12. The memory module 12 stores therein a computer program which, when executed by said processing module 11, enables the electronic device 10 to perform the steps of the method described below.

Of course, the electronic device 10 may also include other modules. For example, the electronic device 10 may also include a communication module for communicating with the server 20 or other network devices. In addition, the electronic device 10 may further include a software functional module of the video subtitle processing apparatus 100 solidified in the storage module 12.

The processing module 11, the storage module 12, and the communication module are electrically connected to each other directly or indirectly to implement data transmission or interaction. For example, the components may be electrically connected to each other via one or more communication buses or signal lines.

In this embodiment, the electronic device 10 may be a personal computer, a network server, or the like, and is not limited herein.

Referring to fig. 2, in this embodiment, the electronic device 10 may establish a communication connection with the server 20 through the communication module. The server 20 may be used to store the target video file for which the subtitle addition is completed. In addition, the server 20 may also be a server 20 that provides a third party service. Third party services include, but are not limited to, speech to text, text translation processes, and the like.

Referring to fig. 3, an embodiment of the present application provides a video subtitle processing method, which can be applied to the electronic device 10 described above, and the electronic device 10 executes or implements the steps of the method. The method may comprise the steps of:

step S210, acquiring a target audio file corresponding to the video file;

step S220, inputting the target audio file into an audio separation tool for acquiring human voice data from the target audio file;

step S230, inputting the voice data output by the audio separation tool into a speech recognition tool, for determining a target subtitle text corresponding to the voice data;

step S240, based on the relative time position of the voice data in the target audio file, adding the target subtitle text corresponding to the voice data to the video file corresponding to the relative time position, so as to obtain a target video file.

The individual steps of the process are explained in detail below, as follows:

in step S210, the electronic device may load a video file to which subtitles need to be added, and then extract a corresponding audio file from the video file, that is, a target audio file. In other embodiments, the target audio file may be audio data separate from the video file, e.g., the target audio file may be an audio file recorded separately when the video file was captured, rather than an audio file extracted from the video file. Wherein the time axis of the target audio file is generally the same as the time axis of the video file.

In the present embodiment, the target audio file generally includes audio data such as an environmental sound and a human voice. The voice data is data which needs to be subjected to voice conversion by the electronic equipment, and the environmental sound is audio data which needs to be filtered.

Before extracting the audio file of the obtained video file, the electronic equipment can set up nodejs service through a koa2 framework. For example, an input interface named api/v1/input is written, a field videoUrl in a reference string format is supported, and a user is allowed to transmit a single or multiple corresponding video or audio file paths to be parsed to read a video file or an audio file.

In step S220, the audio separation tool may be a software tool installed in the electronic device or a software tool installed in the server. The electronic device may input the target audio file to the audio separation tool, and then the audio separation tool separates the target audio file to obtain the human voice data and the environmental voice data.

The audio separation tool may be, but is not limited to, a Spleeter library, and may be determined according to actual situations. Understandably, the Spleeter library can separate sound files of different tracks from the target audio file according to the range threshold of the track and audio, and save into wav format.

In this embodiment, step S220 may include:

inputting the target audio file into the audio separation tool;

Understandably, if the audio separation tool is installed on the server, the electronic device may send the target audio file to the server, and the audio separation tool in the server performs separation processing on the target audio file. If the audio separation tool is installed on the electronic device, the electronic device may input the obtained target audio file to the audio separation tool, and the audio separation tool performs separation processing on the target audio file.

In the processing process, the audio separation tool can extract the voice data corresponding to each voice from the target audio file. Therefore, when multiple persons speak in the same time period in the video file, the voice data corresponding to each person can be extracted, so that the subtitle can be converted according to the voice data corresponding to each person, and the accuracy of the generated subtitle can be improved.

In step S230, the speech recognition tool may be installed on the server or in the electronic device. The voice recognition tool and the audio separation tool may be installed in the same server or in different servers, and the installation positions of the voice recognition tool and the audio separation tool are not particularly limited.

In this embodiment, the electronic device may obtain the voice data obtained by the audio separation tool from the target audio file, and then input the obtained voice data to the voice recognition tool. The speech recognition tool may convert the human voice data into corresponding text content through a speech recognition algorithm.

For example, if the language corresponding to the human voice data is chinese, the voice recognition tool may convert the human voice data into text content of chinese through a chinese voice recognition algorithm. If the language corresponding to the voice data is english, the voice recognition tool can convert the voice data into text content of english through an english voice recognition algorithm. The voice recognition tool may automatically select a corresponding voice recognition algorithm according to the voice features of different languages, or manually designate a voice recognition algorithm corresponding to each piece of voice data.

In this embodiment, the target subtitle text may be text content corresponding to a category of languages, for example, the target subtitle text is text content corresponding to human voice data. Alternatively, the target subtitle text may include text content corresponding to two or more categories of languages. For example, the human voice data is in english language, and the target subtitle may include text content in english language, and text content in chinese language.

In this embodiment, step S230 may include:

translating the first caption text into a second caption text in a second language by a voice recognition tool;

Understandably, the first subtitle text may be text content of a language corresponding to the human voice data, the second subtitle text is subtitle content translated based on the first subtitle text, and the language of the second subtitle text is different from the language of the first subtitle content.

Understandably, the speech recognition tool may also include a translation tool that may be used to translate text in a first language to text in a second language. The first language and the second language are different and can be determined according to actual conditions. For example, if the first language is english and the second language is chinese, the speech recognition tool may automatically translate the first subtitle text in english into a subtitle text in chinese, and then merge the subtitle texts in english and chinese, where the target subtitle text includes subtitles of two types of texts corresponding to english and chinese.

In step S240, the audio separation tool may record the relative time position of each human voice data in the target audio file when extracting the human voice data from the target audio file. In addition, the time axis of the target audio file is generally the same as the time axis of the video file. That is, the relative temporal position of each human voice data in the target audio file is the same as the relative temporal position of the human voice data in the video file. When the subtitles are added, the target subtitle text corresponding to the human voice data can be added to the corresponding relative time position in the video file, and therefore the target video file is obtained.

For example, the duration of a segment of video file is 60 seconds, the relative time of a segment of voice data in the target audio file is 10 seconds to 15 seconds, and when adding subtitles, the subtitles corresponding to the voice data can be added to the video file for 10 seconds to 15 seconds, so that the subtitle text corresponds to the voice data of the video file.

As an optional implementation manner, before step S230, the method may further include:

In this embodiment, the preset frequency and the preset duration may be determined according to actual conditions. For example, the predetermined frequency may be 20hz, and the predetermined duration may be 100 msec. Understandably, if data with frequency less than or equal to the preset frequency exists in the human voice data, the data is generally interference data; or, in the voice data, the existence frequency is greater than the preset frequency, but the duration greater than the preset frequency is less than the preset duration (for example, 100 milliseconds), which means that the voice data is interference data and is not voice with character meaning in the video file. The voice recognition tool comprises a voice recognition tool, a voice recognition module and a control module, wherein the voice recognition tool is used for inputting voice data, and the voice data are input into the voice recognition tool. Therefore, when the voice recognition tool is used for recognizing the voice data in the later period, the accuracy rate of converting the voice into the subtitle text is improved.

As an alternative implementation, step S210 may include:

extracting the video file to obtain an initial audio file;

In this embodiment, the audio separation tool generally supports only audio files in a preset format before performing the separation process on the target audio file. The preset format may be determined according to actual conditions, and may be, but is not limited to, wav format, speedx format, pcm format, and the like. If the data format of the initial audio file is not the preset format, the electronic device can convert the data format of the initial audio file into the preset format, and then the audio file subjected to format conversion is used as the target audio file, so that different formats can be converted into formats which can be supported by the audio separation tool, and the target audio file can be processed by the audio separation tool conveniently.

As an alternative implementation, step S210 may include:

selecting any video file from the video files without the target caption text in the processing list to load;

and acquiring a target audio file corresponding to the loaded video file.

In this embodiment, the processing list is a list created based on the preprocessed video file. In the processing list, the processing status of each video file may be included, for example, the processing status may include processed, in-process, unprocessed, and the like. The electronic device may add subtitles to all video files in the processing list one by one.

Understandably, the electronic device may perform the steps S210 to S240 as described above on the video file one by one based on the order in the processing list to complete the addition of the subtitles. For example, after the addition of the subtitles to the first video file is completed, any video file without subtitles is selected from the processing list as the next video file to be processed until the addition of the corresponding subtitles to all the video files is completed. Based on this, through the mode of processing the list, can once only add a plurality of video files, the user need not to trigger the subtitle of video file to add one by one to be favorable to simplifying user's operation, improve the efficiency of making of video file subtitle.

As an optional implementation manner, before step S220, the method may further include:

In this embodiment, if the content of the target audio file is empty, it usually indicates that there is a problem in the loaded video file, and at this time, the video file needs to be reloaded, and the audio file needs to be extracted from the video file again to serve as a new target audio file. Therefore, the method is beneficial to improving the production efficiency of the video subtitles, and avoids the influence on the production efficiency of the video subtitles caused by manual restarting when the audio file is empty.

In an embodiment, after obtaining the target video file, the electronic device may store the target video file in a server, and the server may send a link for viewing/downloading the target video file to the electronic device. In this manner, the user may view or download the target video file via the link.

To facilitate understanding of the implementation process of the method, the implementation process will be illustrated as follows:

firstly, the electronic equipment builds nodejs service locally through a koa2 framework, an input interface named api/v1/input is written, a field videoUrl in a reference character string format is supported, a user is allowed to transmit a single or multiple corresponding paths (multiple paths can be divided into numbers') for reading a video file) of a video or audio file to be analyzed, a field language in the reference character string format is supported, and added subtitles are determined.

And secondly, acquiring a specific path of the video file to be analyzed by analyzing the path character string uploaded by the user, asynchronously loading the video file in batch (by using a readStream function provided by an FS (file system) module in the node) by calling a premium function provided by nodes, reading the video file by using the readStream function, and simultaneously converting the video file into a data stream.

And thirdly, reading the data stream of the video file based on the ffmpeg plug-in tool with the node version installed in advance. Wherein, the ffmpeg plug-in (const ffmpeg = require (' ffmpeg)), introduces the corresponding file stream through a new ffmpeg function. Callback is a Callback function, and when the ffmpeg plug-in reads the corresponding video file stream, a series of analyzed video information such as metadata (binary data) and duration can be acquired from the Callback function through a Callback (video = > content.log (video)) in the Callback and a video object generated in the Callback function.

Fourthly, after the corresponding video (video) object is obtained through parsing, the audio content in the video can be stored in mp3 format through video. And then transcoding the corresponding mp3 format audio file into a preset format. For example, the mp3 format is transcoded into a preset format such as pcm (pcm _ s16 le), wav, speex, etc. through video. The sampling rate of the audio may be 16000 hz or 8000 hz, which is not limited herein. For example, the format with the bit rate of 16bit standard can be set through video.

And fifthly, processing the audio file in the wav format by using a Spleeter library, separating sound files of different tracks in the file by the range threshold values of the tracks and the audio (the threshold values can be set according to actual conditions), and storing the sound files in the wav format. The Spleeter library can strip out a human voice audio file and an environmental sound file. After the Spleeter library runs, two files can be generated in the audio _ output directory: wav and vocals wav. Wav and the corresponding human voice audio file, and acomposition wav is the separated environmental sound. If a plurality of persons exist, files such as vocals1.Wav, vocals2.Wav and the like are generated, and by acquiring the number of the files at the beginning of the vocals, specific person voice audio files of different persons can be acquired.

And sixthly, uploading the generated wav-format multiple voice audio files and the set sampling rate field 16000 to a third-party voice recognition service party (voice recognition tool) through an interface provided by the third party in sequence. And carrying out voice recognition on the corresponding audio file by calling an interface of a voice recognition manufacturer, acquiring a subtitle text of the corresponding recognition content returned by the interface, carrying out non-empty judgment on the text, and prompting a user that the uploaded video file is invalid if the text is empty, prompting the user to upload the video file again, or automatically reloading the video file.

Seventhly, after the server (the voice recognition tool is installed in the server) obtains a plurality of corresponding subtitle texts, a translation tool in the voice recognition tool can be called according to a language field (language) transmitted by a calling party, and the text file is converted into a subtitle version corresponding to the language.

And eighthly, after the server obtains the corresponding multiple subtitle texts, returning to the electronic equipment to obtain a string of subtitle texts with commas. The text is then converted to the array TextArry by the split (',') function provided by JSON. Analyzing the voice audio file sequentially through an ffmpeg plug-in based on a plurality of voice audio files (voice data), analyzing frequency fluctuation in the file through video. Below 20hz, no speech is present and the array value is 0. Taking out the millisecond range of 1 in the AudioArray continuously (for example, if the 1000ms to the 5000ms are all 1, the audios in the first second to the fifth second are all speaking), and corresponding the elements in the textarray to the AudioArray one by one in sequence to generate a new array, namely: and the corresponding relation array of the time when the caption appears in the video.

And ninthly, adding subtitles according to the generated multiple audio track subtitles and the array corresponding to the occurrence time in the video file through video.

And step ten, storing the generated subtitle target video file added with the subtitle to a server, and returning the link to the electronic equipment. The user may download the target video file via the electronic device based on the link.

Referring to fig. 4, an embodiment of the present application further provides a video subtitle processing apparatus 100, which can be applied to the electronic device described above and is used to execute the steps in the method. The video caption processing apparatus 100 includes at least one software functional module which may be stored in a memory module in the form of software or Firmware (Firmware) or solidified in an Operating System (OS) of the electronic device. The processing module is used for executing executable modules stored in the storage module, such as a software functional module and a computer program included in the video subtitle processing device 100.

The video subtitle processing apparatus 100 may include an obtaining unit 110, an input unit 120, and a subtitle adding unit 130, and the operational contents that can be performed may be as follows:

an acquisition unit 110 configured to acquire a target audio file corresponding to a video file;

an input unit 120, configured to input the target audio file into an audio separation tool, and configured to obtain human voice data from the target audio file;

the input unit 120 is further configured to input the voice data output by the audio separation tool to a speech recognition tool, and is configured to determine a target subtitle text corresponding to the voice data;

a subtitle adding unit 130, configured to add, based on the relative time position of the vocal data in the target audio file, the target subtitle text corresponding to the vocal data in the video file corresponding to the relative time position, so as to obtain a target video file.

Optionally, the input unit 120 may be further configured to: inputting the target audio file into the audio separation tool; and extracting the voice data corresponding to each voice from the target audio file through the audio separation tool.

The video subtitle processing apparatus 100 may include a filtering unit to, before the input unit 120 inputs the human voice data output by the audio separation tool to the voice recognition tool: and selecting the voice data with the voice frequency larger than the preset frequency and the duration time larger than the preset frequency and longer than or equal to the preset time length from the voice data as the voice data input to the audio separation tool.

Optionally, the input unit 120 may be further configured to: inputting the human voice data output by the audio separation tool to the speech recognition tool; determining a first caption text of a first language corresponding to the human voice data through the voice recognition tool; translating the first subtitle text into a second subtitle text in a second language through a voice recognition tool; and determining the target subtitle text according to the first subtitle text and the second subtitle text.

Optionally, the obtaining unit 110 may further be configured to: extracting the video file to obtain an initial audio file; and converting the initial audio file into the target audio file in a preset format.

Optionally, the obtaining unit 110 may further be configured to: selecting any video file from the video files without the target subtitle text added in the processing list to load; and acquiring a target audio file corresponding to the loaded video file.

Optionally, before the input unit 120 inputs the target audio file into the audio separation tool, the obtaining unit 110 may be further configured to reload the video file when the content of the target audio file is empty, and obtain a new target audio file corresponding to the video file.

In this embodiment, the processing module may be an integrated circuit chip having signal processing capability. The processing module may be a general purpose processor. For example, the Processor may be a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a Network Processor (NP), or the like; the method, the steps and the logic block diagram disclosed in the embodiments of the present Application may also be implemented or executed by a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.

The memory module may be, but is not limited to, a random access memory, a read only memory, a programmable read only memory, an erasable programmable read only memory, an electrically erasable programmable read only memory, and the like. In this embodiment, the storage module may be used to store a video file, a target audio file, an audio separation tool, a voice recognition tool, and the like. Of course, the storage module may also be used to store other programs, and the processing module executes the programs after receiving the execution instruction.

It is understood that the structure shown in fig. 1 is only a schematic structural diagram of an electronic device, and the electronic device may further include more components than those shown in fig. 1. The components shown in fig. 1 may be implemented in hardware, software, or a combination thereof.

It should be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the electronic device and the video subtitle processing apparatus 100 described above may refer to the corresponding processes of the steps in the foregoing method, and are not described in detail herein.

The embodiment of the application also provides a computer readable storage medium. The computer-readable storage medium has stored therein a computer program that, when run on a computer, causes the computer to execute the video subtitle processing method as described in the above embodiments.

From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by hardware, or by software plus a necessary general hardware platform, and based on such understanding, the technical solution of the present application can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.), and includes several instructions to enable a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the method described in the embodiments of the present application.

In summary, the present application provides a video subtitle processing method, apparatus, electronic device, and readable storage medium. The method comprises the following steps: acquiring a target audio file corresponding to the video file; inputting the target audio file into an audio separation tool, and acquiring human voice data from the target audio file; inputting the voice data output by the audio separation tool into a voice recognition tool, and determining a target subtitle text corresponding to the voice data; and adding the target subtitle text corresponding to the voice data into the video file corresponding to the relative time position based on the relative time position of the voice data in the target audio file to obtain the target video file. In the above embodiment, the human voice data is extracted from the audio data, which is beneficial to reducing the interference of the environmental sound to the voice conversion, so as to improve the accuracy of the voice conversion into the caption text. And then, adding the obtained target subtitle text into the video file, so that automatic generation and automatic addition of the subtitle of the video file can be realized, and the production efficiency of the video subtitle can be improved.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus, system, and method may be implemented in other ways. The apparatus, system, and method embodiments described above are illustrative only, as the flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

The above description is only an example of the present application and is not intended to limit the scope of the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A method for processing video subtitles, the method comprising:

building nodejs service through a koa2 framework, and providing an input interface to obtain a target audio file corresponding to a video file;

inputting the target audio file into an audio separation tool for acquiring human voice data from the target audio file;

adding the target subtitle text corresponding to the voice data into the video file corresponding to the relative time position based on the relative time position of the voice data in the target audio file to obtain a target video file;

adding the target subtitle text corresponding to the human voice data into the video file corresponding to the relative time position based on the relative time position of the human voice data in the target audio file to obtain a target video file, wherein the step of adding the target subtitle text corresponding to the human voice data into the video file corresponding to the relative time position comprises the following steps:

converting the text into an array TextArry through a split (',') function provided by JSON; analyzing the voice data and analyzing the frequency fluctuation in the file sequentially through an ffmpeg plug-in based on a plurality of voice data, drawing an array AudioArray with the unit length of an audio fluctuation range/100 ms, and setting an array value as 1 when the frequency fluctuation range is larger than a preset frequency, wherein the array value is represented as voice; when the frequency is less than the preset frequency, the voice does not exist, and the numerical group value is 0; taking out the continuous range of 1 in the array AudioArray, and sequentially corresponding the elements in the array TextArry to the arrays AudioArray one by one to generate a relational array corresponding to the time of the caption appearing in the video; and generating a new video file based on the relation array to serve as a target video file.

2. The method of claim 1, wherein inputting the target audio file into an audio separation tool for obtaining human voice data from the target audio file comprises:

inputting the target audio file into the audio separation tool;

3. The method of claim 1 or 2, wherein prior to inputting the vocal data output by the audio separation tool to a speech recognition tool, the method further comprises:

and selecting the voice data with the voice frequency larger than the preset frequency and the duration time larger than the preset frequency and larger than or equal to the preset time as the voice data input to the voice recognition tool from the voice data.

4. The method of claim 1, wherein the target caption text comprises a first caption text and a second caption text, and wherein inputting the vocal data output by the audio separation tool to a speech recognition tool for determining the target caption text corresponding to the vocal data comprises:

and determining the target caption text according to the first caption text and the second caption text.

5. The method of claim 1, wherein obtaining a target audio file corresponding to a video file comprises:

extracting an initial audio file from the video file;

6. The method of claim 1, wherein obtaining a target audio file corresponding to a video file comprises:

and acquiring a target audio file corresponding to the loaded video file.

7. The method of claim 1, wherein prior to inputting the target audio file into an audio separation tool, the method comprises:

8. A video subtitle processing apparatus, the apparatus comprising:

the acquisition unit is used for building nodejs service through a koa2 framework and providing an input interface to acquire a target audio file corresponding to the video file;

the caption adding unit is used for adding the target caption text corresponding to the voice data into the video file corresponding to the relative time position based on the relative time position of the voice data in the target audio file to obtain a target video file;

a subtitle adding unit, configured to, during processing to obtain a target video file by adding the target subtitle text corresponding to the human voice data to the video file corresponding to the relative time position based on the relative time position of the human voice data in the target audio file, specifically:

converting the text into an array TextArry through a split (',') function provided by JSON; analyzing the voice data and analyzing the frequency fluctuation in the file sequentially through an ffmpeg plug-in based on a plurality of voice data, drawing an array AudioArray with the unit length of audio fluctuation range/100 ms (millisecond), and setting the array value as 1 when the frequency fluctuation range is larger than the preset frequency; when the frequency is less than the preset frequency, the voice does not exist, and the numerical group value is 0; taking out the continuous ranges of 1 in the array AudioArray, and sequentially corresponding the elements in the array TextArry to the arrays AudioArray one by one to generate a relational array corresponding to the time of the caption appearing in the video; and generating a new video file based on the relation array to serve as a target video file.

9. An electronic device, characterized in that the electronic device comprises a processor and a memory coupled to each other, in which memory a computer program is stored which, when executed by the processor, causes the electronic device to carry out the method according to any one of claims 1-7.

10. A computer-readable storage medium, in which a computer program is stored which, when run on a computer, causes the computer to carry out the method according to any one of claims 1 to 7.