CN110364154B

CN110364154B - Method and device for converting voice into text in real time, computer equipment and storage medium

Info

Publication number: CN110364154B
Application number: CN201910697228.3A
Authority: CN
Inventors: 蒋壮; 郑勇; 许仿珍; 王辉
Original assignee: Shenzhen Waterward Information Co Ltd
Current assignee: Shenzhen Waterward Information Co Ltd
Priority date: 2019-07-30
Filing date: 2019-07-30
Publication date: 2022-04-22
Anticipated expiration: 2039-07-30
Also published as: CN110364154A

Abstract

The invention discloses a method, a device, computer equipment and a storage medium for converting voice into text in real time, wherein the method comprises the following steps: in the conversation process, establishing connection with a background server; acquiring an audio file in the call process in real time; processing the audio file according to a preset rule, and uploading the audio file to the background server; the background server is used for converting the processed audio file into text information; and acquiring the text information and displaying the text information on the interface of the communication terminal. The purpose that the sender and the receiver can see the text information in real time in the process of voice communication is achieved.

Description

Method and device for converting voice into text in real time, computer equipment and storage medium

Technical Field

The present invention relates to the field of communications technologies, and in particular, to a method and an apparatus for converting speech into text in real time, a computer device, and a storage medium.

Background

With the development of communication technology, voice communication has become a main way for information exchange provided by operators. The user can perform voice communication through a communication network or an internet and acquire information expressed by the other party at the first time. Text messaging is another information communication method provided by operators, and is different from voice communication in that text messaging transmits characters between a network and a mobile terminal through a service center.

At present, the voice communication method cannot display the text information of the voice on the interface of the mobile terminal in real time while receiving the voice information, and only in the call, the text information can be obtained by converting the call content after recording the call content by using the recording function, but the text information can only be obtained after the call is finished, and neither a sender nor a receiver can see the text information in real time in the process of voice communication.

Disclosure of Invention

The invention mainly aims to provide a method, a device, a computer device and a storage medium for converting voice into text in real time, which are used for solving the problem that a sender and a receiver cannot see text information in real time in the process of voice communication.

To achieve the above object, an embodiment of the present invention provides a method for converting speech into text in real time, including the following steps:

in the conversation process, establishing connection with a background server;

acquiring an audio file in the call process in real time;

processing the audio file according to a preset rule, and uploading the audio file to the background server; the background server is used for converting the processed audio file into text information;

and acquiring the text information and displaying the text information on the interface of the communication terminal.

Further, the step of processing the audio file according to a preset rule and uploading the audio file to the background server includes:

detecting the format of the audio file;

judging whether the format of the audio file is a PCM format or not;

if not, converting the format of the initial audio file into a PCM format.

Further, the audio files comprise upstream audio files and/or downstream audio files;

the step of processing the audio file according to the preset rule comprises the following steps;

establishing a buffer area of uplink audio and/or downlink audio; the buffer area is used for correspondingly buffering the uplink audio files and/or the downlink audio files;

and correspondingly and sequentially writing the uplink audio files and/or the downlink audio files into a cache region according to the time sequence.

Further, the audio file comprises a plurality of audio objects;

the step of processing the audio file according to the preset rule and uploading the audio file to the background server comprises the following steps:

detecting a speech signal of the audio object in the audio file;

acquiring a starting point and an end point of each audio object voice, correspondingly, determining an audio file between the starting point and the end point of each audio object voice as a voice segment file, and preprocessing the audio file;

and after preprocessing the voice segment files, uploading the voice segment files to the background server for conversion.

Further, after the step of defining the audio file between the start point and the end point of each audio object voice as a voice segment file and performing preprocessing, the method includes:

setting the starting time of each voice segment according to a preset rule;

and sequentially uploading the voice segment files to the background server according to each starting point time.

Further, the step of obtaining the text information and displaying the text information on the interface of the communication terminal includes:

acquiring text information correspondingly converted by the background server;

correspondingly marking the starting time of each text message;

and sequentially displaying the text information on the communication terminal interface according to the starting time.

Further, the acquiring the text information and displaying the text information after the interface of the communication terminal includes:

displaying the designated information to the communication terminal interface in a form of a table; the specified information comprises the text information, starting time and a voice source which are correspondingly displayed, and the voice source comprises a calling source corresponding to the uplink audio file and a called source corresponding to the downlink audio file;

and receiving an external editing instruction, wherein the editing instruction is used for editing the text information displayed on the interface of the communication terminal.

The invention also provides a device for converting voice into text in real time, which comprises:

the establishing module is used for establishing connection with the background server in the conversation process;

the first acquisition module is used for acquiring audio files in real time, wherein the audio files comprise audio files acquired by the communication terminal in the call process;

the processing module is used for processing the audio file according to a preset rule and uploading the audio file to the background server; the background server is used for converting the processed audio file into text information;

and the first display module is used for acquiring the text information and displaying the text information on the interface of the communication terminal.

An embodiment of the present invention further provides a storage medium, which is a computer-readable storage medium, on which a computer program is stored, and the computer program is executed to implement the method for converting speech into text in real time according to any one of the above claims.

An embodiment of the present invention further provides a computer device, which includes a processor, a memory, and a computer program stored in the memory and running on the processor, and when the processor executes the computer program, the method for converting speech into text in real time according to any one of the above claims is implemented.

In the method, the device, the computer equipment and the storage medium for converting the voice into the text in real time, the method establishes connection with a background server in the conversation process, acquires an uplink audio file and a downlink audio file in real time, uploads the uplink audio file and the downlink audio file to the background server which establishes the connection, performs voice text conversion on the uplink audio file through the background server, and sends the uplink audio file and the downlink audio file to a terminal for displaying, so that the aim that a sender and a receiver can see text information in real time in the voice communication process is fulfilled, and the defect that the text display can only be realized by recording after the conversation is finished in the prior art is overcome.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed for the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a flow chart illustrating an embodiment of a method for real-time speech conversion to text according to the present invention;

FIG. 2 is a block diagram of an apparatus for real-time speech conversion to text according to another embodiment of the present invention;

FIG. 3 is a block diagram of one embodiment of the processing module of FIG. 2;

FIG. 4 is a block diagram of one embodiment of the processing module of FIG. 2;

FIG. 5 is a block diagram of another embodiment of the processing module of FIG. 2;

FIG. 6 is a block diagram of one embodiment of the processing module of FIG. 2;

FIG. 7 is a block diagram of one embodiment of the first display module of FIG. 2;

FIG. 8 is a block diagram of an apparatus for real-time speech conversion to text according to another embodiment of the present invention;

FIG. 9 is a diagram illustrating an image of an audio file with mark detection according to an embodiment of the present invention;

fig. 10 is a schematic diagram of a text that is issued to a terminal interface for display in the tag detection according to an embodiment of the present invention;

FIG. 11 is a schematic diagram of a storage medium according to an embodiment of the present invention;

FIG. 12 is a block diagram of a computer device in accordance with one embodiment of the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative only and should not be construed as limiting the invention.

It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

Referring to fig. 1, a first embodiment of a method for editing a text in real time by voice according to the present invention is provided, the method comprising the following steps:

s10, in the process of communication, establishing connection with a background server;

s20, acquiring the audio file in the call process in real time;

s30, processing the audio file according to a preset rule, and uploading the audio file to the background server; the background server is used for converting the processed audio file into text information;

and S40, acquiring the text information and displaying the text information on the interface of the communication terminal.

The connection established with the background server includes but is not limited to establishing socket long connection and short connection; the communication terminal comprises but not limited to intelligent chatting equipment such as a smart phone and a computer, wherein the smart phone comprises a single SIM card mobile phone supporting VOLTE, a network supporting VOLTE and a connection capable of establishing VOLTE voice conversation. The processing of the audio file according to the preset rule includes, but is not limited to, converting the format of the audio file, cutting the voice segment in the audio file, and the like. For example, by using the function that the VOLTE supports multiple concurrent connections, the VOLTE single SIM card mobile phone can establish a socket long connection with a voice recognition engine in a background server while establishing a connection of a voice call, and when acquiring an audio file of a call terminal, convert the format of the audio file into a specific format, and upload the audio file to the voice recognition engine to be converted; the smart phone also comprises a mobile phone supporting simultaneous multi-connection of multiple SIM cards, wherein the mobile phone establishes connection of voice call of one SIM card, such as voice connection of a CS domain, so as to realize voice call between the two, and simultaneously establishes socket long connection between the other SIM card and a voice recognition engine, so as to realize uploading an audio file to the voice recognition engine to be converted.

In one embodiment of the present invention, for example, in both parties of a call, the calling subscriber a is the caller, the called subscriber B is the callee, the calling mobile phone obtains the audio signal of the calling user A and sends the audio signal to the called mobile phone through the mobile phone for call connection, the called user B can hear the voice of the calling user A, meanwhile, the calling mobile phone converts the audio signal into an audio file through A/D conversion, the audio file is acquired by the background server, the background server converts the audio file into a text through character conversion, and the converted text is sent to the interface of the calling mobile phone and/or the interface of the called mobile phone, so that the called user B can hear the voice and see the characters at the same time, due to the low time delay of voice recognition and data transmission, the text can be displayed in real time in a caption mode on the mobile phone interfaces of the voice recognition and the data transmission by matching with the voice of the calling user. The problem of when user's cell-phone signal card, unable to hear clearly, but can know the content that the other side said through the characters, can need not repeatedly inquire the other side, guarantee that the chat can smoothly go on is solved.

In other specific embodiments, the calling subscriber a is a reporter, the called subscriber C is a called subscriber B that the reporter accesses through the mobile phone, and the called subscriber B cannot receive the access on site for some reason, so the access can be performed by the above method, and the text of the call between the calling subscriber a and the called subscriber C can be displayed on the mobile phone in real time. Different from the prior art, in the mobile phone call, the text information is recorded by a recording pen and then converted, and the real-time effect is lacked.

In an embodiment, the step S30 of processing the audio file according to the preset rule and uploading the audio file to the background server includes:

detecting the format of the audio file;

judging whether the format of the audio file is a PCM format or not;

if not, converting the format of the initial audio file into a PCM format.

The audio file comprises an audio file generated between a calling terminal and a called terminal in a telephone form, and can also be generated by establishing a voice connection through account numbers which can be used for voice communication, such as WeChat and QQ.

According to the steps, when the format of the initial audio file is detected not to be the PCM format, the calling terminal and/or the called terminal changes the format of the audio file into the PCM format, and when the format of the initial audio file is detected to be the PCM format, the background server obtains the audio file processed by the preset rule in real time. The PCM (Pulse Code Modulation — Pulse Code Modulation recording) is a Pulse train in which analog signals such as audio are converted into symbols and recorded. Specifically, the PCM signal is a digital signal composed of symbols such as [1] and [0 ]. It is less susceptible to clutter and distortion of the transmission system than analog signals; the dynamic range is wide, and the effect of quite good tone quality can be obtained. And the PCM track is different from the video track and can be used for after-recording. In addition, the audio file in PCM format is: firstly, analog audio signals are converted into binary sequences through analog-to-digital conversion (A/D conversion), namely, electric signals of common analog audio signals are converted into binary codes 0 and1, the 0 and1 form digital audio files, then the digital audio files are subjected to voice coding to be converted into audio files in a PCM format, and a voice recognition engine in a background server can accurately decode the audio files. The initial format of the initial audio file includes various formats, such as PCM, WMV, MP4, DAT, RM, etc., and the format of the parsed audio file in this embodiment is preferably PCM format.

In an embodiment, the audio files include an uplink audio file and/or a downlink audio file;

the step S30 of processing the audio file according to the preset rule includes;

The uplink audio buffer area is established for the calling terminal and used for buffering uplink audio files, and correspondingly, the downlink audio buffer area is established for the called terminal and used for being changed into downlink audio files. The audio files comprise an uplink audio file directly acquired from a calling terminal and a downlink audio file directly acquired from a called terminal; the audio file is generated by the calling terminal and the called terminal through a telephone form, or generated by account numbers which can be used for voice communication such as WeChat and QQ through establishing voice connection.

The uplink audio file and/or the downlink audio file are generated according to a time sequence, specifically, the calling terminal acquires the voice of the calling user a and carries corresponding uplink time information, and the called terminal acquires the voice of the called user B and carries corresponding downlink time information. The communication process means that the voices of the calling user A and the called user B are conducted interactively, therefore, the calling terminal can obtain multiple sections of voices, a plurality of uplink time information are correspondingly arranged, and the voices are sequentially written into a buffer memory area of an uplink audio file terminal according to the sequence of the uplink time information, so that the text information translated and converted by the background server is sequentially displayed on a communication terminal interface. The same is true for the voice acquired by the called terminal, and a description thereof will not be repeated.

In an embodiment, the audio file comprises a plurality of audio objects;

the step S30 of processing the audio file according to the preset rule and uploading the audio file to the background server includes:

detecting a speech signal of the audio object in the audio file;

In this embodiment, the audio file may include one or more audio objects, such as background noise, human voice or sounds made by animals and plants, and when the audio file is detected, only the human voice signal, such as background noise, gunshot sound or sounds made by animals and plants, is detected. Technologies for detecting human Voice include, but are not limited to, Voice Activity Detection (VAD), Automatic Speech Recognition (ASR), noise reduction, and Voice compression, which are not limited to ixand 1. For example, the VAD technique detects the end point of the character voice in an audio file, and an audio file does not continuously make a sound, so that the detected start point of the voice to the end point of the voice is used as a voice segment file, and only the voice segment file has substantial text information. The voice of the user needs to be displayed in real time, so that the obtained audio file is not too long, if the audio file corresponding to one sentence of the calling terminal is obtained, the audio file is detected, the voice sections in the voice file are detected through VAD detection technology because the tone colors of different characters are different, if only two voice sections in one sentence of the voice are detected, the starting point and the end point of each voice section are respectively obtained and respectively used as the voice section files, and then the voice sections are converted. Because an audio file comprises a section of speech of a user, a section of unvoiced speech or noise, in order to display quickly and reduce the use ratio of background bandwidth, the noise needs to be eliminated, only the starting point and the end point of character speech are marked, and only the voice of a person is converted, namely, an audio object in the application only aims at the speech of the character, so that the time for converting the speech by a speech recognition engine in a background server is short, and the real-time display of text information content on a terminal interface is facilitated.

In one embodiment, the upstream audio file includes the speech of two or more different people, i.e., contains multiple audio objects. Specifically, the calling terminal has the calling user A and the calling user C speaking at the same time, and the tone of the speaking voice of the calling user A and the calling user C is inconsistent, therefore, two segments of voice signals are detected, there are two character voice starting points and two character voice ending points, in the audio file, the voice from the voice starting point to the voice ending point corresponding to the calling user A is used as a calling A voice segment file, the voice from the voice starting point to the voice ending point corresponding to the calling user C is used as a calling C voice segment file, the two segments of files are simultaneously uploaded to a background server for conversion, the method is used for processing and converting the audio files under the multiple audio objects, so that the multiple voice segment files can be converted by the background server at the same time, and the phenomenon that the background server takes longer time to convert a long voice segment file is avoided. If the calling A voice segment file needs 2S for conversion, the calling C voice segment file needs 1S for conversion, and meanwhile, only 2S is needed for conversion when the calling A voice segment file and the calling C voice segment file are uploaded to a background server for conversion, and 3S is needed for conversion of the calling A voice segment file and the calling C voice segment file which are mixed together. Therefore, the technical solution described in this embodiment can also reduce the time for converting the audio file, and further enable the text information to be displayed on the interface of the communication terminal in real time.

In an embodiment, after the step of defining an audio file between the start point and the end point of each audio object voice as a voice segment file and performing preprocessing, the method includes:

setting the starting time of each voice segment according to a preset rule;

The preset rule marks the starting time of the voice segment, which may be the starting point of the character voice in the uplink audio file detected by the communication terminal as the starting time, or the time for uploading the audio file to the background server as the starting point of the character voice as the starting time, without limitation, so as to further display the text information translated and converted by the background server on the interface of the communication terminal in sequence, thereby improving the use effect of the user.

In an embodiment, the step S40 of obtaining the text message and displaying the text message on the interface of the communication terminal includes:

acquiring text information correspondingly converted by the background server;

correspondingly marking the starting time of each text message;

The starting time of each text message may be the starting time of each speech segment set according to the preset rule, which is not limited. In a specific embodiment, the starting point of the character voice in the audio file detected by the background server is defined as the uplink time, if the calling user hears the voice at nine o 'clock and zero minute, the called terminal detects the starting point of the voice segment of the called user at nine o' clock and zero minute and one second, and the nine o 'clock and zero minute and two second background server converts the starting point and displays the converted starting point on the interface of the calling terminal, as described above, the nine o' clock and zero minute and one second is the starting point of the text information, and the time of the starting point is infinitely close to the time of the calling user hearing the voice of the other party, so that the use experience of the user is improved. Due to the low time-delay of voice recognition and data transmission, the time difference of 0S to 1S exists between the time when a user actually hears voice and the time when the user sees text information, and the reaction time of a person needs 0.1-0.5S, so that the method can enable the user to see the starting time of the converted text information mark displayed in real time on a terminal interface to be closer to the time when the user sees the text when the user carries out voice call through the terminal, achieve the technical effect of enabling the user to see the call contents of the user and the user in real time during the call, and improve the use experience of the user.

In other embodiments, the starting time may also include, after the background server correspondingly converts the text information, a time carried by the corresponding text information as the starting time, that is, a time for converting the text information, as the starting time of the text information, which is not limited herein.

In an embodiment, after the step S40 of acquiring the text message and displaying the text message on the interface of the communication terminal, the method includes:

displaying the designated information to the communication terminal interface in a form of a table; the specified information comprises the text information, the starting time and the voice source which are correspondingly displayed, and the voice source comprises a calling source corresponding to the uplink audio file and a called source corresponding to the downlink audio file.

In this embodiment, the text information includes, but is not limited to, uplink text information and downlink text information, and the displaying the interface of the communication terminal in the form of a table may specifically be: the obtained N uplink text messages and M downlink text messages are arranged in time sequence, and can be displayed on a screen in real time, as shown in fig. 10, where T1 is the start time of the first speech segment acquired by the calling terminal, XXX1 is the text message of the first speech segment of the calling terminal, T2 is the start time of the first speech segment acquired by the called terminal, yy1 is the text message of the first speech segment acquired by the called terminal, in the speech source, the speech segments with the same tone are marked as the same user, such as a calling user a, a calling user C, and a called user B, which indicate that two users are in communication with the called user B in the calling source, and so on.

In this embodiment, after the converted uplink text and downlink text are sent to the terminal interface for display, the calling terminal and the called terminal may also receive an external editing instruction to edit the texts thereof. Specifically, the calling user A is a reporter, the called user B is a person to be interviewed, the calling user A can edit the text when the called user B is interviewed through a mobile phone, the text can be corrected immediately when the converted text information is found to be not time-synchronized, the later editing time is saved, and the method is convenient and quick; in other application scenarios, for example, the calling subscriber a is a patent agent, the called subscriber B is an inventor, when the calling subscriber a and the called subscriber B communicate through a mobile phone communication technical scheme, a text displayed in real time on a mobile phone interface can enable the calling subscriber a to more conveniently understand the meaning to be expressed by the called subscriber B, and the text exists in the mobile phone, so that the situation that the calling subscriber a forgets that part of the called subscriber B says something for the last communication content due to too long time interval and too much content expression can be prevented, and the communication content is not manually recorded by the calling subscriber a in the communication process, so that the situation that omission or wrong recording exists is avoided.

The method is different from the following defects in the prior art: firstly, which text information is spoken by one person can not be displayed, the content of the spoken words of a calling user and a called user can not be distinguished, and only the recorded words can be simply converted into character information, and the subsequent users need to distinguish by themselves; secondly, no time is recorded, so that a subsequent reporter can not conveniently report accurately according to the content; and thirdly, the content can not be displayed in real time, and the text information can be seen only by converting after the recording is finished.

In one embodiment, the call between the calling subscriber a and the called subscriber B is continuously connected, so that there are multiple uplink audio files, and the start and end points of the voice segments of the multiple uplink audio files are marked. Specifically, referring to fig. 9, the above-mentioned fig. 9 is composed of multiple uplink audio files, the voice signal of the character is identified by VAD (voice boundary detection), the start point and the end point of the voice are marked, N voice segment files are formed, TS1 is the start point of the voice segment in the first uplink audio file, TE1 is the end point of the voice segment in the first uplink audio file, TS2 is the start point of the voice segment in the second uplink audio file, TE2 is the end point of the voice segment in the second uplink audio file, and so on, there are N voice segment start points and end points, and N voice segments and N-1 voice interval segments are obtained. TSN is the starting point moment of the Nth voice, TEN is the end point of the Nth voice segment, the starting point of the playing time of the voice segment file is the beginning of call answering, call answering is set as the beginning time Toffset1, then the playing time of the N voice segments relative to the mobile phone is respectively Toffset1+ TS1, Toffset1+ TS2, … … … and Toffset1+ TSN, wherein the playing time relative to the mobile phone is the uplink time, the obtained N voice segments are sent to a voice recognition engine for conversion according to the time sequence through socket long connection to obtain N uplink texts, corresponding uplink time is added to each uplink text, the uplink time corresponds to the voice segments one by one, the uplink time is displayed on the interface of the calling terminal or the called terminal, as shown in figure 10, the low time delay through voice recognition and data transmission is achieved, the voice contents of the calling user and the called user are classified according to the text information, and displays subtitles in real time.

In the embodiment of the invention, the connection with the background server is established in the conversation process, the uplink audio file and the downlink audio file are obtained in real time and uploaded to the background server which is already established, and then the background server carries out voice text conversion on the uplink audio file and the downlink audio file and sends the converted uplink audio file and the downlink audio file to the terminal for display, so that the purpose that a sender and a receiver can see text information in real time in the voice communication process is realized, and the defect that the text display can only be realized by using recording after the conversation is finished in the prior art is overcome.

Referring to fig. 2, an embodiment of an apparatus for editing a text in real time by voice according to the present invention is provided, where the apparatus for editing a text in real time by voice includes the following modules:

the establishing module 10 is used for establishing connection with a background server in the conversation process;

the first obtaining module 20 is configured to obtain an audio file in real time, where the audio file includes an audio file obtained by the communication terminal during a call;

the processing module 30 is configured to process the audio file according to a preset rule, and upload the audio file to the background server; the background server is used for converting the processed audio file into text information;

and the first display module 40 is used for acquiring the text information and displaying the text information on the interface of the communication terminal.

Referring to fig. 3, in one embodiment, the processing module 30 includes:

a first detection unit 301, configured to detect a format of the audio file;

a judging unit 302, configured to judge whether the format of the audio file is a PCM format;

a converting unit 303, configured to convert the format of the initial audio file into a PCM format if the format of the initial audio file is not the same as the PCM format.

Referring to fig. 4, in an embodiment, the audio files include an uplink audio file and/or a downlink audio file;

a processing module 30, further comprising;

an establishing unit 304, configured to establish a buffer for the uplink audio and/or the downlink audio; the buffer area is used for correspondingly buffering the uplink audio files and/or the downlink audio files;

and the writing unit 305 is configured to correspondingly and sequentially write the uplink audio file and/or the downlink audio file into the buffer area according to the time sequence.

Referring to fig. 5, in an embodiment, the audio file includes a plurality of audio objects;

the processing module 30 further includes:

a second detecting unit 306, configured to detect a voice signal of the audio object in the audio file;

a first obtaining unit 307, configured to obtain a start point and an end point of each audio object voice, and correspondingly, determine an audio file between the start point and the end point of each audio object voice as a voice segment file, and perform preprocessing;

the first uploading unit 308 is configured to upload the voice segment files to the background server for conversion after the voice segment files are preprocessed.

Referring to fig. 6, in an embodiment, the processing module 30 further includes:

a setting unit 309, configured to set a starting time of each speech segment according to a preset rule;

a second uploading unit 310, configured to sequentially upload the voice segment files to the background server according to each starting time.

Referring to fig. 7, in an embodiment, the first display module 40 includes:

a second obtaining unit 401, configured to obtain text information that is correspondingly converted by the background server;

a marking unit 402 for marking a start time of each of the text information correspondingly;

a display unit 403, configured to sequentially display the text information on the communication terminal interface according to the starting time.

Referring to fig. 8, in an embodiment, the apparatus for converting speech into text in real time further includes:

a second display module 50, configured to display the designation information to the communication terminal interface in a form of a table; the specified information comprises the text information, the starting time and the voice source which are correspondingly displayed, and the voice source comprises a calling source corresponding to the uplink audio file and a called source corresponding to the downlink audio file.

A receiving module 60, configured to receive an external editing instruction, where the editing instruction is used to edit text information displayed on the interface of the communication terminal.

In the embodiment of the invention, the connection with the background server is established in the conversation process, the uplink audio file and the downlink audio file are acquired in real time and uploaded to the background server which is established with the connection, and then the background server converts the voice text and sends the converted voice text to the terminal for displaying, so that the purpose that the sender and the receiver can see the text information in real time in the voice communication process is realized, and the defect that the text display can be realized only by recording after the conversation is finished in the prior art is overcome.

Referring to fig. 11, an embodiment of the present invention further provides a storage medium 110, which is a computer-readable storage medium 110, on which a computer program 120 is stored, and the computer program 120 is executed to implement the method for converting speech into text in real time according to any one of the above claims.

Referring to fig. 12, an embodiment of the present invention further provides a computer apparatus 140, which includes a processor 130, a memory 150, and a computer program 120 stored in the memory 150 and operable on the processor 130, wherein when the processor 130 executes the computer program 120, the method for converting speech into text in real time according to any one of the preceding claims is implemented.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A method for converting speech into text in real time, comprising:

in the conversation process, establishing connection with a background server;

acquiring an audio file in the call process in real time;

acquiring the text information and displaying the text information on a communication terminal interface;

the audio files comprise an uplink audio file and a downlink audio file;

the step of processing the audio file according to the preset rule comprises the following steps:

establishing an uplink audio buffer area for a calling terminal and establishing a downlink audio buffer area for a passive terminal; the uplink cache region is used for caching the uplink audio files, and the downlink cache region is used for caching the downlink audio files;

according to the time sequence, the uplink audio files are written into the uplink audio cache region in sequence, and the downlink audio files are written into the downlink audio cache region;

the step of writing the uplink audio files into the uplink audio buffer area and the downlink audio files into the downlink audio buffer area in sequence according to the time sequence comprises the following steps:

identifying voice signals of a character by VAD, marking a start point and an end point of voice, forming N voice segment files, TS1 being a start point of a voice segment in a first uplink audio file, TE1 being a start point of the voice segment in the first uplink audio file, TS2 being a start point of the voice segment in a second uplink audio file, TE2 being a start point of the voice segment in the second uplink audio file, repeating the steps to obtain N voice segments and N-1 voice interval segments, TSN being a start point time of the Nth voice, TEN being an end point of the Nth voice segment, the start point of the play time of the voice segment files being a call start, the call answering being set as a start time Toffset as Toffset1, the play time of the N voice segments relative to a mobile phone being Toffset1+ TS1, Toffset1+ TS2, … … …, Toff 1+ N respectively, and sending the obtained N voice segments to a speech recognition engine for conversion according to the sequence of the voice segments, and obtaining N uplink texts, adding corresponding uplink time into each uplink text, corresponding to the voice sections one by one, and displaying on the interface of the calling terminal or the called terminal.

2. The method for converting voice into text in real time according to claim 1, wherein the step of processing the audio file according to the preset rule and uploading the audio file to the background server comprises:

detecting the format of the audio file;

judging whether the format of the audio file is a PCM format or not;

if not, converting the format of the initial audio file into a PCM format.

3. The method of real-time speech conversion according to claim 1, wherein the audio file comprises a plurality of audio objects;

detecting a speech signal of the audio object in the audio file;

4. The method for converting speech into text according to claim 3, wherein the step of defining the audio file between the start point and the end point of each audio object speech as a speech segment file and preprocessing comprises:

setting the starting time of each voice segment according to a preset rule;

5. The method for converting speech into text in real time according to claim 4, wherein the step of obtaining the text message and displaying the text message on the interface of the communication terminal comprises:

acquiring text information correspondingly converted by the background server;

correspondingly marking the starting time of each text message;

6. The method for converting speech into text in real time according to claim 1, wherein said obtaining and displaying said text message on said communication terminal interface comprises:

7. An apparatus for converting speech to text in real time, comprising:

the first display module is used for acquiring the text information and displaying the text information on the interface of the communication terminal;

the audio files comprise an uplink audio file and a downlink audio file;

the processing module comprises:

the establishing unit is used for establishing an uplink audio buffer area for the calling terminal and establishing a downlink audio buffer area for the passive terminal; the uplink cache region is used for caching the uplink audio files, and the downlink cache region is used for caching the downlink audio files;

the writing unit is used for sequentially writing the uplink audio files into the uplink audio cache region and writing the downlink audio files into the downlink audio cache region according to a time sequence;

the write unit includes:

a write-in subunit, configured to distinguish voice signals of a character by VAD, mark a start point and an end point of a voice, form N voice segment files, where TS1 is a start point of a voice segment in a first uplink audio file, TE1 is a end point of a voice segment in the first uplink audio file, TS2 is a start point of a voice segment in a second uplink audio file, TE2 is an end point of a voice segment in the second uplink audio file, and so on, there are N voice segment start points and end points, to obtain N voice segments and N-1 voice interval segments, TSN is a start point time of an nth voice, TEN is an end point of an nth voice segment, a start point of a play time of a voice segment file is a talk listening start, a talk listening start time toset 1, and then the N voice segments are, respectively, Toffset1+ TS1, Toffset1+ 2, … … …, Toffset1+ TSN, and then the obtained N voice segments are converted into an engine according to the sequence of the obtained voice segment recognition times through socket long connection, and obtaining N uplink texts, adding corresponding uplink time into each uplink text, corresponding to the voice sections one by one, and displaying on the interface of the calling terminal or the called terminal.

8. A storage medium, which is a computer-readable storage medium, on which a computer program is stored, wherein the computer program, when executed, implements a method for real-time conversion of speech into text according to any one of claims 1 to 6.

9. Computer device comprising a processor, a memory and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method for real-time conversion of speech into text according to any of claims 1 to 6 when executing the computer program.