CN109917982B

CN109917982B - Voice input method, device, equipment and readable storage medium

Info

Publication number: CN109917982B
Application number: CN201910216815.6A
Authority: CN
Inventors: 王影; 乔玉平; 谢珍珠
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2019-03-21
Filing date: 2019-03-21
Publication date: 2021-04-02
Anticipated expiration: 2039-03-21
Also published as: CN109917982A

Abstract

The application discloses a voice input method, a device, equipment and a readable storage medium, wherein the input method comprises the following steps: responding to the voice input operation of the input focus in the state of the target text area, and displaying a text popup; acquiring and displaying a transcription text corresponding to the input voice in the text popup; and transferring the transcription text displayed in the text popup to the input focus in the target text area. Obviously, compared with the existing voice input mode, the method and the device have the advantages that the process of displaying the transcribed text of the input voice in the form of the text popup window is added, a user can conveniently determine the text content input by the current voice, and the integral input efficiency is improved. Furthermore, the transcribed text is displayed through a popup window in the voice input process, so that a user can more visually see the input text content, and the human-computer interaction experience in the input process is improved.

Description

Voice input method, device, equipment and readable storage medium

Technical Field

The present application relates to the field of information recognition technologies, and in particular, to a method, an apparatus, a device, and a readable storage medium for voice input.

Background

With the development of speech recognition technology, the input of text information in the form of speech has become more and more popular. Through the mode of voice information input, a user can input information in an applied information input interface more quickly and conveniently, for example, text information is input in a text editing page such as word, search information is input in a search box of a browser, registration information is input in a registration text box of an application registration interface, and the like.

In the existing voice input mode, after receiving input voice, text transcription is performed, and a transcribed text is directly displayed in a target text region where an input focus is located. Research finds that, in some scenarios, limited by some characteristics of the target text region, the existing voice input method does not facilitate the user to determine the text content input by voice each time, thereby reducing the overall input efficiency. For example, in the process of inputting a text in a word or other editing pages, if a text segment needs to be inserted into a certain position in a large text segment, after the input focus is located at the insertion position, the user enters the voice of the inserted text in the prior art, and the system directly inserts the text after the voice transcription into the input focus. For the user, since there is text before and after the input focus, after the insertion of the transcribed text, the user cannot quickly locate the inserted transcribed text in the whole text. As shown in fig. 1a, an english sentence is required to be inserted into the input focus of the third last row of an english word illustrated in fig. 1a, a transcription text is directly inserted into the input focus according to the existing voice input method, and the result is directly presented to the user as shown in fig. 1 b.

For another example, some target text regions may encrypt and display the input text, if the target text region is a password input box, the text after voice transcription is directly encrypted and displayed in the password input box according to the existing voice input mode, as shown in fig. 2, the input text is displayed in the password input box of the application registration page by "+", so that the user cannot know the real text input by voice, and further cannot determine the correctness of the input text due to certain errors in voice recognition.

Disclosure of Invention

In view of the above, the present application provides a speech input method, apparatus, device and readable storage medium, which are used for the disadvantage that the user cannot know the true text input by speech in the speech recognition process, and further cannot determine the correctness of the input text due to certain error of speech recognition.

In order to achieve the above object, the following solutions are proposed:

a voice input method comprising:

responding to the voice input operation of the input focus in the state of the target text area, and displaying a text popup;

acquiring and displaying a transcription text corresponding to the input voice in the text popup;

and transferring the transcription text displayed in the text popup to the input focus in the target text area.

In the above method, optionally, after transferring the transcribed text displayed in the text popup to the input focus in the target text region, the method further includes:

hiding or destroying the text popup.

Optionally, in the method, while the transcription text corresponding to the input speech is displayed in the text popup, the method further includes:

and displaying a voice signal graph in the text popup, wherein the voice signal graph changes along with the change of the input voice.

Optionally, the method for displaying the text popup in response to the voice input operation with the input focus in the target text region includes:

responding to voice input operation when an input focus is in a target text area state, and determining the position of the input focus;

and displaying the text popup by taking the position of the input focus as a reference.

In the foregoing method, optionally, before the transferring the transcribed text displayed in the text popup to the input focus in the target text region, the method further includes:

determining whether a text transfer condition is satisfied;

and if so, executing the operation of transferring the transcribed text displayed in the text popup to the input focus in the target text area.

The method described above, optionally, the determining whether the text transfer condition is satisfied includes:

determining whether a text transfer condition is met according to the semantic integrity of the transcribed text displayed in the text popup;

and/or the presence of a gas in the gas,

and determining whether a text transfer condition is met according to the correlation between the transcribed text displayed in the text popup and the transcribed text of the subsequent input voice.

and detecting whether a text transfer instruction is received, if so, determining that a text transfer condition is met, and otherwise, determining that the text transfer condition is not met.

Optionally, the above method, wherein the transferring the transcribed text displayed in the text popup to the input focus in the target text region includes:

acquiring the format requirement of a target text area on an input text;

according to the format requirement, format editing is carried out on the transcribed text displayed in the text popup window to obtain the transcribed text after format editing;

and transferring the transcription text with the edited format to the input focus in the target text area.

In the foregoing method, optionally, before transferring the transcribed text displayed in the text popup to the input focus in the target text region, the method further includes:

and responding to the editing operation of the transcribed text in the text popup window, and displaying the transcribed text after editing.

Optionally, the above method, where the displaying the edited transcribed text in response to the editing operation on the transcribed text in the text popup includes:

responding to the global editing operation of the appointed transcription text in the text popup, and determining the editing range of the time as all the transcription texts displayed in the text popup;

and editing the transcribed texts in the editing range, which are the same as the specified transcribed texts, according to the global editing operation, and displaying the edited transcribed texts.

Optionally, in the method, before displaying the edited transcribed text in response to the editing operation on the transcribed text in the text popup, the method further includes:

and responding to an instruction for transferring the selected transfer text in the target text area into the text popup, and transferring the selected transfer text into the text popup.

A voice input device comprising:

the text popup display unit is used for responding to the voice input operation of which the input focus is in the state of the target text area and displaying a text popup;

the text acquisition and display unit is used for acquiring and displaying the transcription text corresponding to the input voice in the text popup window;

and the text transfer unit is used for transferring the transfer text displayed in the text popup to the input focus in the target text area.

The above apparatus, optionally, further comprises:

and the text popup processing unit is used for hiding or destroying the text popup after the text transfer unit transfers the transfer text displayed in the text popup to the input focus in the target text area.

The above apparatus, optionally, further comprises:

and the graphic display unit is used for displaying a voice signal graphic in the text popup while acquiring and displaying the transcription text corresponding to the input voice in the text popup by the text acquisition and display unit, wherein the voice signal graphic changes along with the change of the input voice.

Optionally, the above apparatus, wherein the text popup display unit includes:

an input focus position determination unit configured to determine a position of an input focus in response to a voice input operation in a state where the input focus is in a target text region;

and the input focus position reference unit is used for displaying the text popup by taking the position of the input focus as a reference.

The above apparatus, optionally, further comprises:

and the transfer condition determining unit is used for determining whether a text transfer condition is met or not before the text transfer unit transfers the transfer text displayed in the text popup to the input focus in the target text area, and if so, executing the text transfer unit.

The above apparatus, optionally, the transition condition determining unit includes:

the integrity reference unit is used for determining whether a text transfer condition is met or not according to the semantic integrity of the transcribed text displayed in the text popup;

and/or the presence of a gas in the gas,

and the correlation reference unit is used for determining whether a text transfer condition is met or not according to the correlation between the transcribed text displayed in the text popup and the transcribed text of the subsequent input voice.

and the instruction detection unit is used for detecting whether a text transfer instruction is received, if so, determining that the text transfer condition is met, and otherwise, determining that the text transfer condition is not met.

The above apparatus, optionally, the text transfer unit includes:

a format requirement acquisition unit for acquiring a format requirement of the target text region for the input text;

the format editing unit is used for editing the format of the transcribed text displayed in the text popup according to the format requirement to obtain the transcribed text after format editing;

and the text transfer unit after format editing is used for transferring the transfer text after format editing to the input focus in the target text area.

The above apparatus, optionally, further comprises:

and the editing operation responding unit is used for responding to the editing operation on the transcribed text in the text popup and displaying the edited transcribed text before the text transferring unit transfers the transcribed text displayed in the text popup to the input focus in the target text area.

In the foregoing apparatus, optionally, the editing operation responding unit includes:

the editing range determining unit is used for responding to the global editing operation of the appointed transcription texts in the text popup and determining the editing range of the time as all the transcription texts displayed in the text popup;

and the global editing unit is used for editing the transcription texts in the editing range, which are the same as the specified transcription texts, according to global editing operation, and displaying the edited transcription texts.

The above apparatus, optionally, further comprises:

and the target text area text transfer unit is used for responding to an instruction for transferring the selected transfer text in the target text area into the text popup, and transferring the selected transfer text into the text popup.

A voice input device comprising a memory and a processor;

the memory is used for storing programs;

the processor is configured to execute the program to implement the steps of the voice input method.

A readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the speech input method as described above.

According to the technical scheme, the voice input method provided by the embodiment of the application displays the text popup by responding to the voice input operation of the input focus in the state of the target text area, further obtains and displays the transcription text corresponding to the input voice in the text popup, and can more quickly and conveniently enable a user to determine the text content input by the current voice by displaying the transcription text in the text popup, and finally transfer the transcription text displayed in the text popup to the input focus in the target text area to complete the whole input process. Obviously, compared with the existing voice input mode, the method and the device have the advantages that the process of displaying the transcribed text of the input voice in the form of the text popup window is added, a user can conveniently determine the text content input by the current voice, and the integral input efficiency is improved.

Furthermore, the transcribed text is displayed through a popup window in the voice input process, so that a user can more visually see the input text content, and the human-computer interaction experience in the input process is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIGS. 1a-1b illustrate a speech input process;

FIG. 2 illustrates a process diagram for voice entry of a password;

FIG. 3 is a flow chart of a speech input method disclosed herein;

FIG. 4 is a schematic diagram of a speech input process according to an example of the present application;

FIG. 5 is yet another flow chart of a method for speech input as disclosed herein;

FIG. 6 illustrates a diagram of a date entry box versus input format requirements;

FIG. 7 is a schematic diagram of an exemplary entry to a date entry box;

FIG. 8 is a diagram illustrating editing of text in a text popup according to an exemplary embodiment of the present application;

FIG. 9 is a diagram illustrating an example of text movement from a target text area into a text popup;

FIG. 10 is a block diagram of a voice input device according to the present disclosure;

fig. 11 is a block diagram of a hardware structure of a speech input device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The voice input method can be applied to the field of voice recognition, input voice is recognized into a transcription text through voice recognition, the transcription text is displayed in a text popup window mode, then the transcription text in the text popup window is transferred to a target text area, and the target text area is the position where a user finally needs to input the text. Next, referring to fig. 3, a detailed description is provided of the speech input method of the present invention, which includes the following steps:

and step S110, responding to the voice input operation of the input focus in the state of the target text area, and displaying a text popup.

Specifically, the text area may be an area capable of text input, which may be Excel, Word, notepad, password input box, or the like. The input focus can be switched in different text areas, and the text area where the input focus is located is defined as a target text area, which is an area where a user needs to input texts.

In a state where the input focus is in the target text region, the user may input text to the target text region in the form of voice. The text popup window is displayed on the current terminal interface by responding to the voice input operation. The text popup window may be a rectangular text entry box, a bubble-shaped text entry box, or other preferred forms, and the form in which the text popup window exists is not strictly limited in this application.

And step S120, acquiring and displaying the transcription text corresponding to the input voice in the text popup.

Specifically, in the present embodiment, the process of acquiring the transcribed text corresponding to the input speech may be directly acquiring the transcribed text after the speech recognition engine recognizes the input speech. Besides, the input speech can be recognized to obtain the transcribed text.

And after the transfer text is acquired, displaying the transfer text in the displayed text popup window. It can be understood that the transcribed text displayed in the text popup window can be more directly and conveniently seen by the user, and the purpose that the user reads the voice and simultaneously sees the transcribed text is achieved.

Step S130, transferring the transcribed text displayed in the text popup to the input focus in the target text region.

Specifically, the transcription text in the text popup can be obviously conveniently read by a user, and meanwhile, the transcription text in the text popup can be transferred to the input focus of the target text region, so that the purpose of inputting text content into the target text region is achieved. The transferring the transcribed text to the input focus of the target text region may specifically include: and inserting the transcribed text into the target text area by taking the input focus as a text insertion starting point.

Furthermore, the transcribed text is displayed through the popup window in the voice input process, so that a user can more visually see the input text content, and the human-computer interaction experience in the input process is improved.

Referring to fig. 4, a schematic diagram of a speech input process is illustrated.

In fig. 4, when the user intends to input text contents into the target text area a1 in a speech form, the text popup a2 may be displayed on the interface in response to a speech input operation during the speech input process of the user, and the transcribed text recognized for the input speech may be displayed in the text popup. Further, the transcribed text within the text popup may be transferred to the target text area a 1. In the whole input process, the transcription text of the input voice is displayed in a text popup window mode, so that a user can determine the text content input by the current voice more conveniently, the whole input efficiency is improved, the user can see the input text content more intuitively, the human-computer interaction experience of the input process is improved, and finally the purpose of inputting the text into a target text area can be realized.

Further optionally, after the transfer text displayed in the text popup is transferred to the input focus in the target text region, a new transfer text is not obtained yet at this time, that is, the text popup is empty, so that the text popup can be hidden or destroyed. When a new voice is input, the process of displaying the text popup is triggered. In addition, the method and the device can also select to continuously keep the text popup window fixedly displayed at a certain fixed position of the terminal interface. The specific strategy can be set according to the needs.

Still further, in the voice input process, in order to facilitate the user to know the input voice, a voice signal graph can be additionally displayed in the text popup. The speech signal graphic may be a fixed graphic that only illustratively tells the user that the text popup is used to display transcribed text corresponding to the input speech. Further, the voice signal pattern may also be variable, and in particular, the voice signal pattern may be changed as the input voice is changed. Wherein the change of the voice can be displayed by a waveform chart, a bar chart or other preferred forms. The change of the voice signal pattern may be related to the volume, pitch, speed and the like of the input voice. As the voice signal pattern a21 illustrated in fig. 4 has a waveform length that can vary with the volume of the input voice, and as the volume of the input voice increases, the waveform length becomes longer.

In another embodiment of the present application, a process of displaying a text popup in response to a voice input operation in which an input focus is in a target text area state in step S110 is described.

It can be understood that the text popup can be displayed at any position on the terminal interface, and the display position can be determined according to certain strategies. This embodiment describes a scheme of displaying a text popup associated with an input focus, which is described in detail below.

First, in response to a voice input operation in a state where an input focus is in a target text region, a position of the input focus is determined.

Specifically, the position of the input focus can be the position of the input focus on the current interface of the terminal, and the position of the input focus can be obtained by calling a corresponding interface of the system.

Further, a text popup is displayed by taking the position of the input focus as a reference.

Specifically, the position of the text popup is associated with the position of the input focus, and the position of the text popup may be a position having a preset positional relationship with the input focus. The preset positional relationship may define a lateral distance, a longitudinal distance, an angular relationship, and the like of the text popup from the input focus. Still taking fig. 4 as an example, it illustrates a positional relationship between a text popup and an input focus. In the example of fig. 4, the positional relationship of the text popup with the input focus includes: the upper left corner of the text popup is adjacent to the input focus in the transverse direction, and a standard line spacing is arranged in the longitudinal direction.

In another embodiment of the present application, in order to facilitate reading by a user and improve readability of a text in a target text region, before transferring a transcribed text displayed in a text popup into the target text region, a verification process of whether the transcribed text meets a transfer condition is further added, and the whole voice input process may refer to the flowchart illustrated in fig. 5, which includes:

step S210, responding to the voice input operation with the input focus in the target text region state, and displaying a text popup.

And S220, acquiring and displaying the transcription text corresponding to the input voice in the text popup.

The steps S210 to S220 correspond to the steps S100 to S110 one to one, and the detailed description is omitted here for brevity.

Step S230, determining whether a text transition condition is satisfied, and if so, executing step S240.

Step S240, transferring the transcribed text displayed in the text popup to the input focus in the target text region.

Step S240 corresponds to step S130, and reference is made to the above description for details, which are not repeated herein.

It is to be understood that, if it is determined in step S230 that the text transition condition is not satisfied, the transcribed text within the text popup may not be temporarily transferred to the target text region until it is determined that the text transition condition is satisfied, and the text transition operation is not performed.

This embodiment illustrates several optional ways of determining whether the text transition condition is satisfied, which may specifically include the following:

(1) and determining whether a text transfer condition is met according to the semantic integrity of the transcribed text displayed in the text popup.

Specifically, the semantic integrity of the transcribed text displayed in the text popup is verified, and the integrity of the transcribed text existing in the text popup is mainly verified. It can be understood that if the integrity of the existing transcribed texts in the text popup is high enough, the normal reading of the user is not affected after the existing transcribed texts are transferred to the target text area, and therefore it can be determined that the text transfer condition is satisfied. On the contrary, if the integrity of the existing transcribed texts in the text popup is not high enough, if the transcribed texts with the integrity not high enough are transferred to the target text area, the normal reading of the user is affected, and obviously, the text transfer condition is not met in this case.

Specifically, an integrity threshold value can be preset, and when the semantic integrity of the transcribed text displayed in the text popup exceeds the integrity threshold value, the integrity is considered to be high enough to meet a text transfer condition; and when the semantic integrity of the transcribed text displayed in the text popup does not exceed the integrity threshold, the integrity is considered to be insufficient, and the text transfer condition is not met.

(2) And determining whether a text transfer condition is met according to the correlation between the transcribed text displayed in the text popup and the transcribed text of the subsequent input voice.

Specifically, the user voice input process is a continuous process, and the input voice is also in a voice stream form. With the increase of the input voice stream, the corresponding transcription text can be continuously acquired, and the acquired transcription text is displayed in the text popup window. At a certain moment, if the transcribed text in the text popup has not been transferred to the target text region and the transcribed text corresponding to the subsequent input voice has not entered the text popup, then it can be determined whether the text transfer condition is satisfied through the correlation between the transcribed text displayed in the text popup and the transcribed text of the subsequent input voice.

This embodiment illustrates two optional ways of determining whether the text transition condition is satisfied through correlation verification, which are respectively as follows:

a first kind,

The embodiment can perform verification through semantic relevance of the text, which is specifically as follows:

and when the relevance score is smaller than a preset relevance threshold value, the semantic relevance between the transcribed text in the text popup and the transcribed text input by the subsequent voice is low, and at the moment, if the transcribed text displayed in the text popup is transferred to a target text region, the normal understanding reading of a user cannot be influenced, so that the text transfer condition can be determined to be met. If the relevance score reaches a preset relevance threshold, the relevance between the two is high, and if the transcribed text displayed in the text popup is transferred to the target text area alone, the relevance between the transcribed text and the subsequent input text can be cut off, so that the normal understanding reading of the user is influenced, and therefore, the text transfer condition can be determined not to be met.

A second kind,

The embodiment can be verified by the correlation between the input voices corresponding to the text, which is specifically as follows:

and determining whether the input voice is related to the subsequent input voice according to the input voice corresponding to the transcribed text displayed in the text popup and the pause information between the subsequent input voice. For example, when the pause duration reaches a preset pause duration threshold, it indicates that the pause duration between the input speech and the subsequent input speech is long, and in general, if the text relevance is not large, the interval between the input speech and the subsequent input speech is long, so when the pause duration reaches the preset pause duration threshold, it may be determined that the relevance is not large, transferring the transcribed text displayed in the text popup window to the target text region does not affect the normal understanding reading of the user, and further, it may be determined that the text transfer condition is satisfied; on the contrary, when the pause duration does not reach the preset pause duration threshold, the relevance is relatively large, if the transcribed text displayed in the text popup is transferred to the target text area alone, the relevance between the transcribed text and the subsequent input text can be cut off, the normal understanding reading of the user is influenced, and therefore the text transfer condition can be determined not to be met.

(3) And detecting whether a text transfer instruction is received, if so, determining that a text transfer condition is met, and otherwise, determining that the text transfer condition is not met.

Specifically, the text transfer instruction may include a voice instruction or an instruction input by an external device.

When the text transfer instruction is a voice instruction, instruction texts corresponding to the transfer instruction, such as 'text transfer', 'pause entry' and the like, can be preset, and when the instruction texts are detected to be input by voice, the text transfer condition is satisfied; otherwise, when the branch instruction is not detected to be input by voice, the text branch condition is not satisfied.

When the text transfer instruction is an instruction input through external equipment, such as an instruction input through equipment such as a keyboard and a mouse, taking a keyboard instruction as an example of clicking a space bar, and when the space bar is detected to be triggered, the text transfer instruction meets a text transfer condition; if the space key is not detected to be triggered, the text transfer condition is not met.

In another embodiment of the present application, when the transcribed text displayed in the text popup is transferred to the input focus in the target text region, the transcribed text in the text popup can be directly transferred to the target text region under a common condition.

In addition, in some scenarios, there may be some special requirements for the target text region, such as defining the format of the input text content, that is, only allowing text in a specific format to be input into the target text region. For example, there is a specific format requirement for the date of entry, or a specific requirement for the font of the text entered, etc. Based on this situation, this embodiment provides an implementation manner for transferring the transcribed text displayed in the text popup to the target text area, which is detailed as follows:

and S1, acquiring the format requirement of the target text area on the input text.

Specifically, the format requirement of the target text region on the input text can be obtained by calling a background code of the target text region, and the format requirement is obtained in the background code.

In addition, the target text area with the requirement for the input text format can be labeled with the requirement for the format by inputting prompt information on the interface so as to prompt the user. As in the information retrieval interface exemplified in fig. 6, the date input box requires input in the format xxxx/xx/xx.

Based on the method and the device, the input prompt information related to the target text area can be obtained, and then the obtained input prompt information is analyzed to determine the format requirement of the target text area on the input text.

And S2, according to the format requirement, performing format editing on the transcribed text displayed in the text popup window to obtain the transcribed text after format editing.

Specifically, format editing is performed on the transcribed text displayed in the text popup according to the format requirement, and the transcribed text displayed in the text popup is converted into a format required by a target text area, so that the transcribed text after format editing is obtained.

And S3, transferring the transcribed text with the edited format to the input focus in the target text area.

Specifically, for the transcribed text after format editing, the requirement of the target text region is met, so that the transcribed text after format editing can be transferred to the input focus in the target text region. As shown in fig. 7, the transcribed text displayed in the text popup a2 is 2018.9.10 by speech recognition at the time of speech input. Assume that the target text area is a date entry box A1, which requires that the date entered be in xxxx/xx/xx format. Therefore, the application can convert the transcribed text displayed in the text popup A2 according to the format requirement of the date input box A1, namely, converting "2018.9.10" into "2018/09/10", and further transferring the converted "2018/09/10" into the date input box.

In another embodiment of the present application, the written text needs to be edited to correct the error, because the speech recognition may be wrong or the user's speech input may be misleading. Further, in order to avoid some inconvenience problems in text editing in the target text region, the present application may provide a function of supporting a user to edit the transcribed text displayed in the text popup.

Specifically, before transferring the transcription text displayed in the text popup to the input focus in the target text region in the above embodiment, the speech input method of the present application may further include the following processing links:

That is, before the transcribed text is transferred to the target text region, the user may perform an editing operation, such as a modification, deletion, replacement, etc., on the transcribed text in the text popup, and the text popup may display the edited transcribed text. For the edited transcribed text, further transfer to the target text area is possible.

It can be understood that, in order to support the editing operation of the transcribed text in the text popup, in this embodiment, when it is detected that the user performs the editing operation on the transcribed text in the text popup, the text popup can be set to be in a continuous display state, that is, the text cannot be hidden or destroyed, so that the user can more conveniently perform the editing operation on the text in the text popup. Certainly, when the user needs to hide or destroy the text popup, the user can control to hide or destroy the text popup by issuing a corresponding instruction form.

Wherein the user can edit the designated transcription text within the text popup separately. In addition, global editing can be set, in the global editing mode, the editing range is all the transcribed texts displayed in the text popup, the transcribed texts in the editing range which are the same as the transcribed texts specified by the user can be edited according to the global editing operation, and the edited transcribed texts are displayed.

Whether the specific editing mode is the individual editing or the global editing can be determined in advance through setting, or the editing mode can be changed through instructions instantly by the user.

The present embodiment takes the global editing mode as an example for explanation.

The user may initiate an editing instruction for the designated transcribed text in the text popup, and specifically, the editing instruction may be initiated in a voice form or through an external input device. The voice form is that the user can specify the transcription text to be edited and the specific editing mode of the specified transcription text through voice. According to the scheme, the specified transcription text to be edited and the specific editing mode are determined by analyzing the voice command, all the same specified transcription texts are searched in the editing range, the editing operation is executed according to the analyzed editing mode, and the purpose of editing all the same specified transcription texts in the text popup window at one time is achieved.

Further, the user may also initiate an editing instruction through an external input device, such as in a form of a keyboard cooperating with a mouse, to specify the transcribed text to be edited, and to implement the editing operation. Specifically, there may be two implementations, as follows:

firstly, a user selects a transfer text to be edited at a certain position in a text popup, and the selected transfer text is the appointed transfer text. Further, the user can perform specific editing operations, such as deletion, replacement, and the like, on the specified transcription text. After the user editing operation, the method and the device can further search the transcription text which is the same as the designated transcription text selected by the user in the text popup, and execute the editing operation on the searched transcription text according to the same editing operation of the user, so that the same editing operation on the designated transcription text at each position which is the same in the text popup is realized.

Second, a user can call up a text editing page that provides an interface for the user to set a specified transcription text to be edited and an interface for the user to set a specific editing operation mode. Based on this, the user can input the specified transcription text to be edited and the specific editing operation mode in the text editing page. Furthermore, according to the method and the device, each designated transcription text can be searched in the text popup window based on the text editing page set by the user, and the editing operation is respectively carried out according to the editing operation mode set by the user in the text editing page, so that the same editing operation on the designated transcription texts at all positions in the same text popup window can be realized.

The text editing process within the text popup is explained next with reference to fig. 8.

Assume that the content displayed in the current text popup is: "Beijing to Dalian, Beijing to Tianjin". The user wants to modify all Beijing in the displayed content into Shanghai, an editing instruction can be issued through a voice form or an external input device, taking the example that a keyboard cooperates with a mouse to issue the editing instruction, the user can manually modify the first Beijing displayed in the text popup window into Shanghai, and the application can automatically modify the Beijing displayed at other places in the text popup window into Shanghai on the premise that the user presets an editing mode as global editing.

Further, in some scenarios, for a part of the transcribed text that has been transferred into the target text region, the user may still have a need to edit the part of the transcribed text, and at this time, the user may drag the transcribed text that needs to be modified from the target text region into the text popup, thereby executing the editing process described above. Specifically, the application can transfer the selected transcription text to the text popup in response to an instruction to transfer the selected transcription text in the target text area to the text popup.

Referring to fig. 9, the contents existing in the target text region include: "Beijing to Wuhan, Beijing to Qingdao," the content displayed in the text popup includes: "Beijing to Dalian, Beijing to Tianjin". At this time, the user finds that each "beijing" in "beijing to Qingdao, beijing to Dalian, beijing to Tianjin" needs to be modified to "shanghai". In order to modify the text more quickly, the user can drag the 'Beijing to Qingdao' in the target text area into the text popup window, and then modify the 'Beijing' at each position in the text popup window into 'Shanghai' at one time through the global editing mode of the text popup window. Obviously, this modification is much faster.

The following describes a voice input device provided in an embodiment of the present application, and the voice input device described below and the voice input method described above may be referred to correspondingly.

Referring to fig. 10, fig. 10 is a schematic structural diagram of a voice input device disclosed in the embodiment of the present application.

As shown in fig. 10, the apparatus may include:

a text popup display unit 110 for displaying a text popup in response to a voice input operation in a state where an input focus is in a target text region;

a text acquiring and displaying unit 120, configured to acquire and display a transcription text corresponding to the input voice in the text popup;

a text transferring unit 130, configured to transfer the transcribed text displayed in the text popup to the input focus in the target text region.

According to the technical scheme, the voice input device provided by the embodiment of the application displays the text popup by responding to the voice input operation of the input focus in the state of the target text area, further obtains and displays the transcription text corresponding to the input voice in the text popup, and can more quickly and conveniently enable a user to determine the text content input by the current voice by displaying the transcription text in the text popup, and finally transfer the transcription text displayed in the text popup to the input focus in the target text area to complete the whole input process. Obviously, compared with the existing voice input mode, the method and the device have the advantages that the process of displaying the transcribed text of the input voice in the form of the text popup window is added, a user can conveniently determine the text content input by the current voice, and the integral input efficiency is improved.

Optionally, the speech input device of the present application may further include:

Optionally, the text popup display unit may include:

Optionally, the transition condition determining unit may include:

and/or the presence of a gas in the gas,

Optionally, the transition condition determining unit may include:

Optionally, the text transfer unit may include:

Optionally, the editing operation response unit may include:

The voice input device provided by the embodiment of the application can be applied to voice input equipment, such as a PC terminal, a cloud platform, a server cluster and the like. Alternatively, fig. 11 shows a block diagram of a hardware structure of the voice input device, and referring to fig. 11, the hardware structure of the voice input device may include: at least one processor 1, at least one communication interface 2, at least one memory 3 and at least one communication bus 4;

in the embodiment of the application, the number of the processor 1, the communication interface 2, the memory 3 and the communication bus 4 is at least one, and the processor 1, the communication interface 2 and the memory 3 complete mutual communication through the communication bus 4;

the processor 1 may be a central processing unit CPU, or an application Specific Integrated circuit asic, or one or more Integrated circuits configured to implement embodiments of the present invention, etc.;

the memory 3 may include a high-speed RAM memory, and may further include a non-volatile memory (non-volatile memory) or the like, such as at least one disk memory;

wherein the memory stores a program and the processor can call the program stored in the memory, the program for:

Alternatively, the detailed function and the extended function of the program may refer to the above description.

Embodiments of the present application further provide a readable storage medium, where a program suitable for being executed by a processor may be stored, where the program is configured to:

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A speech input method, comprising:

responding to an instruction for transferring the selected transfer text in the target text area into the text popup, and transferring the selected transfer text into the text popup;

responding to the editing operation of the transcribed text in the text popup window, and displaying the transcribed text after editing;

2. The method of claim 1, wherein after transferring the transcribed text displayed in the text popup to the input focus in the target text region, the method further comprises:

hiding or destroying the text popup.

3. The method of claim 1, wherein while the transcribed text corresponding to the input speech is displayed in the text popup, the method further comprises:

4. The method of claim 1, wherein the presenting a text popup in response to a voice input operation with an input focus in a state of a target text region comprises:

5. The method of claim 1, wherein prior to said transferring the transcribed text displayed in the text popup to the input focus in the target text region, the method further comprises:

determining whether a text transfer condition is satisfied;

6. The method of claim 5, wherein the determining whether the text transfer condition is satisfied comprises:

and/or the presence of a gas in the gas,

7. The method of claim 5, wherein the determining whether the text transfer condition is satisfied comprises:

8. The method of claim 1, wherein the transferring the transcribed text displayed in the text popup to the input focus in the target text region comprises:

acquiring the format requirement of a target text area on an input text;

9. The method of claim 1, wherein displaying the edited transcribed text in response to the editing operation on the transcribed text in the text popup comprises:

10. A speech input device, comprising:

a target text region text transfer unit for transferring a selected transcription text in a target text region into the text popup in response to an instruction to transfer the selected transcription text into the text popup;

the editing operation response unit is used for responding to the editing operation of the transcribed text in the text popup and displaying the transcribed text after editing;

11. The apparatus of claim 10, further comprising:

12. The apparatus of claim 10, wherein the text transfer unit comprises:

13. A voice input device comprising a memory and a processor;

the memory is used for storing programs;

the processor, which executes the program, implements the steps of the voice input method according to any one of claims 1 to 9.

14. A readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the speech input method according to any one of claims 1-9.