JP7542826B2

JP7542826B2 - Voice recognition program and voice recognition device

Info

Publication number: JP7542826B2
Application number: JP2021060936A
Authority: JP
Inventors: 正樹中村
Original assignee: Aisin Seiki Co Ltd; Aisin Corp
Current assignee: Aisin Corp
Priority date: 2021-01-29
Filing date: 2021-03-31
Publication date: 2024-09-02
Anticipated expiration: 2041-03-31
Also published as: JP2022117374A; JP2022117375A; JP2022117376A; JP7552481B2

Description

本発明は、音声認識プログラム及び音声認識装置に関するものである。 The present invention relates to a voice recognition program and a voice recognition device.

特許文献１には、入力された音声を音声認識し、その認識結果を用いて経路の案内や車両の運転等を行うシステム２００が開示されている。そのシステム２００には、ユーザが発話する音声を入力する音声入力装置２１と、音声入力装置２１から入力される一続きの音声を構成する始端から終端までの音声区間を検出する音声区間検出部１１と、その音声区間検出部１１で検出された音声区間内の音声を音声認識する音声認識部１２とが設けられる。 Patent Document 1 discloses a system 200 that recognizes input speech and uses the recognition results to provide route guidance, drive a vehicle, and the like. The system 200 is provided with a speech input device 21 that inputs speech uttered by a user, a speech interval detection unit 11 that detects speech intervals from the beginning to the end that constitute a continuous speech input from the speech input device 21, and a speech recognition unit 12 that recognizes speech within the speech interval detected by the speech interval detection unit 11.

先に第１音声（第１発話）が音声入力装置２１に入力され、その後に第２音声（第２発話）が音声入力装置２１に入力される場合、まず、音声区間検出部１１によって第１音声に対応する第１音声区間が検出され、その第１音声区間の始端から音声認識部１２による第１音声の音声認識が開始される。そして、第１音声区間の終端まで第１音声区間の音声認識が終了した後に、第２音声に対応する第２音声区間の検出および第２音声区間の始端からの音声認識が開始される。これによって、第１音声と第２音声とを区別して音声認識することができる。 When a first voice (first utterance) is input to the voice input device 21 first and then a second voice (second utterance) is input to the voice input device 21, first the voice section corresponding to the first voice is detected by the voice section detection unit 11, and voice recognition of the first voice by the voice recognition unit 12 is started from the start of the first voice section. Then, after voice recognition of the first voice section up to the end of the first voice section is completed, detection of a second voice section corresponding to the second voice and voice recognition from the start of the second voice section are started. This makes it possible to distinguish between the first voice and the second voice and perform voice recognition.

国際公開第２０１９／０５８４５３号（例えば、段落００１３－００３９，図１，５）International Publication No. 2019/058453 (e.g., paragraphs 0013-0039, Figures 1 and 5)

第１音声と第２音声とが連続して発話された場合、第１音声と第２音声との間隔が短時間となる。かかる場合においては、第１音声区間の音声認識、第２音声区間の検出および第２音声区間の音声認識の開始も短時間に行う必要がある。よって、第１音声区間の音声認識に時間を要すると、その後に音声入力装置２１から入力される第２音声の第２音声区間の検出の開始が遅れ、検出された第２音声区間の始端が実際の第２音声の始端よりも遅れて検出される虞がある。これによって、第２音声において始端で発話された内容の音声認識が欠落し、第２音声が正確に音声認識できない虞があるという問題点があった。 When the first voice and the second voice are spoken consecutively, the interval between the first voice and the second voice is short. In such a case, it is necessary to perform voice recognition of the first voice section, detection of the second voice section, and start of voice recognition of the second voice section in a short time. Therefore, if voice recognition of the first voice section takes time, there is a risk that the start of detection of the second voice section of the second voice input from the voice input device 21 thereafter will be delayed, and the start of the detected second voice section will be detected later than the actual start of the second voice. This causes a problem in that voice recognition of the content spoken at the start of the second voice will be missed, and there is a risk that the second voice will not be accurately recognized.

本発明は、上述した問題点を解決するためになされたものであり、第１発話と第２発話とが連続して入力された場合でも、それぞれを正確に音声認識できる音声認識プログラム及び音声認識装置を提供することを目的としている。 The present invention has been made to solve the above-mentioned problems, and aims to provide a speech recognition program and a speech recognition device that can accurately recognize each of the first and second utterances even when they are input consecutively.

この目的を達成するために本発明の音声認識プログラムは、記憶部を備えたコンピュータに、音声認識処理を実行させるプログラムであって、前記記憶部を音声が記憶される音声記憶手段として機能させ、入力された音声を前記音声記憶手段に記憶する音声記憶ステップと、前記音声記憶手段に記憶される音声による発話の開始時刻を取得する開始時刻取得ステップと、前記音声記憶手段に記憶される音声による発話の終了時刻を取得する終了時刻取得ステップと、その終了時刻取得ステップで取得された第１発話の終了時刻と、前記開始時刻取得ステップで取得された開始時刻であって前記第１発話の後に入力される第２発話の開始時刻との時間差である発話間隔を取得する間隔取得ステップと、その間隔取得ステップで取得された発話間隔に基づいて、前記開始時刻取得ステップで取得された前記第２発話の開始時刻から遡る時間である遡及時間を取得する遡及時間取得ステップと、前記音声記憶手段に記憶される音声において、前記開始時刻取得ステップで取得された前記第２発話の開始時刻から前記遡及時間取得ステップで取得された遡及時間を遡った時刻から前記第２発話の音声認識を開始する音声認識ステップとを備えている。 In order to achieve this object, the voice recognition program of the present invention is a program for causing a computer having a storage unit to execute a voice recognition process, and includes a voice storage step for causing the storage unit to function as a voice storage means for storing voice, and storing input voice in the voice storage means, a start time acquisition step for acquiring a start time of an utterance by voice stored in the voice storage means, an end time acquisition step for acquiring an end time of an utterance by voice stored in the voice storage means, an interval acquisition step for acquiring an utterance interval that is a time difference between the end time of a first utterance acquired in the end time acquisition step and the start time acquired in the start time acquisition step, which is a start time of a second utterance input after the first utterance, a retroactive time acquisition step for acquiring a retroactive time that is a time going back from the start time of the second utterance acquired in the start time acquisition step based on the utterance interval acquired in the interval acquisition step, and a voice recognition step for starting voice recognition of the second utterance from a time going back from the start time of the second utterance acquired in the start time acquisition step by the retroactive time acquired in the retroactive time acquisition step, in the voice stored in the voice storage means.

また本発明の音声認識装置は、音声を入力する音声入力手段と、その音声入力手段で入力された音声を記憶する音声記憶手段と、その音声記憶手段で記憶された音声による発話の開始時刻を取得する開始時刻取得手段と、前記音声記憶手段で記憶された音声による発話の終了時刻を取得する終了時刻取得手段と、その終了時刻取得手段で取得された第１発話の終了時刻と、前記開始時刻取得手段で取得された開始時刻であって前記第１発話の後に入力される第２発話の開始時刻との時間差である発話間隔を取得する間隔取得手段と、その間隔取得手段で取得された発話間隔に基づいて、前記開始時刻取得手段で取得された前記第２発話の開始時刻から遡る時間である遡及時間を取得する遡及時間取得手段と、前記音声記憶手段で記憶された音声において、前記開始時刻取得手段で取得された前記第２発話の開始時刻から前記遡及時間取得手段で取得された遡及時間を遡った時刻から前記第２発話の音声認識を開始する音声認識手段と、を備えている。 The voice recognition device of the present invention also includes voice input means for inputting voice, voice storage means for storing the voice input by the voice input means, start time acquisition means for acquiring the start time of an utterance by the voice stored by the voice storage means, end time acquisition means for acquiring the end time of an utterance by the voice stored in the voice storage means, interval acquisition means for acquiring an utterance interval which is the time difference between the end time of a first utterance acquired by the end time acquisition means and the start time acquired by the start time acquisition means which is the start time of a second utterance input after the first utterance, retroactive time acquisition means for acquiring a retroactive time which is the time going back from the start time of the second utterance acquired by the start time acquisition means based on the utterance interval acquired by the interval acquisition means, and voice recognition means for starting voice recognition of the second utterance from a time going back from the start time of the second utterance acquired by the start time acquisition means by the retroactive time acquisition means, in the voice stored in the voice storage means.

請求項１記載の音声認識プログラムによれば、入力された音声が音声記憶手段に記憶され、音声記憶手段に記憶された第１発話の終了時刻と、第２発話の開始時刻とが取得され、それらの時間差である発話間隔に基づいた遡及時間が取得される。そして、音声記憶手段の音声において第２発話の開始時刻から遡及時間を遡った時刻から第２発話の音声認識が開始される。これにより、音声記憶手段に記憶される第２発話の開始から確実に音声認識を開始することができるので、第１発話と第２発話とが連続して入力された場合でも、それぞれを正確に音声認識できるという効果がある。また、遡及時間が第１発話と第２発話との発話間隔に応じて設定されるので、第２発話の開始から音声認識が開始できると共に、その第２発話の音声認識に与える第１発話の影響を最小限に抑制できるという効果もある。 According to the speech recognition program of claim 1, the input speech is stored in a speech storage means, the end time of the first utterance and the start time of the second utterance stored in the speech storage means are acquired, and a retroactive time based on the speech interval, which is the time difference between them, is acquired. Then, speech recognition of the second utterance is started from the time that is the retroactive time from the start time of the second utterance in the speech of the speech storage means. This makes it possible to reliably start speech recognition from the start of the second utterance stored in the speech storage means, and has the effect of accurately recognizing each of the first utterance and the second utterance even if they are input consecutively. In addition, since the retroactive time is set according to the speech interval between the first utterance and the second utterance, speech recognition can be started from the start of the second utterance, and the effect of minimizing the influence of the first utterance on the speech recognition of the second utterance is also achieved.

請求項２記載の音声認識プログラムによれば、請求項１記載の音声認識プログラムの奏する効果に加え、次の効果を奏する。発話間隔が第１所定時間以下の場合は、即ち第１発話と第２発話との発話間隔が短く、これらが連続している場合である。かかる場合に、遡及時間が第１所定時間以上の第１遡及時間に設定されることで、第１発話の後に連続する第２発話の開始から確実に音声認識を開始できるという効果がある。 The speech recognition program of claim 2 has the following effect in addition to the effect of the speech recognition program of claim 1. When the speech interval is less than or equal to the first predetermined time, that is, when the speech interval between the first utterance and the second utterance is short and they are consecutive, the retroactive time is set to the first retroactive time that is equal to or greater than the first predetermined time in such a case, so that speech recognition can be reliably started from the start of the second utterance that follows the first utterance.

請求項３記載の音声認識プログラムによれば、請求項２記載の音声認識プログラムの奏する効果に加え、次の効果を奏する。発話間隔が第１所定時間よりも長い第２所定時間以上の場合、即ち第１発話と第２発話との発話間隔が長い場合に、その第２所定時間以下の第２遡及時間が遡及時間として取得されるので、第２発話の音声認識が開始されてから実際に第２発話が開始されるまでのタイムラグが拡大するのを抑制できる。これにより、第２発話が開始されるまでの周囲の環境音が誤って音声認識されるのを抑制できると共に、第２発話を音声認識するためのコンピュータの処理時間が低減されるので、コンピュータの処理負荷を低減できるという効果がある。 According to the voice recognition program of claim 3, in addition to the effects of the voice recognition program of claim 2 , the following effects are achieved. When the speech interval is equal to or longer than the second predetermined time longer than the first predetermined time , i.e., when the speech interval between the first utterance and the second utterance is long, a second retroactive time equal to or shorter than the second predetermined time is acquired as the retroactive time, so that it is possible to prevent the time lag from the start of speech recognition of the second utterance to the actual start of the second utterance from increasing. This has the effect of preventing erroneous speech recognition of ambient sounds until the start of the second utterance, and reducing the processing time of the computer for speech recognition of the second utterance, thereby reducing the processing load of the computer.

請求項４記載の音声認識プログラムによれば、請求項３記載の音声認識プログラムの奏する効果に加え、次の効果を奏する。第１遡及時間が第１所定時間以上かつ第２所定時間以下の時間に設定されるので、第２発話の音声認識を開始する時刻が第１発話の開始時刻まで遡ることを抑制できる。これにより、第２発話と共に第１発話の全体が音声認識されるのを抑制できるという効果がある。 The speech recognition program according to claim 4 has the following effect in addition to the effect of the speech recognition program according to claim 3. Since the first retroactive time is set to a time equal to or greater than the first predetermined time and equal to or less than the second predetermined time, it is possible to prevent the time at which speech recognition of the second utterance is started from going back to the start time of the first utterance. This has the effect of preventing the entire first utterance from being speech-recognized along with the second utterance.

請求項５記載の音声認識プログラムによれば、請求項３又は４に記載の音声認識プログラムの奏する効果に加え、次の効果を奏する。第２遡及時間が第１所定時間以上かつ第２所定時間以下の時間に設定される。これにより、第２発話の音声認識を開始する時刻が第１発話の開始時刻まで遡ることを抑制できる。これにより、第２発話と共に第１発話の全体が音声認識されるのを抑制できるという効果がある。 The speech recognition program according to claim 5 has the following effect in addition to the effect of the speech recognition program according to claim 3 or 4. The second retroactive time is set to a time equal to or greater than the first predetermined time and equal to or less than the second predetermined time. This makes it possible to prevent the time at which speech recognition of the second utterance starts from going back to the start time of the first utterance. This has the effect of preventing the entire first utterance from being speech-recognized along with the second utterance.

請求項６記載の音声認識プログラムによれば、請求項１から５のいずれかに記載の音声認識プログラムの奏する効果に加え、次の効果を奏する。発話間隔が第１所定時間と第２所定時間との間の場合、その発話間隔が遡及時間に設定される。これにより、第２発話の音声認識の開始が第１発話の終了時刻となるので、遡及時間を容易に取得できると共に、第２発話の開始から確実に音声認識を開始できるという効果がある。 According to the voice recognition program of claim 6, in addition to the effects of the voice recognition program of any one of claims 1 to 5, the following effect is achieved: When the speech interval is between the first and second predetermined times, the speech interval is set as the look-back time. This makes it possible to easily obtain the look-back time and reliably start voice recognition from the start of the second utterance, since the start of the speech recognition of the second utterance coincides with the end time of the first utterance.

請求項７記載の音声認識装置によれば、請求項１記載の音声認識プログラムと同様の効果を奏する。 The voice recognition device described in claim 7 achieves the same effect as the voice recognition program described in claim 1.

携帯端末の外観図である。FIG. 2 is an external view of a mobile terminal. 音声の音量と、ユーザの発話の開始時刻および終了時刻とを表した図である。1 is a diagram showing the volume of a voice and the start and end times of a user's speech. （ａ）は、発話間隔が第１所定時間以下である場合の音声認識を開始するタイミングを表す図であり、（ｂ）は、発話間隔が第２所定時間以上である場合の音声認識を開始するタイミングを表す図であり、（ｃ）は、発話間隔が第１所定時間と第２所定時間との間である場合の音声認識を開始するタイミングを表す図である。(a) is a diagram showing the timing for starting voice recognition when the speech interval is less than or equal to a first predetermined time, (b) is a diagram showing the timing for starting voice recognition when the speech interval is more than or equal to a second predetermined time, and (c) is a diagram showing the timing for starting voice recognition when the speech interval is between the first and second predetermined times. 携帯端末の電気的構成を示すブロック図である。FIG. 2 is a block diagram showing the electrical configuration of the mobile terminal. （ａ）は、音声処理のフローチャートであり、（ｂ）は、録音処理のフローチャートである。1A is a flowchart of a voice process, and FIG. 1B is a flowchart of a recording process. 音声認識処理のフローチャートである。13 is a flowchart of a voice recognition process.

以下、本発明の好ましい実施形態について、添付図面を参照して説明する。まず、図１を参照して、本実施形態における携帯端末１の構成を説明する。図１は、携帯端末１の外観図である。携帯端末１は、ユーザＨが発する発話を音声認識する情報処理装置（コンピュータ）である。 A preferred embodiment of the present invention will now be described with reference to the accompanying drawings. First, the configuration of a mobile terminal 1 in this embodiment will be described with reference to FIG. 1. FIG. 1 is an external view of the mobile terminal 1. The mobile terminal 1 is an information processing device (computer) that performs voice recognition of speech uttered by a user H.

携帯端末１では、音声Ｖが入力可能に構成され、入力された音声Ｖの音量に基づいてユーザＨが発した発話かどうかが判断され、その発話毎に音声認識が実行される。なお、音声認識としては、公知の手法が採用されるが、例えば、音声Ｖを文字列に変換し、変換された文字列を該当する語句に置き換えるものが挙げられる。まず、図２を参照して、携帯端末１に入力された音声ＶからユーザＨの発話の開始および終了を判断する手法を説明する。 The mobile terminal 1 is configured to be able to input voice V, and judges whether the utterance is made by user H based on the volume of the input voice V, and executes voice recognition for each utterance. Note that a known method is used for voice recognition, such as converting voice V into a character string and replacing the converted character string with a corresponding word or phrase. First, referring to FIG. 2, a method for judging the start and end of user H's utterance from voice V input to the mobile terminal 1 will be described.

図２は、音声Ｖの音量と、ユーザＨの発話の開始時刻ＳｔＴ及び終了時刻ＥｄＴとを模式的に表した図である。図２においては横軸に時刻が、縦軸に音声Ｖの音量（ｄＢ）がそれぞれ設定され、その音量の最大値が「０ｄＢ」とされ、最小値が「－１２０ｄＢ」とされる。音量の範囲は０ｄＢから－１２０ｄＢまでに限られず、これ以外の範囲でも良い。 Figure 2 is a diagram that shows a schematic representation of the volume of voice V and the start time StT and end time EdT of user H's speech. In Figure 2, the horizontal axis represents time and the vertical axis represents the volume (dB) of voice V, with the maximum volume being "0 dB" and the minimum volume being "-120 dB." The range of volume is not limited to 0 dB to -120 dB, and may be any other range.

本実施形態の携帯端末１では、入力された音声Ｖの音量に基づいてユーザＨが発話しているかどうかが判断される。具体的には、発話が開始したかを判定する音量の閾値である開始判定値Ｓｔ＿Ａと、発話が終了したかどうかを判定する音量の閾値である終了判定値Ｅｄ＿Ａとがそれぞれ設定される。開始判定値Ｓｔ＿Ａには、終了判定値Ｅｄ＿Ａより大きな音量が設定され、開始判定値Ｓｔ＿Ａとしては「－２５ｄＢ」が、終了判定値Ｅｄ＿Ａとしては「－３０ｄＢ」がそれぞれ例示される。 In the mobile terminal 1 of this embodiment, it is determined whether the user H is speaking based on the volume of the input voice V. Specifically, a start judgment value St_A, which is a volume threshold for determining whether speech has started, and an end judgment value Ed_A, which is a volume threshold for determining whether speech has ended, are set. A volume greater than the end judgment value Ed_A is set for the start judgment value St_A, and examples of the start judgment value St_A and the end judgment value Ed_A are "-25 dB" and "-30 dB", respectively.

入力された音声Ｖの音量が開始判定値Ｓｔ＿Ａより小さい状態から開始判定値Ｓｔ＿Ａ以上となった場合に、ユーザＨの発話が開始したと判断され、その際の時刻が開始時刻ＳｔＴとされる。一方で、開始時刻ＳｔＴ以後に、終了判定値Ｅｄ＿Ａ以下となった場合にユーザＨの発話が終了したと判断され、その時刻が終了時刻ＥｄＴとされる。即ち開始時刻ＳｔＴから終了時刻ＥｄＴまでの間に、ユーザＨの発話がされていたと判断される。 When the volume of the input voice V goes from being lower than the start judgment value St_A to being equal to or higher than the start judgment value St_A, it is determined that user H has started speaking, and the time at which this occurs is designated as the start time StT. On the other hand, if the volume falls below the end judgment value Ed_A after the start time StT, it is determined that user H has ended speaking, and that time is designated as the end time EdT. In other words, it is determined that user H has been speaking between the start time StT and the end time EdT.

開始判定値Ｓｔ＿Ａが終了判定値Ｅｄ＿Ａより大きな音量が設定されることで、周囲の環境音と発話の開始とを明確に区別し、周囲の環境音がユーザＨの発音と誤認識されるのを抑制できる。一方で、終了判定値Ｅｄ＿Ａが開始判定値Ｓｔ＿Ａより小さな音量が設定されることで、ユーザＨが発話していると判断されている状況において、発話による音量が一時的に低下することで開始判定値Ｓｔ＿Ａを下回った場合でも、発話が継続していると判断できる。これらにより、ユーザＨの発話の開始および終了を適切に取得できる。 By setting the start judgment value St_A at a volume greater than the end judgment value Ed_A, it is possible to clearly distinguish between the surrounding environmental sounds and the start of speech, and to prevent the surrounding environmental sounds from being mistaken for user H's pronunciation. On the other hand, by setting the end judgment value Ed_A at a volume less than the start judgment value St_A, in a situation in which it is determined that user H is speaking, even if the volume of the speech temporarily drops below the start judgment value St_A, it can be determined that the speech is continuing. As a result, the start and end of user H's speech can be appropriately obtained.

このように取得されたユーザＨの発話の開始時刻ＳｔＴ及び終了時刻ＥｄＴに基づいて、その発話の音声認識が実行される。本実施形態では、ユーザＨによる発話が連続した場合に、先の発話と後の発話との時間差である発話間隔ΔＴに応じて、後の発話の音声認識を開始するタイミングが設定される。図３を参照して、音声認識を開始するタイミングを説明する。 Based on the start time StT and end time EdT of user H's utterance thus acquired, voice recognition of that utterance is performed. In this embodiment, when user H speaks continuously, the timing to start voice recognition of the subsequent utterance is set according to the speech interval ΔT, which is the time difference between the previous utterance and the subsequent utterance. The timing to start voice recognition will be described with reference to Figure 3.

図３（ａ）は、発話間隔ΔＴが第１所定時間ｘ１以下である場合の音声認識を開始するタイミングを表す図であり、図３（ｂ）は、発話間隔ΔＴが第２所定時間ｘ２以上である場合の音声認識を開始するタイミングを表す図であり、図３（ｃ）は、発話間隔ΔＴが第１所定時間ｘ１と第２所定時間ｘ２との間である場合の音声認識を開始するタイミングを表す図である。 Figure 3(a) is a diagram showing the timing for starting voice recognition when the speech interval ΔT is equal to or less than the first predetermined time x1, Figure 3(b) is a diagram showing the timing for starting voice recognition when the speech interval ΔT is equal to or more than the second predetermined time x2, and Figure 3(c) is a diagram showing the timing for starting voice recognition when the speech interval ΔT is between the first predetermined time x1 and the second predetermined time x2.

図３（ａ）～（ｃ）においては、ユーザＨが「おはようございます。」と発話したものが第１発話Ｖ１とされ、その第１発話の直後にユーザＨが「今日は晴れですね。」と発話したものが第２発話Ｖ２とされる。第１発話Ｖ１の終了時刻ＥｄＴと第２発話Ｖ２の開始時刻ＳｔＴとの時間差が第１発話Ｖ１と第２発話Ｖ２との発話間隔ΔＴとされ、その発話間隔ΔＴの大小に応じて遡及時間Ｔが算出される。 In Figures 3(a) to (c), the first utterance V1 is "Good morning" uttered by user H, and the second utterance V2 is "It's a sunny day today" uttered by user H immediately after the first utterance. The time difference between the end time EdT of the first utterance V1 and the start time StT of the second utterance V2 is the speech interval ΔT between the first utterance V1 and the second utterance V2, and the retroactive time T is calculated depending on the length of the speech interval ΔT.

ここで携帯端末１に入力される音声Ｖは、ユーザＨの発話の有無に依らず図４で後述の音声バッファ１１ｂに記憶される。その音声バッファ１１ｂの音声Ｖにおける、第２発話Ｖ２の開始時刻ＳｔＴから遡及時間Ｔを遡った時刻である認識開始時刻ＳｔＲより、第２発話Ｖ２の音声認識が開始される。 Here, the voice V input to the mobile terminal 1 is stored in the voice buffer 11b described later in FIG. 4, regardless of whether the user H has spoken. Voice recognition of the second utterance V2 starts at the recognition start time StR, which is the time going back the retroactive time T from the start time StT of the second utterance V2 in the voice V in the voice buffer 11b.

まず、図３（ａ）を参照して、第１発話Ｖ１の直後に第２発話Ｖ２が開始された場合の遡及時間Ｔを説明する。図３（ａ）は、第１発話Ｖ１の直後に第２発話Ｖ２が開始された場合、即ち上記の発話間隔ΔＴが第１所定時間ｘ１以下の場合を表している。第１所定時間ｘ１としては「０．１秒間」が例示される。 First, referring to FIG. 3(a), the retroactive time T when the second utterance V2 starts immediately after the first utterance V1 will be described. FIG. 3(a) shows the case where the second utterance V2 starts immediately after the first utterance V1, i.e., the above-mentioned utterance interval ΔT is equal to or shorter than the first predetermined time x1. An example of the first predetermined time x1 is "0.1 seconds."

このように、発話間隔ΔＴが第１所定時間ｘ１以下で短く、第１発話Ｖ１と第２発話Ｖ２とが連続している場合には、遡及時間Ｔとして第１所定時間ｘ１以上の第１遡及時間Ｔｘ１が設定される。第１遡及時間Ｔｘ１としては「０．５秒間」が例示される。これにより、第２発話の認識開始時刻ＳｔＲを第２発話の開始時刻ＳｔＴよりも以前のタイミングとできるので、第２発話の開始から確実に音声認識を開始できる。 In this way, when the speech interval ΔT is short and equal to or less than the first predetermined time x1, and the first utterance V1 and the second utterance V2 are consecutive, the first retroactive time Tx1, which is equal to or greater than the first predetermined time x1, is set as the retroactive time T. An example of the first retroactive time Tx1 is "0.5 seconds." This allows the recognition start time StR of the second utterance to be set to a timing earlier than the start time StT of the second utterance, so that speech recognition can be reliably started from the start of the second utterance.

この際、第１発話の終了時刻ＥｄＴ付近の発話（例えば「おはようございます。」の「す」）が第２発話の認識開始時刻ＳｔＲに含まれることがある。かかる場合は、第２発話の認識開始時刻ＳｔＲより開始した音声認識した結果から、第２発話の開始時刻ＳｔＴ以前の認識結果を除外や除去することで、第２発話の開始時刻ＳｔＴからの音声認識の結果のみを出力しても良い。 In this case, an utterance near the end time EdT of the first utterance (for example, the "su" in "Good morning") may be included in the recognition start time StR of the second utterance. In such a case, the recognition results before the start time StT of the second utterance may be excluded or removed from the results of the speech recognition that began at the recognition start time StR of the second utterance, thereby outputting only the results of the speech recognition from the start time StT of the second utterance.

次に図３（ｂ）を参照して、第１発話Ｖ１と第２発話Ｖ２との発話間隔ΔＴが長い場合を説明する。図３（ｂ）は、発話間隔ΔＴが第２所定時間ｘ２以上の場合を表している。第２所定時間ｘ２としては「３秒間」が例示される。このように、第１発話Ｖ１と第２発話Ｖ２との発話間隔ΔＴが第２所定時間ｘ２以上の長い場合には、遡及時間Ｔとして第２所定時間ｘ２以下の第２遡及時間Ｔｘ２が設定される。第１遡及時間Ｔｘ１としては「２秒間」が例示される。 Next, referring to FIG. 3(b), a case where the speech interval ΔT between the first utterance V1 and the second utterance V2 is long will be described. FIG. 3(b) shows a case where the speech interval ΔT is equal to or longer than the second predetermined time x2. An example of the second predetermined time x2 is "3 seconds". In this way, when the speech interval ΔT between the first utterance V1 and the second utterance V2 is equal to or longer than the second predetermined time x2, a second retroactive time Tx2 equal to or shorter than the second predetermined time x2 is set as the retroactive time T. An example of the first retroactive time Tx1 is "2 seconds".

これにより、第２発話Ｖ２の音声認識が開始されてから実際に第２発話Ｖ２が開始されるまでのタイムラグが拡大するのを抑制できる。これにより、第２発話Ｖ２が開始されるまでの周囲の環境音が誤って音声認識されるのを抑制できると共に、第２発話Ｖ２を音声認識するための携帯端末１（具体的に図４で後述のＣＰＵ１０）の処理時間が低減されるので、携帯端末１の処理負荷を低減できる。 This makes it possible to prevent the time lag from increasing between when speech recognition of the second utterance V2 starts and when the second utterance V2 actually starts. This makes it possible to prevent erroneous speech recognition of ambient sounds until the second utterance V2 starts, and also reduces the processing time of the mobile terminal 1 (specifically, the CPU 10 described later in FIG. 4) for speech recognition of the second utterance V2, thereby reducing the processing load of the mobile terminal 1.

次に図３（ｃ）を参照して、第１発話Ｖ１と第２発話Ｖ２との発話間隔ΔＴが第１所定時間ｘ１と第２所定時間ｘ２との間である場合を説明する。かかる場合には、遡及時間Ｔとして発話間隔ΔＴが設定される。これにより、第２発話Ｖ２の認識開始時刻ＳｔＲが第１発話Ｖ１の終了時刻ＥｄＴとなるので、遡及時間Ｔを容易に取得できると共に、第２発話Ｖ２の開始から確実に音声認識を開始できる。 Next, referring to FIG. 3(c), a case will be described in which the speech interval ΔT between the first utterance V1 and the second utterance V2 is between the first predetermined time x1 and the second predetermined time x2. In such a case, the speech interval ΔT is set as the retroactive time T. As a result, the recognition start time StR of the second utterance V2 becomes the end time EdT of the first utterance V1, so that the retroactive time T can be easily obtained and speech recognition can be reliably started from the start of the second utterance V2.

ここで、第１遡及時間Ｔｘ１及び第２遡及時間Ｔｘ２は、第１所定時間ｘ１以上かつ第２所定時間ｘ２以下の時間に設定される。これにより、第２発話の認識開始時刻ＳｔＲが第１発話Ｖ１の開始時刻ＳｔＴまで遡ることを抑制できるので、第２発話Ｖ２と共に第１発話Ｖ１の全体が音声認識されるのを抑制できる。 Here, the first retroactive time Tx1 and the second retroactive time Tx2 are set to a time equal to or greater than the first predetermined time x1 and equal to or less than the second predetermined time x2. This prevents the recognition start time StR of the second utterance from going back to the start time StT of the first utterance V1, thereby preventing the entire first utterance V1 from being voice recognized together with the second utterance V2.

以上の通り、第２発話Ｖ２の開始時刻ＳｔＴから、その直前の第１発話Ｖ１と第２発話Ｖ２との発話間隔ΔＴに応じた遡及時間Ｔを遡った認識開始時刻ＳｔＲより音声認識を開始することで、第２発話Ｖ２の開始から確実に第２発話の音声認識を開始できる。これにより、第１発話と第２発話とが連続して入力された場合でも、第２発話の開始から確実に音声認識を開始できるので、第１発話と第２発話とを正確に音声認識できる。 As described above, by starting speech recognition from the recognition start time StR, which is calculated by going back from the start time StT of the second utterance V2 by a retroactive time T corresponding to the speech interval ΔT between the immediately preceding first utterance V1 and second utterance V2, speech recognition of the second utterance can be reliably started from the start of the second utterance V2. This ensures that speech recognition can be started reliably from the start of the second utterance even when the first utterance and the second utterance are input consecutively, thereby enabling accurate speech recognition of the first utterance and the second utterance.

また、ユーザＨの第２発話を開始した際の音声Ｖの音量が小さく、第２発話Ｖ２の開始時刻ＳｔＴと判断された時刻では実際にはユーザＨが発話している場合がある。かかる場合でも、第２発話Ｖ２の開始時刻ＳｔＴから遡及時間Ｔを遡った時刻から音声認識を開始することで、第２発話Ｖ２の開始時刻ＳｔＴと判断される以前から実際にはユーザＨが発話していた音声Ｖの音声認識の取りこぼしを抑制できる。 In addition, the volume of the voice V when user H starts the second utterance may be low, and user H may actually be speaking at the time determined to be the start time StT of second utterance V2. Even in such a case, by starting voice recognition from a time that is backdated by the retroactive time T from the start time StT of second utterance V2, it is possible to reduce the loss of voice recognition of the voice V that user H was actually speaking before the start time StT of second utterance V2 was determined.

なお、第１所定時間ｘ１は０．１秒間に限られず、第２所定時間ｘ２以下であれば、０．１秒間以上でも０．１秒間以下でも良い。第２所定時間ｘ２は３秒間に限られず、第１所定時間ｘ１以上であれば、３秒間以上でも３秒間以下でも良い。また、第１遡及時間Ｔｘ１は０．５秒間に限られず、上記した第１所定時間ｘ１以上かつ第２所定時間ｘ２以下の時間であれば、０．５秒間以上でも０．５秒間以下でも良い。同様に第２遡及時間Ｔｘ２は２秒間に限られず、第１所定時間ｘ１以上かつ第２所定時間ｘ２以下の時間であれば、２秒間以上でも２秒間以下でも良い。更に第１遡及時間Ｔｘ１を第２遡及時間Ｔｘ２よりも短い時間としたが、これに限られない。第１遡及時間Ｔｘ１と第２遡及時間Ｔｘ２とを同じ時間としても良いし、第１遡及時間Ｔｘ１を第２遡及時間Ｔｘ２よりも長い時間としても良い。 The first predetermined time x1 is not limited to 0.1 seconds, and may be 0.1 seconds or more or 0.1 seconds or less as long as it is equal to or less than the second predetermined time x2. The second predetermined time x2 is not limited to 3 seconds, and may be 3 seconds or more or 3 seconds or less as long as it is equal to or more than the first predetermined time x1. The first retroactive time Tx1 is not limited to 0.5 seconds, and may be 0.5 seconds or more or 0.5 seconds or less as long as it is equal to or more than the first predetermined time x1 and equal to or less than the second predetermined time x2. Similarly, the second retroactive time Tx2 is not limited to 2 seconds, and may be 2 seconds or more or 2 seconds or less as long as it is equal to or more than the first predetermined time x1 and equal to or less than the second predetermined time x2. Furthermore, the first retroactive time Tx1 is shorter than the second retroactive time Tx2, but is not limited to this. The first retroactive time Tx1 and the second retroactive time Tx2 may be the same time, or the first retroactive time Tx1 may be longer than the second retroactive time Tx2.

次に、図４を参照して、携帯端末１の電気的構成を説明する。図４は、携帯端末１の電気的構成を示すブロック図である。図４に示す通り、携帯端末１は、ＣＰＵ１０と、フラッシュＲＯＭ１１と、ＲＡＭ１２とを有し、これらはバスライン１３を介して入出力ポート１４にそれぞれ接続されている。入出力ポート１４には更に、音声Ｖを入力するマイク１５と、音声認識の認識結果等が表示されるＬＣＤ１６と、ユーザＨからの指示が入力されるタッチパネル１７とが接続される。 Next, the electrical configuration of the mobile terminal 1 will be described with reference to FIG. 4. FIG. 4 is a block diagram showing the electrical configuration of the mobile terminal 1. As shown in FIG. 4, the mobile terminal 1 has a CPU 10, a flash ROM 11, and a RAM 12, which are each connected to an input/output port 14 via a bus line 13. The input/output port 14 is further connected to a microphone 15 for inputting voice V, an LCD 16 for displaying the results of voice recognition, and a touch panel 17 for inputting instructions from the user H.

ＣＰＵ１０は、バスライン１３により接続された各部を制御する演算装置である。フラッシュＲＯＭ１１は、書き換え可能な不揮発性のメモリであり、音声認識プログラム１１ａと、音声Ｖが記憶される音声バッファ１１ｂとが保存される。ＣＰＵ１０によって音声認識プログラム１１ａが実行されると、図５の音声処理が実行される。ＲＡＭ１２は、ＣＰＵ１０の音声認識プログラム１１ａの実行時に各種のワークデータやフラグ等を書き換え可能に記憶するためのメモリであり、上記した遡及時間Ｔが記憶される遡及時間メモリ１２ａが設けられる。 The CPU 10 is a calculation device that controls each part connected by the bus line 13. The flash ROM 11 is a rewritable non-volatile memory that stores a voice recognition program 11a and a voice buffer 11b in which voice V is stored. When the voice recognition program 11a is executed by the CPU 10, the voice processing of FIG. 5 is executed. The RAM 12 is a memory for rewritably storing various work data, flags, etc. when the CPU 10 executes the voice recognition program 11a, and is provided with a retroactive time memory 12a in which the retroactive time T described above is stored.

次に、図５，６を参照して、携帯端末１のＣＰＵ１０で実行される処理を説明する。図５（ａ）は、音声処理のフローチャートである。音声処理は、タッチパネル１７等を介してユーザＨから音声認識プログラム１１ａを実行する指示が入力された場合に実行される処理である。 Next, the processing executed by the CPU 10 of the mobile terminal 1 will be described with reference to Figures 5 and 6. Figure 5(a) is a flowchart of the voice processing. The voice processing is a process that is executed when an instruction to execute the voice recognition program 11a is input from the user H via the touch panel 17 or the like.

音声処理はまず、音声バッファ１１ｂの内容をクリアし（Ｓ１）、音声取得時刻と、上記した開始時刻ＳｔＴ及び終了時刻ＥｄＴとをそれぞれ０に設定する（Ｓ２）。音声取得時刻は、音声Ｖのサンプリング周期（例えば１／４４１００秒）が１単位時間とした時刻であり、音声バッファ１１ｂに記憶された音声Ｖを０秒、即ち音声バッファ１１ｂへの音声Ｖの記憶を開始した時刻から順に上記のサンプリング周期間隔で取得するための時刻情報として用いられる。 The audio process first clears the contents of the audio buffer 11b (S1), and sets the audio acquisition time and the above-mentioned start time StT and end time EdT to 0 (S2). The audio acquisition time is the time when the sampling period of the audio V (e.g., 1/44100 seconds) is one unit time, and is used as time information for acquiring the audio V stored in the audio buffer 11b at the above-mentioned sampling period intervals in order from 0 seconds, i.e., the time when storage of the audio V in the audio buffer 11b began.

Ｓ２の処理の後、今回音量および前回音量に音量の最小値である－１２０ｄＢを設定する（Ｓ３）。今回音量には、発話の開始時刻ＳｔＴ及び終了時刻ＥｄＴを判断するための音量が記憶され、前回音量にはその今回音量の前回の音量が記憶される。 After the process of S2, the current volume and previous volume are set to -120 dB, which is the minimum volume value (S3). The current volume stores the volume for determining the start time StT and end time EdT of the speech, and the previous volume stores the previous volume of the current volume.

Ｓ３の処理の後、録音処理を開始する（Ｓ４）。録音処理は、音声Ｖのサンプリング周期毎に実行され、マイク１５から入力された音声Ｖを、サンプリング周期毎に音声バッファ１１ｂへ記憶させる処理である。Ｓ４の処理によって、録音処理の定期的な実行が開始される。ここで、図５（ｂ）を参照して録音処理を説明する。 After the process of S3, the recording process starts (S4). The recording process is executed every sampling period of the sound V, and the sound V input from the microphone 15 is stored in the sound buffer 11b every sampling period. The process of S4 starts the periodic execution of the recording process. Here, the recording process will be described with reference to FIG. 5(b).

図５（ｂ）は、録音処理のフローチャートである。録音処理は、上記した通り、音声Ｖのサンプリング周期毎に実行される割込処理である。録音処理は、マイク１５から取得した音声Ｖを音声バッファ１１ｂに追加し（Ｓ２０）、終了する。これにより、音声バッファ１１ｂには、上記のサンプリング周期毎に取得された音声Ｖが記憶される。 Figure 5 (b) is a flowchart of the recording process. As described above, the recording process is an interrupt process that is executed for each sampling period of the audio V. The recording process adds the audio V acquired from the microphone 15 to the audio buffer 11b (S20) and ends. As a result, the audio V acquired for each sampling period is stored in the audio buffer 11b.

図５（ａ）に戻る。Ｓ４の処理の後、音声バッファ１１ｂから音声取得時刻における音声Ｖの音量を取得し、今回音量に設定する（Ｓ５）。Ｓ５の処理の後、音声認識処理（Ｓ６）を実行する。ここで、図６を参照して音声認識処理を説明する。 Returning to FIG. 5(a), after the process of S4, the volume of the voice V at the time of voice acquisition is obtained from the voice buffer 11b and set as the current volume (S5). After the process of S5, the voice recognition process (S6) is executed. Here, the voice recognition process will be explained with reference to FIG. 6.

図６は、音声認識処理のフローチャートである。音声認識処理はまず、前回音量が図２で上記した開始判定値Ｓｔ＿Ａより小さく、且つ、音声取得時刻における今回音量が開始判定値Ｓｔ＿Ａ以上かを確認する（Ｓ３０）。即ち音声バッファ１１ｂの音声Ｖにおいて、開始判定値Ｓｔ＿Ａより小さい状態から開始判定値Ｓｔ＿Ａ以上となり、発話が開始した開始時刻ＳｔＴのタイミングであるかを確認する。 Figure 6 is a flowchart of the voice recognition process. The voice recognition process first checks whether the previous volume was smaller than the start judgment value St_A described above in Figure 2, and whether the current volume at the time of voice acquisition is equal to or greater than the start judgment value St_A (S30). That is, it checks whether the voice V in the voice buffer 11b has changed from a state smaller than the start judgment value St_A to equal to or greater than the start judgment value St_A, at the start time StT when speech began.

Ｓ３０の処理において、前回音量が開始判定値Ｓｔ＿Ａより小さく、且つ、音声取得時刻における今回音量が開始判定値Ｓｔ＿Ａ以上の場合は（Ｓ３０：Ｙｅｓ）、開始時刻ＳｔＴに音声取得時刻を設定する（Ｓ３１）。Ｓ３１の処理の後、開始時刻ＳｔＴから後述のＳ３９，Ｓ４０の処理で設定される図３で上記した終了時刻ＥｄＴを減算することで、発話間隔ΔＴを算出する（Ｓ３２）。Ｓ３２の処理の後、算出された発話間隔ΔＴを確認する（Ｓ３３）。 In the process of S30, if the previous volume is smaller than the start judgment value St_A and the current volume at the voice acquisition time is equal to or greater than the start judgment value St_A (S30: Yes), the voice acquisition time is set to the start time StT (S31). After the process of S31, the speech interval ΔT is calculated by subtracting the end time EdT described above in FIG. 3, which is set in the processes of S39 and S40 described below, from the start time StT (S32). After the process of S32, the calculated speech interval ΔT is confirmed (S33).

Ｓ３３の処理において、発話間隔ΔＴが第１所定時間ｘ１以下の場合は（ΔＴ≦ｘ１）、遡及時間メモリ１２ａに第１遡及時間Ｔｘ１を設定し（Ｓ３４）、発話間隔ΔＴが第２所定時間ｘ２以上の場合は（ΔＴ≧ｘ２）、遡及時間メモリ１２ａに第２遡及時間Ｔｘ２を設定し（Ｓ３５）、発話間隔ΔＴが第１所定時間ｘ１と第２所定時間ｘ２との間である場合は（ｘ１＜ΔＴ＜ｘ２）、遡及時間メモリ１２ａに発話間隔ΔＴを設定する（Ｓ３６）。 In the process of S33, if the speech interval ΔT is less than or equal to the first predetermined time x1 (ΔT≦x1), a first retroactive time Tx1 is set in the retroactive time memory 12a (S34); if the speech interval ΔT is greater than or equal to the second predetermined time x2 (ΔT≧x2), a second retroactive time Tx2 is set in the retroactive time memory 12a (S35); if the speech interval ΔT is between the first predetermined time x1 and the second predetermined time x2 (x1<ΔT<x2), the speech interval ΔT is set in the retroactive time memory 12a (S36).

Ｓ３４～Ｓ３６の処理の後、音声バッファ１１ｂにおいてＳ３１の処理で設定された開始時刻ＳｔＴから遡及時間メモリ１２ａの遡及時間Ｔだけ遡った時刻（即ち認識開始時刻ＳｔＲ）から音声認識を実施する（Ｓ３７）。これにより、図３（ａ）～（ｃ）で上記した発話間隔ΔＴに応じた遡及時間Ｔが遡及時間メモリ１２ａに設定され、開始時刻ＳｔＴからその遡及時間Ｔから音声認識が開始される。Ｓ３７の処理によって音声認識された結果をＬＣＤ１６に表示しても良いし、図示しない通信装置を介して、他の携帯端末１等の情報処理装置に送信しても良い。 After the processes of S34 to S36, speech recognition is performed from a time that is the retroactive time T in the retroactive time memory 12a back from the start time StT set in the process of S31 in the speech buffer 11b (i.e., the recognition start time StR) (S37). As a result, the retroactive time T corresponding to the speech interval ΔT described above in Figures 3(a) to (c) is set in the retroactive time memory 12a, and speech recognition is started from the retroactive time T from the start time StT. The results of the speech recognition performed in the process of S37 may be displayed on the LCD 16, or may be transmitted to an information processing device such as another mobile terminal 1 via a communication device (not shown).

Ｓ３０の処理において、前回音量が開始判定値Ｓｔ＿Ａ以上の場合、または、今回音量が開始判定値Ｓｔ＿Ａより小さい場合は（Ｓ３０：Ｎｏ）、Ｓ３１～Ｓ３７の処理をスキップする。Ｓ３０，Ｓ３７の処理の後、前回音量が終了判定値Ｅｄ＿Ａより大きく且つ今回音量が終了判定値Ｅｄ＿Ａ以下かを確認する（Ｓ３８）。 In the process of S30, if the previous volume is equal to or greater than the start judgment value St_A, or if the current volume is smaller than the start judgment value St_A (S30: No), the processes of S31 to S37 are skipped. After the processes of S30 and S37, it is confirmed whether the previous volume is greater than the end judgment value Ed_A and the current volume is equal to or less than the end judgment value Ed_A (S38).

Ｓ３８の処理において、前回音量が終了判定値Ｅｄ＿Ａより大きく且つ今回音量が終了判定値Ｅｄ＿Ａ以下の場合は（Ｓ３８：Ｙｅｓ）、音声取得時刻が図２で上記した発話が終了した終了時刻ＥｄＴのタイミングであるので、終了時刻ＥｄＴに音声取得時刻を設定する（Ｓ３９）。一方で、前回音量が終了判定値Ｅｄ＿Ａ以下または今回音量が終了判定値Ｅｄ＿Ａより大きい場合は（Ｓ３８：Ｎｏ）、Ｓ３９の処理をスキップする。Ｓ３８，Ｓ３９の処理の後、音声認識処理を終了する。 In the process of S38, if the previous volume is greater than the end judgment value Ed_A and the current volume is equal to or less than the end judgment value Ed_A (S38: Yes), the voice acquisition time is the timing of the end time EdT when the utterance described above in FIG. 2 ends, so the voice acquisition time is set to the end time EdT (S39). On the other hand, if the previous volume is equal to or less than the end judgment value Ed_A or the current volume is greater than the end judgment value Ed_A (S38: No), the process of S39 is skipped. After the processes of S38 and S39, the voice recognition process ends.

図５（ａ）に戻る。Ｓ６の音声認識処理の後、音声取得時刻にサンプリング周期を加算し、音声取得時刻を音声バッファ１１ｂから音量を取得する次のタイミングに進める（Ｓ７）。Ｓ７の処理の後、タッチパネル１７を介してユーザＨから音声処理の終了する指示を取得したかを確認する（Ｓ８）。Ｓ８の処理において、音声処理の終了の指示を取得しなかった場合は（Ｓ８：Ｎｏ）、Ｓ５以下の処理を繰り返し、音声処理の終了の指示を取得した場合は（Ｓ８：Ｙｅｓ）、音声処理を終了する。 Returning to FIG. 5(a), after the voice recognition process of S6, the sampling period is added to the voice acquisition time, and the voice acquisition time is advanced to the next timing for acquiring the volume from the voice buffer 11b (S7). After the process of S7, it is confirmed whether an instruction to end the voice processing has been acquired from the user H via the touch panel 17 (S8). If an instruction to end the voice processing has not been acquired in the process of S8 (S8: No), the processes from S5 onwards are repeated, and if an instruction to end the voice processing has been acquired (S8: Yes), the voice processing is terminated.

以上、実施形態に基づき本発明を説明したが、本発明は上述した実施形態に何ら限定されるものではなく、本発明の趣旨を逸脱しない範囲内で種々の改良変更が可能であることは容易に推察できるものである。 The present invention has been described above based on the embodiments, but the present invention is in no way limited to the above-mentioned embodiments, and it can be easily imagined that various improvements and modifications are possible within the scope of the invention without departing from its spirit.

上記実施形態では、遡及時間Ｔに、発話間隔ΔＴが第１所定時間ｘ１以下の場合は第１遡及時間Ｔｘ１を、発話間隔ΔＴが第２所定時間ｘ２以上の場合は第２遡及時間Ｔｘ２を、発話間隔ΔＴが第１所定時間ｘ１と第２所定時間ｘ２との間の場合は発話間隔ΔＴをそれぞれ設定したが、これに限られない。発話間隔ΔＴによらず、遡及時間Ｔとして、発話間隔ΔＴを設定しても良いし、発話間隔ΔＴに所定の係数（例えば０．８）を乗算した時間を設定しても良いし、発話間隔ΔＴに所定の時間（例えば０．５秒間）を加算した時間を設定しても良い。また、発話間隔ΔＴによらず、遡及時間Ｔとして、第１遡及時間Ｔｘ１又は第２遡及時間Ｔｘ２を設定しても良い。 In the above embodiment, the retroactive time T is set to the first retroactive time Tx1 when the speech interval ΔT is equal to or less than the first predetermined time x1, the second retroactive time Tx2 when the speech interval ΔT is equal to or greater than the second predetermined time x2, and the speech interval ΔT when the speech interval ΔT is between the first predetermined time x1 and the second predetermined time x2, but this is not limited to the above. Regardless of the speech interval ΔT, the retroactive time T may be set to the speech interval ΔT, or a time obtained by multiplying the speech interval ΔT by a predetermined coefficient (e.g., 0.8), or a time obtained by adding a predetermined time (e.g., 0.5 seconds) to the speech interval ΔT. Furthermore, regardless of the speech interval ΔT, the retroactive time T may be set to the first retroactive time Tx1 or the second retroactive time Tx2.

上記実施形態では、発話の開始時刻ＳｔＴ及び終了時刻ＥｄＴを音声Ｖの音量で判断したがこれに限られない。例えば、音声Ｖにおいて、人間の音声による周波数帯域（例えば、１００Ｈｚ～１０００Ｈｚ）が観測され始めた時刻を発話の開始時刻ＳｔＴと判断し、音声Ｖにおいて該周波数帯域が観測されている状態から、観測されなくなった時刻を終了時刻ＥｄＴと判断しても良い。 In the above embodiment, the start time StT and end time EdT of the speech are determined based on the volume of the voice V, but this is not limited to the above. For example, the start time StT of the speech may be determined as the time when a frequency band of human voice (e.g., 100 Hz to 1000 Hz) begins to be observed in the voice V, and the end time EdT may be determined as the time when the frequency band is no longer observed in the voice V.

上記実施形態では、マイク１５から入力されたものを音声Ｖとしたが、これに限られない。例えば、予めフラッシュＲＯＭ１１に記憶された音声データを音声Ｖとしても良いし、図示しない通信装置を介して他の携帯端末１等から送信された音声データを音声Ｖとしても良い。 In the above embodiment, the voice V is input from the microphone 15, but this is not limited to the above. For example, the voice V may be voice data stored in advance in the flash ROM 11, or voice data transmitted from another mobile terminal 1 or the like via a communication device (not shown) may be used as the voice V.

上記実施形態では、音声取得時刻を、サンプリング周期を１単位時間とし、音声バッファ１１ｂからサンプリング周期間隔で音量を取得したが、これに限られない。例えば、音声取得時刻の１単位時間を１秒間とし、音声バッファ１１ｂから音量を１秒間隔で取得しても良い。 In the above embodiment, the audio acquisition time is set to one unit time of the sampling period, and the volume is acquired from the audio buffer 11b at sampling period intervals, but this is not limited to this. For example, one unit time of the audio acquisition time may be one second, and the volume may be acquired from the audio buffer 11b at one second intervals.

上記実施形態では、音声認識プログラム１１ａが組み込まれた携帯端末１を例示したが、これに限られず、パーソナルコンピュータやタブレット端末等の他の情報処理装置（コンピュータ）によって音声認識プログラム１１ａを実行する構成としても良い。また、音声認識プログラム１１ａをＲＯＭやＩＣチップ等に記憶し、音声認識プログラム１１ａのみを実行する専用装置に、本発明を適用しても良い。 In the above embodiment, a mobile terminal 1 incorporating the voice recognition program 11a is exemplified, but the present invention is not limited to this, and the voice recognition program 11a may be executed by another information processing device (computer) such as a personal computer or a tablet terminal. In addition, the voice recognition program 11a may be stored in a ROM or an IC chip, and the present invention may be applied to a dedicated device that executes only the voice recognition program 11a.

１携帯端末（コンピュータ）
１１フラッシュＲＯＭ（記憶部）
１１ｂ音声バッファ（音声記憶手段）
１１ａ音声認識プログラム
Ｖ音声
Ｖ１第１発話
Ｖ２第２発話
Ｓ２０音声記憶ステップ
ＳｔＴ開始時刻
ＥｄＴ終了時刻
ΔＴ発話間隔
ｘ１第１所定時間
ｘ２第２所定時間
Ｔ遡及時間
Ｔｘ１第１遡及時間
Ｔｘ２第２遡及時間
Ｓ３１開始時刻取得ステップ、開始時刻取得手段
Ｓ３９終了時刻取得ステップ、終了時刻取得手段
Ｓ３２間隔取得ステップ、間隔取得手段
Ｓ３４～Ｓ３６遡及時間取得ステップ、遡及時間取得手段
Ｓ３７音声認識ステップ、音声認識手段 1. Mobile terminal (computer)
11 Flash ROM (storage unit)
11b Audio buffer (audio storage means)
11a Voice recognition program V Voice V1 First utterance V2 Second utterance S20 Voice storage step StT Start time EdT End time ΔT Speech interval x1 First predetermined time x2 Second predetermined time T Retrospective time Tx1 First retrospective time Tx2 Second retrospective time S31 Start time acquisition step, start time acquisition means S39 End time acquisition step, end time acquisition means S32 Interval acquisition step, interval acquisition means S34 ~S36 Retrospective time acquisition step, retrospective time acquisition means S37 Voice recognition step, voice recognition means

Claims

A speech recognition program for causing a computer having a storage unit to execute a speech recognition process,
The storage unit functions as a voice storage means for storing voice,
a voice storage step of storing the input voice in the voice storage means;
a start time acquisition step of acquiring a start time of an utterance by voice stored in the voice storage means;
an end time acquisition step of acquiring an end time of a voice utterance stored in the voice storage means;
an interval acquiring step of acquiring an utterance interval which is a time difference between an end time of the first utterance acquired in the end time acquiring step and a start time of a second utterance which is acquired in the start time acquiring step and is input after the first utterance;
a retroactive time acquisition step of acquiring a retroactive time that is a time going back from the start time of the second utterance acquired in the start time acquisition step, based on the utterance interval acquired in the interval acquisition step;
a voice recognition step of starting voice recognition of the second utterance from a time going back from the start time of the second utterance acquired in the start time acquisition step by the retroactive time acquired in the retroactive time acquisition step, for the voice stored in the voice storage means.

The speech recognition program according to claim 1, characterized in that, in the retroactive time acquisition step, if the speech interval acquired in the interval acquisition step is equal to or shorter than a first predetermined time, a first retroactive time that is equal to or longer than the first predetermined time is acquired as the retroactive time.

3. The speech recognition program according to claim 2, wherein the retroactive time acquisition step acquires, when the speech interval acquired in the interval acquisition step is equal to or greater than a second predetermined time that is longer than the first predetermined time, a second retroactive time that is equal to or less than the second predetermined time as the retroactive time.

The speech recognition program according to claim 3, characterized in that the first retroactive time is equal to or greater than the first predetermined time and equal to or less than the second predetermined time.

The speech recognition program according to claim 3 or 4, characterized in that the second retroactive time is equal to or greater than the first predetermined time and equal to or less than the second predetermined time.

6. The speech recognition program according to claim 1, wherein the retroactive time acquisition step acquires the speech interval as the retroactive time when the speech interval acquired in the interval acquisition step is between a first predetermined time and a second predetermined time.

A voice input means for inputting voice;
a voice storage means for storing the voice inputted by the voice input means;
a start time acquisition means for acquiring a start time of an utterance by the voice stored in the voice storage means;
an end time acquisition means for acquiring an end time of an utterance by voice stored in the voice storage means;
an interval acquiring means for acquiring an utterance interval, which is a time difference between an end time of a first utterance acquired by the end time acquiring means and a start time of a second utterance acquired by the start time acquiring means and input after the first utterance;
a retroactive time acquisition means for acquiring a retroactive time that is a time going back from the start time of the second utterance acquired by the start time acquisition means, based on the utterance interval acquired by the interval acquisition means;
a voice recognition means for starting voice recognition of the second utterance from a time going back from the start time of the second utterance acquired by the start time acquisition means by the retroactive time acquired by the retroactive time acquisition means, in the voice stored in the voice storage means.