JP6513869B1

JP6513869B1 - Dialogue summary generation apparatus, dialogue summary generation method and program

Info

Publication number: JP6513869B1
Application number: JP2018205371A
Authority: JP
Inventors: 一仁横内; 鈴木　茂; 鈴木　　茂
Original assignee: Evoice
Current assignee: Evoice
Priority date: 2018-10-31
Filing date: 2018-10-31
Publication date: 2019-05-15
Anticipated expiration: 2038-10-31
Also published as: JP2020071676A

Abstract

【課題】対話音声から、十分に短縮化され、かつ対話中の話者の発話における感情が十分に反映された高精度な要約文を生成する。【解決手段】対話音声要約生成装置は、対話音声データから対話の話者を識別する話者識別部と、話者識別部により識別された話者ごとに、対話音声データを発話単位に分離する音声分離部と、記対話音声データを、前声分離部により分離された発話単位で音声認識して対話音声テキストを生成する音声認識部と、音声認識部により生成された対話音声テキストを要約して要約文テキストを生成する要約文生成部と、発話単位で対話音声データを解析して話者ごとの感情表現を導出し、導出された感情表現を要約文テキストに付加するか、感情表現で要約文テキストの一部を置き換えるかまたは要約文テキストに対応付けて出力する感情解析部と、を備える。【選択図】図３[PROBLEMS] To generate a highly accurate summary sentence which is sufficiently shortened and which sufficiently reflects emotions in a speaker's speech during dialogue from dialogue speech. A dialog speech summary generation apparatus separates dialog speech data into utterance units for each speaker identified by a speaker identification unit that identifies a speaker of dialog from conversation speech data. A speech separation unit, a speech recognition unit that generates speech speech text by speech recognition of speech speech data in units of speech separated by the speech separation unit, and speech speech text generated by the speech recognition unit is summarized. A summary sentence generation unit for generating a summary sentence text, analyzing dialogue speech data in units of speech to derive an emotional expression for each speaker, and adding the derived emotional expression to the summary text, or And an emotion analysis unit that replaces a part of the summary text or outputs it in association with the summary text. [Selected figure] Figure 3

Description

本発明は、対話要約生成装置、対話要約生成成方法およびプログラムに関する。より詳細には、本発明は、例えば顧客と応対担当者の電話もしくは対面でなされた対話を録音蓄積して管理するＣｕｓｔｏｍｅｒＲｅｌａｔｉｏｎｓｈｉｐＭａｎａｇｅｍｅｎｔ（ＣＲＭ）システムに利用可能な、録音された対話音声から要約を作成し、生成された要約を出力するための技術に関する。 The present invention relates to a dialogue summary generation apparatus, dialogue summary generation method and program. More particularly, the present invention provides a summary from recorded dialog speech available to, for example, a Customer Relationship Management (CRM) system that records, stores and manages customer and agent calls or face-to-face interactions. The present invention relates to a technique for creating and outputting a generated summary.

顧客と事業者との間でなされた対話音声を事業者側において録音して管理する各種技術が提案されている。近年では、事業者のコンプライアンス遵守、顧客からのクレーム対策、および事業者側オペレータの評価や教育等の目的のため、電話での通話に限らず、対面による対話を含めて、あらゆる場面で対話内容の録音蓄積が要請されている。 Various techniques have been proposed for recording and managing on the business operator side dialogue voices made between the customer and the business operator. In recent years, for the purpose of compliance compliance of business operators, measures against customer complaints, evaluation and education of business operators, conversation contents in all situations, including not only telephone calls but also face-to-face dialogues It is requested that recordings of

一例として、顧客からの電話応対部署であるコールセンタにおけるオペレータの通話内容をデータ化して録音するとともに検索するための通話録音システムにおいては、一般に、事業者が運営するコールセンタ等の構内には、公衆電話交換回線網（ＰｕｂｌｉＳｗｉｔｃｈｅｄＴｅｌｅｐｈｏｎｅＮｅｔｗｏｒｋ：ＰＳＴＮ）からの発信および受信が集中する交換機（ＰＢＸ）が設置され、この交換機により音声通話がコールセンタ構内の固定電話に分配される。 As an example, in a call recording system for digitizing, recording and searching the call contents of an operator in a call center which is a section handling a call from a customer, a public telephone generally operates a public telephone at a premises such as a call center operated by a business operator. A switch (PBX) is installed to concentrate call origination and reception from a switched switched telephone network (PSTN), and this switch distributes voice calls to fixed telephones in a call center site.

このため、この交換機から分岐する通話録音サーバを設ければ、通話を音声データファイルに録音蓄積することができる。オペレータ側には、音声応対用内線電話とともに、ＰＣ（ＰｅｒｓｏｎａｌＣｏｍｐｕｔｅｒ）等の端末装置が設けられてよく、このオペレータ端末装置には、例えば、発話者が告げた顧客名をキーとして顧客情報を検索する機能や、当該顧客の過去の通話履歴を表示する機能が備えられてよい。 Therefore, if a call recording server branching from the exchange is provided, it is possible to record and store a call in an audio data file. On the operator side, a terminal device such as a PC (Personal Computer) may be provided along with an extension telephone for voice response, and this operator terminal device is searched for customer information using, for example, a customer name told by a speaker as a key And a function to display the past call history of the customer.

このように音声データファイルに録音蓄積された顧客とオペレータとの間の音声通話につき、１回の電話応対ごと、その概要を応対履歴として記録保持し、通話終了後にこの応対履歴を閲覧およびレポートとして出力可能とすることが要請される。この応対履歴の内容を迅速に確認および照査するため、録音された音声通話からテキスト化された要約を生成することが望まれる。
このような音声データから要約テキストを作成する技術において、音声データファイル中の音声を音声認識処理により文字コード化し、文字コード化された音声テキストデータから要約テキストを生成する技術がある。テキスト化された要約を生成することで、応対履歴の内容の把握が容易となり、一覧性が提供され、さらにテキスト中の単語をキーワードに用いて検索を実行できる等、コンピュータとの柔軟な連携が可能となる。 The voice call between the customer and the operator recorded and stored in the voice data file in this way is recorded and maintained as a call history for each call, and the call history is viewed and reported after the call ends. It is required to enable output. In order to quickly verify and verify the contents of this response history, it is desirable to generate a textified summary from the recorded voice call.
As a technique for creating a summary text from such voice data, there is a technique for converting the voice in the voice data file into a character code by speech recognition processing and generating a summary text from the character coded voice text data. The creation of a text-based summary makes it easy to understand the contents of the response history, provides a list-like feature, and enables flexible collaboration with a computer, such as executing a search using words in the text as keywords. It becomes possible.

例えば、特許文献１は、ビデオテープレコーダ（ＶＴＲ）により記録媒体に録音された音声を音声認識して文字コード列に変換し、この音声認識された文字コード列中の文の構成要素の重要度、典型的には名詞・動詞・助詞・形容詞等の品詞別、主格・目的格・述部等の句別に付与された重要度、を予め登録された重要度テーブルを参照することにより判定し、重要度が高いと判定された文中構成要素を組み合わせることで要約文を自動生成する技術を開示する。 For example, Patent Document 1 recognizes the voice recorded on a recording medium by a video tape recorder (VTR) as voice recognition and converts it into a character code string, and the importance of the component of the sentence in the character code string recognized as voice. Typically, it is judged by referring to the importance table registered in advance, classified according to parts of speech such as nouns, verbs, particles, adjectives, etc., and classified according to phrases such as subject, purpose case, predicate etc. Disclosed is a technique for automatically generating a summary sentence by combining sentence components judged to be of high importance.

また、特許文献２は、音声から重要区間を抽出し、抽出された重要区間の出現分布を用いて話題境界を検出し、それぞれの話題区間に含まれる重要区間を意味分類して、重要区間の音声から話題ごとに分割されたテキストによる要約を生成する技術を開示する。 Further, Patent Document 2 extracts important sections from speech, detects topic boundaries using the appearance distribution of the extracted important sections, and semantically classifies the important sections included in each of the topic sections. Disclosed is a technology for generating a text summary divided into topics from speech.

特開平８−２１２２２８号公報Unexamined-Japanese-Patent No. 8-212228 gazette 特開２０００−２８４７９３号公報Unexamined-Japanese-Patent No. 2000-284793

しかしながら、上記各特許文献に開示される技術を、例えばコールセンタにおける電話応対業務に直ちに適用することは困難である。なぜなら、顧客とオペレータ間の音声通話は、通常、顧客情報の取得・確認、問い合わせ内容の取得・確認、問い合わせへの回答内容の取得・確認、顧客の理解度および免責内容の提示・確認等、多くの段階を経るため不可避的に冗長であり、また、同じ発話内容が繰り返された結果、対話が長時間に亘ることも多いからである。加えて、多数のオペレータについて終日録音蓄積される通話録音データは膨大なものとなるため、応対履歴の迅速な確認および照査を困難にする。 However, it is difficult to immediately apply the techniques disclosed in the above-mentioned patent documents to, for example, telephone service operations in a call center. Because voice communication between the customer and the operator usually involves acquisition / confirmation of customer information, acquisition / confirmation of inquiry content, acquisition / confirmation of response content to inquiry, presentation / confirmation of customer's understanding level and disclaimer content, etc. It is inevitably redundant because there are many steps, and as the same utterance content is repeated, the dialogue often takes a long time. In addition, the call recording data that is recorded and accumulated all day for a large number of operators becomes enormous, which makes it difficult to quickly confirm and check the response history.

このため、音声通話をそのまま音声認識して得られる音声通話テキストに公知の要約文作成技術を適用しても、生成される要約文もまた不可避的に冗長かつ長文となってしまう不都合があり、利便性が乏しかった。 For this reason, there is a disadvantage that the generated summary also inevitably becomes redundant and long even if a known summary creation technology is applied to the voice call text obtained by speech recognition of the voice call as it is, The convenience was poor.

一方、対話中の話者の感情は一律ではない。例えば、対話中に「はい」との発話が音声認識された場合、当該発話の話者が、快諾して発話した「はい」であるのか、渋々同意を余儀なくされた「はい」であるのか、異なる感情に基づく発話であり得る。
しかしながら、従来の技術では、対話中の話者の発話における感情を要約に反映することはできなかった。 On the other hand, the speaker's emotion during the conversation is not uniform. For example, if the utterance "Yes" is recognized during the dialogue, whether the speaker of the utterance is "Yes" who uttered in a comfortable manner or "Yes" who was compelled to agree compelled, It may be an utterance based on different emotions.
However, according to the prior art, it was not possible to reflect the emotion in the utterance of the speaker during dialogue in the summary.

本発明は、上記課題に鑑みてなされたものであり、その目的は、対話音声から、十分に短縮化され、かつ対話中の話者の発話における感情が十分に反映された高精度な要約文を生成することが可能な対話要約生成装置、対話要約生成方法およびプログラムを提供することにある。 The present invention has been made in view of the above problems, and an object thereof is a high-accuracy summary sentence sufficiently shortened from dialogue speech and sufficiently reflecting emotions in a speaker's speech during dialogue. A dialogue summary generation apparatus, a dialogue summary generation method, and a program capable of generating

上記課題を解決するために、本発明のある態様によれば対話音声データから対話の話者を識別する話者識別部と、前記話者識別部により識別された話者ごとに、前記対話音声データを発話単位に分離する音声分離部と、前記対話音声データを、前記音声分離部により分離された前記発話単位で音声認識して対話音声テキストを生成する音声認識部と、前記音声認識部により生成された前記対話音声テキストを要約して要約文テキストを生成する要約生成部と、前記発話単位で前記対話音声データを解析して話者ごとの感情表現を導出し、導出された前記感情表現を前記要約文テキストに付加しまたは前記感情表現で前記要約文テキストの一部を置き換え、または前記要約文テキストに対応付けて出力する感情解析部と、を備える対話要約生成装置が提供される。 According to an aspect of the present invention, there is provided a speaker identifying unit for identifying a speaker of a dialog from dialog voice data, and the dialog voice for each speaker identified by the speaker identifying unit. A speech separation unit for separating data into speech units; a speech recognition unit for speech recognition of the dialogue speech data by the speech units separated by the speech separation unit to generate dialogue speech text; and the speech recognition unit A summary generation unit for summarizing the generated dialog voice text to generate a summary text, and analyzing the dialog voice data by the utterance unit to derive an emotional expression for each speaker, and the derived emotional expression An emotion analysis unit which appends to the summary sentence text or replaces a part of the summary sentence text with the emotion expression, or outputs it in association with the summary sentence text; Apparatus is provided.

前記感情解析部はさらに、前記発話単位で前記対話音声データを解析することにより、１つの対話における話者ごとの時系列上の感情の遷移を導出し、話者ごとに導出された前記感情の遷移を、前記要約文テキストに対応付けて出力してよい。 The emotion analysis unit further analyzes the dialogue speech data in units of the utterance to derive a transition of emotions in time series of each speaker in one dialogue, and the emotion analysis unit extracts the emotion derived for each speaker. A transition may be output in association with the summary text.

前記対話要約生成装置はさらに、前記音声認識部により生成された前記対話音声テキストから、話者ごとの感情を示す感情語を抽出し、抽出された前記感情語を対応する感情表現に変換し、変換された前記感情表現で、前記要約文テキストの少なくとも一部を置き換える第２の感情解析部を備えてよい。
前記対話音声生成装置はさらに、１つの対話の単位の対話音声テキスト中に、同一ないし類似するテキストが複数回出現するか否かを判定し、同一ないし類似するテキストが複数回出現する場合には、時系列上前方に出現するテキストを削除する冗長性排除部をさらに備えてよい。
前記冗長性排除部は、さらに、予め重要語を定義する重要語テーブルを参照して、前記対話音声テキスト中から前記重要語テーブルに定義されるテキストを抽出し、抽出されたテキストの直前に位置するとともに抽出されたテキストの読みが少なくとも部分一致する第２のテキストを検索し、検索されたテキストを前記対話音声テキストから削除してよい。 The dialogue summary generation apparatus further extracts an emotion word indicating an emotion of each speaker from the dialogue speech text generated by the speech recognition unit, and converts the extracted emotion word into a corresponding emotion expression. A second emotion analysis unit may be provided that replaces at least a portion of the abstract sentence text with the transformed emotion expression.
The dialog voice generation apparatus further determines whether the same or similar text appears multiple times in the dialog voice text of one dialog unit, and when the same or similar text appears multiple times. The method may further comprise a redundancy exclusion unit that deletes text appearing forward in time series.
The redundancy exclusion unit further refers to the keyword table which defines keywords in advance, extracts the text defined in the keyword table from the interactive speech text, and positions the text immediately before the extracted text. In addition, the second text may be searched for at least partially matching the reading of the extracted text, and the searched text may be deleted from the interactive speech text.

前記対話要約生成装置はさらに、前記音声認識部により生成される前記対話音声テキストを解析して数詞を抽出し、抽出された数詞の種別に応じて異なる単位および重みを付与して、前記要約生成部へ供給するテキスト補正部を備えてよい。
前記対話要約生成装置はさらに、通話音声または対面での対話音声を録音して前記対話音声データを取得する音声取得部を備えてよい。 The dialogue summary generation apparatus further analyzes the dialogue speech text generated by the speech recognition unit to extract a number sentence, and adds a different unit and weight according to the type of the extracted number sentence, and generates the summary. A text correction unit for supplying to a unit may be provided.
The dialogue summary generation apparatus may further include a voice acquisition unit for recording a talk voice or a face-to-face dialogue voice to acquire the dialogue voice data.

本発明の他の態様によれば、対話音声データから対話の話者を識別するステップと、識別された話者ごとに、前記対話音声データを発話単位に分離するステップと、前記対話音声データを、分離された前記発話単位で音声認識して対話音声テキストを生成するステップと、生成された前記対話音声テキストを要約して要約文テキストを生成するステップと、前記発話単位で前記対話音声データを解析して話者ごとの感情表現を導出し、導出された前記感情表現を前記要約文テキストに付加しまたは前記感情表現で前記要約文テキストの一部を置き換え、または前記要約文テキストに対応付けて出力するステップと、を含む対話要約生成方法が提供される。 According to another aspect of the present invention, there is provided the steps of: identifying a speaker of a dialog from dialog voice data; separating the dialog voice data into utterance units for each identified speaker; and Speech recognition in the separated speech unit to generate dialogue speech text, summarizing the generated dialogue speech text to generate a summary text, and speech information in the speech unit The analysis is performed to derive an emotional expression for each speaker, and the derived emotional expression is added to the summary text, or a part of the summary text is replaced with the emotion expression, or the abstract text is associated with the summary text And outputting the dialog summary generation method.

本発明のさらに他の態様によれば、対話要約生成処理をコンピュータに実行させるための対話音声要約生成プログラムであって、該プログラムは、前記コンピュータに、話音声データから対話の話者を識別する話者識別処理と、識別された話者ごとに、前記対話音声データを発話単位に分離する音声分離処理と、前記対話音声データを、分離された前記発話単位で音声認識して対話音声テキストを生成する音声認識処理と、生成された前記対話音声テキストを要約して要約文テキストを生成する要約生成処理と、前記発話単位で前記対話音声データを解析して話者ごとの感情表現を導出し、導出された前記感情表現を前記要約文テキストに付加しまたは前記感情表現で前記要約文テキストの一部を置き換え、または前記要約文テキストに対応付けて出力する感情解析処理と、を含む処理を実行させるためのものである、対話要約生成プログラムが提供される。 According to still another aspect of the present invention, there is provided an interactive speech summary generation program for causing a computer to execute a dialog summarizing process, wherein the program identifies the speaker of the dialogue from the spoken speech data in the computer. A speaker identification process, a speech separation process for separating the dialogue speech data into speech units for each identified speaker, and speech speech recognition by speech recognition of the dialogue speech data in the separated speech units A speech recognition process to be generated, a summary generation process of summarizing the generated dialog speech text to generate a summary text, and analysis of the dialog speech data on a speech basis to derive an emotional expression for each speaker Adding the derived emotion expression to the summary sentence text or replacing part of the summary sentence text with the emotion expression or corresponding to the summary sentence text And emotion analysis to only output is intended for causing a process including execution, interactive summarization program is provided.

本発明に係る対話要約生成装置、対話要約生成方法およびプログラムによれば、対話音声から、十分に短縮化され、かつ対話中の話者の発話における感情が十分に反映された高精度な要約文を生成することができる。よって、対話音声の要約の有用性向上に資する。 According to the dialog summary generation apparatus, the dialog summary generation method, and the program according to the present invention, a highly accurate summary sentence sufficiently shortened from the dialog speech and sufficiently reflecting the emotion in the speaker's speech during the dialog Can be generated. Therefore, it contributes to the improvement of the usefulness of the summary of dialogue speech.

本発明の実施形態に係る音声処理システムのネットワーク構成の一例を示す図である。It is a figure showing an example of the network configuration of the speech processing system concerning the embodiment of the present invention. 図１の音声処理システムを構成する音声認識サーバの機能構成の一例を示すブロック図である。It is a block diagram which shows an example of a function structure of the speech recognition server which comprises the speech processing system of FIG. 図１の音声処理システムを構成する要約生成サーバの機能構成の一例を示すブロック図である。It is a block diagram which shows an example of a function structure of the abstract generation server which comprises the audio processing system of FIG. 図２の音声認識サーバが実行する音声認識処理の処理フローの一例を示すフローチャートである。It is a flowchart which shows an example of the processing flow of the speech recognition process which the speech recognition server of FIG. 2 performs. 音声データに対する図４の話者識別（Ｓ２）および音声の発話単位への分離（Ｓ３）を説明する図である。It is a figure explaining the speaker identification (S2) of FIG. 4 with respect to audio | voice data, and isolation | separation to the speech unit of an audio | voice (S3). 図３の要約生成サーバが実行する要約生成処理の処理フローの一例を示すフローチャートである。It is a flowchart which shows an example of a processing flow of a summary production | generation process which the summary production | generation server of FIG. 3 performs. 図４の自然発話への変換・要約単位への分離（Ｓ５）の詳細処理フローの一例を示すフローチャートである。It is a flowchart which shows an example of a detailed processing flow of isolation | separation (S5) to conversion to a natural speech and a summary unit of FIG. 図７の要約単位への分離（Ｓ５３）の詳細処理フローの一例を示すフローチャートである。It is a flowchart which shows an example of a detailed processing flow of isolation | separation (S53) to the summary unit of FIG. 図４のＳ４の実行により発話単位に音声認識された認識結果テキストの一例を示す図である。It is a figure which shows an example of the recognition result text by which speech recognition was carried out to the utterance unit by execution of S4 of FIG. 図９の認識結果テキストの構文解析結果の一例を示す図である。It is a figure which shows an example of a syntactic analysis result of the recognition result text of FIG. 図９の認識結果テキストの形態素解析結果の一例を示す図である。It is a figure which shows an example of the morphological analysis result of the recognition result text of FIG. 図９の認識結果テキストが、図４のＳ５の実行により要約単位に分離された認識結果テキストの一例を示す図である。FIG. 10 is a diagram showing an example of recognition result text in which the recognition result text in FIG. 9 is separated into summary units by execution of S5 in FIG. 4. 図４の相槌解析（Ｓ６）の詳細処理フローの一例を示すフローチャートである。It is a flowchart which shows an example of a detailed processing flow of the competition analysis (S6) of FIG. 図５の音声データに対する図４の相槌解析（Ｓ６）を説明する図である。FIG. 6 is a diagram for explaining the speech analysis (S6) of FIG. 4 for the audio data of FIG. 5; 認識結果テキストに対して適用される句読点テーブルの一例を示す図である。It is a figure which shows an example of the punctuation mark table applied with respect to a recognition result text. 認識結果テキストに対して適用される単位重みテーブルの一例を示す図である。It is a figure which shows an example of the unit weight table applied with respect to a recognition result text. 認識結果テキストに対して適用される不要語テーブルの一例を示す図である。It is a figure which shows an example of the unnecessary word table applied with respect to a recognition result text. 認識結果テキストに対して適用される文字置換テーブルの一例を示す図である。It is a figure which shows an example of the character substitution table applied with respect to a recognition result text. 認識結果テキストに対して適用される重要語テーブルの一例を示す図である。It is a figure which shows an example of the important word table applied with respect to a recognition result text. 認識結果テキストに対して適用される肯定語テーブルの一例を示す図である。It is a figure which shows an example of the positive word table applied with respect to a recognition result text. 認識結果テキストに対して適用される否定語テーブルの一例を示す図である。It is a figure which shows an example of the negative term table applied with respect to a recognition result text. 顧客の対話音声に対する感情解析結果の出力の一例を示す図である。It is a figure which shows an example of the output of the emotion-analysis result with respect to a customer's dialogue speech. オペレータの対話音声に対する感情解析結果の出力の一例を示す図である。It is a figure which shows an example of the output of the emotion-analysis result with respect to an operator's dialogue speech. オペレータの複数の対話音声に対する感情解析結果の出力の一例を示す図である。It is a figure which shows an example of the output of the emotion-analysis result with respect to the several conversational speech of an operator. 認識結果テキストに対して適用される感情語テーブルの一例を示す図である。It is a figure which shows an example of the emotion word table applied with respect to a recognition result text. 感情解析結果が付加された要約テキストの一例を示す図である。It is a figure which shows an example of the summary text to which the emotion analysis result was added. 図２５の感情語テーブルの要約テキストへの適用例を示す図である。It is a figure which shows the example of application to the summary text of the emotional word table of FIG. 音声対話の認識結果テキストを話者ごと要約単位に分離した一例を示す図である。It is a figure which shows an example which isolate | separated the recognition result text of speech interaction into a summary unit for every speaker. 音声対話の認識結果テキストを話者ごと要約単位に分離した他の例を示す図である。It is a figure which shows the other example which isolate | separated the recognition result text of speech interaction into a summary unit per speaker. 図２９の認識結果テキストから生成された要約文の一例を示す図である。It is a figure which shows an example of the abstract sentence produced | generated from the recognition result text of FIG. 対話音声の要約表示のユーザインタフェースの一例を示す図である。It is a figure which shows an example of the user interface of the summary display of interactive speech. 対話音声の要約とともに表示可能な感情解析結果の表示例を示す図である。It is a figure which shows the example of a display of the emotion-analysis result which can be displayed with the summary of dialog speech. 対話音声の音声認識結果、自然言語処理結果、および対応する要約結果の表示例を示す図である。It is a figure which shows the example of a display of the speech recognition result of a dialog speech, a natural language processing result, and a corresponding summary result. 本実施形態における各装置のハードウエア構成の一例を示す図である。It is a figure which shows an example of the hardware constitutions of each apparatus in this embodiment.

以下、添付図面を参照して、本発明を実施するための実施形態について詳細に説明する。なお、以下に説明する実施形態は、本発明の実現手段としての一例であり、本発明が適用される装置の構成や各種条件によって適宜修正又は変更されるべきものであり、本発明は以下の実施形態に必ずしも限定されるものではない。また、本実施形態で説明されている特徴の組み合わせの全てが本発明の解決手段に必須のものとは限らない。なお、同一の構成については、同じ符号を付して説明する。 Hereinafter, an embodiment for carrying out the present invention will be described in detail with reference to the attached drawings. The embodiment described below is an example as a realization means of the present invention, and should be appropriately corrected or changed according to the configuration of the apparatus to which the present invention is applied and various conditions. It is not necessarily limited to the embodiment. Moreover, not all combinations of the features described in the present embodiment are essential to the solution means of the present invention. In addition, about the same structure, the same code | symbol is attached | subjected and demonstrated.

＜本実施形態の音声処理システムのネットワーク構成＞
以下では、顧客と、コールセンタのオペレータとの間で電話網を介してなされた通話を録音する例を説明するが、本実施形態はこれに限定されない。本実施形態は、例えば、通話に替えて、対面での対話をマイクロフォン等の集音装置により集音し録音した対話音声についても、同様に要約文を生成することができる。
図１は、本実施形態に係る音声処理システムのネットワーク構成の非限定的一例を示す図である。図１を参照して、音声処理システムは、ＰＢＸ（交換機）１、音声取得サーバ２、通話録音サーバ３、制御サーバ４、音声認識サーバ５、感情解析サーバ６、要約生成サーバ７、および対話要約照会用に利用可能なＰＣ（ＰｅｒｓｏｎａｌＣｏｍｐｕｔｅｒ）９を備える。ＰＢＸ１、音声取得サーバ２、通話録音サーバ３、制御サーバ４、音声認識サーバ５、感情解析サーバ６、要約生成サーバ７、およびＰＣ９の全部または一部は、コールセンタ構内に設置され、ＬＡＮ（ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ）／ＷＡＮ（ＷｉｄｅＡｒｅａＮｅｔｗｏｒｋ）等のイントラネット８等のＩＰ（ＩｎｔｅｒｎｅｔＰｒｏｔｏｃｏｌ）ネットワークにより相互接続されてよい。 <Network Configuration of Speech Processing System of This Embodiment>
The following describes an example of recording a call made between a customer and an operator of a call center via a telephone network, but the present embodiment is not limited to this. In the present embodiment, for example, instead of a call, a summary can be similarly generated also for dialogue voice in which a face-to-face dialogue is collected and recorded by a sound collection device such as a microphone.
FIG. 1 is a diagram showing a non-limiting example of the network configuration of the voice processing system according to the present embodiment. Referring to FIG. 1, the speech processing system includes a PBX (switching machine) 1, speech acquisition server 2, call recording server 3, control server 4, speech recognition server 5, emotion analysis server 6, summary generation server 7, and dialogue summary. It has a PC (Personal Computer) 9 available for inquiry. The PBX 1, the voice acquisition server 2, the call recording server 3, the control server 4, the speech recognition server 5, the emotion analysis server 6, the summary generation server 7, and all or part of the PC 9 are installed in a call center site and are LAN (Local Area) They may be interconnected by an IP (Internet Protocol) network such as an intranet 8 such as a Network) / WAN (Wide Area Network).

或いは代替的に、音声取得サーバ２、通話録音サーバ３、制御サーバ４、音声認識サーバ５、感情解析サーバ６、要約生成サーバ７、およびＰＣ９の全部または一部は、インターネット等の遠隔ＩＰ接続を介して適宜コールセンタ外部に設置されてもよい。
特に、コールセンタのオペレータ以外の管理者等が対話要約照会用ＰＣ９を操作して要約文データベース内の応対履歴である対話音声要約の照会ないし更新処理を行う場合には、当該対話要約照会ＰＣ９は、オペレータ近傍に設置される必要はなく、遠隔ＩＰ接続を介して適宜コールセンタ外部に設置されることが好適である。 Alternatively, all or part of the voice acquisition server 2, the call recording server 3, the control server 4, the speech recognition server 5, the emotion analysis server 6, the summary generation server 7, and the PC 9 perform remote IP connection such as the Internet. It may be installed outside the call center as appropriate.
In particular, when an administrator other than the call center operator operates the dialogue summary query PC 9 to query or update the dialogue speech summary which is the response history in the summary sentence database, the dialogue summary query PC 9 is It is not necessary to be installed near the operator, and it is preferable to be installed outside the call center via remote IP connection as appropriate.

音声処理システムは、さらに、イントラネット８或いはインターネットを介して音声処理システムに接続される、マイクロフォンを接続または内蔵する他のＰＣ１０を備えてよい。このように構成すれば、ＰＣ１０のマイクロフォンで集音した対面でなされた対話音声を、本実施形態に係る音声処理システムに入力し、対面でなされた対話音声の要約文を生成することができる。 The voice processing system may further comprise another PC 10 connected or incorporating a microphone, connected to the voice processing system via the intranet 8 or the Internet. According to this configuration, it is possible to input face-to-face dialogue speech collected by the microphone of the PC 10 into the speech processing system according to the present embodiment, and to generate a summary sentence of the face-to-face dialogue speech.

ＰＢＸ１は、コールセンタ内の内線電話を収容し、これら内線電話同士を接続するとともに、各オペレータの電話端末１２を、構内回線１１ａ、１１ｂ、１１ｃ・・・を介してＰＳＴＮ（公衆電話網）１３に回線交換接続して、各オペレータの電話端末１２と、ＰＳＴＮ１３に接続される顧客の電話端末１４との間の通話を実現する。 The PBX 1 accommodates extension telephones in a call center, connects these extension telephones, and connects the telephone terminal 12 of each operator to the PSTN (public telephone network) 13 via the private lines 11a, 11b, 11c,. Circuit-switched connection is performed to realize a call between each operator's telephone terminal 12 and a customer's telephone terminal 14 connected to the PSTN 13.

なお、図１におけるＰＢＸ１は、ＰＳＴＮ１３等の公衆電話交換回線網を介して顧客の電話端末１４に接続されているが、これに替えて、或いはこれに加えて、ＩＰ網接続機能を備えることにより、ＶｏＩＰ（ＶｏｉｃｅＯｖｅｒＩｎｔｅｒｎｅｔＰｒｏｔｏｃｏｌ）ネットワーク等の音声パケット通信ネットワークを介して、ＩＰ電話機能を備える顧客のＩＰ通話端末に接続されてよく、この場合、後述する音声取得サーバ２は、顧客のＩＰ通話端末およびオペレータの電話端末１２の間の音声通話を取得することができる。顧客の電話端末１４は、固定電話機或いは携帯電話機やスマートフォンのいずれであってもよい。
＜各サーバ装置の機能構成＞ Although the PBX 1 in FIG. 1 is connected to the customer's telephone terminal 14 via a public switched telephone network such as the PSTN 13, etc., instead of or in addition to this, it is provided with an IP network connection function. May be connected to a customer's IP call terminal equipped with an IP telephone function via a voice packet communication network such as a Voice Over Internet Protocol (VoIP) network, and in this case, the voice acquisition server 2 described later A voice call between the terminal and the operator's telephone terminal 12 can be obtained. The customer's telephone terminal 14 may be either a fixed telephone, a mobile telephone or a smart phone.
<Functional configuration of each server device>

音声取得サーバ２は、ＰＢＸ１に分岐接続され、各オペレータの電話端末１２と顧客の電話端末１４との通話音声を取得するとともに、取得された通話音声をオペレータの電話端末１２の識別子（例えば内線番号）と対応付けて各サーバに供給する。代替的に、この音声取得サーバ２は、ＰＳＴＮ１３の終端装置（ＤＳＵ）とＰＢＸ１との間の回線に分岐接続されてもよい。 The voice acquisition server 2 is branch-connected to the PBX 1 and acquires a call voice between the telephone terminal 12 of each operator and the telephone terminal 14 of the customer, and identifies the acquired call voice as an identifier of the operator's telephone terminal 12 (for example, an extension number ) And supply to each server. Alternatively, the voice acquisition server 2 may be branch-connected to the circuit between the termination unit (DSU) of the PSTN 13 and the PBX 1.

通話録音サーバ３は、制御サーバ４の制御の下、着呼後に音声取得サーバ２から供給される通話音声を必要に応じて圧縮し、取得された音声データを、例えばＮＡＳ（ＮｅｔｗｏｒｋＡｐｐｌｉａｎｃｅＳｔｏｒａｇｅ）等の大規模外部記憶装置により構成される対話音声ファイル（図２の対話音声ファイル３１）のデータベースに蓄積記憶する。
好適には、通話録音サーバ３は、音声取得サーバ２からアナログ音声が供給された場合、このアナログ音声波形を電圧で表したものを所定のビット深度と所定のサンプリング周波数でサンプリングすることによりデジタル音声に変換し、対話音声ファイル３１に蓄積保存する。 The call recording server 3 compresses the call voice supplied from the voice acquisition server 2 after the incoming call as necessary under the control of the control server 4, and acquires the voice data acquired, for example, NAS (Network Appliance Storage) The dialogue speech file (dialogue speech file 31 in FIG. 2) composed of a large scale external storage device is stored and stored in the database.
Preferably, when analog voice is supplied from the voice acquisition server 2, the call recording server 3 samples the analog voice waveform represented by a voltage by sampling at a predetermined bit depth and a predetermined sampling frequency. , And stored in the dialog voice file 31.

このデジタル音声データは、圧縮後に対話音声ファイル３１に蓄積保存されてよい。録音音声の圧縮には、種々の公知の手法を種々の圧縮率で用いることができ、非限定的一例として、モノラル５分の１圧縮、モノラル１０分の１圧縮、或いはステレオ無圧縮等により録音音声が圧縮される。代替的に、通話録音サーバ３は、音声取得サーバ２から供給される音声データを変換圧縮することなく、通話音声ファイル３１に蓄積保存してもよい。 The digital audio data may be stored in the interactive audio file 31 after compression. Various known methods can be used for compression of recorded voice at various compression rates, and as a non-limiting example, recording is performed by monaural 1⁄5 compression, monaural 1/10 compression, stereo non-compression, etc. Audio is compressed. Alternatively, the call recording server 3 may store the voice data supplied from the voice acquisition server 2 in the call voice file 31 without converting and compressing it.

通話録音サーバ３はまた、対話音声ファイル３１内に蓄積保存された１通話単位ごとの対話音声データに関連付けて、呼情報ファイル（不図示）に通話の制御情報として取得される呼情報を書き出す。この呼情報は、ＰＢＸ１により供給される。
通話録音サーバ３により取得される呼情報は、例えば、着信開始情報（着信開始タイムスタンプを含む）、発信開始情報（発信開始タイムスタンプを含む）、通話開始情報（通話開始タイムスタンプを含む）、通話終了情報（通話終了タイムスタンプを含む）等の呼制御情報と、発信元電話番号、発信先電話番号、発信元チャネル番号、発信者番号、着信チャネル番号、着信電話番号（着信先内線番号等）等の呼識別情報とを含む。 The call recording server 3 also writes call information acquired as call control information in a call information file (not shown) in association with the dialogue voice data for each call unit stored and stored in the dialogue voice file 31. This call information is provided by the PBX 1.
The call information acquired by the call recording server 3 includes, for example, incoming call start information (including incoming call start time stamp), outgoing call start information (including outgoing call start time stamp), call start information (including call start time stamp), Call control information such as call termination information (including call termination time stamp), source phone number, destination phone number, source channel number, sender number, incoming channel number, incoming phone number (destination extension number etc. And call identification information.

この呼情報はさらに、録音された通話内の発話が、インバウンド、すなわち顧客側からの発話であるか、アウトバウンド、すなわちオペレータ側からの発話であるかの極性を識別する話者識別情報を含む。この話者識別情報は、ＰＢＸ１により取得可能であり、例えばＳＩＰ（ＳｅｓｓｉｏｎＩｎｉｔｉａｔｉｏｎＰｒｏｔｏｃｏｌ）の場合には、呼生成の際のセッション構成時に把握可能であり、具体的には、例えば、セッション構成時に、発呼側から着呼側に送信されるＩｎｖｉｔｅコマンド中で、セッション開始に必要な情報を記述するＳＤＰ（ＳｅｓｓｉｏｎＤｅｓｃｒｉｐｔｉｏｎＰｒｏｔｏｃｏｌ）内に発呼側が受信に使用するＩＰアドレスとポート番号を指定し、一方これに応答して着呼側から発呼側へ送信される２００ＯＫメッセージ中のＳＤＰ内に着呼側が受信に使用するＩＰアドレスとポート番号を指定し、このそれぞれ指定されたＩＰアドレスとポート番号を使用してＲＴＰ（ＲｅａｌｔｉｍｅＴｒａｎｓｐｏｒｔＰｒｏｔｏｃｏｌ）プロトコル上音声データが送受信される。このため、これら発呼側および着呼側がそれぞれ受信に使用するＩＰアドレスとポート番号を取得することにより、１通話内の発話それぞれの話者識別情報を得ることができ、１通話内の顧客の発話とオペレータの発話とを必要に応じて区別或いは分離することができる。
ＩＳＤＮの場合には、話者識別情報は、回線終端装置（ＤｉｇｉｔａｌＳｅｒｖｉｃｅＵｎｉｔ：ＤＳＵ）の物理的なピン位置として取得可能である。 The call information further includes speaker identification information identifying the polarity of whether the speech in the recorded call is inbound, ie speech from the customer side or outbound, ie speech from the operator side. This speaker identification information can be acquired by the PBX 1, and can be grasped at session configuration in call generation, for example, in the case of SIP (Session Initiation Protocol), specifically, for example, at session configuration, In the Invite command sent from the calling party to the called party, specify the IP address and port number that the calling party uses for reception in SDP (Session Description Protocol) that describes the information necessary for session initiation, In response to this, the IP address and port number used for reception by the callee are specified in the SDP in the 200 OK message sent from the callee to the callee, and the specified IP address and port number are specified. RTP (Realtime Transport P) (rotocol) Voice data is transmitted and received on the protocol. Therefore, by acquiring the IP address and the port number used by the calling party and the called party for reception, it is possible to obtain the speaker identification information of each utterance in one call, and the customer in one call can The utterance and the operator's utterance can be distinguished or separated as needed.
In the case of ISDN, the speaker identification information can be obtained as a physical pin position of a digital service unit (DSU).

これら呼情報は、好適には、ＣＴＩ（ＣｏｍｐｕｔｅｒＴｅｌｅｐｈｏｎｙＩｎｔｅｇｒａｔｉｏｎ）プロトコルを実装した制御サーバ４上ないしオペレータのＰＣ９上で稼動するＣＴＩプログラムと連携して、これらの表示装置上に呼情報をリアルタイムに表示してよい。 The call information is preferably displayed in real time on the display device in cooperation with a CTI program running on the control server 4 implementing the Computer Telephony Integration (CTI) protocol or on the PC 9 of the operator. You may

通話録音サーバ３はまた、すでに応対履歴のある顧客を中心とする顧客の情報が事前登録された顧客情報データベース（不図示）を備える。この顧客情報は、顧客を識別する個人情報を含み、例えば顧客氏名、住所、登録された顧客電話番号、生年月日、年齢層、性別、その他顧客属性、製品購入履歴、応対履歴等を含むものとし、オペレータが操作可能な端末装置に、オペレータの指示入力に応じて適宜出力され得る。 The call recording server 3 also has a customer information database (not shown) in which information of customers mainly of customers who already have a response history has been registered. This customer information includes personal information identifying the customer, and includes, for example, customer name, address, registered customer telephone number, date of birth, age group, gender, other customer attributes, product purchase history, response history, etc. And the terminal device that can be operated by the operator can be appropriately output according to the instruction input by the operator.

なお、通話録音サーバ３は、構内回線８に接続するのに替えて、例えば、ＰＳＴＮ１３とＰＢＸ１との間に接続されてよく、このように構成すれば、通話録音サーバ３は、上記の話者識別情報を直接取得することができる。さらに代替的に、音声取得サーバ２を別途設置することなく、通話録音サーバ３は、構内回線８に接続され、構内回線８に供給される通話音声を直接取得してよい。 Note that the call recording server 3 may be connected between, for example, the PSTN 13 and the PBX 1 instead of connecting to the local line 8, and if configured in this way, the call recording server 3 may Identification information can be obtained directly. Further alternatively, without separately installing the voice acquisition server 2, the call recording server 3 may be connected to the private line 8 and directly obtain the call voice supplied to the private line 8.

制御サーバ４は、音声取得サーバ２、通話録音サーバ３、音声認識サーバ５、感情解析サーバ６、および要約生成サーバ７から供給されるデータおよび制御情報に基づいて、これらサーバが実行する処理、これらサーバ間のデータトラフィックおよび制御情報の送受信を制御する。代替的に、音声認識サーバ５および要約生成サーバ７は、通話録音サーバ３が保有する通話音声ファイル３１や呼情報ファイルへのアクセスや対話要約照会用ＰＣ９へのインターフェースを、制御サーバ４を介することなく直接提供してもよい。この場合、音声処理システムは、別途制御サーバ４を備えなくてよい。 The control server 4 performs processing based on data and control information supplied from the voice acquisition server 2, the call recording server 3, the speech recognition server 5, the emotion analysis server 6, and the summary generation server 7, Control transmission and reception of data traffic and control information between servers. Alternatively, the voice recognition server 5 and the summary generation server 7 may access the call voice file 31 and the call information file held by the call recording server 3 and interface to the PC 9 for dialog summary inquiry via the control server 4. May be provided directly. In this case, the voice processing system may not include the control server 4 separately.

音声認識サーバ５は、制御サーバ４の制御の下、対話音声ファイル３１に蓄積保存された対話音声データを、オフフックからオンフックまでの１通話分ごと読み出し、１通話分の対話音声を複数の発話単位に分離する。この発話単位への分離は、無音区間を識別して対話音声をこの無音区間で区切るものであり、図５を参照して後述する。
本実施形態において、音声認識サーバ５は、分離された発話単位ごとに対話音声データを解析して特徴量を抽出し、音声認識辞書（図２の音声認識辞書３２）等の各種認識用辞書を参照し、公知の音声認識技術を適用して対話音声データを文字コード列に変換し、さらに変換された文字コード列を対話音声テキストとしてファイルに出力する。本実施形態において、音声認識サーバ５が出力する対話音声テキストは、要約単位に区切られたテキスト（図２の要約単位テキスト）を含む。この対話音声テキストを要約単位に区切る処理は、図４、図７、および図８を参照して後述する。 The voice recognition server 5 reads the dialogue voice data accumulated and stored in the dialogue voice file 31 under the control of the control server 4 for every one call from off-hook to on-hook, and the dialogue voice for one call is divided into a plurality of utterance units. To separate. The separation into speech units is to identify a silent section and separate the dialog speech into the silent sections, which will be described later with reference to FIG.
In the present embodiment, the speech recognition server 5 analyzes dialogue speech data for each separated speech unit to extract feature quantities, and uses various recognition dictionaries such as a speech recognition dictionary (speech recognition dictionary 32 in FIG. 2). The speech recognition data is converted to a character code string by applying known speech recognition technology, and the converted character code string is output to a file as a dialog sound text. In the present embodiment, the interactive speech text output by the speech recognition server 5 includes text (summarized unit text in FIG. 2) divided into summary units. The process of dividing the dialogue speech text into summary units will be described later with reference to FIGS. 4, 7 and 8.

感情解析サーバ６は、通話録音サーバ３から供給される対話音声データを入力として、話者ごとに例えば、喜怒、満足度、ストレス度、信頼度等の話者の感情を示す定量的指標を話者の感情解析結果として出力する。この感情解析結果は、１通話内あるいは終日等、所定期間における各感情指標の変化として出力することができる。感情解析サーバ６が実行するこの感情解析処理の詳細は、図６、図２２ないし図２４を参照して後述する。 The emotion analysis server 6 receives the dialogue voice data supplied from the call recording server 3 as an input, and for each speaker, a quantitative index indicating the speaker's emotion such as anger, satisfaction, stress, reliability, etc. Output as the speaker's emotion analysis result. The emotion analysis result can be output as a change in each emotion indicator in a predetermined period, such as within one call or all day. The details of the emotion analysis process executed by the emotion analysis server 6 will be described later with reference to FIGS. 6 and 22 to 24.

要約生成サーバ７は、対話音声テキストファイル３３に格納された、要約単位に区切られた対話音声テキストを１通話分ごと読み出して、要約生成処理を実行し、生成された対話要約文を、要約文テキスト（図３の要約文テキスト３８）として出力する。この要約生成処理の詳細は、図６を参照して後述する。 The summary generation server 7 reads out the dialog speech text divided into summary units and stored in the dialog speech text file 33 for each call, executes a summary generation process, and generates the generated summary text of the dialogue. Output as text (abstract text 38 in FIG. 3). Details of this summary generation process will be described later with reference to FIG.

要約生成サーバ７は、１通話内の一方の話者、例えばオペレータの発話の対話音声テキストを読み出して要約文を生成してもよく、他方の話者、例えば顧客の発話から抽出された受け答え部分（後述）を要約文に付加してもよく、双方の話者の対話音声テキストから要約文を作成してもよい。後者の場合、話者の識別情報を対話音声テキストに対応付けることが好適である。 The summary generation server 7 may read out interactive speech text of the speech of one of the speakers in one call, for example, the operator, and generate a summary sentence, and the answering part extracted from the speech of the other speaker, for example, the customer (Described later) may be added to the summary, or the summary may be created from the interactive speech texts of both speakers. In the latter case, it is preferable to associate the identification information of the speaker with the interactive voice text.

この１通話ごとに生成される要約文は、適宜、照会入力に応答して、対話要約照会用のＰＣ９等のディスプレイ装置やプリンタ装置等の出力装置に出力可能であり、好適には、呼情報からデコードされた通話開始時間、通話終了時間、通話の発信者識別情報（顧客から着信した通話か、オペレータから発信した通話かを識別する情報）等と関連付けて出力されてよい。
好適には、ＰＣ９等に表示出力される要約文は、操作者の修正入力により、適宜更新され得る。この更新結果を学習し、要約文生成の際に参照されるべき重要語テーブル、不要語テーブル、各種変換テーブル等を適宜更新することにより、より高精度かつ簡明な要約文を生成することが可能となる。
本実施形態において、要約生成サーバ７はさらに、音声認識サーバ５から供給される対話音声テキストを入力として、感情語テーブル（図３の感情語テーブル３７）等を参照して、対話音声テキスト中の感情表現部分を抽出し、要約文に含めるべき感情表現語に変換する。 The abstract generated for each call can be output appropriately to an output device such as a display device such as a PC 9 for dialogue summary inquiry or a printer device in response to a query input, and preferably call information The call start time, the call end time, the caller identification information of the call (information for identifying a call received from a customer or a call sent from an operator) may be output in association with the decoding start time, the call end time, and the like.
Preferably, the summary displayed and output on the PC 9 or the like can be appropriately updated by the operator's correction input. By learning the update result and appropriately updating the important word table, unnecessary word table, various conversion tables, etc. to be referred to when generating the summary sentence, it is possible to generate a more accurate and clear summary sentence. It becomes.
In the present embodiment, the summary generation server 7 further receives the dialogue speech text supplied from the speech recognition server 5 as an input and refers to the emotion word table (the emotion word table 37 of FIG. 3) etc. Emotional expression parts are extracted and converted into emotional expression words to be included in the summary sentence.

なお、図１に示すネットワークおよびハードウエアの構成は非限定的一例に過ぎず、各サーバおよびデータベースを必要に応じて一体としてもよく、或いは各コンポーネントをＡＳＰ（ＡｐｐｌｉｃａｔｉｏｎＳｅｒｖｉｃｅＰｒｏｖｉｄｅ）等の外部設備に設置してもよい。 The configuration of the network and hardware shown in FIG. 1 is only a non-limiting example, and each server and database may be integrated as required, or each component may be integrated into an external facility such as ASP (Application Service Provide). You may install it.

＜音声認識サーバ５の機能構成例＞
図２は、本実施形態に係る音声認識サーバ５の機能構成の非限定的一例を示す図である。
図２に示す音声認識サーバ５の各機能モジュールのうち、ソフトウエアにより実現される機能については、各機能モジュールの機能を提供するためのプログラムがＲＯＭ等のメモリに記憶され、ＲＡＭに読み出してＣＰＵが実行することにより実現される。ハードウエアにより実現される機能については、例えば、所定のコンパイラを用いることで、各機能モジュールの機能を実現するためのプログラムからＦＰＧＡ上に自動的に専用回路を生成すればよい。ＦＰＧＡとは、ＦｉｅｌｄＰｒｏｇｒａｍｍａｂｌｅＧａｔｅＡｒｒａｙの略である。また、ＦＰＧＡと同様にしてＧａｔｅＡｒｒａｙ回路を形成し、ハードウエアとして実現するようにしてもよい。また、ＡＳＩＣ（ＡｐｐｌｉｃａｔｉｏｎＳｐｅｃｉｆｉｃＩｎｔｅｇｒａｔｅｄＣｉｒｃｕｉｔ）により実現するようにしてもよい。なお、図２に示した機能ブロックの構成は一例であり、複数の機能ブロックが１つの機能ブロックを構成するようにしてもよいし、いずれかの機能ブロックが複数の機能を行うブロックに分かれてもよい。図３に示す要約生成サーバ７、および他のサーバ装置の機能構成についても同様である。
図２を参照して、音声認識サーバ５は、音声認識前処理部５１、音声認識部５２、音声認識後処理部５３、および相槌解析部５４を備える。 <Example of Functional Configuration of Speech Recognition Server 5>
FIG. 2 is a diagram showing one non-limiting example of the functional configuration of the speech recognition server 5 according to the present embodiment.
Among the function modules of the speech recognition server 5 shown in FIG. 2, for the functions realized by software, a program for providing the function of each function module is stored in a memory such as a ROM and read into a RAM Is realized by executing. For the function realized by hardware, for example, a dedicated circuit may be automatically generated on the FPGA from a program for realizing the function of each functional module by using a predetermined compiler. FPGA is an abbreviation for Field Programmable Gate Array. Further, a Gate Array circuit may be formed in the same manner as an FPGA and realized as hardware. Also, it may be realized by an application specific integrated circuit (ASIC). Note that the configuration of the functional blocks shown in FIG. 2 is an example, and a plurality of functional blocks may constitute one functional block, or any functional block may be divided into blocks that perform a plurality of functions. It is also good. The same applies to the functional configuration of the summary generation server 7 shown in FIG. 3 and other server devices.
Referring to FIG. 2, the speech recognition server 5 includes a speech recognition pre-processing unit 51, a speech recognition unit 52, a speech recognition post-processing unit 53, and a harmony analysis unit 54.

音声認識前処理部５１は、通話録音サーバ３が蓄積保存した対話音声ファイル３１から１通話ごとの対話音声のファイルを読み出して、読み出された１通話の対話音声ファイルから無音区間を検出し、検出された無音区間を境界として、対話における発話単位に区切る。音声認識前処理部５１はまた、１通話の対話音声ファイルから区切られた複数の発話単位を、発話単位ごとに音声認識部５２へ供給して、音声認識部５２に発話単位での音声認識処理を実行させる。 The voice recognition pre-processing unit 51 reads out a file of dialog voice for each call from the dialog voice file 31 accumulated and stored by the call recording server 3, and detects a silent section from the read dialog voice file of one call, The detected silence interval is used as a boundary to divide into utterance units in the dialogue. The speech recognition pre-processing unit 51 also supplies a plurality of speech units divided from the dialogue speech file of one call to the speech recognition unit 52 for each speech unit, and the speech recognition unit 52 performs speech recognition processing in speech units. Run

音声認識部５２は、音声認識前処理部５１から供給される発話単位ごとの対話音声を入力として、音声認識処理を実行し、発話単位ごとの対話音声テキストを音声認識後処理部５３へ供給する。音声認識部５２は、例えば正確に認識されるべき重要語や重要文を定義可能な音声認識辞書３２を参照して、対話音声の音声データを対話音声テキストに変換することができる。なお、音声認識部５２を公知の音声認識エンジンに実装し、一方、音声認識前処理部５１、音声認識後処理部５３、および相槌解析部５４を例えば制御サーバ４に実装してもよい。 The speech recognition unit 52 receives the dialogue speech for each utterance unit supplied from the speech recognition pre-processing unit 51, executes speech recognition processing, and supplies the dialogue speech text for each utterance unit to the speech recognition post-processing unit 53. . The speech recognition unit 52 can convert speech data of dialogue speech into dialogue speech text, for example, with reference to the speech recognition dictionary 32 that can define important terms and sentences that should be recognized correctly. The speech recognition unit 52 may be implemented in a known speech recognition engine, while the speech recognition pre-processing unit 51, the speech recognition post-processing unit 53, and the harmony analysis unit 54 may be implemented in the control server 4, for example.

音声認識後処理部５３は、音声認識部５２が出力する発話単位ごとの対話音声テキストに対して、構文解析および形態素解析等を実行して、対話音声テキストを要約単位に区切り、要約単位に区切られた対話音声テキスト３３として出力する。構文解析結果および形態素解析結果は、要約単位に区切られた通話音声テキストに対応付けられてよい。この要約単位とは、発話単位の通話音声テキストから要約文生成を容易かつ高精度化できるよう、要約生成処理の処理単位としてさらに細分化された区切りの単位であり、その詳細は図８を参照して後述する。 The speech recognition post-processing unit 53 executes syntactic analysis, morphological analysis, and the like on the dialogue speech text for each utterance unit output by the speech recognition unit 52 to divide the dialogue speech text into a summary unit and separate it into a summary unit. Output as the specified interactive speech text 33. The syntactic analysis result and the morphological analysis result may be associated with the call speech text divided into summary units. The summary unit is a unit of division further subdivided as a processing unit of summary generation processing so that summary generation can be easily and accurately performed from speech voice text of the utterance unit, and the details thereof are shown in FIG. Will be described later.

音声認識後処理部５３はまた、各重要語について重み付けを定義する音声認識用辞書３２を参照して、抽出した要約単位ごとに重み付けを付与してもよい。例えば、日付、時間、住所、電話番号等は要約文に残すべき重要語であることが多く、音声認識後処理部５３によりこれらの語を重み付けすることにより、誤変換を低減することができる。 The voice recognition post-processing unit 53 may also assign weights to the extracted summary units with reference to the voice recognition dictionary 32 that defines weights for each key word. For example, date, time, address, telephone number, etc. are often important words to be left in the summary, and erroneous conversion can be reduced by weighting these words by the speech recognition post-processing unit 53.

相槌解析部５４は、音声認識後処理部５３により供給される要約単位に区切られた対話音声テキストから、例えば「はい」、「いいえ」等の受け答えと推定されるテキストを検出し、検出されたテキストが相槌か、或いは受け答えかを判定する。相槌解析部５４は、この判定結果に基づいて、相槌と判定されたテキストを、音声認識後処理部５３が出力する要約単位に区切られた対話音声テキスト３３から削除する。
一方、相槌解析部５４はまた、受け答えと判定されたテキストは、要約生成サーバ７が生成する要約文に含まれるよう、対話音声テキスト３３に含めるとともに、対話音声テキスト中で当該テキストに「受け答え」である旨をタグ付けする。この相槌解析処理の詳細は、図１３および図１４を参照して後述する。 The speech analysis unit 54 detects and detects text presumed to be an answer such as “Yes” or “No” from the interactive speech text divided into the summary units supplied by the speech recognition post-processing unit 53. Determine if the text is a backstroke or a response. Based on the determination result, the competition analysis unit 54 deletes the text determined to be a competition from the interactive speech text 33 divided into summary units output by the speech recognition post-processing unit 53.
On the other hand, the competition analysis section 54 also includes the text judged to be an answer in the dialog speech text 33 so as to be included in the summary sentence generated by the summary generation server 7 and “answers” the text in the dialogue speech text. Tag that. The details of this competition analysis process will be described later with reference to FIGS. 13 and 14.

＜要約生成サーバ７の機能構成例＞
図３は、本実施形態に係る要約生成サーバ７の機能構成の非限定的一例を示す図である。
図３を参照して、要約生成サーバ７は、テキスト補正部７１、冗長性排除部７２、要約文生成部７３、感情解析部７４、および要約文短縮部７５を備える。 <Example of Functional Configuration of Summary Generation Server 7>
FIG. 3 is a diagram showing one non-limiting example of a functional configuration of the summary generation server 7 according to the present embodiment.
Referring to FIG. 3, summary generation server 7 includes text correction unit 71, redundancy elimination unit 72, summary generation unit 73, emotion analysis unit 74, and summary shortening unit 75.

テキスト補正部７１は、要約単位に区切られた対話音声テキスト３３を読み出して、構文解析結果および形態素解析結果に基づいて、要約文生成を容易化するため、対話音声テキストを補正し、補正された対話音声テキストを冗長性排除部７２へ出力する。 The text correction unit 71 reads out the dialog speech text 33 divided into summary units, and corrects the dialog speech text in order to facilitate the generation of a summary sentence based on the syntactic analysis result and the morphological analysis result. The interactive speech text is output to the redundancy elimination unit 72.

冗長性排除部７２は、テキスト補正部７１から供給される補正された対話音声テキストの冗長性を排除する。具体的には、冗長性排除部７２は、例えば不要語テーブル３５を参照することにより、対話音声テキストから不要語や重複する文等を削除して、要約文生成部７３に供給すべき対話音声テキストを短縮化する。冗長性排除部７２は、冗長性が排除された短縮化対話音声テキストを、要約文生成部７３へ出力する。 The redundancy exclusion unit 72 eliminates redundancy of the corrected dialogue speech text supplied from the text correction unit 71. More specifically, the redundancy exclusion unit 72 refers to the unnecessary word table 35, for example, to delete unnecessary words and overlapping sentences from the dialogue speech text, and the dialogue speech to be supplied to the summary sentence generation unit 73. Shorten the text The redundancy removal unit 72 outputs the abbreviated dialogue speech text from which the redundancy is removed to the summary generation unit 73.

要約文生成部７３は、冗長性排除部７２から供給される短縮化対話音声テキストを読み出し、重要語テーブル３４、不要語テーブル３５、および各種変換テーブル３６を参照して、要約文テキストを生成する。要約文生成部７３は、１通話ごとに１つの要約文テキストを生成してよい。要約文生成部７３が出力する要約文は、例えば、通話音声テキストの話し言葉を変換して得られる報告調の簡潔な文体、例えば体言止めの文体であってよい。 The summary sentence generation unit 73 reads out the abbreviated dialogue speech text supplied from the redundancy exclusion unit 72, and generates a summary sentence text with reference to the important word table 34, the unnecessary word table 35, and the various conversion tables 36. . The summary sentence generation unit 73 may generate one summary sentence text for each call. The summary sentence output by the summary sentence generation unit 73 may be, for example, a brief style of a report style obtained by converting the spoken language of the call speech text, for example, an unspeakable style.

本実施形態において、要約文生成部７３は、感情解析サーバ６から、対話中の話者の感情を示す定量的指標を、話者の感情解析結果として取得し、取得された話者の感情解析結果を、生成すべき要約文テキストに含めたり、要約文テキストと同時にまたは関連して表示装置上に表示させたりすることができる。感情解析サーバ６から供給される話者の感情解析結果は、話者ごとに例えば、喜怒、満足度、ストレス度、信頼度等の定量的指標を含む。 In the present embodiment, the abstract sentence generation unit 73 acquires from the emotion analysis server 6 a quantitative index indicating the speaker's emotion during the dialog as a speaker's emotion analysis result, and acquires the speaker's emotion analysis The results can be included in the summary text to be generated, or can be displayed on the display simultaneously or in conjunction with the summary text. The emotion analysis result of the speaker supplied from the emotion analysis server 6 includes, for each speaker, quantitative indexes such as, for example, anger, satisfaction, stress, reliability, and the like.

感情解析部７４は、要約文生成部７３が生成する要約文から、感情語テーブル３７を参照して、要約文テキスト中の感情表現部分を抽出し、要約文に含めるべき端的な感情表現語に変換して、変換された感情表現語で、要約文テキスト中で抽出された感情表現部分を置き換える。
要約文短縮部７５は、要約文生成部７３から供給される要約文が、所定長、例えば所定文字数の閾値を超えた場合に、当該閾値内の要約文長となるよう、要約文を短縮し、短縮化された要約文を要約文テキスト３８として出力する。 The emotion analysis unit 74 refers to the emotion word table 37 from the summary sentence generated by the summary sentence generation unit 73, extracts the emotion expression part in the summary sentence text, and sets it as a simple emotion expression word to be included in the summary sentence. Convert and replace the emotion expression part extracted in the summary sentence text with the converted emotion expression word.
The summary sentence shortening unit 75 shortens the summary sentence so that the summary sentence length within the threshold is obtained when the summary sentence supplied from the summary sentence generation unit 73 exceeds a threshold of a predetermined length, for example, a predetermined number of characters. , The condensed summary sentence is output as a summary sentence text 38.

＜音声認識サーバ５における音声認識処理の処理手順＞
図４は、音声認識サーバ５の各部が実行する音声認識処理の処理手順の非限定的一例を示すフローチャートである。
Ｓ１で、音声認識サーバ５の音声認識前処理部５１は、対話音声ファイル３１から、１通話ごとにファイル化された対話音声データを読み出す。
Ｓ２で、音声認識サーバ５の音声認識前処理部５１は、Ｓ１で読み出された対話音声中の話者を識別する。具体的には、音声認識前処理部５１は、対話音声ファイルに対応付けられた呼情報の話者識別情報を参照することにより、対話音声から話者、例えば顧客とオペレータ、を識別することができる。 <Processing Procedure of Speech Recognition Processing in Speech Recognition Server 5>
FIG. 4 is a flowchart showing a non-limiting example of the processing procedure of the speech recognition process performed by each unit of the speech recognition server 5.
In step S1, the speech recognition pre-processing unit 51 of the speech recognition server 5 reads out the dialogue speech data converted into a file for each call from the dialogue speech file 31.
At S2, the speech recognition pre-processing unit 51 of the speech recognition server 5 identifies the speaker in the dialogue voice read at S1. Specifically, the speech recognition pre-processing unit 51 can identify the speaker, for example, the customer and the operator from the dialogue voice by referring to the speaker identification information of the call information associated with the dialogue voice file. it can.

具体的には、音声認識前処理部５１は、呼情報データベース（不図示）を参照して、１通話内の話者識別情報を判別することにより、１通話内の発話のそれぞれの発話者が顧客であるかオペレータであるかを識別することができる。
後段の音声認識部５２では、識別された話者ごとに、対話音声データが音声認識されるとともに、音声認識された対話音声テキストから要約を生成する要約生成サーバ７の要約文生成部７３では、対話録音のタイムスタンプを参照して、双方の話者の認識結果のテキストを対応付けることができる。 Specifically, the speech recognition pre-processing unit 51 refers to the call information database (not shown) and determines the speaker identification information in one call so that each utterer of the utterance in one call is It can identify whether it is a customer or an operator.
In the speech recognition unit 52 in the latter stage, the dialogue speech data is speech-recognized for each identified speaker, and the summary generation unit 73 of the summary generation server 7 generates a summary from the speech speech-recognized dialogue speech text, The texts of the recognition results of both speakers can be associated with reference to the time stamp of the dialogue recording.

音声認識前処理部５１は、一方の話者、例えばオペレータの発話であると識別された発話の対話音声データを他方の話者、例えば顧客の発話であると識別された発話の対話音声データより優先して、要約生成サーバ７に供給してもよい。これは、要約文生成源として、一方の発話者、典型的にはオペレータの発話の方が、応対履歴を要約するに足る情報がより効率的に得られるとの知見に基づく。
代替的に、音声認識前処理部５１は、一方の話者のみ、例えばオペレータの発話であると識別された発話の対話音声データのみを音声認識して、対話音声テキストに変換してもよい。音声認識の対象を制限することで、高負荷な音声認識を行う音声認識サーバ５内におけるハードウエア資源が低減でき、音声認識処理や要約文生成処理のリアルタイム性が向上するとともに、対話音声テキストファイル等のリソース容量も削減できる。 The speech recognition pre-processing unit 51 uses the dialogue speech data of one speaker, for example, the speech identified as the speech of the operator from the speech speech data of the utterance identified as the speech of the other speaker, for example, the customer. It may be supplied to the summary generation server 7 in priority. This is based on the finding that the utterance of one of the speakers, typically the operator, can more efficiently obtain information sufficient to summarize the response history as a summary text generation source.
Alternatively, the speech recognition pre-processing unit 51 may perform speech recognition on only one of the speakers, for example, speech of the speech identified as the speech of the operator and convert it into speech text. By limiting the target of speech recognition, hardware resources in the speech recognition server 5 performing high-load speech recognition can be reduced, and real-time performance of speech recognition processing and summary sentence generation processing is improved, and an interactive speech text file Resource capacity can also be reduced.

Ｓ３で、音声認識サーバ５の音声認識前処理部５１は、１通話ごとに読み出された話者が分離された対話音声データを、発話単位に分離して、発話単位に分離された対話音声を、音声認識部５２に供給する。
具体的には、音声認識前処理部５１は、対話音声データ中で一定の無音区間を検出し、検出された無音区間で音声を区切ることにより、有音区間を切り出して発話単位の対話音声として分離する。 In S3, the speech recognition pre-processing unit 51 of the speech recognition server 5 separates the dialogue speech data in which the speaker read out for each call is separated into speech units, and divides the speech into speech units. Are supplied to the speech recognition unit 52.
Specifically, the speech recognition pre-processing unit 51 detects a certain silent section in the dialogue speech data, and cuts out a sound section by dividing the speech in the detected silent section, and sets it as the dialogue voice of the utterance unit. To separate.

図５に示すように、１通話分の対話音声ファイルは、ＣＨ１とＣＨ２の２チャネルで構成される。ＣＨ１の音声は例えば顧客の発話であり、ＣＨ２の音声は例えばオペレータの発話であるものとする。
音声認識前処理部５１は、一定の長さの無音区間を検出する。検出すべき無音区間は、例えば、１．５秒以上の無音区間であってよく、例えば１秒から２秒の間でその下限値が調整されてよい。この無音区間の下限値を、第１の閾値という。この無音区間の下限値は、例えば息継ぎに要する時間を考慮して設定することができる。また、この無音区間の下限値は、例えば「言ったよね」の発話中の撥音「っ」を誤って無音区間として検出しないよう設定されることが好適である。 As shown in FIG. 5, the interactive voice file for one call is composed of two channels CH1 and CH2. The sound of CH1 is, for example, the speech of a customer, and the sound of CH2 is, for example, the speech of an operator.
The speech recognition pre-processing unit 51 detects a silent section of a fixed length. The silence interval to be detected may be, for example, a silence interval of 1.5 seconds or more, and the lower limit may be adjusted, for example, between 1 second and 2 seconds. The lower limit value of this silent section is referred to as a first threshold. The lower limit value of this silent section can be set, for example, in consideration of the time required for breathing. In addition, it is preferable that the lower limit value of the silent section is set so as not to erroneously detect, as a silent section, a sound repellent "" in a speech of "I told you", for example.

図５を参照して、音声認識前処理部５１は、ＣＨ１の顧客の音声から、第１の閾値以上の長さの無音区間（ＳＬ１１、ＳＬ１２、・・・、ＳＬ１６）を検出し、検出された２つの無音区間の間にある有音区間（ＳＰ１１、ＳＰ１２、・・・、Ｓ１７）を抽出する。抽出された有音区間（ＳＰ１１、ＳＰ１２、・・・、Ｓ１７）のそれぞれが、顧客として識別された音声中の１つの発話単位となり、本実施形態において、音声認識部５２に供給される音声認識単位となる。有音区間のそれぞれは、息継ぎなしで発話された区間と見做すことができる。
同様に、図５を参照して、音声認識前処理部５１は、ＣＨ２のオペレータの音声から、第１の閾値を下限値とする無音区間（ＳＬ２１、ＳＬ２２、・・・、ＳＬ２６）を検出し、検出された２つの無音区間の間にある有音区間（ＳＰ２１、ＳＰ２２、・・・、Ｓ２７）を抽出する。抽出された有音区間（ＳＰ２１、ＳＰ２２、・・・、Ｓ２７）のそれぞれが、オペレータとして識別された音声中の１つの発話単位となる。 Referring to FIG. 5, speech recognition pre-processing unit 51 detects and detects a silent section (SL11, SL12,..., SL16) having a length equal to or greater than a first threshold from the voice of the customer of CH1. A sound section (SP11, SP12,..., S17) existing between two silent sections is extracted. Each of the extracted voice segments (SP11, SP12,..., S17) is one utterance unit in the voice identified as the customer, and in the present embodiment, the voice recognition supplied to the voice recognition unit 52 It becomes a unit. Each of the sound segments can be regarded as a segment uttered without breath.
Similarly, referring to FIG. 5, speech recognition pre-processing unit 51 detects a silent interval (SL21, SL22,..., SL26) having the first threshold as the lower limit value from the voice of the operator of CH2. Extract a sound section (SP21, SP22,..., S27) between two detected silent sections. Each of the extracted voice segments (SP21, SP22,..., S27) is one utterance unit in the voice identified as the operator.

図４に戻り、Ｓ４で、音声認識サーバ５の音声認識部５２は、音声認識前処理部５１から発話単位で入力される対話音声データに対して、識別された話者ごとに音声認識処理を実行して、テキスト化された対話音声である対話音声テキストを出力する。
本実施形態においては、このように対話音声データを発話単位で音声認識処理を実行する。上記の無音区間は、当該無音区間中に話者が切り替わったこと、あるいは同一の話者が話題ないし内容を転換したことを推認させる。このため、無音区間の前後では発話内容における連続性が乏しいと推定でき、発話単位で対話音声テキストを音声認識することで、認識精度の向上が期待できる。 Returning to FIG. 4, in S4, the speech recognition unit 52 of the speech recognition server 5 performs speech recognition processing for each identified speaker on the dialogue speech data input from the speech recognition pre-processing unit 51 in units of speech. Execute to output interactive speech text, which is text-to-speech interactive speech.
In the present embodiment, the speech recognition process is performed on the basis of the dialogue speech data in this manner. The silent section above recognizes that the speaker has switched during the silent section or that the same speaker has changed the topic or content. Therefore, it can be estimated that the continuity in the uttered content is poor before and after the silent section, and improvement in the recognition accuracy can be expected by speech recognition of the dialog speech text on an utterance basis.

この音声認識処理には、公知の音声認識エンジンを適用することができる。
音声認識部５２が実行する音声認識処理における文字コード列への変換の一例として、例えば、対話音声データ中の、必要に応じて各種変換処理された音声波形から抽出される特徴量を、予め定義されている音素ごとの参照音響パターンと比較処理することにより、音声波形データを文字コード列に変換することができる。 A known speech recognition engine can be applied to this speech recognition process.
As an example of conversion to a character code string in the speech recognition process performed by the speech recognition unit 52, for example, a feature amount extracted from speech waveforms which have been variously converted as necessary in interactive speech data is defined in advance The speech waveform data can be converted into a character code string by performing comparison processing with the reference sound pattern for each phoneme being processed.

音声認識部５２および音声認識後処理部５３により参照される音声認識辞書３２には、予め音声認識の対象と想定され、かつ要約文に含まれるべき重要情報を含む重要語（ないし重要文）のデータが定義されているため、この音声認識辞書３２に定義された重要語に相当する対話音声データの音素列のみが抽出されて意味づけされてよい。また、音声認識辞書３２に定義される重要語（ないし重要文）には重み付けが付与されてよい。音声認識部５２により読み出された対話音声データのうち、この定義された重要語に相当する対話音声データ箇所が対話音声テキストに変換され、音声認識結果として出力されてよい。 The speech recognition dictionary 32 referred to by the speech recognition unit 52 and the speech recognition post-processing unit 53 is an important word (or important sentence) including important information assumed to be a target of speech recognition in advance and to be included in a summary sentence. Since the data is defined, only the phoneme string of the interactive speech data corresponding to the key word defined in the speech recognition dictionary 32 may be extracted and made meaningful. In addition, important words (or important sentences) defined in the speech recognition dictionary 32 may be weighted. Of the dialogue speech data read out by the speech recognition unit 52, the dialogue speech data portion corresponding to the defined important word may be converted into the dialogue speech text and outputted as the speech recognition result.

図９は、Ｓ４で音声認識部５２が出力する、対話音声データから生成される発話単位の音声認識結果である対話音声テキストの非限定的一例を示す。図９の例では、「対話要約処理は不要な発言や表現の削除のほか話し言葉から書き言葉への変換などで構成されますなお処理対象データの特性に応じて選択することができます」が、２つの無音区間の間で抽出された発話単位となる。図９に示すように、Ｓ４で出力される発話単位の音声認識結果は、句読点等で区切られない複数の文章を１つのまとまりとして含み得る。 FIG. 9 shows a non-limiting example of interactive speech text which is a speech recognition result of a speech unit generated from the dialogue speech data, which is output from the speech recognition unit 52 in S4. In the example shown in Fig. 9, "Interaction summary processing consists of conversion from spoken language to written language in addition to unnecessary speech and expression deletion. It can be selected according to the characteristics of the data to be processed." It becomes an utterance unit extracted between two silent sections. As shown in FIG. 9, the speech recognition result of the speech unit output in S4 may include a plurality of sentences not separated by punctuation marks as one unit.

図４に戻り、Ｓ５で、音声認識サーバ５の音声認識後処理部５３は、Ｓ４で音声認識部５２が出力する音声認識結果を自然発話へ変換するとともに、要約単位に区切る。音声認識後処理部５３はまた、Ｓ５で区切った要約単位の対話音声テキストに、構文解析や形態素解析結果に基づいて、種別や重み付けを付与することができる。
なお、Ｓ５における変換処理の詳細は、図７および図８を参照して後述する。 Returning to FIG. 4, in S5, the speech recognition post-processing unit 53 of the speech recognition server 5 converts the speech recognition result output by the speech recognition unit 52 in S4 into natural speech and divides it into a summary unit. The speech recognition post-processing unit 53 can also assign a type and a weight to the dialogue speech text of the summary unit divided in S5 based on the result of syntactic analysis and morphological analysis.
The details of the conversion process in S5 will be described later with reference to FIGS. 7 and 8.

Ｓ６で、音声認識サーバ５の相槌解析部５４は、要約単位に区切られた対話音声テキストから、例えば「はい」、「いいえ」等の受け答えと推定されるテキストを検出し、検出されたテキストが相槌か、或いは受け答えかを判定する。
相槌解析部５４は、この判定結果に基づいて、相槌と判定されたテキストを、音声認識後処理部５３が出力する要約単位に区切られた対話音声テキスト３３から削除する。一方、相槌解析部５４は、受け答えと判定されたテキストを、要約生成サーバ７が生成する要約文に含まれるよう、通話音声テキスト３３に含めるとともに、対話音声テキスト中で当該テキスト要素に「受け答え」である旨をタグ付け（種別付与）する。なお、Ｓ６における相槌解析処理の詳細は、図１３および図１４を参照して後述する。
Ｓ７で、相槌解析部５４は、受け答えと判定されたテキストを付加した要約単位に区切られた対話音声テキストを出力する。 In S6, the speech analysis server 54 of the speech recognition server 5 detects the text presumed to be an answer such as "Yes" or "No" from the interactive speech text divided into summary units, and the detected text is detected. It is determined whether it is a fight or a response.
Based on the determination result, the competition analysis unit 54 deletes the text determined to be a competition from the interactive speech text 33 divided into summary units output by the speech recognition post-processing unit 53. On the other hand, the competition analysis unit 54 includes the text determined to be the answer in the call voice text 33 so as to be included in the summary sentence generated by the summary generation server 7, and also "answers" the text element in the dialogue speech text. Tagging (classification) that it is. In addition, the details of the processing of analyzing the competition in S6 will be described later with reference to FIGS. 13 and 14.
In S7, the sumo analysis unit 54 outputs the dialog speech text divided into summary units to which the text determined as the answer is added.

＜音声認識後処理部５３における音声認識後処理の詳細処理手順＞
図７は、図４のＳ５で音声認識後処理部５３が実行する音声認識後処理の詳細処理手順の一例を示すフローチャートである。
図７を参照して、Ｓ５１で、音声認識サーバ５の音声認識後処理部５３は、音声認識辞書３２を参照して、Ｓ４で音声認識部５２が出力する音声認識結果である発話単位の対話音声テキストの構文解析を実行する。
Ｓ５２で、音声認識後処理部５３は、音声認識辞書３２を参照して、発話単位の対話音声テキストの形態素解析を実行する。なお、Ｓ５１の構文解析およびＳ５２の形態素解析は、いずれかを先に実行してもよく、同時並行的に実行されてよい。 <Detailed Processing Procedure of Speech Recognition Post Processing in Speech Recognition Post Processing Unit 53>
FIG. 7 is a flowchart showing an example of a detailed processing procedure of the speech recognition post-processing performed by the speech recognition post-processing unit 53 in S5 of FIG.
Referring to FIG. 7, in S51, the speech recognition post-processing unit 53 of the speech recognition server 5 refers to the speech recognition dictionary 32, and the dialogue of the utterance unit which is the speech recognition result output by the speech recognition unit 52 in S4. Perform parsing of speech text.
In step S52, the speech recognition post-processing unit 53 refers to the speech recognition dictionary 32 and executes morphological analysis of dialogue speech text in utterance units. Note that either the syntax analysis of S51 or the morphological analysis of S52 may be performed first, or may be performed concurrently.

図１０は、図９に示す発話単位の対話音声テキストに対して、Ｓ５１における構文解析処理を実行して得られる構文解析結果の非限定的一例を示す。図１０に示すように、Ｓ５１で出力される構文解析結果では、テキスト中の形態素間の関係が構造化されている。
図１１は、図９に示す発話単位の対話音声テキストに対して、Ｓ５２における形態素解析処理を実行して得られる形態素解析結果の非限定的一例を示す。図１１に示すように、形態素解析結果は、抽出された形態素ごとに、書字、読み、取得された品詞種別（大分類、中分類、小分類）を含んでよい。 FIG. 10 shows one non-limiting example of the parsing result obtained by performing the parsing process in S51 on the dialog voice text of the utterance unit shown in FIG. As shown in FIG. 10, in the parsing result output in S51, the relationship between morphemes in the text is structured.
FIG. 11 shows one non-limiting example of the morphological analysis result obtained by performing the morphological analysis processing in S52 on the dialog speech text of the utterance unit shown in FIG. As shown in FIG. 11, the morpheme analysis result may include, for each extracted morpheme, writing, reading, and acquired part-of-speech classification (major classification, middle classification, minor classification).

図７に戻り、Ｓ５３で、音声認識後処理部５３は、Ｓ５１およびＳ５２の構文解析および形態素解析の解析結果に基づいて、発話単位の対話音声テキストを、要約単位に細分する。
図８は、図７のＳ５３で音声認識後処理部５３が実行する要約単位への分離処理の詳細処理手順の一例を示すフローチャートである。
Ｓ５３１で、音声認識後処理部５３は、形態素解析および構文解析の結果得られた区切り単位の品詞種別が、名詞か否かを判定する。解析の結果得られた区切り単位の品詞種別が名詞である場合Ｓ（Ｓ５３１：Ｙ）、Ｓ５３２に進む。一方、解析の結果得られた区切り単位の品詞種別が名詞以外である場合、Ｓ５３２以降の処理をスキップして処理を終了し、Ｓ６へ進む。 Referring back to FIG. 7, in step S53, the speech recognition post-processing unit 53 divides the dialogue unit speech text of the utterance unit into a summary unit based on the analysis results of the syntactic analysis and the morphological analysis of S51 and S52.
FIG. 8 is a flowchart showing an example of a detailed processing procedure of separation processing into summary units which is executed by the speech recognition post-processing unit 53 in S53 of FIG.
In step S531, the speech recognition post-processing unit 53 determines whether the part-of-speech classification obtained as a result of morphological analysis and syntax analysis is a noun. If the part-of-speech type of the delimiter unit obtained as a result of analysis is a noun, the process proceeds to S (S 531: Y) and S 532. On the other hand, when the part-of-speech type of the delimiter unit obtained as a result of analysis is other than the noun, the processing after S532 is skipped to end the processing, and the processing proceeds to S6.

Ｓ５３２で、音声認識後処理部５３は、形態素解析および構文解析結果得られた区切り単位の群（まとまり）の先頭が、名詞以外か否かを判定する。区切り単位の群の先頭が名詞以外である場合（Ｓ５３２：Ｙ）、Ｓ５３３以降の処理をスキップして処理を終了し、Ｓ６へ進む。一方、区切り単位の群の先頭が名詞である場合（Ｓ５３２３：Ｎ）、Ｓ５３３に進む。 In step S532, the speech recognition post-processing unit 53 determines whether the beginning of the group (group) of delimiter units obtained as a result of morphological analysis and syntax analysis is other than a noun. If the beginning of the group of delimiter units is other than a noun (S532: Y), the processing after S533 is skipped to end the processing, and the processing proceeds to S6. On the other hand, if the beginning of the group of delimiter units is a noun (S5323: N), the process proceeds to S533.

Ｓ５３３で、音声認識後処理部５３は、形態素解析および構文解析の結果得られた区切り単位が名詞＋αであるか否かを判定する。区切り単位が名詞＋αである場合、すなわち末尾に助詞等、名詞以外を含む場合（Ｓ５３３：Ｙ）、Ｓ５３４で、音声認識後処理部５３は、当該区切り単位を直前の区切り単位に結合して、処理を終了し、Ｓ６に進む。一方、区切り単位が名詞＋αでない場合、すなわち名詞のみである場合（Ｓ５３３：Ｎ）、Ｓ５３５で、音声認識後処理部５３は、当該区切り単位を直前の区切り単位に結合した上で、Ｓ５３２に戻り、Ｓ５３２およびＳ５３３の区切り単位の判定を繰り返す。 In S533, the speech recognition post-processing unit 53 determines whether the delimiter unit obtained as a result of morphological analysis and syntactic analysis is noun + α. If the delimiter unit is noun + α, that is, if the end includes particles other than nouns (S533: Y), the speech recognition post-processing unit 53 combines the delimiter unit with the immediately preceding delimiter unit in S534. The process ends, and the process proceeds to S6. On the other hand, if the delimiter unit is not noun + α, that is, only nouns (S533: N), the speech recognition post-processing unit 53 returns the processing to S532 after combining the delimiter unit with the preceding delimiter unit in S535. , S532 and S533 repeat the determination of the delimiting unit.

図１２は、図９に示す発話単位の対話音声テキストを入力とし、図１０に示す構文解析結果および図１１に示す形態素解析結果を経て、図４のＳ５で、音声認識後処理部４３が出力する対話音声テキストの一例である。
図１２中の四角記号は、それぞれ要約単位の区切りを示す。図１２に示すように、Ｓ５の自然発話への変換および要約単位への分離処理を実行することにより、連続する「対話」、「要約」、「処理は」が１つの要約単位に、また、連続する「処理」、「対象」、「データの」が他の１つの要約単位に、それぞれ結合されて、要約単位を構成している。
音声認識サーバ５の音声認識後処理部４３はさらに、分離された要約単位の対話音声テキストのそれぞれに、音声認識辞書３２を参照することにより、種別や重み付けを付加してよい。図１２では、要約単位「対話要約処理は」および要約単位「処理対象データの」が、それぞれ要約文に含められるべき重要要約単位として重み付けされている。 FIG. 12 receives as an input the dialogue speech text of the utterance unit shown in FIG. 9 and passes through the syntactic analysis result shown in FIG. 10 and the morphological analysis result shown in FIG. Is an example of interactive speech text.
Square symbols in FIG. 12 indicate the division of the summary unit. As shown in FIG. 12, by performing conversion to S5 into natural speech and separation processing into summary units, continuous "dialogue", "summary", "process" into one summary unit, or Consecutive "processes", "objects" and "data" are respectively combined into one other summary unit to constitute a summary unit.
The speech recognition post-processing unit 43 of the speech recognition server 5 may further add a type and a weight to each of the separated speech units in the summary unit by referring to the speech recognition dictionary 32. In FIG. 12, the summary unit "dialogue summary processing" and the summary unit "of data to be processed" are respectively weighted as important summary units to be included in the summary.

＜相槌解析部５４における相槌解析処理の詳細処理手順＞
図１３は、図４のＳ６で音声認識サーバ５の相槌解析部５４が実行する相槌解析処理の詳細処理手順の非限定的一例を示すフローチャートである。
図１３を参照して、Ｓ６１で、音声認識サーバ５の相槌解析部５４は、双方の話者、例えば顧客およびオペレータの対話音声を対話音声ファイル３１から取得する。対話音声ファイル３１には、１通話ごとに双方の話者を対応付けることが可能なタイムスタンプが付与されているため、相槌解析部４４は、１通話単位を構成する双方の話者の対話音声を取得することができる。或いは、通話単位ごとに当該通話単位を構成する話者の対話音声それぞれに共通の識別子が付与されることにより、双方の話者の対話音声を対応付けてもよい。Ｓ６１では、取得された双方の話者の対話音声とともに、当該対話音声を音声認識して得られた対話音声テキストが入力される。 <Detailed Processing Procedure of Sumo Analysis Processing in Sumo Analysis Unit 54>
FIG. 13 is a flowchart showing a non-limiting example of the detailed processing procedure of the competition analysis process performed by the competition analysis unit 54 of the speech recognition server 5 in S6 of FIG.
Referring to FIG. 13, in S61, the speech analysis unit 54 of the speech recognition server 5 acquires dialogue speech of both speakers, for example, a customer and an operator from the dialogue speech file 31. Since the dialogue voice file 31 is provided with a time stamp capable of associating the two speakers for each call, the harmony analysis unit 44 sets the dialogue voices of both speakers constituting one talk unit. It can be acquired. Alternatively, the dialog voices of both speakers may be associated with each other by giving a common identifier to each of the dialog voices of the speakers constituting the call unit for each call unit. At S61, together with the acquired dialogue voice of both speakers, dialogue voice text obtained by speech recognition of the dialogue voice is input.

Ｓ６２で、相槌解析部５４は、顧客およびオペレータの双方の対話音声を対比し、対話の相手が発話している間に短い発話が検出できるか否かを判定する。
図１４（ａ）を参照して、ＣＨ１の顧客の対話音声中の短い発話（ＳＰ１４）は、対話の相手であるＣＨ２のオペレータの発話（ＳＰ２４）の間になされた短い発話であるため、Ｓ６２で検出される。Ｓ６２で検出すべき短い発話とは、例えば２秒未満であってよい。
相手が発話中に短い発話が検出されない場合（Ｓ６２：Ｎ）、Ｓ６３からＳ６８の処理をスキップして処理を終了し、Ｓ７へ進む。一方、相手が発話中に短い発話が検出された場合（Ｓ６２：Ｙ）、Ｓ６３に進む。 In S62, the competition analysis unit 54 compares the dialogue voices of both the customer and the operator, and determines whether a short utterance can be detected while the conversation partner is speaking.
Referring to FIG. 14A, since the short utterance (SP14) in the dialogue voice of the customer of CH1 is a short utterance made during the utterance (SP24) of the operator of CH2 which is the other party of the dialogue, S62. Is detected. The short utterance to be detected in S62 may be, for example, less than 2 seconds.
When a short utterance is not detected while the other party is speaking (S62: N), the processing of S63 to S68 is skipped and the processing is ended, and the processing proceeds to S7. On the other hand, if a short utterance is detected while the other party is speaking (S62: Y), the process proceeds to S63.

Ｓ６３で、相槌解析部５４は、Ｓ６２で検出された短い発話と同一のタイムスタンプを有する音声認識結果の対話音声テキストを検索し、当該短い発話の音声認識結果が受け答えと推定できるか否か、すなわち受け答えの候補となるか否かを判定する。例えば、短い発話が「はい」、「ええ」、「いいえ」、「いや」等のテキストであれば、受け答えの候補と判定できる。例えばこの受け答え候補は予め相槌解析部５４に設定しておけばよい。 In step S63, the harmony analysis unit 54 searches the dialog speech text of the speech recognition result having the same time stamp as the short speech detected in step S62, and whether or not the speech recognition result of the short speech can be estimated as an answer. That is, it is determined whether it becomes a candidate of an answer. For example, if the short utterance is a text such as “Yes”, “Yes”, “No”, “No”, etc., it can be judged as a candidate for an answer. For example, the answer candidate may be set in advance in the face analysis unit 54.

短い発話の音声認識結果が受け答え候補でない場合（Ｓ６３：Ｎ）、Ｓ６４に進んで、当該短い発話は相槌であると判定して、要約生成に入力すべき対話音声テキストから削除する。すなわち、Ｓ６４で相槌と判定された短い発話は、要約作成において意味のない相槌であるため、要約文生成源とされない。一方、短い発話の音声認識結果が受け答え候補である場合（Ｓ６３：Ｙ）、Ｓ６５に進む。 If the speech recognition result of the short utterance is not the answer candidate (S63: N), the process proceeds to S64, determines that the short utterance is a compliment, and deletes it from the interactive speech text to be input for summary generation. That is, the short utterance determined to be a sumo wrestling in S64 is not a summary sentence generation source because it is a meaningless compliment in the summarization. On the other hand, if the speech recognition result of the short utterance is an answer candidate (S63: Y), the process proceeds to S65.

Ｓ６５では、相槌解析部５４はさらに、Ｓ６３で検出された受け答え候補である短い発話の発話中に、対話の相手の音声に短い無音期間があるか否かを判定する。
図１４（ａ）を参照して、ＣＨ１の顧客の音声中の短い発話（ＳＰ１４）に対応するＣＨ２のオペレータの発話では、図４のＳ３で音声認識前処理部４１により、第１の閾値以上の長さの無音区間を含まないため、１つの発話単位ＳＰ２４として検出されている。Ｓ６５では、この第１の閾値より小さい第２の閾値を用いて、対話の相手の音声に短い無音区間が検出できるか否かが判定される。この第２の閾値は、第１の閾値より小さい値を持ち、例えば、１秒であり、０．５秒から１．５秒の間で調整されてよい。 In S65, the competition analysis unit 54 further determines whether or not there is a short silent period in the voice of the other party in the dialogue during the utterance of the short utterance which is the answer candidate detected in S63.
Referring to FIG. 14A, in the utterance of the operator of CH2 corresponding to the short utterance (SP14) in the voice of the customer of CH1, the voice recognition pre-processing unit 41 in S3 of FIG. Since it does not include a silent section having a length of, it is detected as one utterance unit SP24. In S65, it is determined whether or not a short silent interval can be detected in the voice of the conversation partner using a second threshold smaller than the first threshold. The second threshold has a value smaller than the first threshold, and may be, for example, 1 second and adjusted between 0.5 and 1.5 seconds.

Ｓ６５で、受け答え候補である短い発話の発話中に、対話の相手の発話単位（有音区間）中に、第２の閾値以上の長さを持つ短い無音区間が検出された場合（Ｓ６５：Ｙ）、Ｓ６６で当該短い発話は受け答えであると判定してＳ６７に進む。一方、受け答え候補である短い発話の発話中に、対話の相手の発話単位（有音区間）中に、第２の閾値以上の長さを持つ短い無音区間が検出されない場合（Ｓ６５：Ｎ）、Ｓ６４に進んで、受け答え候補であった当該短い発話は、相槌であると判定して、要約生成に入力すべき対話音声テキストから削除する。
Ｓ６７で、相槌解析部５４は、Ｓ６６で受け答えと判定された短い発話の前後で、対話の相手の音声を２つの発話単位に分離する。 When a short silent interval having a length equal to or greater than the second threshold is detected in the utterance unit (speech zone) of the other party of the dialogue during the utterance of the short utterance which is the answer candidate in S65 (S65: Y It is determined at S66 that the short utterance is an answer, and the process proceeds to S67. On the other hand, when a short silent segment having a length equal to or greater than the second threshold is not detected in the utterance unit (speech segment) of the other party of the dialogue during the utterance of the short utterance which is the answer candidate (S65: N) Proceeding to S64, the short utterance that was the answer candidate is determined to be a compliment, and is deleted from the dialog speech text to be input for summary generation.
In S67, the harmony analysis unit 54 separates the voice of the other party of the dialogue into two utterance units before and after the short utterance determined as the answer in S66.

図１４（ｂ）を参照して、ＣＨ１の顧客の短い発話区間（ＳＰ１４）の音声認識結果が受け答え候補と判定されたものとすると、この発話（ＳＰ１４）中に、ＣＨ２のオペレータの発話区間（ＳＰ２４）には、第２の閾値以上かつ第１の閾値より小さい無音区間（ＳＬ２４ａ）が検出できる。この場合、相槌解析部５４は、検出されたこの無音区間（ＳＬ２４ａ）の前後で、オペレータの発話区間（ＳＰ２４）を分離して、無音区間（ＳＬ２４ａ）直前の発話区間（ＳＰ２４ａ）と無音区間（ＳＬ２４ａ）直後の発話区間（ＳＰ２４ｂ）とを取得する。 Referring to FIG. 14B, assuming that the speech recognition result of the short utterance section (SP14) of the customer of CH1 is determined as the answer candidate, the utterance section of the operator of CH2 during this utterance (SP14) In SP24), a silent interval (SL24a) which is equal to or greater than the second threshold and smaller than the first threshold can be detected. In this case, the speech analysis unit 54 separates the speech section (SP24) of the operator before and after the detected silent section (SL24a), and the speech section (SP24a) immediately before the silent section (SL24a) and the silent section (SP24). SL24a) The speech section (SP24b) immediately after is acquired.

Ｓ６８で、相槌解析部５４は、Ｓ６７で分離された、短い無音区間（ＳＬ２４ａ）直前の発話区間（ＳＰ２４ａ）を音声認識して得られた対話音声テキストを、Ｓ６６で受け答えと判定された音声テキストと対になるべき対話音声テキストであると判定し、この対のテキストを、受け答えのテキストと、当該受け答えを促した、何に対する受け答えであるかを特定するテキストとして、相互に対応付けて「受け答え」の種別を付与して、要約単位の対話音声テキストファイル３３へ出力する。 In S68, the speech analysis section 54 determines that the dialogue speech text obtained by speech recognition of the speech section (SP24a) immediately before the short silent section (SL24a) separated in S67 is the speech text determined as the answer in S66. And the dialogue text to be paired with each other, and the text of the pair is associated with each other as the text of the answer and the text for specifying the answer to which the answer was prompted. And a summary unit interactive voice text file 33.

＜要約生成サーバ７における要約生成処理の処理手順＞
図６は、要約生成サーバ７の各部が実行する要約生成処理の処理手順の非限定的一例を示すフローチャートである。
図６を参照して、Ｓ１０で、要約生成サーバ７のテキスト補正部７１は、要約単位に区切られた対話音声テキスト３３から１通話単位の対話音声テキストを読み出す。 <Processing Procedure of Summary Generation Processing in Summary Generation Server 7>
FIG. 6 is a flowchart showing a non-limiting example of the processing procedure of the abstract generation process performed by each section of the abstract generation server 7.
Referring to FIG. 6, in S10, the text correction unit 71 of the summary generation server 7 reads out one dialogue unit speech text from the dialogue speech text 33 divided into summary units.

Ｓ１１で、テキスト補正部７１は、Ｓ１０で読み出した対話音声テキストを補正する。具体的には、テキスト補正部７１は、図１２に示すように要約単位（要約生成の処理単位）で区切られた１つの発話単位の対話音声テキストに、句読点を挿入した上で、句点の位置で改行を挿入する。
図１５は、テキスト補正部７１が参照する句読点テーブルの非限定的一例を示す。図１５の句読点テーブルは、句点または読点を直後に挿入すべき用語を定義する。図１５において、「１」は読点の挿入、「０」は句点の挿入を示す。テキスト補正部７１は、図１５の句読点テーブルを参照して、要約単位の区切り記号から後方一致で、句読点テーブルに定義される「ますが」、「ますか」、「ます」、「はい」等の語を検索し、検索された語の直後に、句読点テーブルの定義に従って句点または読点を挿入していく。テキスト補正部７１は、図１５の句読点テーブルに定義される検索語のうち、字数の多いものから順に検索してよい。 In S11, the text correction unit 71 corrects the dialogue voice text read out in S10. Specifically, as shown in FIG. 12, the text correction unit 71 inserts a punctuation mark into the interactive speech text of one utterance unit separated by a summary unit (processing unit of summary generation), and then the position of the punctuation mark Insert a line break with
FIG. 15 shows one non-limiting example of the punctuation mark table to which the text correction unit 71 refers. The punctuation table of FIG. 15 defines terms for which a punctuation mark or a reading point should be inserted immediately after. In FIG. 15, “1” indicates insertion of a reading point, and “0” indicates insertion of a punctuation mark. The text correction unit 71 refers to the punctuation table shown in FIG. 15 and is defined in the punctuation table in backward matching from the delimiter of the summary unit, “Mr.”, “Mr.”, “Mas”, “Yes” etc. The word is searched, and immediately after the searched word, punctuation marks or reading points are inserted according to the definition of the punctuation mark table. The text correction unit 71 may sequentially search the search terms defined in the punctuation mark table of FIG.

テキスト補正部７１はさらに、形態素解析で抽出された数詞を検索し、検索された数値の意味を解析する。応対履歴のための要約文生成においては、数詞が要約におけるキーワードとなる重要語であることが多い。このため、テキスト補正部７１は、検索された数詞の意味を解析して、解析された数詞の意味に応じた種別を取得し、取得された種別に応じた単位や重みを付与する。
数詞の意味としては、例えば、「日付」、「時間」、「金額」、「電話番号」、「個数」等が付与され得るがこれに限定されない。 Furthermore, the text correction unit 71 searches for the numeral extracted by the morphological analysis, and analyzes the meaning of the searched numerical value. In the generation of a summary sentence for response history, the numeral is often an important word which is a key word in the summary. Therefore, the text correction unit 71 analyzes the meaning of the retrieved numeral, acquires the type according to the meaning of the analyzed numeral, and assigns a unit or weight according to the acquired type.
For example, “date”, “time”, “amount of money”, “telephone number”, “number of pieces” and the like may be given as the meaning of the numeral, but the present invention is not limited thereto.

図１６は、テキスト補正部７１が、解析された数詞の要素へ、種別、単位（表記）、重み付けをそれぞれ付与するために参照する数詞種別テーブルである。図１６を参照して、例えば日付や時刻や金額（円）には、個数（個）や温度（度）より高い重みが付与されている。
一方、テキスト補正部７１は、対話音声テキスト中に前後の語に関連しない数詞が検索された場合、誤認識と判定して、対話音声テキストから当該数詞を削除してよい。また、テキスト補正部７１は、要約文中における視認性および明瞭性向上のため、検索された数詞を半角数字に変換してよい。 FIG. 16 is a number classification type table that the text correction unit 71 refers to in order to assign types, units (notations), and weights to the elements of the analyzed number division. Referring to FIG. 16, for example, weights higher than the number (pieces) and the temperature (degrees) are given to the date, time, and money (yen).
On the other hand, when the dialogue speech text is searched for the number conjunction that is not related to the preceding and following words, the text correction unit 71 may determine that the speech is misrecognized and delete the number conjunction from the dialogue speech text. In addition, the text correction unit 71 may convert the retrieved numeral into half-width numerals in order to improve visibility and clarity in the summary text.

図６に戻り、Ｓ１２で、要約生成サーバ７の冗長性排除部７２は、音声認識された対話音声テキスト中の冗長性を排除してより簡明化ないし単純化された対話音声テキストを出力する。
具体的には、冗長性排除部７２は、不要語テーブル３５を参照して、対話音声テキストから不要語を削除する。
図１７は、冗長性排除部７２が参照する不要語テーブル３５の非限定的一例を示す。図１７を参照して、不要語テーブル３５には、「えー」等の間投詞、「いつもお世話になっております。」等の定型挨拶文等が不要語として定義されている。 Returning to FIG. 6, in S12, the redundancy exclusion unit 72 of the summary generation server 7 outputs the simplified or simplified dialogue speech text by eliminating the redundancy in the speech speech dialogue speech text.
Specifically, the redundancy exclusion unit 72 refers to the unnecessary word table 35 and deletes the unnecessary words from the interactive speech text.
FIG. 17 shows one non-limiting example of the unnecessary word table 35 to which the redundancy exclusion unit 72 refers. Referring to FIG. 17, in the unnecessary word table 35, fixed phrases such as “Eh” interjections, “always being taken care of” and the like are defined as unnecessary words.

冗長性排除部７２はさらに、１通話分の対話音声テキストから、同一ないし類似内容を記述する文（ないし句、語等の意味を有するまとまりであってもよい）が複数回出現した場合に、重複する文を対話音声テキストから適宜削除してよい。好適には、冗長性排除部７２は、１通話分の対話音声テキスト中に同一ないし類似内容を記述する文等が複数回出願した場合には、通話開始から終了までの時系列上前方に出現した文を削除し、最後に出現した文を残してよい。通話終了時点に近い文が、より応対における最終的な結論を記述する蓋然性が高いからである。また、最後に出現した文は、オペレータによる復唱であると推定でき、この場合、復唱された文がより応対履歴として要約に残すべき正確な内容を記述してものであると期待できるからである。 The redundancy exclusion unit 72 further generates a sentence describing the same or similar contents (or a group having meaning such as a phrase or a word) multiple times from the dialogue speech text for one call. Duplicate sentences may be removed from the dialog speech text as appropriate. Preferably, when a sentence or the like describing the same or similar content is filed multiple times in the interactive voice text for one call, the redundancy exclusion unit 72 appears forward in time series from the call start to the end. You may delete the last sentence and leave the last one. This is because sentences near the end of the call are more likely to describe the final conclusion in the response. In addition, it is possible to estimate that the sentence that appears last is the repetition by the operator, and in this case, it can be expected that the read-out sentence is more likely to describe the exact content to be left in the summary as the response history. .

冗長性排除部７２は、さらに、重要語テーブル３４を参照し、重要語テーブル３４に登録済みであるキーワードの言い淀みや繰り返しを削除してもよい。
例えば、重要語テーブル３４にキーワードとして表記「ｅＶｏｉｃｅ」、読み「イーボイス」と登録されていたものとする。
この場合、認識結果が「明日の１０時にいいｅＶｏｉｃｅへ伺います。」であったとすると、冗長性排除部７２は、登録済みのキーワードの直前に読みが先頭から部分一致するものを検索し、検索された語を削除する。これにより、言い淀み箇所を対話音声テキストから削除することができる。
同様に、認識結果が「明日の１０時にｅＶｏｉｃｅへｅＶｏｉｃｅにお伺いします。」であったとすると、冗長性排除部７２は、上記のように、登録済みのキーワードの繰り返しは前方を削除する。これにより、繰り返し箇所を対話音声テキストから削除することができる。 The redundancy exclusion unit 72 may further refer to the keyword table 34 and may delete wordings and repetitions of keywords registered in the keyword table 34.
For example, it is assumed that "eVoice" and "E-voice" are registered as keywords in the important word table 34.
In this case, if the recognition result is "I'm going to a good eVoice at 10 o'clock tomorrow", the redundancy exclusion unit 72 searches for a partial match of the reading from the beginning just before the registered keyword and searches Delete the word This makes it possible to delete the saying part from the dialogue speech text.
Similarly, assuming that the recognition result is "I ask eVoice to 10 o'clock tomorrow," the redundancy exclusion unit 72 deletes the repetition of the registered keyword forward as described above. In this way, repeated parts can be deleted from the interactive speech text.

図６に戻り、Ｓ１３で、要約生成サーバ７の要約文生成部７３は、冗長性排除部７２が出力する対話音声テキストから、応対履歴の要約文を生成する。具体的には、要約文生成部７３は、会話体で記述された対話音声テキストを文章体に整形する。好適には、要約文生成部７３は、会話体で記述された対話音声テキストを体言止めの文章体に整形する。 Returning to FIG. 6, in S13, the summary generation unit 73 of the summary generation server 7 generates a summary of the response history from the interactive speech text output by the redundancy removal unit 72. Specifically, the abstract sentence generation unit 73 shapes the interactive speech text described in the conversation body into a sentence. Preferably, the abstract sentence generation unit 73 shapes the dialogue speech text described in the conversation body into an unspeakable sentence style.

図１８は、要約文生成部７３が参照する文体変換テーブル３６の非限定的一例を示す。図１８を参照して、文体変換テーブル３６には、左欄に変換元の会話体の語（「ございますね」、「と申します」、「おっしゃっていました」等）が、右欄に変換先の文章体の語（「ですね」、「です」、「言っていた」等）が、それぞれ定義されている。要約文生成部７２は、対話音声テキストから、文体変換テーブル３６に定義された変換元の会話体の語を検索し、検索された会話体の語を文体変換テーブル３６に定義される対応する文章体の語に変換する。これにより、対話音声テキスト中の丁寧語が簡潔な報告調の文章体に変換される。
なお、図１９の文体変換テーブル３６中、変換元の「ちょっと」の語には対応する変換先の文章体の語が定義されていない。この場合、要約文生成部７２は、変換元の語を対話音声テキストから削除すればよい。 FIG. 18 shows one non-limiting example of the style conversion table 36 to which the abstract sentence generation unit 73 refers. Referring to FIG. 18, in the style conversion table 36, in the left column, the words (“Are you”, “I say“ I say ”,“ I told you ”, etc.) of the conversion source are shown. The words ("I", "I", "I was saying", etc.) of the text form of the conversion destination are respectively defined. The summary sentence generation unit 72 searches the dialogue speech text for the words of the conversion source conversation body defined in the style conversion table 36, and the retrieved conversation body words are defined in the style conversion table 36 as corresponding sentences. Convert to body words. As a result, polite words in the interactive speech text are converted into a concise report-like text.
It should be noted that in the sentence conversion table 36 of FIG. 19, the word of the sentence of the conversion destination corresponding to the word of “conversion” is not defined. In this case, the abstract sentence generation unit 72 may delete the conversion source word from the interactive speech text.

図６に戻り、Ｓ１３で、要約文生成部７３はさらに、対話音声テキストから予め定義された重要語を検索し、検索された重要語を出力すべき要約文に含める。
図１９、図２０および図２１はそれぞれ、要約文生成部７３が参照する重要語テーブル３４の非限定的一例を示す。図１７を参照して、重要語テーブル３４には、「連絡」、および「確認」の語が重要語として定義されている。重要語テーブル３４には、重要語を可変の重み（ポイント）とともに定義してよい。図１９には、「連絡」、および「確認」の語には、いずれも重み「１」が定義されている。また、ユーザが追加や削除等の編集可能な他の重要語テーブル３４を提供し、固有名詞等を適宜定義可能としてよい。
要約文生成部７３は、対話音声テキストから、重要語テーブル３４に定義された重要語を検索し、検索された重要語を対応する重みに応じて重み付けして、生成すべき要約文に含める。 Returning to FIG. 6, in S13, the summary sentence generation unit 73 further searches the dialogue speech text for a previously defined key word, and includes the searched key word in the summary sentence to be output.
FIG. 19, FIG. 20 and FIG. 21 respectively show one non-limiting example of the important word table 34 to which the abstract sentence generation unit 73 refers. Referring to FIG. 17, in the key word table 34, the words "contact" and "confirmation" are defined as key words. In the key word table 34, key words may be defined with variable weights (points). In FIG. 19, the weight “1” is defined for the words “contact” and “confirmation”. In addition, the user may provide another important word table 34 that can be edited such as addition or deletion, so that proper nouns and the like can be defined as appropriate.
The summary sentence generation unit 73 searches the interactive speech text for the key words defined in the key word table 34, weights the searched key words according to the corresponding weights, and includes them in the summary sentence to be generated.

図２０は、肯定表現である重要語（「はい」、「わかった」、「いいよ」、）了解」等）を定義する重要語テーブル３４の非限定的一例を示し、図２１は、否定表現である重要語（「いいえ」、「やだよ」、「断る」、「承認しない」等）を定義する重要語テーブル３４の非限定的一例を示す。要約文生成部７３は、これらの重要語テーブル３４も参照して、対話音声テキストから重要語を検索し、検索された重要語を対応する重みに応じて重み付けして、生成すべき要約文に含める。図２０および図２１に含まれる肯定ないし否定表現としての重要語は、適宜文章体（「承諾」、「拒否」等）に変換されてよい。
なお、好適には、要約文生成部７３は、冗長性排除部７２から複数の文が供給された場合と単独の文が供給された場合のいずれであっても、１つの通話単位について１つの要約文を生成してよい。 FIG. 20 shows a non-limiting example of the key word table 34 defining key words ("Yes", "OK", "Good", "OK", etc.) which are positive expressions, and FIG. One non-limiting example of the key word table 34 defining key words ("No", "Yadayo", "Any", "Do not approve", etc.) that are expressions is shown. The summary sentence generation unit 73 also refers to these key word tables 34 to search for key words from the interactive speech text, and weights the searched key words according to the corresponding weight to generate a summary sentence to be generated. include. The key words as positive or negative expressions included in FIG. 20 and FIG. 21 may be converted into sentences (“accept”, “reject”, etc.) as appropriate.
Note that, preferably, the summary generation unit 73 generates one sentence per one call unit, regardless of whether a plurality of sentences are supplied from the redundancy exclusion unit 72 or a single sentence is supplied. Summary sentences may be generated.

図６に戻り、Ｓ１４で、要約生成サーバ７の要約文短縮部７５は、要約文生成部７３により生成された要約文が、所定長、例えば所定文字数の閾値を超えた場合に、該閾値内の要約文長となるよう、要約文を短縮する。
好適には、要約文短縮部７５は、対話要約文が一覧表示される照会結果表示画面において、１通話単位の要約文表示用に設けられた出力欄に要約文全文がスクロールを要することなく一瞥して可読な範囲の文字数を閾値として設定してよい。これにより、要約文確認のための追加的操作が不要となり、要約文全体の迅速な視認が可能となる。 Returning to FIG. 6, in S14, when the abstract sentence generated by the abstract sentence generation unit 73 exceeds a threshold of a predetermined length, for example, a predetermined number of characters, the abstract sentence shortening unit 75 of the abstract generation server 7 Shorten the summary sentence so that it becomes the summary sentence length of.
Preferably, in the query result display screen on which the dialogue summary is displayed in a list, the summary shortening section 75 does not need to scroll the entire summary in an output field provided for displaying a summary of one call unit. The number of readable characters may be set as the threshold. This eliminates the need for an additional operation for summary confirmation, and enables quick visual recognition of the entire summary.

より詳細には、要約文短縮部７５は、各種重要語テーブル３４を参照して、要約文中に出現する重要語に付与された重み（重要度ポイント）に基づいて、要約文を短縮してよい。
一例として、要約文短縮部７５は、冗長性排除部７２から供給される対話音声テキストを、句点（「。」）ごとに区切り、１つの対話音声テキスト文ごとに、文中に出現する重要語の重要度ポイントを加算し、高い重要度が算出された通話テキスト文を優先的に選択してよい。
要約文短縮部７５は、短縮された要約文を、要約文テキスト３８のファイルへ出力する。 More specifically, the summary sentence shortening unit 75 may shorten the summary sentence based on the weight (importance degree point) given to the key words appearing in the summary sentence with reference to the various key word table 34. .
As an example, the summary sentence shortening unit 75 separates the dialogue speech text supplied from the redundancy exclusion unit 72 for each phrase point (“.”), And for each dialogue speech text sentence, an important word that appears in the sentence. Importance degree points may be added to preferentially select a call text sentence for which a high importance degree is calculated.
The summary sentence shortening unit 75 outputs the abbreviated summary sentence to the file of the summary sentence text 38.

図６のＳ１５で、本実施形態において、要約文生成部７３は、音声認識サーバ５の相槌解析部５４が生成した、「受け答え」の種別が付与された対のテキストを、出力すべき要約文に付加する。
音声認識サーバ５の相槌解析部５４により実行された図１３の相槌解析処理により、一方の話者（例えば、顧客）により発話された、受け答えと判定された対話音声テキストと、当該受け答えの直前に他方の話者（例えば、オペレータ）により発話された、当該受け答えを促した、何に対する受け答えであるかを特定する対話音声テキストとが対となり、「受け答え」の種別が付与されて、一問一答形式の対話として対話音声テキストに含まれている。 In S15 of FIG. 6, in the present embodiment, the abstract sentence generation unit 73 outputs an abstract sentence to be output of the text of the pair to which the type of “Answer” is generated, generated by the speech analysis unit 54 of the speech recognition server 5 Add to
In the speech analysis process of FIG. 13 executed by the speech analysis unit 54 of the speech recognition server 5, the dialogue speech text uttered by one of the speakers (for example, the customer) determined as the answer and the speech immediately before the answer. A dialogue voice text uttered by the other speaker (e.g., an operator) that prompts the user to answer the question is specified as a pair, and a type of "answer" is given. It is included in the dialogue speech text as an answer form dialogue.

要約文生成部７３は、この「受け答え」の種別が付与された対話音声テキストの対を重要語として取り扱い、各種変換テーブル３６を参照して、要約文用の文体に変換した上で、出力すべき要約文に付加する。例えば、「受け答え」の種別が付与された対話音声テキストが「発送は二三日後でよろしかったでしょうか（オペレータの問い）」と「はい（顧客の受け答え）」の対であるとする。この場合、要約文生成部７３は、この対話音声テキストの対から「二三日後の発送を了承」等に変換し、変換後のテキストを応対履歴における重要語（重要文）として出力すべき要約文に含める。 The summary sentence generation unit 73 treats the pair of dialogue speech texts to which the type of “Answer” is given as an important word, refers to various conversion tables 36, converts it into a sentence style for summary sentences, and outputs it. Add to the abstract. For example, it is assumed that the dialogue speech text to which the type of "Answer" is assigned is a pair of "Should you like shipping within a few days (Operator's Question)" and "Yes (Customer's Answer)". In this case, the abstract sentence generation unit 73 converts the pair of dialogue speech texts into “accept delivery after 23 days” and the like, and outputs the converted text as an important word (important sentence) in the response history. Include in the statement.

他の例として、「受け答え」の種別が付与された対話音声テキストが「ご注文の品は対話要約ｅＶ−Ｏｕｔｌｉｎｅでよろしいでしょうか（オペレータの問い）」と「はい、お願いします（顧客の受け答え）」の対であるとする。この場合、要約文生成部７３は、この対話音声テキストの対から「注文の品は対話要約ｅＶ−Ｏｕｔｌｉｎｅを確認」等に変換し、変換後のテキストを応対履歴における重要語（重要文）として出力すべき要約文に含める。 As another example, the dialogue voice text given the type of “Answer” is “Is the order item acceptable with the dialogue summary eV-Outline (operator question)” and “Yes, please (customer's answer It is assumed that it is a pair of "). In this case, the abstract sentence generation unit 73 converts the pair of dialogue speech texts into “confirm the order item is dialogue summary eV-Outline” or the like, and sets the converted text as the key word (important sentence) in the response history. Include in the summary to be output.

Ｓ１６で、要約生成サーバ７の感情解析部７４は、対話音声テキストに基づいて、対話の話者の感情解析処理を実行する。また、感情解析部７４は、要約生成部７３から感情解析サーバ６へのインターフェースを提供し、感情解析サーバ６に感情解析処理を実行させ、感情解析処理の実行結果を要約文生成部７３へ供給してもよい。あるいは感情解析サーバ６を別途設けることなく、感情解析部７４が要約文を生成すべき対話の話者の感情解析処理を実行してもよい。以下では、前者の感情解析サーバ６を使用して感情解析処理を実行する例を説明する。 In S16, the emotion analysis unit 74 of the summary generation server 7 executes an emotion analysis process of the dialog speaker based on the dialog speech text. Also, the emotion analysis unit 74 provides an interface from the summary generation unit 73 to the emotion analysis server 6, causes the emotion analysis server 6 to execute the emotion analysis process, and supplies the execution result of the emotion analysis process to the abstract sentence generation unit 73. You may Alternatively, without providing the emotion analysis server 6 separately, the emotion analysis unit 74 may execute the emotion analysis process of the speaker of the dialog to generate the summary sentence. Below, the example which performs an emotion analysis process using the former emotion analysis server 6 is demonstrated.

感情解析処理は、対話音声データを使用した非言語的感情解析処理と、音声認識結果である対話音声テキストを使用した言語的感情解析処理とを含む。
前者の対話音声データに基づく感情解析処理において、感情解析部７４から呼び出された感情解析サーバ６は、通話録音サーバ３から供給される対話音声データを入力として、話者ごとに例えば、喜怒、満足度、ストレス度、信頼度等の話者の感情を数値化した定量的指標を話者の感情解析結果として出力する。 The emotion analysis process includes a non-verbal emotion analysis process using dialogue speech data and a linguistic emotion analysis process using dialogue speech text which is a speech recognition result.
In the emotion analysis process based on the former dialogue speech data, the emotion analysis server 6 called from the emotion analysis unit 74 takes the dialogue speech data supplied from the call recording server 3 as an input, for example, annoyance for each speaker, A quantitative index that quantifies the speaker's emotion such as the degree of satisfaction, stress, and reliability is output as the speaker's emotion analysis result.

感情解析サーバ６が提供するこの感情解析処理は、話者の脳波の動きと声帯の動きとが連動するものであり、発話のプロセスにおいて人間は感情を制御することができず感情が声に現れるとの知見に基づくものである。このため、感情解析サーバ６は、話者の発話の言語に依存することなく、対話音声データから話者の感情を数値化することができる。
後者の対話音声テキストに基づく感情解析処理において、要約生成サーバ７の感情解析部７４は、音声認識サーバ５から供給される対話音声テキストを入力として、対話音声テキスト中の感情語を抽出し、感情語テーブル３７を参照して、要約文に含めるべき感情表現に変換する。 The emotion analysis processing provided by the emotion analysis server 6 is such that the movement of the speaker's EEG and the movement of the vocal cords are linked, and in the process of speech, the human being can not control the emotion and the emotion appears in the voice. Based on the findings of the Therefore, the emotion analysis server 6 can digitize the speaker's emotion from the interactive speech data without depending on the language of the speaker's utterance.
In emotion analysis processing based on the latter dialog voice text, the emotion analysis unit 74 of the summary generation server 7 receives the dialog voice text supplied from the voice recognition server 5 as an input, and extracts emotion words in the dialog voice text. The word table 37 is referred to and converted into an emotional expression to be included in the summary sentence.

図２２は、感情解析サーバ６が、１つの通話単位の一方の話者（顧客）の対話音声データに対して、感情解析処理を実行した結果の非限定的出力例を示す。図２２を参照して、顧客（ＣＳ）の１通話中の顧客の感情の遷移が時系列上出力されている。図２２は、顧客からのクレーム対応で、通話中にオペレータが顧客を納得させた例を示す。図２２において、「喜怒」および「満足度」の感情指標は、中盤から後半にかけてともに数値が上昇しており、一方、「ストレス度」の感情指標は、中盤から後半にかけて数値が減少しており、１つの通話単位の中盤から後半にかけて、顧客の怒りおよびストレスが低下して不満が満足に転化しているとの感情の遷移を読み取ることができる。 FIG. 22 shows a non-limiting example output of the result of the emotion analysis processing performed by the emotion analysis server 6 on the interactive speech data of one speaker (customer) of one call unit. Referring to FIG. 22, the transition of the customer's emotion during one call of the customer (CS) is output in time series. FIG. 22 shows an example in which the operator convinces the customer during a call in response to a complaint from the customer. In FIG. 22, the emotion indexes of “Kiring” and “Satisfaction” increase in values from the middle to the second half, while the emotion indicators of “Stress level” decrease in values from the middle to the second half. In the middle to the second half of one talk unit, it is possible to read the transition of emotions that the customer's anger and stress are reduced and the dissatisfaction is converted to satisfaction.

また、図２２に例示される顧客の感情解析結果から、他方の話者であるオペレータの応対の品質を評価する指標を得ることができる。
例えば、通話の始めから「喜怒」の感情指標がマイナスで「怒り」が高いが、通話の最後には、「喜怒」の感情指標が０またはプラスに転化して「喜び」の傾向を示し、かつ「満足度」の感情指標も０またはプラスに転化して「満足」の傾向を示している場合、オペレータの応対履歴の評価は、優れた応対を示す「応対優良」としてよい。
ただし、通話の最後に、例えば顧客の「信頼度」の感情指標がマイナスで「不信」の傾向を示している場合、当該顧客の発話内容の信頼度が低いと評価することができるため、当該顧客の発言につき要注意であることを示す「顧客注意」を注記してもよい。 Further, from the emotion analysis result of the customer illustrated in FIG. 22, it is possible to obtain an index for evaluating the quality of the response of the operator who is the other speaker.
For example, although the emotion index of "Kio" is negative and "Anger" is high from the beginning of the call, at the end of the call, the emotion index of "Kio" is converted to 0 or positive and tends to "Joy" In the case where the feeling level of "satisfaction" is also converted to 0 or positive and indicates a tendency of "satisfaction", the evaluation of the operator's response history may be "good for response" indicating an excellent response.
However, if, for example, the customer's “confidence level” sentiment indicator is negative and indicates a tendency of “distrust” at the end of the call, it can be evaluated that the credibility of the uttered content of the customer is low. You may note "Customer Caution", which indicates that the customer's remarks require caution.

一方、通話の途中で突然「喜怒」の感情指標がマイナスに大きく転化するとともに「満足度」の感情指標もマイナスに大きく転化し、「怒り」かつ「不満」の傾向がその後も継続した場合、マイナス転化の直前のオペレータの発言が顧客の怒りや不満を誘発したと評価することができるため、当該オペレータの応対を確認することが必要であることを示す「応対注意」としてよい。
この場合も、通話の最後に、例えば顧客の「信頼度」の感情指標がマイナスで「不信」の傾向を示している場合、当該顧客の発話内容の信頼度が低いと評価することができるため、当該顧客の発言につき要注意であることを示す「顧客注意」を注記してもよい。
また、上記のような傾向が示されなかった場合には、妥当な応対であることを示す「応対通常」としてよい。 On the other hand, when the emotion index of "Kiring" is largely converted to minus while the emotion index of "satisfaction" is also significantly converted to minus during the call, and the tendency of "anger" and "dissatisfaction" continues thereafter Since it can be evaluated that the statement of the operator immediately before the minus conversion induced the customer's anger and dissatisfaction, it may be considered as "response attention" indicating that it is necessary to confirm the operator's response.
Also in this case, for example, if the customer's "reliability" emotion indicator is negative and indicates a tendency of "distrust" at the end of the call, it can be evaluated that the credibility of the uttered content of the customer is low. , You may note "customer notice" to indicate that the customer's remarks require caution.
In addition, when the above tendency is not indicated, it may be set as "a response normal" indicating that the response is a reasonable response.

図２３は、感情解析サーバ６が、１つの通話単位の他方の話者（オペレータ）の対話音声データに対して、感情解析処理を実行した結果の非限定的出力例を示す。図２３は、顧客との通話でオペレータがストレスを感じている例を示す。図２３において、「ストレス度」の感情指標は、通話の始めから終わりにかけて数値が上昇しており、オペレータのストレスが高まっているとの感情の遷移を読み取ることができる。
この場合、例えば、前回までのストレス度の感情指標の数値の遷移と比較して今回の通話でのストレスが高まっている場合には、オペレータの評価指標を、当該オペレータのストレス状態を引き続き監視すべきであることを示す「応対注意」としてよい。 FIG. 23 shows a non-limiting output example of the result of the emotion analysis server 6 executing emotion analysis processing on the interactive speech data of the other speaker (operator) of one speech unit. FIG. 23 shows an example in which the operator feels stress in a call with a customer. In FIG. 23, the emotion index of “stress level” has a numerical value rising from the beginning to the end of the call, and it is possible to read the transition of emotion that the stress on the operator is increasing.
In this case, for example, when the stress in the current call is increased compared to the transition of the numerical value of the stress level to the previous time, the evaluation index of the operator is continuously monitored for the stress state of the operator. It may be a "careful attention" indicating that it should be done.

図２４は、ある期間内（１日、１週間等）における複数回（図２４では１５回）の通話間での感情の遷移を示す。図２４において、通話回数が増加するにつれて、オペレータの「ストレス度」の感情指標の平均数値が徐々に上昇しており、通話回数が増加するにつれて、オペレータのストレスが高まっているとの感情の遷移を読み取ることができる。
この場合、オペレータの評価指標を、当該オペレータの応対を中止させ、直ちにヒヤリングを実施すべきであることを示す「応対中止」としてよい。 FIG. 24 shows the transition of emotions between multiple calls (15 times in FIG. 24) within a certain period (1 day, 1 week, etc.). In FIG. 24, as the number of calls increases, the average value of the operator's “stress level” emotional index gradually rises, and as the number of calls increases, transition of emotion that the operator's stress increases. Can read.
In this case, the evaluation index of the operator may be set as “cancellation of response” which indicates that the response of the operator is suspended and that the interview should be performed immediately.

図２５は、要約生成サーバ７の感情解析部７４が参照する感情語テーブル３７の非限定的一例を示す。図２５を参照して、感情語テーブル３７には、左欄に変換元の感情語（「まあいいか」、「それでいいよ。ありがとう」、「がっかりしたよ」、「大丈夫だよな」、「なんとかしろよ」、「いい加減にしろよ」等）が、右欄に変換先の感情表現（「渋々承諾」、「快諾」、「落胆」、「不安」、「不快」等）が、それぞれ定義されている。要約生成サーバ７の感情解析部７４は、対話音声テキストから、感情語テーブル３７に定義された変換元の感情語を検索し、検索された感情語を感情語テーブル３７に定義される対応する感情表現に変換する。これにより、対話音声テキスト中の感情語が簡潔な感情表現に変換される。 FIG. 25 shows a non-limiting example of the emotional word table 37 to which the emotion analysis unit 74 of the summary generation server 7 refers. Referring to FIG. 25, in the emotional word table 37, in the left column, the original emotional word (“Well?”, “Well, good. Thank you”, “I was disappointed”, “I'm okay”, "Somewhat I do", "I'm stupid", etc.) are the emotional expressions of the conversion destination in the right column ("Agreelessly", "Just", "Disappointing", "Disappointment", "Anxiety", "Discomfort", etc.) respectively It is defined. The emotion analysis unit 74 of the summary generation server 7 searches the dialogue speech text for the conversion-source emotion word defined in the emotion word table 37, and the searched emotion word is defined in the emotion word table 37. Convert to representation. As a result, the emotional words in the dialogue speech text are converted into brief emotional expressions.

図２７は、図２５の感情語テーブル３７を参照して、感情解析部７４が音声認識結果である対話音声テキストから感情表現を組み入れた要約文を生成する非限定的一例を示す。図２６を参照して、感情解析部７４は、図２７上段の対話音声テキスト「機器を交換したけど、また壊れて、がっかりだよ」を、図２７下段の「機器交換したが故障し落胆」の要約文へ変換する。出力すべき要約文に音声認識結果である対話音声テキストから把握される感情表現を含めることができる。変換後の「落胆」の語が話者（顧客）の感情表現を示すものであり、出力される要約文に含められる。 FIG. 27 shows a non-limiting example of generating a summary sentence incorporating emotion expression from dialogue speech text which is a speech recognition result by referring to the emotion word table 37 of FIG. Referring to FIG. 26, the emotion analysis unit 74 has the dialogue voice text “I replaced the device but I have broken it again and again” in the upper row of FIG. Convert to a summary sentence of The summary sentence to be output can include an emotional expression grasped from the dialogue speech text which is a speech recognition result. The converted "disappointment" word indicates the speaker's (customer's) emotional expression and is included in the output summary.

一方、図２６は、感情解析サーバ７が対話音声データ（声色）から感情解析処理を実行して得られた感情表現を、要約文テキストに括弧書で付加した非限定的一例を示す。図２６を参照して、感情解析サーバ７は、図２６上段の対話音声テキスト「食品に虫が入っているんだよ」の基となった対話音声データに対して感情解析処理を実行し、例えば当該音声データの「信頼度」の感情指標がマイナスで「不信」の傾向を示している場合、当該顧客の発言につき要注意であることを示す「顧客注意」の感情表現を生成して、要約生成サーバ７の感情解析部７４を解して要約文生成部７３へ供給する。要約生成サーバ７の要約文生成部７３は、図２６上段の対話音声テキストから生成された図２６下段の要約文「食品に虫が混入」に、感情解析サーバ６から供給された「顧客注意」を括弧書で付加する。
上記のように、生成される要約文に話者の感情表現を反映させることにより、話者の感情遷移の状況把握や、対策を取るべき問題通話の自動抽出が容易に可能となる。 On the other hand, FIG. 26 shows a non-limiting example in which the emotion expression obtained by the emotion analysis server 7 executing the emotion analysis process from the dialogue voice data (voice color) is added to the summary text in parentheses. Referring to FIG. 26, the emotion analysis server 7 executes an emotion analysis process on the dialog speech data which is the basis of the dialogue speech text “food contains insects” in the upper stage of FIG. For example, when the emotion index of "reliability" of the voice data is negative and shows a tendency of "distrust", an emotion expression of "customer attention" indicating that attention is required for the remarks of the customer is generated, It interprets the emotion analysis unit 74 of the summary generation server 7 and supplies the summary sentence generation unit 73. The abstract sentence generation unit 73 of the abstract generation server 7 receives the “customer notice” supplied from the emotion analysis server 6 in the “absent food is mixed with food” in the abstract sentence in the lower part of FIG. In parentheses.
As described above, by reflecting the speaker's emotional expression in the generated summary text, it becomes possible to easily grasp the situation of the speaker's emotional transition and automatically extract the problem call for which measures should be taken.

図６に戻り、要約生成サーバ７の要約生成部７３は、Ｓ１７で、上記のような感情解析結果を用いて、図２７に示すように、要約文中の感情語からより端的でカテゴライズされた感情表現に置き換え、および図２６に示すように、出力すべき要約文に付加する。
Ｓ１８で、要約文生成部７３または要約文短縮部７５は、最終的に生成された要約文を要約文テキスト３８のファイルへ出力する。 Returning to FIG. 6, the summary generation unit 73 of the summary generation server 7 uses the emotion analysis result as described above in S17, and as shown in FIG. 27, the emotion more neatly categorized from the emotional words in the summary sentence Replace with a representation and, as shown in FIG. 26, add to the summary to be output.
In S18, the abstract sentence generation unit 73 or the abstract sentence shortening unit 75 outputs the finally generated abstract sentence to the file of the abstract sentence text 38.

図２８ないし図３０を参照して、音声認識サーバ５が出力する要約単位に区切られた対話音声テキストから最終的に出力される要約文を生成するまでの抽出変換処理の一例を説明する。
図２８は、音声認識サーバ５が出力し、要約生成サーバ７に入力される１つの通話単位の対話音声テキストの非限定的一例を示す。図２８の対話音声テキストは、識別された話者（オペレータ（ＯＰ）または顧客（ＣＳ））ごとに、各行に１つの発話単位の対話音声テキストが示されており、各行の対話音声テキストは、四角で示される要約単位の区切りが挿入されている。
図２９は、図２８に示す対話音声テキストから、要約生成サーバ７の要約文生成部７３が中間的に出力する要約文テキストの非限定的一例を示す。図２９に示すように、図２８の２０発話単位のテキストから、６発話単位のテキスト（３番目、６番目、９番目、１１番目、１４番目、および１５番目の発話単位のテキスト）が抽出されるとともに、抽出された発話単位のテキストのそれぞれが、要約文用のより簡潔なテキストに変換されている。要約文生成部７３は、重要語テーブル３４、不要語テーブル３５、および各種変換テーブル３６を参照することにより、図２８の１通話全体の対話音声テキストから図２９の中間的要約文テキストに変換する。 An example of extraction and conversion processing up to the generation of a final output summary sentence from interactive speech text divided into summary units output by the speech recognition server 5 will be described with reference to FIGS. 28 to 30.
FIG. 28 shows a non-limiting example of interactive speech text of one call unit output from the speech recognition server 5 and input to the summary generation server 7. In the dialogue voice text of FIG. 28, the dialogue voice text of one utterance unit is shown in each line for each identified speaker (operator (OP) or customer (CS)), and the dialogue voice text of each line is A break of summary units indicated by a square is inserted.
FIG. 29 shows one non-limiting example of the summary text output in an intermediate manner by the summary generation section 73 of the summary generation server 7 from the interactive speech text shown in FIG. As shown in FIG. 29, the text of the sixth utterance unit (the third, sixth, ninth, eleventh, fourteenth, and fifteenth utterance units of the text) is extracted from the text of the twenty utterance units of FIG. And each of the extracted utterance unit texts is converted into more concise text for a summary sentence. The summary sentence generation unit 73 converts the interactive speech text of the entire one call of FIG. 28 into the intermediate summary text of FIG. 29 by referring to the key word table 34, the unnecessary word table 35, and the various conversion tables 36. .

図３０は、図２９の中間的に出力する要約文テキストから、要約文生成部７３ないし要約文短縮部７５が最終的に出力する要約文テキストの非限定的一例を示す。図３０に示すように、図２９で抽出され変換された６発話単位のテキストから、５行の要約文が生成されており、各要約文の末尾は体言止めの「希望」、「確認」等に変換されている。特に、図２９の５行目のオペレータの発話（問い）と６行目の顧客の発話（受け答え）との対は、図３０において、「作成し郵送するので二三日待つ事を快諾」と１つの要約文に集約されている。要約文生成部７３は、重要語テーブル３４や各種変換テーブル３６を参照することにより、応対履歴として機能する図３０の最終的に出力される要約文テキストを生成する。図３０の５行目の要約文の文末は、上記の感情解析処理を適用して、話者（顧客）の感情表現を反映した「快諾」に変換されている。 FIG. 30 shows one non-limiting example of the summary text finally output from the summary generator 73 to the summary shortening section 75 from the intermediate output text of FIG. As shown in FIG. 30, a five-line summary sentence is generated from the text of six utterance units extracted and converted in FIG. 29, and the end of each summary sentence is “hope”, “confirmation” etc. Has been converted to. In particular, the pair of the utterance (question) of the operator in the fifth line of FIG. 29 and the utterance (answer and answer) of the customer in the sixth line is “FIG. It is summarized in one summary sentence. The abstract sentence generation unit 73 generates the finalized abstract text of FIG. 30 functioning as the response history by referring to the keyword table 34 and the various conversion tables 36. The sentence end of the condensed sentence on the 5th line of FIG. 30 is converted to “hidden” reflecting the emotional expression of the speaker (customer) by applying the above-described emotion analysis processing.

図３１は、図２８の対話音声テキストを照会した結果表示装置等に出力されるユーザインタフェースの非限定的一例を示す。図３１を参照して、ユーザインタフェースは、識別された話者３１１、発話単位の応対内容３１２、再生ボタン３１３、および話者の感情解析結果アイコン３１４を含んでよい。所望する発話に対応する再生ボタン３１３を選択することにより、当該発話の音声ファイルが再生される。
図３２は、感情解析結果として、図３１で照会された通話単位についての、話者ごとの感情指標について、感情指標の数値から得られる感情解析結果が、「喜怒」が「通常」、満足感が「普通」ないし「やや高い」、ストレスが「なし」、「若干あり」等と示されている。図３１と図３２は同時に視認可能に表示装置上表示されてよい。 FIG. 31 shows one non-limiting example of a user interface outputted to a result display device etc. in which the interactive voice text of FIG. 28 is inquired. Referring to FIG. 31, the user interface may include the identified speaker 311, the response content 312 of the utterance unit, the play button 313, and the speaker's emotion analysis result icon 314. By selecting the play button 313 corresponding to the desired speech, the speech file of the speech is reproduced.
FIG. 32 shows that the emotion analysis result obtained from the numerical value of the emotion index is “normal” for the emotion index, “normal” for the emotion index for each speaker for the speech unit inquired in FIG. The feeling is shown as "normal" to "somewhat high", stress as "no", "somewhat", etc. 31 and 32 may be displayed on the display device so as to be visible simultaneously.

図３３は、１通話単位（録音時間１．２５．７１６）について話者識別された発話単位の対話音声の音声認識結果、対応するユーザ辞書等を参照した自然言語処理結果、および音声ファイルのリンク、開始および終了時間を一覧で示す非限定的表示例である。図３３左下にあるように、当該通話単位について生成された要約文が表示されており、各処理結果と要約文との間の相互参照を容易にしている。図３３のユーザインタフェースは、音声ファイルを再生した後、音声認識結果や自然言語処理結果を、ユーザにエラー訂正させるべく、編集可能に表示してもよい。
また、図３３左下の生成された要約文には、対話において最終的に「サクサファンドの目論見書をインターネットで見ることを了承」したことが示されているが、当該要約文部分のうち「了承」に対して、複数の感情指標の数値から得られる感情解析結果を、例えば、「了承（快諾）」または「了承（渋々承諾）」のように括弧書等で付加してもよく、「了承」を「快諾」ないし「渋々承諾」等の感情解析結果を含む表現で置き換えてもよい。
本実施形態によれば、このように対話録音データ、対話音声の音声認識結果、自然言語処理結果、感情解析結果、および生成された要約文を統合して出力することができる。 FIG. 33 shows the speech recognition result of the dialog speech of the speech unit identified as the speaker for one speech unit (recording time 1.25.716), the natural language processing result with reference to the corresponding user dictionary, etc., and the link of the speech file It is a non-limiting display example showing the start and end times in a list. As shown in the lower left of FIG. 33, a summary generated for the call unit is displayed to facilitate cross reference between each processing result and the summary. After reproducing the voice file, the user interface of FIG. 33 may display the voice recognition result and the natural language processing result in an editable manner to cause the user to correct the error.
In addition, the generated summary in the lower left of Fig. 33 indicates that the dialogue "approved to see the prospectus of the Saxa Fund on the Internet" is finally shown in the dialogue. For example, the emotion analysis result obtained from the numerical value of a plurality of emotion indicators may be added in parenthesis as in “acknowledgement” or “acknowledgement”. "" May be replaced with an expression including emotion analysis results such as "favorable" or "faintly consent".
According to this embodiment, it is possible to integrate and output the dialogue recording data, the speech recognition result of the dialogue speech, the natural language processing result, the emotion analysis result, and the generated summary sentence.

（各装置のハードウエア構成の一例）
図３４は、音声処理システムにおける各装置が備えるハードウエア構成の一例を示す図である。音声取得サーバ２、通話録音サーバ３、制御サーバ４、音声認識サーバ５、感情解析サーバ６、要約生成サーバ７、およびＰＣ９，１０は、図３４に示すハードウエアコンポーネントの全部又は一部を備える。図３４に示す各装置１００は、ＣＰＵ１０１、ＲＯＭ１０２、ＲＡＭ１０３、外部メモリ１０４、入力部１０５、表示部１０６、通信Ｉ／Ｆ１０７及びシステムバス１０８を備えてよい。 (Example of hardware configuration of each device)
FIG. 34 is a diagram illustrating an example of a hardware configuration provided in each device in the speech processing system. The voice acquisition server 2, the call recording server 3, the control server 4, the voice recognition server 5, the emotion analysis server 6, the summary generation server 7, and the PCs 9 and 10 include all or part of the hardware components shown in FIG. Each device 100 illustrated in FIG. 34 may include a CPU 101, a ROM 102, a RAM 103, an external memory 104, an input unit 105, a display unit 106, a communication I / F 107, and a system bus 108.

ＣＰＵ１０１は、装置における動作を統括的に制御するものであり、システムバス１０８８を介して各構成部（１０２〜１０７）を制御する。ＣＰＵ１０１は音声認識処理、要約生成処理または感情解析処理等の各処理を実行する処理部として機能する。ＲＯＭ１０２は、ＣＰＵ１０１が処理を実行するために必要な制御プログラム等を記憶する不揮発性メモリである。なお、当該プログラムは外部メモリ１０４や着脱可能な記憶媒体（図示せず）に記憶されていてもよい。ＲＡＭ１０３は、ＣＰＵ１０１の主メモリ、ワークエリア等として機能する。よって、ＣＰＵ１０１は、処理の実行に際してＲＯＭ１０２から必要なプログラム等をＲＡＭ１０３にロードし、当該プログラム等を実行することで各種の機能動作を実現する。 The CPU 101 centrally controls the operation of the apparatus, and controls each component (102 to 107) via the system bus 1088. The CPU 101 functions as a processing unit that executes each processing such as speech recognition processing, summary generation processing, and emotion analysis processing. The ROM 102 is a non-volatile memory that stores control programs and the like necessary for the CPU 101 to execute processing. The program may be stored in the external memory 104 or a removable storage medium (not shown). The RAM 103 functions as a main memory, a work area, and the like of the CPU 101. Therefore, the CPU 101 loads a necessary program and the like from the ROM 102 to the RAM 103 at the time of execution of processing, and realizes various functional operations by executing the program and the like.

外部メモリ１０４は例えば、ＣＰＵ１０１がプログラムを用いた処理を行う際に必要な各種データや各種情報等を記憶する。また、外部メモリ１０４には例えば、ＣＰＵ１０１がプログラム等を用いた処理を行うことにより得られた各種データや各種情報等が記憶される。入力部１０５はキーボード、タブレット等各種入力デバイスから構成される。表示部１０６は例えば液晶ディスプレイ等からなる。通信Ｉ／Ｆ１０７は、外部装置と通信するためのインターフェースであり、例えば無線ＬＡＮ（Ｗｉ−Ｆｉ）インターフェースやＢｌｕｅｔｏｏｔｈ（登録商標）インターフェースを備える。システムバス１０８は、ＣＰＵ１０１、ＲＯＭ１０２、ＲＡＭ１０３、外部メモリ１０４、入力部１０５、表示部１０６及び通信Ｉ／Ｆ１０７を通信可能に接続する。 The external memory 104 stores, for example, various data, various information, and the like necessary when the CPU 101 performs processing using a program. Further, in the external memory 104, for example, various data, various information, and the like obtained by the CPU 101 performing processing using a program or the like are stored. The input unit 105 includes various input devices such as a keyboard and a tablet. The display unit 106 includes, for example, a liquid crystal display. The communication I / F 107 is an interface for communicating with an external device, and includes, for example, a wireless LAN (Wi-Fi) interface or a Bluetooth (registered trademark) interface. The system bus 108 communicably connects the CPU 101, the ROM 102, the RAM 103, the external memory 104, the input unit 105, the display unit 106, and the communication I / F 107.

以上説明したように、本実施形態によれば、対話音声から、十分に短縮化され、かつ対話中の話者の発話における感情が十分に反映された高精度な要約文を生成することができる。よって、対話音声の要約の有用性向上に資する。
なお、上述した各実施形態は、その複数を組み合わせて実現することが可能である。
また、本発明は、上述の実施形態の一部または１以上の機能を実現するプログラムによっても実現可能である。すなわち、そのプログラムを、ネットワーク又は記憶媒体を介してシステム又は装置に供給し、そのシステム又は装置のコンピュータ（またはＣＰＵやＭＰＵ等）における１つ以上のプロセッサがプログラムを読出し実行する処理により実現可能である。また、そのプログラムをコンピュータ可読な記録媒体に記録して提供してもよい。
また、コンピュータが読みだしたプログラムを実行することにより、実施形態の機能が実現されるものに限定されない。例えば、プログラムの指示に基づき、コンピュータ上で稼働しているオペレーティングシステム（ＯＳ）などが実際の処理の一部または全部を行い、その処理によって上記した実施形態の機能が実現されてもよい。 As described above, according to this embodiment, it is possible to generate a highly accurate summary sentence which is sufficiently shortened and in which the emotion of the speaker's utterance in the dialogue is sufficiently reflected from the dialogue voice. . Therefore, it contributes to the improvement of the usefulness of the summary of dialogue speech.
In addition, it is possible to implement | achieve each embodiment mentioned above combining the plurality.
The present invention can also be realized by a program that implements part or one or more functions of the above-described embodiment. That is, it can be implemented by processing that the program is supplied to a system or apparatus via a network or storage medium, and one or more processors in a computer (or CPU, MPU, etc.) of the system or apparatus read and execute the program. is there. Alternatively, the program may be provided by being recorded on a computer readable recording medium.
Further, the functions of the embodiment are not limited to those realized by executing the program read by the computer. For example, an operating system (OS) or the like operating on a computer may perform part or all of the actual processing based on the instructions of the program, and the functions of the above-described embodiment may be realized by the processing.

以上、本発明の実施形態について詳細に説明したが、上記実施形態は、本発明を実施するにあたっての具体例を示したに過ぎない。本発明の技術的範囲は、前記実施形態に限定されるものではない。本発明は、その趣旨を逸脱しない範囲において種々の変更が可能であり、それらも本発明の技術的範囲に含まれる。 As mentioned above, although embodiment of this invention was described in detail, the said embodiment only showed the specific example in practicing this invention. The technical scope of the present invention is not limited to the above embodiment. The present invention can be modified in various ways without departing from the scope of the invention, which are also included in the technical scope of the present invention.

１ＰＢＸ
２音声取得サーバ
３通話録音サーバ
４制御サーバ
５音声認識サーバ
６感情解析サーバ
７要約生成サーバ
８構内回線
９、１０ＰＣ
３１対話音声
３２音声認識辞書
３３要約単位テキスト
３４重用語テーブル
３５不要語テーブル
３６変換テーブル
３７感情語テーブル
５１音声認識前処理部
５２音声認識部
５３音声認識後処理部
５４相槌解析部
７１テキスト補正部
７２冗長性排除部
７３要約文生成部
７４感情解析部
７５要約文短縮部 1 PBX
2 voice acquisition server 3 call recording server 4 control server 5 voice recognition server 6 emotion analysis server 7 summary generation server 8 private line 9, 10 PC
Reference Signs List 31 dialogue speech 32 speech recognition dictionary 33 summary unit text 34 double term table 35 unnecessary word table 36 conversion table 37 emotion word table 51 speech recognition pre-processing unit 52 speech recognition unit 53 speech recognition post-processing unit 54 speech analysis unit 71 text correction unit 72 redundancy exclusion unit 73 summary sentence generation unit 74 emotion analysis unit 75 summary sentence shortening unit

Claims

A speaker identification unit for identifying a speaker of dialogue from dialogue speech data;
A speech separation unit for separating the dialogue speech data into speech units for each of the speakers identified by the speaker identification unit;
A speech recognition unit that generates speech speech text by speech recognition of the speech speech data in units of the speech separated by the speech separation unit;
A summary generation unit for summarizing the dialogue speech text generated by the speech recognition unit to generate a summary text;
The dialogue voice data is analyzed in the utterance unit to derive an emotional expression for each speaker, and the derived emotional expression is added to the abstract sentence text or output in association with the abstract sentence text . The emotion analysis department of
An emotion word indicating an emotion of each speaker is extracted from the dialogue speech text generated by the speech recognition unit, the extracted emotion word is converted into a corresponding emotion expression, and the converted emotion expression is obtained. A second emotion analysis unit that replaces at least a part of the summary sentence text;
An interactive summary generator comprising:

The first emotion analysis unit further derives the transition of emotions in time series of each speaker in one dialogue by analyzing the dialogue speech data in the utterance unit, and is derived for each speaker Outputting the transition of the emotion in association with the summary text;
The dialogue abstract generation device according to claim 1, characterized in that:

Determine whether the same or similar text appears multiple times in the dialogue speech text of one dialogue unit, and when the same or similar text appears multiple times, it appears forward in time series Further comprising a redundancy remover for deleting text,
The dialogue abstract generation device according to claim 1 or 2 characterized by things.

The redundancy exclusion unit further refers to the keyword table which defines keywords in advance, extracts the text defined in the keyword table from the interactive speech text, and positions the text immediately before the extracted text. Searching for a second text at least partially matching the reading of the extracted text, and deleting the searched text from the interactive speech text,
The dialogue abstract generation device according to claim 3 , characterized in that:

A text correction unit which analyzes the dialogue speech text generated by the speech recognition unit, extracts a number of speech, assigns different units and weights according to the type of the extracted number speech, and supplies the summarizing unit with the text correction unit; Furthermore,
The dialogue abstract generation device according to any one of claims 1 to 4 , characterized in that:

It further comprises a voice acquisition unit for recording a talk voice or a face-to-face dialogue voice to acquire the dialogue voice data.
An interactive speech summary generator according to any one of claims 1 to 5 , characterized in that.

Dialogue summary generation method executed by a dialogue summary generation apparatus comprising a speaker identification unit, a speech separation unit, a speech recognition unit, a summary generation unit, a first emotion analysis unit, and a second emotion analysis unit And
The speaker identification unit identifies a speaker of dialog from the dialog voice data;
The speech separation unit separating the dialogue speech data into speech units for each of the identified speakers;
The speech recognition unit performs speech recognition of the dialogue speech data by the separated speech unit to generate dialogue speech text;
The summary generation unit sums up the generated dialog voice text to generate summary text;
The first emotion analysis unit analyzes the dialogue speech data in the utterance unit to derive an emotion expression for each speaker, adds the derived emotion expression to the summary text , or the summary text Outputting in correspondence with the text;
The second emotion analysis unit extracts an emotion word indicating an emotion of each speaker from the generated dialogue speech text, converts the extracted emotion word into a corresponding emotion expression, and converts the emotion word. Replacing at least a portion of the abstract sentence text with an emotional expression;
A method of generating a dialog summary, comprising:

A dialogue summary generation program for causing a computer to execute a dialogue summary generation process, said program comprising:
Speaker identification processing for identifying a speaker of a dialog from dialog voice data;
Speech separation processing for separating the dialogue speech data into speech units for each identified speaker;
Speech recognition processing for speech recognition of the dialogue speech data by the separated speech unit to generate dialogue speech text;
Summarizing processing for summarizing the generated dialogue speech text to generate a summary text;
The dialogue voice data is analyzed in the utterance unit to derive an emotional expression for each speaker, and the derived emotional expression is added to the abstract sentence text or output in association with the abstract sentence text . and emotion analysis processing of,
An emotional word indicating an emotion for each speaker is extracted from the generated dialogue voice text, the extracted emotional word is converted into a corresponding emotional expression, and the converted emotional expression is used for the summary text A second emotion analysis process of replacing at least a part of the second emotion analysis process;
To execute processing including
A dialogue summary generator characterized in that.