JP2020071676A

JP2020071676A - Speech summary generation apparatus, speech summary generation method, and program

Info

Publication number: JP2020071676A
Application number: JP2018205371A
Authority: JP
Inventors: 一仁横内; Kazuhito Yokouchi; 鈴木　茂; Shigeru Suzuki; 鈴木　　茂
Original assignee: Evoice; EVOICE CO Ltd; Saxa Inc
Current assignee: Evoice; EVOICE CO Ltd; Saxa Inc
Priority date: 2018-10-31
Filing date: 2018-10-31
Publication date: 2020-05-07
Anticipated expiration: 2038-10-31
Also published as: JP6513869B1

Abstract

To generate a sufficiently shortened and highly accurate summary sentence from a speech voice, while sufficiently reflecting emotion of a speaker during a speech.SOLUTION: A speech voice summary generation apparatus includes: a speaker identification unit which identifies speakers of a speech from speech voice data: a voice separation unit which separates the speech voice data into speech units for each of the speakers identified by the speaker identification unit; a voice recognition unit which recognizes the speech voice data by speech unit separated by the separation unit to generate speech voice texts; a summary sentence generation unit which generates a summary sentence text by summarizing the speech voice texts generated by the voice recognition unit; and an emotion analysis unit which derives emotional expression of each of the speakers by analyzing the speech voice data by speech unit, and outputs the summary sentence text by adding the derived emotional expression to the summary sentence text, replacing a part of the summary sentence text with the emotional expression, or associating the emotional expression with the summary sentence text.SELECTED DRAWING: Figure 3

Description

本発明は、対話要約生成装置、対話要約生成成方法およびプログラムに関する。より詳細には、本発明は、例えば顧客と応対担当者の電話もしくは対面でなされた対話を録音蓄積して管理するＣｕｓｔｏｍｅｒＲｅｌａｔｉｏｎｓｈｉｐＭａｎａｇｅｍｅｎｔ（ＣＲＭ）システムに利用可能な、録音された対話音声から要約を作成し、生成された要約を出力するための技術に関する。 The present invention relates to a dialogue summary generation device, a dialogue summary generation method, and a program. More particularly, the present invention provides a summary from recorded dialogue voices available in, for example, a Customer Relationship Management (CRM) system that records and manages telephone or face-to-face dialogues between customers and contacts. A technique for creating and outputting a generated summary.

顧客と事業者との間でなされた対話音声を事業者側において録音して管理する各種技術が提案されている。近年では、事業者のコンプライアンス遵守、顧客からのクレーム対策、および事業者側オペレータの評価や教育等の目的のため、電話での通話に限らず、対面による対話を含めて、あらゆる場面で対話内容の録音蓄積が要請されている。 Various technologies have been proposed for recording and managing a dialogue voice made between a customer and a business on the business side. In recent years, for the purpose of compliance with business operators, measures against complaints from customers, and evaluation and education of business operators, the content of dialogue is not limited to telephone calls but also includes face-to-face dialogue. Is requested to be recorded.

一例として、顧客からの電話応対部署であるコールセンタにおけるオペレータの通話内容をデータ化して録音するとともに検索するための通話録音システムにおいては、一般に、事業者が運営するコールセンタ等の構内には、公衆電話交換回線網（ＰｕｂｌｉＳｗｉｔｃｈｅｄＴｅｌｅｐｈｏｎｅＮｅｔｗｏｒｋ：ＰＳＴＮ）からの発信および受信が集中する交換機（ＰＢＸ）が設置され、この交換機により音声通話がコールセンタ構内の固定電話に分配される。 As an example, in a call recording system for recording and searching the contents of an operator's call in a call center, which is a department for responding to a call from a customer, generally, a public telephone is installed on the premises of a call center operated by an operator. An exchange (PBX) is installed in which calls and receptions from a public switched telephone network (PSTN) are concentrated, and a voice call is distributed to a fixed telephone in a call center by this exchange.

このため、この交換機から分岐する通話録音サーバを設ければ、通話を音声データファイルに録音蓄積することができる。オペレータ側には、音声応対用内線電話とともに、ＰＣ（ＰｅｒｓｏｎａｌＣｏｍｐｕｔｅｒ）等の端末装置が設けられてよく、このオペレータ端末装置には、例えば、発話者が告げた顧客名をキーとして顧客情報を検索する機能や、当該顧客の過去の通話履歴を表示する機能が備えられてよい。 Therefore, if a call recording server branching from this exchange is provided, the call can be recorded and stored in a voice data file. On the operator side, a terminal device such as a PC (Personal Computer) may be provided together with a voice response extension telephone. For example, the operator terminal device retrieves customer information by using a customer name announced by a speaker as a key. The function of displaying the past call history of the customer may be provided.

このように音声データファイルに録音蓄積された顧客とオペレータとの間の音声通話につき、１回の電話応対ごと、その概要を応対履歴として記録保持し、通話終了後にこの応対履歴を閲覧およびレポートとして出力可能とすることが要請される。この応対履歴の内容を迅速に確認および照査するため、録音された音声通話からテキスト化された要約を生成することが望まれる。
このような音声データから要約テキストを作成する技術において、音声データファイル中の音声を音声認識処理により文字コード化し、文字コード化された音声テキストデータから要約テキストを生成する技術がある。テキスト化された要約を生成することで、応対履歴の内容の把握が容易となり、一覧性が提供され、さらにテキスト中の単語をキーワードに用いて検索を実行できる等、コンピュータとの柔軟な連携が可能となる。 For each voice call between the customer and operator recorded and accumulated in the voice data file in this way, an outline of the call history is recorded and retained for each telephone call, and after the call ends, this history is viewed and used as a report. Output is required. It is desirable to generate a textual summary from the recorded voice call in order to quickly verify and review the contents of this history of responses.
As a technique for creating a summary text from such voice data, there is a technique for converting the voice in a voice data file into a character code by a voice recognition process and generating a summary text from the character-coded voice text data. By generating a textual summary, the contents of the response history can be easily grasped, a list is provided, and the words in the text can be used as keywords to perform a search. It will be possible.

例えば、特許文献１は、ビデオテープレコーダ（ＶＴＲ）により記録媒体に録音された音声を音声認識して文字コード列に変換し、この音声認識された文字コード列中の文の構成要素の重要度、典型的には名詞・動詞・助詞・形容詞等の品詞別、主格・目的格・述部等の句別に付与された重要度、を予め登録された重要度テーブルを参照することにより判定し、重要度が高いと判定された文中構成要素を組み合わせることで要約文を自動生成する技術を開示する。 For example, in Patent Document 1, voices recorded on a recording medium by a video tape recorder (VTR) are voice-recognized and converted into character code strings, and the importance of the constituent elements of the sentence in the voice-recognized character code strings. , Typically, by classifying parts of speech such as nouns, verbs, particles, adjectives, and degrees of importance given to phrases such as nominative cases, objective cases, predicates, etc., by referring to a pre-registered importance table, Disclosed is a technique for automatically generating a summary sentence by combining constituent elements in a sentence determined to have high importance.

また、特許文献２は、音声から重要区間を抽出し、抽出された重要区間の出現分布を用いて話題境界を検出し、それぞれの話題区間に含まれる重要区間を意味分類して、重要区間の音声から話題ごとに分割されたテキストによる要約を生成する技術を開示する。 Further, in Patent Document 2, important sections are extracted from speech, topic boundaries are detected using the appearance distribution of the extracted important sections, important sections included in each topic section are semantically classified, and Disclosed is a technique for generating a text-based abstract of a topic from voice.

特開平８−２１２２２８号公報JP-A-8-212228 特開２０００−２８４７９３号公報JP 2000-284793 A

しかしながら、上記各特許文献に開示される技術を、例えばコールセンタにおける電話応対業務に直ちに適用することは困難である。なぜなら、顧客とオペレータ間の音声通話は、通常、顧客情報の取得・確認、問い合わせ内容の取得・確認、問い合わせへの回答内容の取得・確認、顧客の理解度および免責内容の提示・確認等、多くの段階を経るため不可避的に冗長であり、また、同じ発話内容が繰り返された結果、対話が長時間に亘ることも多いからである。加えて、多数のオペレータについて終日録音蓄積される通話録音データは膨大なものとなるため、応対履歴の迅速な確認および照査を困難にする。 However, it is difficult to immediately apply the technology disclosed in each of the above patent documents to, for example, telephone answering work in a call center. Because the voice call between the customer and the operator is usually the acquisition / confirmation of the customer information, the acquisition / confirmation of the inquiry content, the acquisition / confirmation of the reply content to the inquiry, the presentation / confirmation of the customer's understanding level and the exemption content, etc. This is because it is inevitably redundant because it goes through many stages, and as a result of repeating the same utterance content, the dialogue often takes a long time. In addition, call recording data that is recorded and accumulated all day long for many operators makes it difficult to promptly check and check the response history.

このため、音声通話をそのまま音声認識して得られる音声通話テキストに公知の要約文作成技術を適用しても、生成される要約文もまた不可避的に冗長かつ長文となってしまう不都合があり、利便性が乏しかった。 Therefore, even if a well-known abstract sentence creating technique is applied to a voice call text obtained by directly recognizing a voice call, there is an inconvenience that the generated summary sentence is inevitably redundant and long. It was not convenient.

一方、対話中の話者の感情は一律ではない。例えば、対話中に「はい」との発話が音声認識された場合、当該発話の話者が、快諾して発話した「はい」であるのか、渋々同意を余儀なくされた「はい」であるのか、異なる感情に基づく発話であり得る。
しかしながら、従来の技術では、対話中の話者の発話における感情を要約に反映することはできなかった。 On the other hand, the emotions of the speaker during the dialogue are not uniform. For example, if the utterance of "yes" is voice-recognized during the dialogue, is it the "yes" that the speaker of the utterance consented and uttered, or was the "yes" forced to consent, It can be an utterance based on different emotions.
However, with the conventional technology, it is not possible to reflect the emotion in the utterance of the speaker during the dialogue in the summary.

本発明は、上記課題に鑑みてなされたものであり、その目的は、対話音声から、十分に短縮化され、かつ対話中の話者の発話における感情が十分に反映された高精度な要約文を生成することが可能な対話要約生成装置、対話要約生成方法およびプログラムを提供することにある。 The present invention has been made in view of the above problems, and an object thereof is a highly accurate summary sentence that is sufficiently shortened from a dialogue voice and sufficiently reflects the emotion in the utterance of the speaker during the dialogue. It is an object of the present invention to provide a dialogue summary generation device, a dialogue summary generation method and a program capable of generating a dialogue.

上記課題を解決するために、本発明のある態様によれば対話音声データから対話の話者を識別する話者識別部と、前記話者識別部により識別された話者ごとに、前記対話音声データを発話単位に分離する音声分離部と、前記対話音声データを、前記音声分離部により分離された前記発話単位で音声認識して対話音声テキストを生成する音声認識部と、前記音声認識部により生成された前記対話音声テキストを要約して要約文テキストを生成する要約生成部と、前記発話単位で前記対話音声データを解析して話者ごとの感情表現を導出し、導出された前記感情表現を前記要約文テキストに付加しまたは前記感情表現で前記要約文テキストの一部を置き換え、または前記要約文テキストに対応付けて出力する感情解析部と、を備える対話要約生成装置が提供される。 In order to solve the above problems, according to an aspect of the present invention, a speaker identification unit that identifies a speaker of a dialogue from dialogue voice data, and the dialogue voice for each speaker identified by the speaker identification unit. A voice separating unit for separating data into utterance units, a voice recognizing unit for recognizing the dialogue voice data in the utterance units separated by the voice separating unit to generate a dialogue voice text, and the voice recognizing unit. A summary generation unit that summarizes the generated dialogue voice text to generate a summary text, and analyzes the dialogue voice data for each utterance to derive an emotional expression for each speaker, and the derived emotional expression To the summary sentence text, replace part of the summary sentence text with the emotion expression, or output in association with the summary sentence text. Apparatus is provided.

前記感情解析部はさらに、前記発話単位で前記対話音声データを解析することにより、１つの対話における話者ごとの時系列上の感情の遷移を導出し、話者ごとに導出された前記感情の遷移を、前記要約文テキストに対応付けて出力してよい。 The emotion analysis unit further analyzes time-series emotional transitions for each speaker in one dialogue by analyzing the conversation voice data in units of the utterance, and the emotions derived for each speaker are analyzed. The transition may be output in association with the summary text.

前記対話要約生成装置はさらに、前記音声認識部により生成された前記対話音声テキストから、話者ごとの感情を示す感情語を抽出し、抽出された前記感情語を対応する感情表現に変換し、変換された前記感情表現で、前記要約文テキストの少なくとも一部を置き換える第２の感情解析部を備えてよい。
前記対話音声生成装置はさらに、１つの対話の単位の対話音声テキスト中に、同一ないし類似するテキストが複数回出現するか否かを判定し、同一ないし類似するテキストが複数回出現する場合には、時系列上前方に出現するテキストを削除する冗長性排除部をさらに備えてよい。
前記冗長性排除部は、さらに、予め重要語を定義する重要語テーブルを参照して、前記対話音声テキスト中から前記重要語テーブルに定義されるテキストを抽出し、抽出されたテキストの直前に位置するとともに抽出されたテキストの読みが少なくとも部分一致する第２のテキストを検索し、検索されたテキストを前記対話音声テキストから削除してよい。 The dialogue summary generation device further extracts from the dialogue voice text generated by the voice recognition unit, an emotional word indicating an emotion of each speaker, and converts the extracted emotional word into a corresponding emotional expression, A second emotion analysis unit that replaces at least a part of the summary text with the converted emotion expression may be provided.
The dialogue voice generation device further determines whether the same or similar text appears multiple times in the dialogue voice text of one dialogue unit, and when the same or similar text appears multiple times. A redundancy elimination unit may be further provided for deleting a text that appears ahead in time series.
The redundancy eliminating unit further refers to an important word table that defines important words in advance, extracts the text defined in the important word table from the dialogue voice text, and positions the text immediately before the extracted text. And extracting the second text in which the reading of the extracted text at least partially matches the reading, and deleting the searched text from the dialogue voice text.

前記対話要約生成装置はさらに、前記音声認識部により生成される前記対話音声テキストを解析して数詞を抽出し、抽出された数詞の種別に応じて異なる単位および重みを付与して、前記要約生成部へ供給するテキスト補正部を備えてよい。
前記対話要約生成装置はさらに、通話音声または対面での対話音声を録音して前記対話音声データを取得する音声取得部を備えてよい。 The dialogue summary generation device further analyzes the dialogue voice text generated by the voice recognition unit to extract a number, assigns different units and weights depending on the type of the extracted number, and generates the summary. A text correction unit may be provided to supply the unit.
The dialogue summary generation device may further include a voice acquisition unit that records a conversation voice or a face-to-face conversation voice to acquire the conversation voice data.

本発明の他の態様によれば、対話音声データから対話の話者を識別するステップと、識別された話者ごとに、前記対話音声データを発話単位に分離するステップと、前記対話音声データを、分離された前記発話単位で音声認識して対話音声テキストを生成するステップと、生成された前記対話音声テキストを要約して要約文テキストを生成するステップと、前記発話単位で前記対話音声データを解析して話者ごとの感情表現を導出し、導出された前記感情表現を前記要約文テキストに付加しまたは前記感情表現で前記要約文テキストの一部を置き換え、または前記要約文テキストに対応付けて出力するステップと、を含む対話要約生成方法が提供される。 According to another aspect of the present invention, a step of identifying a speaker of the dialogue from the dialogue voice data, a step of separating the dialogue voice data for each identified speaker into utterance units, and a step of separating the dialogue voice data A step of generating a dialogue voice text by performing voice recognition on the separated utterance unit, a step of summarizing the generated dialogue voice text to generate a summary text, and a step of generating the dialogue voice data on the utterance unit basis. Derives emotional expressions for each speaker by analysis, adds the derived emotional expressions to the summary sentence text, replaces a part of the summary sentence text with the emotion expression, or associates with the summary sentence text. And a step of outputting as a dialogue summary generation method.

本発明のさらに他の態様によれば、対話要約生成処理をコンピュータに実行させるための対話音声要約生成プログラムであって、該プログラムは、前記コンピュータに、話音声データから対話の話者を識別する話者識別処理と、識別された話者ごとに、前記対話音声データを発話単位に分離する音声分離処理と、前記対話音声データを、分離された前記発話単位で音声認識して対話音声テキストを生成する音声認識処理と、生成された前記対話音声テキストを要約して要約文テキストを生成する要約生成処理と、前記発話単位で前記対話音声データを解析して話者ごとの感情表現を導出し、導出された前記感情表現を前記要約文テキストに付加しまたは前記感情表現で前記要約文テキストの一部を置き換え、または前記要約文テキストに対応付けて出力する感情解析処理と、を含む処理を実行させるためのものである、対話要約生成プログラムが提供される。 According to still another aspect of the present invention, there is provided a dialogue voice summary generation program for causing a computer to execute a dialogue summary generation process, wherein the program identifies the speaker of the dialogue from the talk voice data. Speaker identification processing, voice separation processing for separating the dialogue voice data into utterance units for each identified speaker, and voice recognition of the dialogue voice data in the separated utterance units to obtain dialogue voice text. A voice recognition process to generate, a summary generation process to summarize the generated dialogue voice text to generate a summary sentence text, and analyze the dialogue voice data for each utterance unit to derive an emotional expression for each speaker. , Adding the derived emotional expression to the summary text or replacing part of the summary text with the emotional expression, or corresponding to the summary text And emotion analysis to only output is intended for causing a process including execution, interactive summarization program is provided.

本発明に係る対話要約生成装置、対話要約生成方法およびプログラムによれば、対話音声から、十分に短縮化され、かつ対話中の話者の発話における感情が十分に反映された高精度な要約文を生成することができる。よって、対話音声の要約の有用性向上に資する。 According to the dialogue summary generation device, the dialogue summary generation method, and the program according to the present invention, a highly accurate summary sentence that is sufficiently shortened from the dialogue voice and sufficiently reflects the emotion in the utterance of the speaker during the dialogue. Can be generated. Therefore, it contributes to improving the usefulness of the summary of the dialogue voice.

本発明の実施形態に係る音声処理システムのネットワーク構成の一例を示す図である。It is a figure which shows an example of the network structure of the audio processing system which concerns on embodiment of this invention. 図１の音声処理システムを構成する音声認識サーバの機能構成の一例を示すブロック図である。It is a block diagram which shows an example of a functional structure of the voice recognition server which comprises the voice processing system of FIG. 図１の音声処理システムを構成する要約生成サーバの機能構成の一例を示すブロック図である。FIG. 3 is a block diagram showing an example of a functional configuration of a summary generation server which constitutes the voice processing system of FIG. 1. 図２の音声認識サーバが実行する音声認識処理の処理フローの一例を示すフローチャートである。3 is a flowchart showing an example of a processing flow of a voice recognition process executed by the voice recognition server in FIG. 2. 音声データに対する図４の話者識別（Ｓ２）および音声の発話単位への分離（Ｓ３）を説明する図である。It is a figure explaining the speaker identification (S2) of FIG. 4 with respect to audio | voice data, and isolation | separation (S3) of an audio | voice into the utterance unit. 図３の要約生成サーバが実行する要約生成処理の処理フローの一例を示すフローチャートである。4 is a flowchart showing an example of a processing flow of a summary generation process executed by the summary generation server of FIG. 3. 図４の自然発話への変換・要約単位への分離（Ｓ５）の詳細処理フローの一例を示すフローチャートである。5 is a flowchart showing an example of a detailed processing flow of conversion into natural speech and separation into summary units (S5) in FIG. 4. 図７の要約単位への分離（Ｓ５３）の詳細処理フローの一例を示すフローチャートである。8 is a flowchart showing an example of a detailed processing flow of separation (S53) into summary units in FIG. 7. 図４のＳ４の実行により発話単位に音声認識された認識結果テキストの一例を示す図である。It is a figure which shows an example of the recognition result text | voice recognized by the speech recognition by execution of S4 of FIG. 図９の認識結果テキストの構文解析結果の一例を示す図である。It is a figure which shows an example of the syntax analysis result of the recognition result text of FIG. 図９の認識結果テキストの形態素解析結果の一例を示す図である。It is a figure which shows an example of the morphological analysis result of the recognition result text of FIG. 図９の認識結果テキストが、図４のＳ５の実行により要約単位に分離された認識結果テキストの一例を示す図である。FIG. 10 is a diagram showing an example of the recognition result text obtained by separating the recognition result text of FIG. 9 into summary units by performing S5 of FIG. 4. 図４の相槌解析（Ｓ６）の詳細処理フローの一例を示すフローチャートである。5 is a flowchart showing an example of a detailed processing flow of a summation analysis (S6) in FIG. 4. 図５の音声データに対する図４の相槌解析（Ｓ６）を説明する図である。FIG. 6 is a diagram illustrating the summation analysis (S6) of FIG. 4 for the voice data of FIG. 5. 認識結果テキストに対して適用される句読点テーブルの一例を示す図である。It is a figure which shows an example of the punctuation mark table applied to recognition result text. 認識結果テキストに対して適用される単位重みテーブルの一例を示す図である。It is a figure which shows an example of the unit weight table applied with respect to a recognition result text. 認識結果テキストに対して適用される不要語テーブルの一例を示す図である。It is a figure which shows an example of the unnecessary word table applied with respect to a recognition result text. 認識結果テキストに対して適用される文字置換テーブルの一例を示す図である。It is a figure which shows an example of the character substitution table applied to a recognition result text. 認識結果テキストに対して適用される重要語テーブルの一例を示す図である。It is a figure which shows an example of the important word table applied with respect to a recognition result text. 認識結果テキストに対して適用される肯定語テーブルの一例を示す図である。It is a figure which shows an example of the affirmative word table applied to a recognition result text. 認識結果テキストに対して適用される否定語テーブルの一例を示す図である。It is a figure which shows an example of the negative word table applied to recognition result text. 顧客の対話音声に対する感情解析結果の出力の一例を示す図である。It is a figure which shows an example of the output of the emotion analysis result with respect to a customer's conversation voice. オペレータの対話音声に対する感情解析結果の出力の一例を示す図である。It is a figure which shows an example of the output of the emotion analysis result with respect to a dialogue voice of an operator. オペレータの複数の対話音声に対する感情解析結果の出力の一例を示す図である。It is a figure which shows an example of the output of the emotion analysis result with respect to several dialogue voices of an operator. 認識結果テキストに対して適用される感情語テーブルの一例を示す図である。It is a figure which shows an example of the emotion word table applied to the recognition result text. 感情解析結果が付加された要約テキストの一例を示す図である。It is a figure which shows an example of the summary text to which the emotion analysis result was added. 図２５の感情語テーブルの要約テキストへの適用例を示す図である。It is a figure which shows the example of application to the summary text of the emotional word table of FIG. 音声対話の認識結果テキストを話者ごと要約単位に分離した一例を示す図である。It is a figure which shows an example which separated the recognition result text of a voice conversation for every speaker into the summary unit. 音声対話の認識結果テキストを話者ごと要約単位に分離した他の例を示す図である。It is a figure which shows the other example which isolate | separated the recognition result text of a voice conversation for every speaker into the summary unit. 図２９の認識結果テキストから生成された要約文の一例を示す図である。It is a figure which shows an example of the abstract sentence produced | generated from the recognition result text of FIG. 対話音声の要約表示のユーザインタフェースの一例を示す図である。It is a figure which shows an example of the user interface of the summary display of dialog voice. 対話音声の要約とともに表示可能な感情解析結果の表示例を示す図である。It is a figure which shows the example of a display of the emotion analysis result which can be displayed with the summary of dialog voice. 対話音声の音声認識結果、自然言語処理結果、および対応する要約結果の表示例を示す図である。It is a figure which shows the example of a display of the voice recognition result of a dialog voice, the natural language processing result, and the corresponding summary result. 本実施形態における各装置のハードウエア構成の一例を示す図である。It is a figure which shows an example of the hardware constitutions of each apparatus in this embodiment.

以下、添付図面を参照して、本発明を実施するための実施形態について詳細に説明する。なお、以下に説明する実施形態は、本発明の実現手段としての一例であり、本発明が適用される装置の構成や各種条件によって適宜修正又は変更されるべきものであり、本発明は以下の実施形態に必ずしも限定されるものではない。また、本実施形態で説明されている特徴の組み合わせの全てが本発明の解決手段に必須のものとは限らない。なお、同一の構成については、同じ符号を付して説明する。 Hereinafter, embodiments for carrying out the present invention will be described in detail with reference to the accompanying drawings. The embodiment described below is an example as a means for realizing the present invention, and should be appropriately modified or changed depending on the configuration of the device to which the present invention is applied and various conditions. It is not necessarily limited to the embodiment. Further, not all of the combinations of features described in the present embodiment are essential to the solving means of the present invention. The same configurations will be described with the same reference numerals.

＜本実施形態の音声処理システムのネットワーク構成＞
以下では、顧客と、コールセンタのオペレータとの間で電話網を介してなされた通話を録音する例を説明するが、本実施形態はこれに限定されない。本実施形態は、例えば、通話に替えて、対面での対話をマイクロフォン等の集音装置により集音し録音した対話音声についても、同様に要約文を生成することができる。
図１は、本実施形態に係る音声処理システムのネットワーク構成の非限定的一例を示す図である。図１を参照して、音声処理システムは、ＰＢＸ（交換機）１、音声取得サーバ２、通話録音サーバ３、制御サーバ４、音声認識サーバ５、感情解析サーバ６、要約生成サーバ７、および対話要約照会用に利用可能なＰＣ（ＰｅｒｓｏｎａｌＣｏｍｐｕｔｅｒ）９を備える。ＰＢＸ１、音声取得サーバ２、通話録音サーバ３、制御サーバ４、音声認識サーバ５、感情解析サーバ６、要約生成サーバ７、およびＰＣ９の全部または一部は、コールセンタ構内に設置され、ＬＡＮ（ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ）／ＷＡＮ（ＷｉｄｅＡｒｅａＮｅｔｗｏｒｋ）等のイントラネット８等のＩＰ（ＩｎｔｅｒｎｅｔＰｒｏｔｏｃｏｌ）ネットワークにより相互接続されてよい。 <Network configuration of voice processing system of this embodiment>
Hereinafter, an example of recording a call made between the customer and the operator of the call center through the telephone network will be described, but the present embodiment is not limited to this. In this embodiment, for example, instead of a call, a summary sentence can be similarly generated also for a dialogue voice that is recorded by collecting a face-to-face dialogue with a sound collection device such as a microphone.
FIG. 1 is a diagram showing a non-limiting example of a network configuration of a voice processing system according to this embodiment. Referring to FIG. 1, the voice processing system includes a PBX (exchange) 1, a voice acquisition server 2, a call recording server 3, a control server 4, a voice recognition server 5, an emotion analysis server 6, a summary generation server 7, and a dialogue summary. A PC (Personal Computer) 9 that can be used for inquiry is provided. All or a part of the PBX 1, the voice acquisition server 2, the call recording server 3, the control server 4, the voice recognition server 5, the emotion analysis server 6, the summary generation server 7, and the PC 9 are installed in a call center premises, and LAN (Local Area) is used. They may be interconnected by an IP (Internet Protocol) network such as an intranet 8 such as Network / WAN (Wide Area Network).

或いは代替的に、音声取得サーバ２、通話録音サーバ３、制御サーバ４、音声認識サーバ５、感情解析サーバ６、要約生成サーバ７、およびＰＣ９の全部または一部は、インターネット等の遠隔ＩＰ接続を介して適宜コールセンタ外部に設置されてもよい。
特に、コールセンタのオペレータ以外の管理者等が対話要約照会用ＰＣ９を操作して要約文データベース内の応対履歴である対話音声要約の照会ないし更新処理を行う場合には、当該対話要約照会ＰＣ９は、オペレータ近傍に設置される必要はなく、遠隔ＩＰ接続を介して適宜コールセンタ外部に設置されることが好適である。 Alternatively, all or part of the voice acquisition server 2, the call recording server 3, the control server 4, the voice recognition server 5, the emotion analysis server 6, the summary generation server 7, and the PC 9 may be connected to a remote IP connection such as the Internet. It may be installed outside the call center as appropriate.
In particular, when an administrator other than the operator of the call center operates the dialogue summary inquiry PC 9 to perform inquiry or update processing of the dialogue voice summary which is the response history in the summary sentence database, the dialogue summary inquiry PC 9 It is not necessary to be installed in the vicinity of the operator, but it is suitable to be installed outside the call center as appropriate through a remote IP connection.

音声処理システムは、さらに、イントラネット８或いはインターネットを介して音声処理システムに接続される、マイクロフォンを接続または内蔵する他のＰＣ１０を備えてよい。このように構成すれば、ＰＣ１０のマイクロフォンで集音した対面でなされた対話音声を、本実施形態に係る音声処理システムに入力し、対面でなされた対話音声の要約文を生成することができる。 The voice processing system may further include another PC 10 that connects to or incorporates a microphone that is connected to the voice processing system via the intranet 8 or the Internet. According to this structure, the face-to-face dialogue voice collected by the microphone of the PC 10 can be input to the voice processing system according to the present embodiment to generate a summary sentence of the face-to-face dialogue voice.

ＰＢＸ１は、コールセンタ内の内線電話を収容し、これら内線電話同士を接続するとともに、各オペレータの電話端末１２を、構内回線１１ａ、１１ｂ、１１ｃ・・・を介してＰＳＴＮ（公衆電話網）１３に回線交換接続して、各オペレータの電話端末１２と、ＰＳＴＮ１３に接続される顧客の電話端末１４との間の通話を実現する。 The PBX 1 accommodates extension telephones in the call center, connects these extension telephones to each other, and connects the telephone terminals 12 of the operators to the PSTN (public telephone network) 13 via the private lines 11a, 11b, 11c. The telephone connection between each operator's telephone terminal 12 and the customer's telephone terminal 14 connected to the PSTN 13 is realized by circuit-switching connection.

なお、図１におけるＰＢＸ１は、ＰＳＴＮ１３等の公衆電話交換回線網を介して顧客の電話端末１４に接続されているが、これに替えて、或いはこれに加えて、ＩＰ網接続機能を備えることにより、ＶｏＩＰ（ＶｏｉｃｅＯｖｅｒＩｎｔｅｒｎｅｔＰｒｏｔｏｃｏｌ）ネットワーク等の音声パケット通信ネットワークを介して、ＩＰ電話機能を備える顧客のＩＰ通話端末に接続されてよく、この場合、後述する音声取得サーバ２は、顧客のＩＰ通話端末およびオペレータの電話端末１２の間の音声通話を取得することができる。顧客の電話端末１４は、固定電話機或いは携帯電話機やスマートフォンのいずれであってもよい。
＜各サーバ装置の機能構成＞ The PBX 1 in FIG. 1 is connected to the customer's telephone terminal 14 via a public switched telephone line network such as the PSTN 13, but instead of or in addition to this, by providing an IP network connection function. , A voice over internet protocol (VoIP) network or the like, may be connected to a customer's IP call terminal having an IP telephone function via a voice packet communication network. In this case, the voice acquisition server 2 to be described later is connected to the customer's IP call. A voice call between the terminal and the operator's telephone terminal 12 can be obtained. The customer's telephone terminal 14 may be a fixed telephone, a mobile telephone, or a smartphone.
<Functional configuration of each server device>

音声取得サーバ２は、ＰＢＸ１に分岐接続され、各オペレータの電話端末１２と顧客の電話端末１４との通話音声を取得するとともに、取得された通話音声をオペレータの電話端末１２の識別子（例えば内線番号）と対応付けて各サーバに供給する。代替的に、この音声取得サーバ２は、ＰＳＴＮ１３の終端装置（ＤＳＵ）とＰＢＸ１との間の回線に分岐接続されてもよい。 The voice acquisition server 2 is branched and connected to the PBX 1, acquires the call voice between the operator's telephone terminal 12 and the customer's telephone terminal 14, and uses the obtained call voice as an identifier (for example, an extension number) of the operator's telephone terminal 12. ) And supply it to each server. Alternatively, the voice acquisition server 2 may be branched and connected to the line between the terminating device (DSU) of the PSTN 13 and the PBX 1.

通話録音サーバ３は、制御サーバ４の制御の下、着呼後に音声取得サーバ２から供給される通話音声を必要に応じて圧縮し、取得された音声データを、例えばＮＡＳ（ＮｅｔｗｏｒｋＡｐｐｌｉａｎｃｅＳｔｏｒａｇｅ）等の大規模外部記憶装置により構成される対話音声ファイル（図２の対話音声ファイル３１）のデータベースに蓄積記憶する。
好適には、通話録音サーバ３は、音声取得サーバ２からアナログ音声が供給された場合、このアナログ音声波形を電圧で表したものを所定のビット深度と所定のサンプリング周波数でサンプリングすることによりデジタル音声に変換し、対話音声ファイル３１に蓄積保存する。 Under the control of the control server 4, the call recording server 3 compresses the call voice supplied from the voice acquisition server 2 after an incoming call as necessary, and acquires the obtained voice data, for example, NAS (Network Appliance Storage) or the like. The dialogue voice file (the dialogue voice file 31 in FIG. 2) constituted by the large-scale external storage device is stored and stored in the database.
Preferably, the call recording server 3, when analog voice is supplied from the voice acquisition server 2, samples the analog voice waveform represented by a voltage with a predetermined bit depth and a predetermined sampling frequency to generate a digital voice. And is stored in the dialogue voice file 31.

このデジタル音声データは、圧縮後に対話音声ファイル３１に蓄積保存されてよい。録音音声の圧縮には、種々の公知の手法を種々の圧縮率で用いることができ、非限定的一例として、モノラル５分の１圧縮、モノラル１０分の１圧縮、或いはステレオ無圧縮等により録音音声が圧縮される。代替的に、通話録音サーバ３は、音声取得サーバ２から供給される音声データを変換圧縮することなく、通話音声ファイル３１に蓄積保存してもよい。 This digital voice data may be stored in the dialogue voice file 31 after being compressed. Various known methods can be used at various compression rates for compressing the recorded sound, and as a non-limiting example, recording is performed by monaural 1/5 compression, monaural 1/10 compression, or stereo uncompression. Audio is compressed. Alternatively, the call recording server 3 may store and save the voice data supplied from the voice acquisition server 2 in the call voice file 31 without converting and compressing the voice data.

通話録音サーバ３はまた、対話音声ファイル３１内に蓄積保存された１通話単位ごとの対話音声データに関連付けて、呼情報ファイル（不図示）に通話の制御情報として取得される呼情報を書き出す。この呼情報は、ＰＢＸ１により供給される。
通話録音サーバ３により取得される呼情報は、例えば、着信開始情報（着信開始タイムスタンプを含む）、発信開始情報（発信開始タイムスタンプを含む）、通話開始情報（通話開始タイムスタンプを含む）、通話終了情報（通話終了タイムスタンプを含む）等の呼制御情報と、発信元電話番号、発信先電話番号、発信元チャネル番号、発信者番号、着信チャネル番号、着信電話番号（着信先内線番号等）等の呼識別情報とを含む。 The call recording server 3 also writes call information acquired as call control information in a call information file (not shown) in association with the call voice data stored in the call voice file 31 for each call voice data. This call information is provided by PBX1.
The call information acquired by the call recording server 3 includes, for example, incoming call start information (including the incoming call start time stamp), outgoing call start information (including the outgoing call start time stamp), call start information (including the call start time stamp), Call control information such as call end information (including call end time stamp), caller telephone number, callee telephone number, caller channel number, caller number, callee channel number, callee telephone number (callee extension number, etc.) ) And other call identification information.

この呼情報はさらに、録音された通話内の発話が、インバウンド、すなわち顧客側からの発話であるか、アウトバウンド、すなわちオペレータ側からの発話であるかの極性を識別する話者識別情報を含む。この話者識別情報は、ＰＢＸ１により取得可能であり、例えばＳＩＰ（ＳｅｓｓｉｏｎＩｎｉｔｉａｔｉｏｎＰｒｏｔｏｃｏｌ）の場合には、呼生成の際のセッション構成時に把握可能であり、具体的には、例えば、セッション構成時に、発呼側から着呼側に送信されるＩｎｖｉｔｅコマンド中で、セッション開始に必要な情報を記述するＳＤＰ（ＳｅｓｓｉｏｎＤｅｓｃｒｉｐｔｉｏｎＰｒｏｔｏｃｏｌ）内に発呼側が受信に使用するＩＰアドレスとポート番号を指定し、一方これに応答して着呼側から発呼側へ送信される２００ＯＫメッセージ中のＳＤＰ内に着呼側が受信に使用するＩＰアドレスとポート番号を指定し、このそれぞれ指定されたＩＰアドレスとポート番号を使用してＲＴＰ（ＲｅａｌｔｉｍｅＴｒａｎｓｐｏｒｔＰｒｏｔｏｃｏｌ）プロトコル上音声データが送受信される。このため、これら発呼側および着呼側がそれぞれ受信に使用するＩＰアドレスとポート番号を取得することにより、１通話内の発話それぞれの話者識別情報を得ることができ、１通話内の顧客の発話とオペレータの発話とを必要に応じて区別或いは分離することができる。
ＩＳＤＮの場合には、話者識別情報は、回線終端装置（ＤｉｇｉｔａｌＳｅｒｖｉｃｅＵｎｉｔ：ＤＳＵ）の物理的なピン位置として取得可能である。 The call information further includes speaker identification information that identifies the polarity of whether the utterance in the recorded call is inbound, that is, from the customer side or outbound, that is, from the operator side. This speaker identification information can be acquired by the PBX 1, and in the case of SIP (Session Initiation Protocol), for example, it can be grasped at the time of session configuration at the time of call generation. Specifically, for example, at the time of session configuration, In the Invite command sent from the calling side to the called side, the IP address and port number used by the calling side for reception are specified in the SDP (Session Description Protocol) that describes the information necessary for starting the session. In response to this, the IP address and port number used by the callee for reception are specified in the SDP in the 200 OK message sent from the callee to the caller, and the specified IP address and port number, respectively. RTP (Realtime Transport P audio data is transmitted and received according to the protocol. Therefore, by obtaining the IP address and port number used by each of the calling side and the called side for reception, the speaker identification information of each utterance in one call can be obtained, and the customer identification information in one call can be obtained. The utterance and the operator's utterance can be distinguished or separated as necessary.
In the case of ISDN, the speaker identification information can be acquired as a physical pin position of a line terminating device (Digital Service Unit: DSU).

これら呼情報は、好適には、ＣＴＩ（ＣｏｍｐｕｔｅｒＴｅｌｅｐｈｏｎｙＩｎｔｅｇｒａｔｉｏｎ）プロトコルを実装した制御サーバ４上ないしオペレータのＰＣ９上で稼動するＣＴＩプログラムと連携して、これらの表示装置上に呼情報をリアルタイムに表示してよい。 These call information are preferably displayed in real time on these display devices in cooperation with a CTI program running on the control server 4 or the operator's PC 9 that implements the CTI (Computer Telephony Integration) protocol. You can do it.

通話録音サーバ３はまた、すでに応対履歴のある顧客を中心とする顧客の情報が事前登録された顧客情報データベース（不図示）を備える。この顧客情報は、顧客を識別する個人情報を含み、例えば顧客氏名、住所、登録された顧客電話番号、生年月日、年齢層、性別、その他顧客属性、製品購入履歴、応対履歴等を含むものとし、オペレータが操作可能な端末装置に、オペレータの指示入力に応じて適宜出力され得る。 The call recording server 3 also includes a customer information database (not shown) in which customer information centering on customers who already have a contact history is pre-registered. This customer information includes personal information that identifies the customer, such as customer name, address, registered customer phone number, date of birth, age group, gender, other customer attributes, product purchase history, and response history. , Can be appropriately output to a terminal device that can be operated by an operator in response to an instruction input by the operator.

なお、通話録音サーバ３は、構内回線８に接続するのに替えて、例えば、ＰＳＴＮ１３とＰＢＸ１との間に接続されてよく、このように構成すれば、通話録音サーバ３は、上記の話者識別情報を直接取得することができる。さらに代替的に、音声取得サーバ２を別途設置することなく、通話録音サーバ３は、構内回線８に接続され、構内回線８に供給される通話音声を直接取得してよい。 The call recording server 3 may be connected, for example, between the PSTN 13 and the PBX 1 instead of being connected to the premises line 8. With this configuration, the call recording server 3 can be connected to the above-mentioned speaker. The identification information can be obtained directly. Further alternatively, the call recording server 3 may be connected to the premises line 8 and directly acquire the call voice supplied to the premises line 8 without separately installing the voice acquisition server 2.

制御サーバ４は、音声取得サーバ２、通話録音サーバ３、音声認識サーバ５、感情解析サーバ６、および要約生成サーバ７から供給されるデータおよび制御情報に基づいて、これらサーバが実行する処理、これらサーバ間のデータトラフィックおよび制御情報の送受信を制御する。代替的に、音声認識サーバ５および要約生成サーバ７は、通話録音サーバ３が保有する通話音声ファイル３１や呼情報ファイルへのアクセスや対話要約照会用ＰＣ９へのインターフェースを、制御サーバ４を介することなく直接提供してもよい。この場合、音声処理システムは、別途制御サーバ４を備えなくてよい。 The control server 4, based on the data and control information supplied from the voice acquisition server 2, the call recording server 3, the voice recognition server 5, the emotion analysis server 6, and the summary generation server 7, the processes executed by these servers. Controls the sending and receiving of data traffic and control information between servers. Alternatively, the voice recognition server 5 and the summary generation server 7 may access the call voice file 31 and the call information file held by the call recording server 3 and interface with the dialogue summary inquiry PC 9 via the control server 4. You may provide it directly instead. In this case, the voice processing system does not need to include the control server 4 separately.

音声認識サーバ５は、制御サーバ４の制御の下、対話音声ファイル３１に蓄積保存された対話音声データを、オフフックからオンフックまでの１通話分ごと読み出し、１通話分の対話音声を複数の発話単位に分離する。この発話単位への分離は、無音区間を識別して対話音声をこの無音区間で区切るものであり、図５を参照して後述する。
本実施形態において、音声認識サーバ５は、分離された発話単位ごとに対話音声データを解析して特徴量を抽出し、音声認識辞書（図２の音声認識辞書３２）等の各種認識用辞書を参照し、公知の音声認識技術を適用して対話音声データを文字コード列に変換し、さらに変換された文字コード列を対話音声テキストとしてファイルに出力する。本実施形態において、音声認識サーバ５が出力する対話音声テキストは、要約単位に区切られたテキスト（図２の要約単位テキスト）を含む。この対話音声テキストを要約単位に区切る処理は、図４、図７、および図８を参照して後述する。 Under the control of the control server 4, the voice recognition server 5 reads out the conversation voice data accumulated and saved in the conversation voice file 31 for each call from off-hook to on-hook, and outputs the conversation voice for one call in a plurality of utterance units. To separate. This separation into utterance units is to identify a silent section and divide the dialogue voice into the silent sections, which will be described later with reference to FIG.
In the present embodiment, the voice recognition server 5 analyzes the dialogue voice data for each separated utterance unit to extract a feature amount, and uses various recognition dictionaries such as a voice recognition dictionary (the voice recognition dictionary 32 in FIG. 2). By referring to this, a known voice recognition technique is applied to convert the conversational voice data into a character code string, and the converted character code sequence is output to a file as a conversational voice text. In the present embodiment, the dialogue voice text output by the voice recognition server 5 includes a text segmented into summary units (summary unit text in FIG. 2). The process of dividing the dialogue voice text into summary units will be described later with reference to FIGS. 4, 7, and 8.

感情解析サーバ６は、通話録音サーバ３から供給される対話音声データを入力として、話者ごとに例えば、喜怒、満足度、ストレス度、信頼度等の話者の感情を示す定量的指標を話者の感情解析結果として出力する。この感情解析結果は、１通話内あるいは終日等、所定期間における各感情指標の変化として出力することができる。感情解析サーバ６が実行するこの感情解析処理の詳細は、図６、図２２ないし図２４を参照して後述する。 The emotion analysis server 6 receives the dialogue voice data supplied from the call recording server 3 as an input, and provides a quantitative index indicating the emotion of the speaker, such as mood, satisfaction, stress, and reliability for each speaker. It is output as a speaker emotion analysis result. The emotion analysis result can be output as a change in each emotion index in a predetermined period such as one call or all day. Details of the emotion analysis processing executed by the emotion analysis server 6 will be described later with reference to FIGS. 6 and 22 to 24.

要約生成サーバ７は、対話音声テキストファイル３３に格納された、要約単位に区切られた対話音声テキストを１通話分ごと読み出して、要約生成処理を実行し、生成された対話要約文を、要約文テキスト（図３の要約文テキスト３８）として出力する。この要約生成処理の詳細は、図６を参照して後述する。 The summary generation server 7 reads the conversation voice texts stored in the conversation voice text file 33 and divided into summary units for each call, and executes a summary generation process to convert the generated dialogue summary sentence into a summary sentence. It is output as a text (summary text 38 in FIG. 3). Details of this abstract generation processing will be described later with reference to FIG.

要約生成サーバ７は、１通話内の一方の話者、例えばオペレータの発話の対話音声テキストを読み出して要約文を生成してもよく、他方の話者、例えば顧客の発話から抽出された受け答え部分（後述）を要約文に付加してもよく、双方の話者の対話音声テキストから要約文を作成してもよい。後者の場合、話者の識別情報を対話音声テキストに対応付けることが好適である。 The summary generation server 7 may read the dialogue voice text of one speaker in one call, for example, the utterance of the operator to generate a summary sentence, and the answer part extracted from the utterance of the other speaker, for example, the customer. (Described later) may be added to the summary sentence, or the summary sentence may be created from the dialogue voice texts of both speakers. In the latter case, it is preferable to associate the speaker identification information with the dialogue voice text.

この１通話ごとに生成される要約文は、適宜、照会入力に応答して、対話要約照会用のＰＣ９等のディスプレイ装置やプリンタ装置等の出力装置に出力可能であり、好適には、呼情報からデコードされた通話開始時間、通話終了時間、通話の発信者識別情報（顧客から着信した通話か、オペレータから発信した通話かを識別する情報）等と関連付けて出力されてよい。
好適には、ＰＣ９等に表示出力される要約文は、操作者の修正入力により、適宜更新され得る。この更新結果を学習し、要約文生成の際に参照されるべき重要語テーブル、不要語テーブル、各種変換テーブル等を適宜更新することにより、より高精度かつ簡明な要約文を生成することが可能となる。
本実施形態において、要約生成サーバ７はさらに、音声認識サーバ５から供給される対話音声テキストを入力として、感情語テーブル（図３の感情語テーブル３７）等を参照して、対話音声テキスト中の感情表現部分を抽出し、要約文に含めるべき感情表現語に変換する。 The summary text generated for each call can be appropriately output to an output device such as a display device such as the PC 9 or a printer device for the dialog summary inquiry in response to the inquiry input. It may be output in association with the call start time, the call end time, and the caller identification information of the call (information for identifying whether the call is received from the customer or the call originated from the operator) decoded from the above.
Preferably, the summary sentence displayed and output on the PC 9 or the like can be appropriately updated by a correction input by the operator. By learning the update result and updating the important word table, unnecessary word table, various conversion tables, etc. that should be referred to when generating the summary, it is possible to generate a more accurate and simple summary. Becomes
In the present embodiment, the abstract generation server 7 further refers to the emotion word table (the emotion word table 37 of FIG. 3) and the like by using the dialogue voice text supplied from the voice recognition server 5 as an input, and the dialogue speech text The emotion expression part is extracted and converted into an emotion expression word that should be included in the summary sentence.

なお、図１に示すネットワークおよびハードウエアの構成は非限定的一例に過ぎず、各サーバおよびデータベースを必要に応じて一体としてもよく、或いは各コンポーネントをＡＳＰ（ＡｐｐｌｉｃａｔｉｏｎＳｅｒｖｉｃｅＰｒｏｖｉｄｅ）等の外部設備に設置してもよい。 The network and hardware configurations shown in FIG. 1 are merely non-limiting examples, and each server and database may be integrated as needed, or each component may be connected to external equipment such as an ASP (Application Service Provide). May be installed.

＜音声認識サーバ５の機能構成例＞
図２は、本実施形態に係る音声認識サーバ５の機能構成の非限定的一例を示す図である。
図２に示す音声認識サーバ５の各機能モジュールのうち、ソフトウエアにより実現される機能については、各機能モジュールの機能を提供するためのプログラムがＲＯＭ等のメモリに記憶され、ＲＡＭに読み出してＣＰＵが実行することにより実現される。ハードウエアにより実現される機能については、例えば、所定のコンパイラを用いることで、各機能モジュールの機能を実現するためのプログラムからＦＰＧＡ上に自動的に専用回路を生成すればよい。ＦＰＧＡとは、ＦｉｅｌｄＰｒｏｇｒａｍｍａｂｌｅＧａｔｅＡｒｒａｙの略である。また、ＦＰＧＡと同様にしてＧａｔｅＡｒｒａｙ回路を形成し、ハードウエアとして実現するようにしてもよい。また、ＡＳＩＣ（ＡｐｐｌｉｃａｔｉｏｎＳｐｅｃｉｆｉｃＩｎｔｅｇｒａｔｅｄＣｉｒｃｕｉｔ）により実現するようにしてもよい。なお、図２に示した機能ブロックの構成は一例であり、複数の機能ブロックが１つの機能ブロックを構成するようにしてもよいし、いずれかの機能ブロックが複数の機能を行うブロックに分かれてもよい。図３に示す要約生成サーバ７、および他のサーバ装置の機能構成についても同様である。
図２を参照して、音声認識サーバ５は、音声認識前処理部５１、音声認識部５２、音声認識後処理部５３、および相槌解析部５４を備える。 <Functional configuration example of the voice recognition server 5>
FIG. 2 is a diagram showing a non-limiting example of a functional configuration of the voice recognition server 5 according to the present embodiment.
Among the function modules of the voice recognition server 5 shown in FIG. 2, as for functions realized by software, a program for providing the function of each function module is stored in a memory such as a ROM and is read out to the RAM to be executed by the CPU. It is realized by executing. Regarding the function realized by the hardware, for example, by using a predetermined compiler, a dedicated circuit may be automatically generated on the FPGA from a program for realizing the function of each functional module. FPGA is an abbreviation for Field Programmable Gate Array. Further, a Gate Array circuit may be formed in the same manner as the FPGA and realized as hardware. Moreover, you may make it implement | achieve by ASIC (Application Specific Integrated Circuit). Note that the configuration of the functional blocks shown in FIG. 2 is an example, and a plurality of functional blocks may constitute one functional block, or one of the functional blocks may be divided into blocks that perform a plurality of functions. Good. The same applies to the functional configurations of the summary generation server 7 shown in FIG. 3 and other server devices.
Referring to FIG. 2, the voice recognition server 5 includes a voice recognition pre-processing unit 51, a voice recognition unit 52, a voice recognition post-processing unit 53, and an Azai analysis unit 54.

音声認識前処理部５１は、通話録音サーバ３が蓄積保存した対話音声ファイル３１から１通話ごとの対話音声のファイルを読み出して、読み出された１通話の対話音声ファイルから無音区間を検出し、検出された無音区間を境界として、対話における発話単位に区切る。音声認識前処理部５１はまた、１通話の対話音声ファイルから区切られた複数の発話単位を、発話単位ごとに音声認識部５２へ供給して、音声認識部５２に発話単位での音声認識処理を実行させる。 The voice recognition pre-processing unit 51 reads a conversation voice file for each call from the conversation voice file 31 stored and stored by the call recording server 3, and detects a silent section from the read conversation voice file of one call. The detected silent section is divided into utterance units in the dialogue. The voice recognition preprocessing unit 51 also supplies a plurality of utterance units delimited from the conversation voice file of one call to the voice recognition unit 52 for each utterance unit, and the voice recognition unit 52 performs the voice recognition process in the utterance unit. To run.

音声認識部５２は、音声認識前処理部５１から供給される発話単位ごとの対話音声を入力として、音声認識処理を実行し、発話単位ごとの対話音声テキストを音声認識後処理部５３へ供給する。音声認識部５２は、例えば正確に認識されるべき重要語や重要文を定義可能な音声認識辞書３２を参照して、対話音声の音声データを対話音声テキストに変換することができる。なお、音声認識部５２を公知の音声認識エンジンに実装し、一方、音声認識前処理部５１、音声認識後処理部５３、および相槌解析部５４を例えば制御サーバ４に実装してもよい。 The voice recognition unit 52 receives the dialogue voice for each utterance unit supplied from the voice recognition pre-processing unit 51, executes voice recognition processing, and supplies the dialogue voice text for each utterance unit to the voice recognition post-processing unit 53. .. The voice recognition unit 52 can convert the voice data of the conversation voice into the conversation voice text by referring to the voice recognition dictionary 32 capable of defining the important words or the important sentences to be recognized accurately, for example. The voice recognition unit 52 may be mounted on a known voice recognition engine, while the voice recognition pre-processing unit 51, the voice recognition post-processing unit 53, and the analysis unit 54 may be mounted on the control server 4, for example.

音声認識後処理部５３は、音声認識部５２が出力する発話単位ごとの対話音声テキストに対して、構文解析および形態素解析等を実行して、対話音声テキストを要約単位に区切り、要約単位に区切られた対話音声テキスト３３として出力する。構文解析結果および形態素解析結果は、要約単位に区切られた通話音声テキストに対応付けられてよい。この要約単位とは、発話単位の通話音声テキストから要約文生成を容易かつ高精度化できるよう、要約生成処理の処理単位としてさらに細分化された区切りの単位であり、その詳細は図８を参照して後述する。 The speech recognition post-processing unit 53 performs syntax analysis, morphological analysis, and the like on the dialogue voice text for each utterance unit output by the voice recognition unit 52, and divides the dialogue voice text into summary units and divides into summary units. The dialogue voice text 33 is output. The syntactic analysis result and the morphological analysis result may be associated with the call voice text divided into summary units. The summarization unit is a unit of subdivision that is further subdivided as a processing unit of the summarization generation process so that the summarization sentence can be easily and highly accurately generated from the call voice text of the utterance unit. For details, see FIG. And will be described later.

音声認識後処理部５３はまた、各重要語について重み付けを定義する音声認識用辞書３２を参照して、抽出した要約単位ごとに重み付けを付与してもよい。例えば、日付、時間、住所、電話番号等は要約文に残すべき重要語であることが多く、音声認識後処理部５３によりこれらの語を重み付けすることにより、誤変換を低減することができる。 The speech recognition post-processing unit 53 may refer to the speech recognition dictionary 32 that defines weighting for each important word, and weight the extracted abstract units. For example, date, time, address, telephone number, and the like are often important words to be left in the summary sentence, and the speech recognition post-processing unit 53 weights these words to reduce erroneous conversion.

相槌解析部５４は、音声認識後処理部５３により供給される要約単位に区切られた対話音声テキストから、例えば「はい」、「いいえ」等の受け答えと推定されるテキストを検出し、検出されたテキストが相槌か、或いは受け答えかを判定する。相槌解析部５４は、この判定結果に基づいて、相槌と判定されたテキストを、音声認識後処理部５３が出力する要約単位に区切られた対話音声テキスト３３から削除する。
一方、相槌解析部５４はまた、受け答えと判定されたテキストは、要約生成サーバ７が生成する要約文に含まれるよう、対話音声テキスト３３に含めるとともに、対話音声テキスト中で当該テキストに「受け答え」である旨をタグ付けする。この相槌解析処理の詳細は、図１３および図１４を参照して後述する。 The Aizuchi analysis unit 54 detects texts that are estimated to be the answers such as “Yes” and “No” from the dialogue voice texts divided by the summary unit supplied by the voice recognition post-processing unit 53 and detected. Determine if the text is a match or answer. Based on this determination result, the azuchi analysis section 54 deletes the text determined to be an azure from the dialogue voice text 33 divided by the summary unit output by the voice recognition post-processing section 53.
On the other hand, the summation analysis unit 54 also includes the text determined to be the answer in the dialogue voice text 33 so that it is included in the summary sentence generated by the summary generation server 7, and at the same time, in the dialogue voice text, "acknowledge". Tag that it is. Details of this summation analysis processing will be described later with reference to FIGS. 13 and 14.

＜要約生成サーバ７の機能構成例＞
図３は、本実施形態に係る要約生成サーバ７の機能構成の非限定的一例を示す図である。
図３を参照して、要約生成サーバ７は、テキスト補正部７１、冗長性排除部７２、要約文生成部７３、感情解析部７４、および要約文短縮部７５を備える。 <Example of Functional Configuration of Summary Generation Server 7>
FIG. 3 is a diagram showing a non-limiting example of the functional configuration of the summary generation server 7 according to the present embodiment.
Referring to FIG. 3, the summary generation server 7 includes a text correction unit 71, a redundancy elimination unit 72, a summary sentence generation unit 73, an emotion analysis unit 74, and a summary sentence shortening unit 75.

テキスト補正部７１は、要約単位に区切られた対話音声テキスト３３を読み出して、構文解析結果および形態素解析結果に基づいて、要約文生成を容易化するため、対話音声テキストを補正し、補正された対話音声テキストを冗長性排除部７２へ出力する。 The text correction unit 71 reads the dialogue voice text 33 divided into summary units, and corrects and corrects the dialogue voice text in order to facilitate the generation of the summary sentence based on the syntactic analysis result and the morphological analysis result. The dialogue voice text is output to the redundancy elimination unit 72.

冗長性排除部７２は、テキスト補正部７１から供給される補正された対話音声テキストの冗長性を排除する。具体的には、冗長性排除部７２は、例えば不要語テーブル３５を参照することにより、対話音声テキストから不要語や重複する文等を削除して、要約文生成部７３に供給すべき対話音声テキストを短縮化する。冗長性排除部７２は、冗長性が排除された短縮化対話音声テキストを、要約文生成部７３へ出力する。 The redundancy elimination unit 72 eliminates the redundancy of the corrected dialogue voice text supplied from the text correction unit 71. Specifically, the redundancy elimination unit 72 deletes unnecessary words, duplicate sentences, and the like from the dialogue voice text by referring to the unnecessary word table 35, for example, and the dialogue speech to be supplied to the summary sentence generation unit 73. Shorten the text. The redundancy elimination unit 72 outputs the shortened dialogue voice text from which the redundancy has been eliminated to the summary sentence generation unit 73.

要約文生成部７３は、冗長性排除部７２から供給される短縮化対話音声テキストを読み出し、重要語テーブル３４、不要語テーブル３５、および各種変換テーブル３６を参照して、要約文テキストを生成する。要約文生成部７３は、１通話ごとに１つの要約文テキストを生成してよい。要約文生成部７３が出力する要約文は、例えば、通話音声テキストの話し言葉を変換して得られる報告調の簡潔な文体、例えば体言止めの文体であってよい。 The summary sentence generation unit 73 reads the shortened dialogue voice text supplied from the redundancy elimination unit 72, refers to the important word table 34, the unnecessary word table 35, and the various conversion tables 36 to generate the summary sentence text. .. The summary generating unit 73 may generate one summary text for each call. The summary sentence output by the summary sentence generation unit 73 may be, for example, a simple style of a report tone obtained by converting the spoken language of the phonetic voice text, for example, a style of stop speech.

本実施形態において、要約文生成部７３は、感情解析サーバ６から、対話中の話者の感情を示す定量的指標を、話者の感情解析結果として取得し、取得された話者の感情解析結果を、生成すべき要約文テキストに含めたり、要約文テキストと同時にまたは関連して表示装置上に表示させたりすることができる。感情解析サーバ６から供給される話者の感情解析結果は、話者ごとに例えば、喜怒、満足度、ストレス度、信頼度等の定量的指標を含む。 In the present embodiment, the summary sentence generation unit 73 acquires, from the emotion analysis server 6, a quantitative index indicating the emotion of the speaker in the conversation as the emotion analysis result of the speaker, and analyzes the acquired emotion of the speaker. The results can be included in the summary text to be generated and can be displayed on the display device concurrently with or in association with the summary text. The speaker emotion analysis result supplied from the emotion analysis server 6 includes, for each speaker, a quantitative index such as mood, satisfaction, stress, and reliability.

感情解析部７４は、要約文生成部７３が生成する要約文から、感情語テーブル３７を参照して、要約文テキスト中の感情表現部分を抽出し、要約文に含めるべき端的な感情表現語に変換して、変換された感情表現語で、要約文テキスト中で抽出された感情表現部分を置き換える。
要約文短縮部７５は、要約文生成部７３から供給される要約文が、所定長、例えば所定文字数の閾値を超えた場合に、当該閾値内の要約文長となるよう、要約文を短縮し、短縮化された要約文を要約文テキスト３８として出力する。 The emotion analysis unit 74 refers to the emotional word table 37 from the summary sentence generated by the summary sentence generation unit 73, extracts the emotional expression portion in the summary sentence text, and determines the emotional expression word to be included in the summary sentence. The converted emotion expression word is converted to replace the emotion expression part extracted in the summary text.
The summary sentence shortening unit 75 shortens the summary sentence so that when the summary sentence supplied from the summary sentence generating unit 73 exceeds a predetermined length, for example, a threshold of a predetermined number of characters, the summary sentence length is within the threshold. , And outputs the shortened summary text as the summary text 38.

＜音声認識サーバ５における音声認識処理の処理手順＞
図４は、音声認識サーバ５の各部が実行する音声認識処理の処理手順の非限定的一例を示すフローチャートである。
Ｓ１で、音声認識サーバ５の音声認識前処理部５１は、対話音声ファイル３１から、１通話ごとにファイル化された対話音声データを読み出す。
Ｓ２で、音声認識サーバ５の音声認識前処理部５１は、Ｓ１で読み出された対話音声中の話者を識別する。具体的には、音声認識前処理部５１は、対話音声ファイルに対応付けられた呼情報の話者識別情報を参照することにより、対話音声から話者、例えば顧客とオペレータ、を識別することができる。 <Processing procedure of speech recognition processing in the speech recognition server 5>
FIG. 4 is a flowchart showing a non-limiting example of the processing procedure of the voice recognition processing executed by each unit of the voice recognition server 5.
In S1, the voice recognition preprocessing unit 51 of the voice recognition server 5 reads the dialogue voice data filed for each call from the dialogue voice file 31.
In S2, the voice recognition preprocessing unit 51 of the voice recognition server 5 identifies the speaker in the dialogue voice read in S1. Specifically, the voice recognition preprocessing unit 51 can identify a speaker, for example, a customer and an operator, from the dialogue voice by referring to the speaker identification information of the call information associated with the dialogue voice file. it can.

具体的には、音声認識前処理部５１は、呼情報データベース（不図示）を参照して、１通話内の話者識別情報を判別することにより、１通話内の発話のそれぞれの発話者が顧客であるかオペレータであるかを識別することができる。
後段の音声認識部５２では、識別された話者ごとに、対話音声データが音声認識されるとともに、音声認識された対話音声テキストから要約を生成する要約生成サーバ７の要約文生成部７３では、対話録音のタイムスタンプを参照して、双方の話者の認識結果のテキストを対応付けることができる。 Specifically, the voice recognition pre-processing unit 51 refers to a call information database (not shown) to determine the speaker identification information in one call, so that each speaker of the utterances in one call is identified. It is possible to identify whether it is a customer or an operator.
In the voice recognition unit 52 in the latter stage, the dialogue voice data is voice-recognized for each identified speaker, and in the summary sentence generation unit 73 of the summary generation server 7 which generates a summary from the voice-recognized dialogue voice text, By referring to the time stamp of the dialogue recording, the texts of the recognition results of both speakers can be associated with each other.

音声認識前処理部５１は、一方の話者、例えばオペレータの発話であると識別された発話の対話音声データを他方の話者、例えば顧客の発話であると識別された発話の対話音声データより優先して、要約生成サーバ７に供給してもよい。これは、要約文生成源として、一方の発話者、典型的にはオペレータの発話の方が、応対履歴を要約するに足る情報がより効率的に得られるとの知見に基づく。
代替的に、音声認識前処理部５１は、一方の話者のみ、例えばオペレータの発話であると識別された発話の対話音声データのみを音声認識して、対話音声テキストに変換してもよい。音声認識の対象を制限することで、高負荷な音声認識を行う音声認識サーバ５内におけるハードウエア資源が低減でき、音声認識処理や要約文生成処理のリアルタイム性が向上するとともに、対話音声テキストファイル等のリソース容量も削減できる。 The voice recognition pre-processing unit 51 converts the dialogue voice data of the utterance identified as one speaker, for example, the utterance of the operator from the dialogue voice data of the other speaker, for example, the utterance identified as the utterance of the customer. The summary generation server 7 may be preferentially supplied. This is based on the finding that the utterance of one speaker, typically, the operator, as a summary sentence generation source, can obtain information sufficient for summarizing the response history more efficiently.
Alternatively, the voice recognition preprocessing unit 51 may voice-recognize only one speaker, for example, only the conversation voice data of the utterance identified as the utterance of the operator, and convert the voice data into the conversation voice text. By limiting the target of speech recognition, the hardware resources in the speech recognition server 5 that performs speech recognition with high load can be reduced, the real-time property of the speech recognition processing and the summary sentence generation processing is improved, and the dialogue speech text file It is also possible to reduce the resource capacity such as.

Ｓ３で、音声認識サーバ５の音声認識前処理部５１は、１通話ごとに読み出された話者が分離された対話音声データを、発話単位に分離して、発話単位に分離された対話音声を、音声認識部５２に供給する。
具体的には、音声認識前処理部５１は、対話音声データ中で一定の無音区間を検出し、検出された無音区間で音声を区切ることにより、有音区間を切り出して発話単位の対話音声として分離する。 In S3, the voice recognition preprocessing unit 51 of the voice recognition server 5 separates the conversation voice data, which is read for each call, in which the speaker is separated, into the utterance units, and the conversation voice is separated into the utterance units. Is supplied to the voice recognition unit 52.
Specifically, the voice recognition pre-processing unit 51 detects a certain silent section in the dialogue voice data, and divides the voice into the detected silence sections, thereby cutting out the voiced section to obtain dialogue speech in utterance units. To separate.

図５に示すように、１通話分の対話音声ファイルは、ＣＨ１とＣＨ２の２チャネルで構成される。ＣＨ１の音声は例えば顧客の発話であり、ＣＨ２の音声は例えばオペレータの発話であるものとする。
音声認識前処理部５１は、一定の長さの無音区間を検出する。検出すべき無音区間は、例えば、１．５秒以上の無音区間であってよく、例えば１秒から２秒の間でその下限値が調整されてよい。この無音区間の下限値を、第１の閾値という。この無音区間の下限値は、例えば息継ぎに要する時間を考慮して設定することができる。また、この無音区間の下限値は、例えば「言ったよね」の発話中の撥音「っ」を誤って無音区間として検出しないよう設定されることが好適である。 As shown in FIG. 5, the conversation voice file for one call is composed of two channels, CH1 and CH2. The voice of CH1 is, for example, the utterance of the customer, and the voice of CH2 is, for example, the utterance of the operator.
The voice recognition preprocessing unit 51 detects a silent section having a fixed length. The silent section to be detected may be, for example, a silent section of 1.5 seconds or more, and its lower limit value may be adjusted, for example, between 1 second and 2 seconds. The lower limit value of this silent section is called a first threshold value. The lower limit value of this silent section can be set, for example, in consideration of the time required for breathing. In addition, it is preferable that the lower limit value of the silent section is set so that, for example, the sound repellent "t" during the utterance of "I told you" is not mistakenly detected as the silent section.

図５を参照して、音声認識前処理部５１は、ＣＨ１の顧客の音声から、第１の閾値以上の長さの無音区間（ＳＬ１１、ＳＬ１２、・・・、ＳＬ１６）を検出し、検出された２つの無音区間の間にある有音区間（ＳＰ１１、ＳＰ１２、・・・、Ｓ１７）を抽出する。抽出された有音区間（ＳＰ１１、ＳＰ１２、・・・、Ｓ１７）のそれぞれが、顧客として識別された音声中の１つの発話単位となり、本実施形態において、音声認識部５２に供給される音声認識単位となる。有音区間のそれぞれは、息継ぎなしで発話された区間と見做すことができる。
同様に、図５を参照して、音声認識前処理部５１は、ＣＨ２のオペレータの音声から、第１の閾値を下限値とする無音区間（ＳＬ２１、ＳＬ２２、・・・、ＳＬ２６）を検出し、検出された２つの無音区間の間にある有音区間（ＳＰ２１、ＳＰ２２、・・・、Ｓ２７）を抽出する。抽出された有音区間（ＳＰ２１、ＳＰ２２、・・・、Ｓ２７）のそれぞれが、オペレータとして識別された音声中の１つの発話単位となる。 With reference to FIG. 5, the voice recognition preprocessing unit 51 detects and detects a silent section (SL11, SL12, ..., SL16) having a length equal to or longer than a first threshold value from the voice of the customer of CH1. A voiced section (SP11, SP12, ..., S17) between the two silent sections is extracted. Each of the extracted voiced sections (SP11, SP12, ..., S17) becomes one utterance unit in the voice identified as the customer, and in the present embodiment, the voice recognition supplied to the voice recognition unit 52. It becomes a unit. Each voiced section can be considered as a section spoken without breathing.
Similarly, with reference to FIG. 5, the voice recognition preprocessing unit 51 detects a silent section (SL21, SL22, ..., SL26) having the first threshold as the lower limit value from the voice of the operator of CH2. , A voiced section (SP21, SP22, ..., S27) between the two detected silent sections is extracted. Each of the extracted voiced sections (SP21, SP22, ..., S27) becomes one utterance unit in the voice identified as the operator.

図４に戻り、Ｓ４で、音声認識サーバ５の音声認識部５２は、音声認識前処理部５１から発話単位で入力される対話音声データに対して、識別された話者ごとに音声認識処理を実行して、テキスト化された対話音声である対話音声テキストを出力する。
本実施形態においては、このように対話音声データを発話単位で音声認識処理を実行する。上記の無音区間は、当該無音区間中に話者が切り替わったこと、あるいは同一の話者が話題ないし内容を転換したことを推認させる。このため、無音区間の前後では発話内容における連続性が乏しいと推定でき、発話単位で対話音声テキストを音声認識することで、認識精度の向上が期待できる。 Returning to FIG. 4, in S4, the voice recognition unit 52 of the voice recognition server 5 performs voice recognition processing for each identified speaker on the conversational voice data input from the voice recognition preprocessing unit 51 in units of utterances. It executes and outputs the dialogue voice text which is the text-formed dialogue voice.
In the present embodiment, the speech recognition processing is performed on the conversational voice data in units of utterances as described above. The silent section makes it possible to infer that the speaker has switched during the silent section, or that the same speaker has changed the topic or content. Therefore, it can be estimated that the continuity of the utterance content is poor before and after the silent section, and the recognition accuracy can be expected to be improved by recognizing the dialogue voice text by the utterance unit.

この音声認識処理には、公知の音声認識エンジンを適用することができる。
音声認識部５２が実行する音声認識処理における文字コード列への変換の一例として、例えば、対話音声データ中の、必要に応じて各種変換処理された音声波形から抽出される特徴量を、予め定義されている音素ごとの参照音響パターンと比較処理することにより、音声波形データを文字コード列に変換することができる。 A known voice recognition engine can be applied to this voice recognition process.
As an example of conversion into a character code string in the voice recognition process executed by the voice recognition unit 52, for example, a feature amount extracted from a voice waveform in the conversation voice data that has undergone various conversion processes is defined in advance. The voice waveform data can be converted into a character code string by performing a comparison process with the reference acoustic pattern for each phoneme.

音声認識部５２および音声認識後処理部５３により参照される音声認識辞書３２には、予め音声認識の対象と想定され、かつ要約文に含まれるべき重要情報を含む重要語（ないし重要文）のデータが定義されているため、この音声認識辞書３２に定義された重要語に相当する対話音声データの音素列のみが抽出されて意味づけされてよい。また、音声認識辞書３２に定義される重要語（ないし重要文）には重み付けが付与されてよい。音声認識部５２により読み出された対話音声データのうち、この定義された重要語に相当する対話音声データ箇所が対話音声テキストに変換され、音声認識結果として出力されてよい。 In the voice recognition dictionary 32 referred to by the voice recognition unit 52 and the voice recognition post-processing unit 53, important words (or important sentences) that are supposed to be targets of voice recognition in advance and that include important information to be included in a summary sentence are stored. Since the data is defined, only the phoneme string of the conversational speech data corresponding to the important word defined in the speech recognition dictionary 32 may be extracted and given meaning. In addition, weighting may be given to important words (or important sentences) defined in the voice recognition dictionary 32. Of the dialogue voice data read by the voice recognition unit 52, the dialogue voice data portion corresponding to the defined important word may be converted into dialogue voice text and output as a voice recognition result.

図９は、Ｓ４で音声認識部５２が出力する、対話音声データから生成される発話単位の音声認識結果である対話音声テキストの非限定的一例を示す。図９の例では、「対話要約処理は不要な発言や表現の削除のほか話し言葉から書き言葉への変換などで構成されますなお処理対象データの特性に応じて選択することができます」が、２つの無音区間の間で抽出された発話単位となる。図９に示すように、Ｓ４で出力される発話単位の音声認識結果は、句読点等で区切られない複数の文章を１つのまとまりとして含み得る。 FIG. 9 shows a non-limiting example of the dialogue voice text which is the speech recognition result of the utterance unit generated from the dialogue voice data, which is output by the voice recognition unit 52 in S4. In the example of Fig. 9, "Dialogue summarization processing consists of deleting unnecessary utterances and expressions as well as conversion from spoken language to written language. It can be selected according to the characteristics of the processing target data." It becomes the utterance unit extracted between two silent intervals. As shown in FIG. 9, the speech recognition result for each utterance output in S4 may include a plurality of sentences that are not separated by punctuation marks or the like as one unit.

図４に戻り、Ｓ５で、音声認識サーバ５の音声認識後処理部５３は、Ｓ４で音声認識部５２が出力する音声認識結果を自然発話へ変換するとともに、要約単位に区切る。音声認識後処理部５３はまた、Ｓ５で区切った要約単位の対話音声テキストに、構文解析や形態素解析結果に基づいて、種別や重み付けを付与することができる。
なお、Ｓ５における変換処理の詳細は、図７および図８を参照して後述する。 Returning to FIG. 4, in S5, the voice recognition post-processing unit 53 of the voice recognition server 5 converts the voice recognition result output by the voice recognition unit 52 in S4 into natural utterances, and also divides it into summary units. The voice recognition post-processing unit 53 can also add a type and weight to the dialogue voice text in the abstract unit divided in S5 based on the result of the syntactic analysis and the morphological analysis.
The details of the conversion process in S5 will be described later with reference to FIGS. 7 and 8.

Ｓ６で、音声認識サーバ５の相槌解析部５４は、要約単位に区切られた対話音声テキストから、例えば「はい」、「いいえ」等の受け答えと推定されるテキストを検出し、検出されたテキストが相槌か、或いは受け答えかを判定する。
相槌解析部５４は、この判定結果に基づいて、相槌と判定されたテキストを、音声認識後処理部５３が出力する要約単位に区切られた対話音声テキスト３３から削除する。一方、相槌解析部５４は、受け答えと判定されたテキストを、要約生成サーバ７が生成する要約文に含まれるよう、通話音声テキスト３３に含めるとともに、対話音声テキスト中で当該テキスト要素に「受け答え」である旨をタグ付け（種別付与）する。なお、Ｓ６における相槌解析処理の詳細は、図１３および図１４を参照して後述する。
Ｓ７で、相槌解析部５４は、受け答えと判定されたテキストを付加した要約単位に区切られた対話音声テキストを出力する。 In S6, the summation analysis unit 54 of the voice recognition server 5 detects a text estimated to be a response such as “yes” or “no” from the dialogue voice texts divided into summary units, and the detected text is Judge whether it is a hammer or an answer.
Based on this determination result, the azuchi analysis section 54 deletes the text determined to be an azure from the dialogue voice text 33 divided by the summary unit output by the voice recognition post-processing section 53. On the other hand, the consonant analysis unit 54 includes the text determined as the answer in the call voice text 33 so that it is included in the summary sentence generated by the summary generation server 7, and at the same time, in the dialogue voice text, “acknowledgement” is given to the text element. Is added (type is added). The details of the summation analysis processing in S6 will be described later with reference to FIGS. 13 and 14.
In step S7, the Azai analysis unit 54 outputs the dialogue voice text divided into the summary units to which the text determined to be the answer is added.

＜音声認識後処理部５３における音声認識後処理の詳細処理手順＞
図７は、図４のＳ５で音声認識後処理部５３が実行する音声認識後処理の詳細処理手順の一例を示すフローチャートである。
図７を参照して、Ｓ５１で、音声認識サーバ５の音声認識後処理部５３は、音声認識辞書３２を参照して、Ｓ４で音声認識部５２が出力する音声認識結果である発話単位の対話音声テキストの構文解析を実行する。
Ｓ５２で、音声認識後処理部５３は、音声認識辞書３２を参照して、発話単位の対話音声テキストの形態素解析を実行する。なお、Ｓ５１の構文解析およびＳ５２の形態素解析は、いずれかを先に実行してもよく、同時並行的に実行されてよい。 <Detailed Processing Procedure of Speech Recognition Post-Processing in Speech Recognition Post-Processing Unit 53>
FIG. 7 is a flowchart showing an example of a detailed processing procedure of the voice recognition post-processing executed by the voice recognition post-processing unit 53 in S5 of FIG.
Referring to FIG. 7, in S51, the speech recognition post-processing unit 53 of the speech recognition server 5 refers to the speech recognition dictionary 32, and talks in utterance units, which are the speech recognition results output by the speech recognition unit 52 in S4. Performs voice text parsing.
In S52, the voice recognition post-processing unit 53 refers to the voice recognition dictionary 32 and executes the morphological analysis of the dialogue voice text in utterance units. Either of the syntactic analysis in S51 and the morphological analysis in S52 may be executed first, or may be executed simultaneously in parallel.

図１０は、図９に示す発話単位の対話音声テキストに対して、Ｓ５１における構文解析処理を実行して得られる構文解析結果の非限定的一例を示す。図１０に示すように、Ｓ５１で出力される構文解析結果では、テキスト中の形態素間の関係が構造化されている。
図１１は、図９に示す発話単位の対話音声テキストに対して、Ｓ５２における形態素解析処理を実行して得られる形態素解析結果の非限定的一例を示す。図１１に示すように、形態素解析結果は、抽出された形態素ごとに、書字、読み、取得された品詞種別（大分類、中分類、小分類）を含んでよい。 FIG. 10 shows a non-limiting example of the syntactic analysis result obtained by executing the syntactic analysis process in S51 on the dialogue voice text of the utterance unit shown in FIG. As shown in FIG. 10, in the syntax analysis result output in S51, the relationship between morphemes in the text is structured.
FIG. 11 shows a non-limiting example of a morpheme analysis result obtained by executing the morpheme analysis processing in S52 on the dialogue voice text of the utterance unit shown in FIG. As shown in FIG. 11, the morpheme analysis result may include the written character, the reading, and the acquired part-of-speech classification (large classification, middle classification, small classification) for each extracted morpheme.

図７に戻り、Ｓ５３で、音声認識後処理部５３は、Ｓ５１およびＳ５２の構文解析および形態素解析の解析結果に基づいて、発話単位の対話音声テキストを、要約単位に細分する。
図８は、図７のＳ５３で音声認識後処理部５３が実行する要約単位への分離処理の詳細処理手順の一例を示すフローチャートである。
Ｓ５３１で、音声認識後処理部５３は、形態素解析および構文解析の結果得られた区切り単位の品詞種別が、名詞か否かを判定する。解析の結果得られた区切り単位の品詞種別が名詞である場合Ｓ（Ｓ５３１：Ｙ）、Ｓ５３２に進む。一方、解析の結果得られた区切り単位の品詞種別が名詞以外である場合、Ｓ５３２以降の処理をスキップして処理を終了し、Ｓ６へ進む。 Returning to FIG. 7, in S53, the voice recognition post-processing unit 53 subdivides the dialogue voice text in utterance units into summary units based on the analysis results of the syntactic analysis and the morphological analysis in S51 and S52.
FIG. 8 is a flowchart showing an example of a detailed processing procedure of the separation processing into summary units executed by the speech recognition post-processing unit 53 in S53 of FIG.
In S531, the speech recognition post-processing unit 53 determines whether or not the part-of-speech type of the delimiter unit obtained as a result of the morphological analysis and the syntactic analysis is a noun. When the part-of-speech type of the division unit obtained as a result of the analysis is a noun S (S531: Y), the process proceeds to S532. On the other hand, when the part-of-speech type of the delimiter obtained as a result of the analysis is other than a noun, the processing of S532 and subsequent steps is skipped, the processing ends, and the processing proceeds to S6.

Ｓ５３２で、音声認識後処理部５３は、形態素解析および構文解析結果得られた区切り単位の群（まとまり）の先頭が、名詞以外か否かを判定する。区切り単位の群の先頭が名詞以外である場合（Ｓ５３２：Ｙ）、Ｓ５３３以降の処理をスキップして処理を終了し、Ｓ６へ進む。一方、区切り単位の群の先頭が名詞である場合（Ｓ５３２３：Ｎ）、Ｓ５３３に進む。 In S532, the speech recognition post-processing unit 53 determines whether the beginning of the group (group) of delimiter units obtained as a result of the morphological analysis and the syntactic analysis is other than a noun. When the head of the group of delimiters is other than a noun (S532: Y), the process of S533 and the subsequent steps is skipped, the process ends, and the process proceeds to S6. On the other hand, if the head of the group of delimiters is a noun (S5323: N), the process proceeds to S533.

Ｓ５３３で、音声認識後処理部５３は、形態素解析および構文解析の結果得られた区切り単位が名詞＋αであるか否かを判定する。区切り単位が名詞＋αである場合、すなわち末尾に助詞等、名詞以外を含む場合（Ｓ５３３：Ｙ）、Ｓ５３４で、音声認識後処理部５３は、当該区切り単位を直前の区切り単位に結合して、処理を終了し、Ｓ６に進む。一方、区切り単位が名詞＋αでない場合、すなわち名詞のみである場合（Ｓ５３３：Ｎ）、Ｓ５３５で、音声認識後処理部５３は、当該区切り単位を直前の区切り単位に結合した上で、Ｓ５３２に戻り、Ｓ５３２およびＳ５３３の区切り単位の判定を繰り返す。 In S533, the speech recognition post-processing unit 53 determines whether the delimiter unit obtained as a result of the morpheme analysis and the syntactic analysis is the noun + α. When the delimiter unit is the noun + α, that is, when a non-noun such as a particle is included at the end (S533: Y), the speech recognition post-processing unit 53 combines the delimiter unit with the immediately preceding delimiter unit in S534. The process ends and the process proceeds to S6. On the other hand, when the delimiter unit is not the noun + α, that is, when it is only the noun (S533: N), the speech recognition post-processing unit 53 combines the delimiter unit with the immediately preceding delimiter unit, and then returns to S532. , S532, and S533, the determination of the delimiter unit is repeated.

図１２は、図９に示す発話単位の対話音声テキストを入力とし、図１０に示す構文解析結果および図１１に示す形態素解析結果を経て、図４のＳ５で、音声認識後処理部４３が出力する対話音声テキストの一例である。
図１２中の四角記号は、それぞれ要約単位の区切りを示す。図１２に示すように、Ｓ５の自然発話への変換および要約単位への分離処理を実行することにより、連続する「対話」、「要約」、「処理は」が１つの要約単位に、また、連続する「処理」、「対象」、「データの」が他の１つの要約単位に、それぞれ結合されて、要約単位を構成している。
音声認識サーバ５の音声認識後処理部４３はさらに、分離された要約単位の対話音声テキストのそれぞれに、音声認識辞書３２を参照することにより、種別や重み付けを付加してよい。図１２では、要約単位「対話要約処理は」および要約単位「処理対象データの」が、それぞれ要約文に含められるべき重要要約単位として重み付けされている。 In FIG. 12, the speech recognition post-processing unit 43 is output in S5 of FIG. 4 after the dialog speech text of the utterance unit shown in FIG. 9 is input and the syntactic analysis result shown in FIG. 10 and the morphological analysis result shown in FIG. It is an example of a dialogue voice text to be performed.
Square symbols in FIG. 12 indicate delimiters of summary units. As shown in FIG. 12, by performing the conversion into natural utterance and the separation processing into summary units in S5, consecutive "dialogue", "summary", and "processing" are combined into one summary unit, and A series of "processing", "object", and "data" are combined into another summary unit, respectively, to form a summary unit.
The voice recognition post-processing unit 43 of the voice recognition server 5 may further add the type and weight to each of the separated dialogue voice texts in the abstract unit by referring to the voice recognition dictionary 32. In FIG. 12, the summarization unit “dialogue summarization process” and the summarization unit “data to be processed” are weighted as important summarization units to be included in the summarization sentence.

＜相槌解析部５４における相槌解析処理の詳細処理手順＞
図１３は、図４のＳ６で音声認識サーバ５の相槌解析部５４が実行する相槌解析処理の詳細処理手順の非限定的一例を示すフローチャートである。
図１３を参照して、Ｓ６１で、音声認識サーバ５の相槌解析部５４は、双方の話者、例えば顧客およびオペレータの対話音声を対話音声ファイル３１から取得する。対話音声ファイル３１には、１通話ごとに双方の話者を対応付けることが可能なタイムスタンプが付与されているため、相槌解析部４４は、１通話単位を構成する双方の話者の対話音声を取得することができる。或いは、通話単位ごとに当該通話単位を構成する話者の対話音声それぞれに共通の識別子が付与されることにより、双方の話者の対話音声を対応付けてもよい。Ｓ６１では、取得された双方の話者の対話音声とともに、当該対話音声を音声認識して得られた対話音声テキストが入力される。 <Detailed Processing Procedure of Ajai Analysis Processing in Azai Analysis Unit 54>
FIG. 13 is a flowchart showing a non-limiting example of a detailed processing procedure of the matching analysis processing executed by the matching analysis unit 54 of the voice recognition server 5 in S6 of FIG.
With reference to FIG. 13, in S61, the consonant analysis unit 54 of the voice recognition server 5 acquires the dialogue voices of both speakers, for example, the customer and the operator, from the dialogue voice file 31. Since the conversation voice file 31 is provided with a time stamp capable of associating both speakers with each other for each call, the Aizuchi analysis unit 44 outputs the conversation voices of both speakers constituting one call unit. Can be obtained. Alternatively, the conversation voices of both speakers may be associated with each other by giving a common identifier to each of the conversation voices of the speakers forming the call unit. In S61, the acquired conversational voices of both speakers and the conversational voice text obtained by voice recognition of the conversational voices are input.

Ｓ６２で、相槌解析部５４は、顧客およびオペレータの双方の対話音声を対比し、対話の相手が発話している間に短い発話が検出できるか否かを判定する。
図１４（ａ）を参照して、ＣＨ１の顧客の対話音声中の短い発話（ＳＰ１４）は、対話の相手であるＣＨ２のオペレータの発話（ＳＰ２４）の間になされた短い発話であるため、Ｓ６２で検出される。Ｓ６２で検出すべき短い発話とは、例えば２秒未満であってよい。
相手が発話中に短い発話が検出されない場合（Ｓ６２：Ｎ）、Ｓ６３からＳ６８の処理をスキップして処理を終了し、Ｓ７へ進む。一方、相手が発話中に短い発話が検出された場合（Ｓ６２：Ｙ）、Ｓ６３に進む。 In S62, the Aizuchi analysis unit 54 compares the dialogue voices of both the customer and the operator, and determines whether or not a short utterance can be detected while the other party of the dialogue is uttering.
Referring to FIG. 14A, since the short utterance (SP14) in the conversation voice of the customer of CH1 is a short utterance made during the utterance (SP24) of the operator of CH2 who is the other party of the dialogue, S62. Detected in. The short utterance to be detected in S62 may be, for example, less than 2 seconds.
When a short utterance is not detected while the other party is uttering (S62: N), the processes of S63 to S68 are skipped, the process is terminated, and the process proceeds to S7. On the other hand, when a short utterance is detected while the other party is uttering (S62: Y), the process proceeds to S63.

Ｓ６３で、相槌解析部５４は、Ｓ６２で検出された短い発話と同一のタイムスタンプを有する音声認識結果の対話音声テキストを検索し、当該短い発話の音声認識結果が受け答えと推定できるか否か、すなわち受け答えの候補となるか否かを判定する。例えば、短い発話が「はい」、「ええ」、「いいえ」、「いや」等のテキストであれば、受け答えの候補と判定できる。例えばこの受け答え候補は予め相槌解析部５４に設定しておけばよい。 In S63, the Azumi analysis section 54 searches the dialogue voice text of the voice recognition result having the same time stamp as the short utterance detected in S62, and whether the voice recognition result of the short utterance can be estimated as an answer or not. That is, it is determined whether or not the answer is a candidate. For example, if the short utterance is a text such as "Yes", "Yes", "No", or "No", it can be determined as a candidate for the answer. For example, this answer / answer candidate may be set in advance in the summation analysis unit 54.

短い発話の音声認識結果が受け答え候補でない場合（Ｓ６３：Ｎ）、Ｓ６４に進んで、当該短い発話は相槌であると判定して、要約生成に入力すべき対話音声テキストから削除する。すなわち、Ｓ６４で相槌と判定された短い発話は、要約作成において意味のない相槌であるため、要約文生成源とされない。一方、短い発話の音声認識結果が受け答え候補である場合（Ｓ６３：Ｙ）、Ｓ６５に進む。 When the voice recognition result of the short utterance is not a received answer candidate (S63: N), the process proceeds to S64, it is determined that the short utterance is a cooperation, and the short utterance is deleted from the dialogue voice text to be input for the summary generation. In other words, the short utterance determined to be an amulet in S64 is a meaningless adjusent in the creation of the abstract, and is therefore not a source of the abstract sentence generation. On the other hand, when the voice recognition result of the short utterance is the answer candidate (S63: Y), the process proceeds to S65.

Ｓ６５では、相槌解析部５４はさらに、Ｓ６３で検出された受け答え候補である短い発話の発話中に、対話の相手の音声に短い無音期間があるか否かを判定する。
図１４（ａ）を参照して、ＣＨ１の顧客の音声中の短い発話（ＳＰ１４）に対応するＣＨ２のオペレータの発話では、図４のＳ３で音声認識前処理部４１により、第１の閾値以上の長さの無音区間を含まないため、１つの発話単位ＳＰ２４として検出されている。Ｓ６５では、この第１の閾値より小さい第２の閾値を用いて、対話の相手の音声に短い無音区間が検出できるか否かが判定される。この第２の閾値は、第１の閾値より小さい値を持ち、例えば、１秒であり、０．５秒から１．５秒の間で調整されてよい。 In S65, the summation analysis unit 54 further determines whether or not the voice of the other party of the dialogue has a short silent period during the utterance of the short utterance which is the answer candidate detected in S63.
With reference to FIG. 14A, in the utterance of the operator of CH2 corresponding to the short utterance (SP14) in the voice of the customer of CH1, the voice recognition preprocessing unit 41 in S3 of FIG. Since it does not include a silent section of the length of, it is detected as one utterance unit SP24. In S65, a second threshold value smaller than the first threshold value is used to determine whether or not a short silent section can be detected in the voice of the other party of the dialogue. This second threshold has a value smaller than the first threshold, for example 1 second and may be adjusted between 0.5 and 1.5 seconds.

Ｓ６５で、受け答え候補である短い発話の発話中に、対話の相手の発話単位（有音区間）中に、第２の閾値以上の長さを持つ短い無音区間が検出された場合（Ｓ６５：Ｙ）、Ｓ６６で当該短い発話は受け答えであると判定してＳ６７に進む。一方、受け答え候補である短い発話の発話中に、対話の相手の発話単位（有音区間）中に、第２の閾値以上の長さを持つ短い無音区間が検出されない場合（Ｓ６５：Ｎ）、Ｓ６４に進んで、受け答え候補であった当該短い発話は、相槌であると判定して、要約生成に入力すべき対話音声テキストから削除する。
Ｓ６７で、相槌解析部５４は、Ｓ６６で受け答えと判定された短い発話の前後で、対話の相手の音声を２つの発話単位に分離する。 In S65, during the utterance of the short utterance that is the answer candidate, in the utterance unit (voiced section) of the other party of the dialogue, a short silent section having a length of the second threshold or more is detected (S65: Y ), It is determined in S66 that the short utterance is an answer and the process proceeds to S67. On the other hand, during the utterance of the short utterance that is the answer candidate, when a short silent section having a length equal to or greater than the second threshold value is not detected in the utterance unit (voiced section) of the conversation partner (S65: N), Proceeding to S64, the short utterance, which was a response candidate, is determined to be a humor, and is deleted from the dialogue voice text to be input for the summary generation.
In S67, the Atsuuchi analysis unit 54 separates the voice of the other party of the dialogue into two utterance units before and after the short utterance determined as the answer in S66.

図１４（ｂ）を参照して、ＣＨ１の顧客の短い発話区間（ＳＰ１４）の音声認識結果が受け答え候補と判定されたものとすると、この発話（ＳＰ１４）中に、ＣＨ２のオペレータの発話区間（ＳＰ２４）には、第２の閾値以上かつ第１の閾値より小さい無音区間（ＳＬ２４ａ）が検出できる。この場合、相槌解析部５４は、検出されたこの無音区間（ＳＬ２４ａ）の前後で、オペレータの発話区間（ＳＰ２４）を分離して、無音区間（ＳＬ２４ａ）直前の発話区間（ＳＰ２４ａ）と無音区間（ＳＬ２４ａ）直後の発話区間（ＳＰ２４ｂ）とを取得する。 With reference to FIG. 14B, if it is determined that the voice recognition result of the short speech period (SP14) of the customer of CH1 is determined as the answer candidate, the speech period of the operator of CH2 ( In SP24), a silent section (SL24a) that is equal to or larger than the second threshold and smaller than the first threshold can be detected. In this case, the hammer analysis unit 54 separates the operator's utterance section (SP24) before and after the detected silent section (SL24a), and immediately before the silent section (SL24a), the utterance section (SP24a) and the silent section (SP24a). The utterance section (SP24b) immediately after SL24a) is acquired.

Ｓ６８で、相槌解析部５４は、Ｓ６７で分離された、短い無音区間（ＳＬ２４ａ）直前の発話区間（ＳＰ２４ａ）を音声認識して得られた対話音声テキストを、Ｓ６６で受け答えと判定された音声テキストと対になるべき対話音声テキストであると判定し、この対のテキストを、受け答えのテキストと、当該受け答えを促した、何に対する受け答えであるかを特定するテキストとして、相互に対応付けて「受け答え」の種別を付与して、要約単位の対話音声テキストファイル３３へ出力する。 In S68, the summation analysis unit 54 recognizes in S66 the dialogue voice text obtained by performing voice recognition of the utterance section (SP24a) immediately before the short silent section (SL24a) separated in S67. It is determined that the text is a dialogue voice text that should be paired with, and the text of this pair is associated with each other as the text of the answer and answer and the text that identifies the answer that prompted the answer. Is added to the dialogue voice text file 33 in a summary unit.

＜要約生成サーバ７における要約生成処理の処理手順＞
図６は、要約生成サーバ７の各部が実行する要約生成処理の処理手順の非限定的一例を示すフローチャートである。
図６を参照して、Ｓ１０で、要約生成サーバ７のテキスト補正部７１は、要約単位に区切られた対話音声テキスト３３から１通話単位の対話音声テキストを読み出す。 <Processing Procedure of Summary Generation Processing in Summary Generation Server 7>
FIG. 6 is a flowchart showing a non-limiting example of the processing procedure of the summary generation processing executed by each unit of the summary generation server 7.
With reference to FIG. 6, in S10, the text correction unit 71 of the abstract generation server 7 reads the conversational voice texts of one call unit from the conversational voice texts 33 divided into the summary units.

Ｓ１１で、テキスト補正部７１は、Ｓ１０で読み出した対話音声テキストを補正する。具体的には、テキスト補正部７１は、図１２に示すように要約単位（要約生成の処理単位）で区切られた１つの発話単位の対話音声テキストに、句読点を挿入した上で、句点の位置で改行を挿入する。
図１５は、テキスト補正部７１が参照する句読点テーブルの非限定的一例を示す。図１５の句読点テーブルは、句点または読点を直後に挿入すべき用語を定義する。図１５において、「１」は読点の挿入、「０」は句点の挿入を示す。テキスト補正部７１は、図１５の句読点テーブルを参照して、要約単位の区切り記号から後方一致で、句読点テーブルに定義される「ますが」、「ますか」、「ます」、「はい」等の語を検索し、検索された語の直後に、句読点テーブルの定義に従って句点または読点を挿入していく。テキスト補正部７１は、図１５の句読点テーブルに定義される検索語のうち、字数の多いものから順に検索してよい。 In S11, the text correction unit 71 corrects the dialogue voice text read in S10. Specifically, as shown in FIG. 12, the text correction unit 71 inserts punctuation marks in the dialogue voice text of one utterance unit delimited by the summarization unit (processing unit of summarization generation), and then the position of the punctuation mark. Insert a line break with.
FIG. 15 shows a non-limiting example of the punctuation mark table referred to by the text correction unit 71. The punctuation mark table in FIG. 15 defines terms in which punctuation marks or punctuation marks should be inserted immediately after. In FIG. 15, “1” indicates insertion of a reading point and “0” indicates insertion of a punctuation mark. The text correction unit 71 refers to the punctuation mark table of FIG. 15 and matches with suffixes from the delimiter of the summary unit, such as “masuga”, “masuka”, “masu”, and “yes” defined in the punctuation mark table. Is searched for, and a punctuation mark or a punctuation mark is inserted immediately after the searched word according to the definition of the punctuation mark table. The text correction unit 71 may search the search words defined in the punctuation mark table in FIG. 15 in order from the one having the largest number of characters.

テキスト補正部７１はさらに、形態素解析で抽出された数詞を検索し、検索された数値の意味を解析する。応対履歴のための要約文生成においては、数詞が要約におけるキーワードとなる重要語であることが多い。このため、テキスト補正部７１は、検索された数詞の意味を解析して、解析された数詞の意味に応じた種別を取得し、取得された種別に応じた単位や重みを付与する。
数詞の意味としては、例えば、「日付」、「時間」、「金額」、「電話番号」、「個数」等が付与され得るがこれに限定されない。 The text correction unit 71 further searches for a numerical word extracted by the morphological analysis, and analyzes the meaning of the searched numerical value. In generating a summary sentence for a response history, a number is often an important word that is a keyword in the summary. Therefore, the text correction unit 71 analyzes the meaning of the searched number, acquires the type according to the analyzed meaning of the number, and assigns a unit or weight according to the acquired type.
As the meaning of the numerical words, for example, “date”, “time”, “amount of money”, “telephone number”, “number of pieces” and the like can be given, but the number is not limited thereto.

図１６は、テキスト補正部７１が、解析された数詞の要素へ、種別、単位（表記）、重み付けをそれぞれ付与するために参照する数詞種別テーブルである。図１６を参照して、例えば日付や時刻や金額（円）には、個数（個）や温度（度）より高い重みが付与されている。
一方、テキスト補正部７１は、対話音声テキスト中に前後の語に関連しない数詞が検索された場合、誤認識と判定して、対話音声テキストから当該数詞を削除してよい。また、テキスト補正部７１は、要約文中における視認性および明瞭性向上のため、検索された数詞を半角数字に変換してよい。 FIG. 16 is a number-word type table that the text correction unit 71 refers to for giving a type, a unit (notation), and a weight to the analyzed element of the number. Referring to FIG. 16, for example, date, time, and amount of money (yen) are given higher weights than the number (pieces) and temperature (degrees).
On the other hand, the text correction unit 71 may determine that it is a misrecognition and delete the number from the dialogue voice text when the dialogue voice text is searched for a number that is not related to the preceding and following words. In addition, the text correction unit 71 may convert the retrieved numerical words into half-width numbers in order to improve the visibility and the clarity in the summary sentence.

図６に戻り、Ｓ１２で、要約生成サーバ７の冗長性排除部７２は、音声認識された対話音声テキスト中の冗長性を排除してより簡明化ないし単純化された対話音声テキストを出力する。
具体的には、冗長性排除部７２は、不要語テーブル３５を参照して、対話音声テキストから不要語を削除する。
図１７は、冗長性排除部７２が参照する不要語テーブル３５の非限定的一例を示す。図１７を参照して、不要語テーブル３５には、「えー」等の間投詞、「いつもお世話になっております。」等の定型挨拶文等が不要語として定義されている。 Returning to FIG. 6, in S12, the redundancy removing unit 72 of the digest generation server 7 removes the redundancy in the voice-recognized dialogue voice text and outputs a more simplified or simplified dialogue voice text.
Specifically, the redundancy eliminating unit 72 refers to the unnecessary word table 35 and deletes the unnecessary words from the dialogue voice text.
FIG. 17 shows a non-limiting example of the unnecessary word table 35 referred to by the redundancy eliminating section 72. Referring to FIG. 17, unnecessary words table 35 defines interjections such as “Eh” and fixed greetings such as “I am always indebted.” As unnecessary words.

冗長性排除部７２はさらに、１通話分の対話音声テキストから、同一ないし類似内容を記述する文（ないし句、語等の意味を有するまとまりであってもよい）が複数回出現した場合に、重複する文を対話音声テキストから適宜削除してよい。好適には、冗長性排除部７２は、１通話分の対話音声テキスト中に同一ないし類似内容を記述する文等が複数回出願した場合には、通話開始から終了までの時系列上前方に出現した文を削除し、最後に出現した文を残してよい。通話終了時点に近い文が、より応対における最終的な結論を記述する蓋然性が高いからである。また、最後に出現した文は、オペレータによる復唱であると推定でき、この場合、復唱された文がより応対履歴として要約に残すべき正確な内容を記述してものであると期待できるからである。 The redundancy elimination unit 72 further, when a sentence (or a group having a meaning such as a phrase or a word) that describes the same or similar content appears multiple times from the conversation voice text for one call, Duplicate sentences may be removed from the dialogue voice text as appropriate. Preferably, the redundancy eliminating unit 72 appears forward in time series from the start to the end of the call when a sentence or the like describing the same or similar contents is applied a plurality of times in the conversation voice text for one call. You can delete the sentence that appeared and leave the last sentence that appeared. This is because the sentence near the end of the call is more likely to describe the final conclusion of the response. Also, the sentence that appears last can be presumed to be a repeat by the operator, and in this case, it can be expected that the repeated sentence will describe the exact content that should be left in the summary as a response history. ..

冗長性排除部７２は、さらに、重要語テーブル３４を参照し、重要語テーブル３４に登録済みであるキーワードの言い淀みや繰り返しを削除してもよい。
例えば、重要語テーブル３４にキーワードとして表記「ｅＶｏｉｃｅ」、読み「イーボイス」と登録されていたものとする。
この場合、認識結果が「明日の１０時にいいｅＶｏｉｃｅへ伺います。」であったとすると、冗長性排除部７２は、登録済みのキーワードの直前に読みが先頭から部分一致するものを検索し、検索された語を削除する。これにより、言い淀み箇所を対話音声テキストから削除することができる。
同様に、認識結果が「明日の１０時にｅＶｏｉｃｅへｅＶｏｉｃｅにお伺いします。」であったとすると、冗長性排除部７２は、上記のように、登録済みのキーワードの繰り返しは前方を削除する。これにより、繰り返し箇所を対話音声テキストから削除することができる。 The redundancy eliminating unit 72 may further refer to the important word table 34 and delete the stagnation and repetition of the keywords registered in the important word table 34.
For example, it is assumed that the keyword “eVoice” and the reading “Evoice” are registered in the important word table 34 as keywords.
In this case, if the recognition result is "I will visit a good eVoice at 10 o'clock tomorrow.", The redundancy elimination unit 72 searches for a keyword whose reading partially matches from the beginning immediately before the registered keyword, and searches. Delete the specified word. As a result, the stagnation point can be deleted from the dialogue voice text.
Similarly, if the recognition result is "I will ask eVoice to eVoice at 10 o'clock tomorrow", the redundancy elimination section 72 deletes the forward part of the registered keywords as described above. This allows the repeated portion to be deleted from the dialogue voice text.

図６に戻り、Ｓ１３で、要約生成サーバ７の要約文生成部７３は、冗長性排除部７２が出力する対話音声テキストから、応対履歴の要約文を生成する。具体的には、要約文生成部７３は、会話体で記述された対話音声テキストを文章体に整形する。好適には、要約文生成部７３は、会話体で記述された対話音声テキストを体言止めの文章体に整形する。 Returning to FIG. 6, in S13, the summary sentence generation unit 73 of the summary generation server 7 generates a summary sentence of the response history from the dialogue voice text output by the redundancy elimination unit 72. Specifically, the summary sentence generation unit 73 shapes the dialogue voice text described in the conversational body into a sentence body. Preferably, the summary sentence generation unit 73 shapes the dialogue voice text described in the conversational body into a sentence body that is a word stop.

図１８は、要約文生成部７３が参照する文体変換テーブル３６の非限定的一例を示す。図１８を参照して、文体変換テーブル３６には、左欄に変換元の会話体の語（「ございますね」、「と申します」、「おっしゃっていました」等）が、右欄に変換先の文章体の語（「ですね」、「です」、「言っていた」等）が、それぞれ定義されている。要約文生成部７２は、対話音声テキストから、文体変換テーブル３６に定義された変換元の会話体の語を検索し、検索された会話体の語を文体変換テーブル３６に定義される対応する文章体の語に変換する。これにより、対話音声テキスト中の丁寧語が簡潔な報告調の文章体に変換される。
なお、図１９の文体変換テーブル３６中、変換元の「ちょっと」の語には対応する変換先の文章体の語が定義されていない。この場合、要約文生成部７２は、変換元の語を対話音声テキストから削除すればよい。 FIG. 18 shows a non-limiting example of the style conversion table 36 referred to by the abstract sentence generator 73. Referring to FIG. 18, in the style conversion table 36, the words of the source conversational language (such as "I am", "I say", and "I was talking") are listed in the left column. The words in the writing style of the conversion destination ("is", "is", "was saying", etc.) are defined respectively. The summary sentence generation unit 72 searches the dialogue voice text for the word of the conversion source conversational body defined in the style conversion table 36, and the retrieved sentence word is the corresponding sentence defined in the style conversion table 36. Convert to body words. As a result, the polite word in the dialogue voice text is converted into a concise report style writing style.
It should be noted that in the style conversion table 36 of FIG. 19, the word of the conversion destination corresponding to the word "little" of the conversion source is not defined. In this case, the abstract sentence generation unit 72 may delete the conversion source word from the dialogue voice text.

図６に戻り、Ｓ１３で、要約文生成部７３はさらに、対話音声テキストから予め定義された重要語を検索し、検索された重要語を出力すべき要約文に含める。
図１９、図２０および図２１はそれぞれ、要約文生成部７３が参照する重要語テーブル３４の非限定的一例を示す。図１７を参照して、重要語テーブル３４には、「連絡」、および「確認」の語が重要語として定義されている。重要語テーブル３４には、重要語を可変の重み（ポイント）とともに定義してよい。図１９には、「連絡」、および「確認」の語には、いずれも重み「１」が定義されている。また、ユーザが追加や削除等の編集可能な他の重要語テーブル３４を提供し、固有名詞等を適宜定義可能としてよい。
要約文生成部７３は、対話音声テキストから、重要語テーブル３４に定義された重要語を検索し、検索された重要語を対応する重みに応じて重み付けして、生成すべき要約文に含める。 Returning to FIG. 6, in S13, the abstract sentence generation unit 73 further searches for a predefined important word from the dialogue voice text and includes the searched important word in the summary sentence to be output.
19, 20 and 21 each show a non-limiting example of the important word table 34 referred to by the abstract sentence generation unit 73. With reference to FIG. 17, in the important word table 34, the words “contact” and “confirm” are defined as important words. Important words may be defined in the important word table 34 together with variable weights (points). In FIG. 19, the weights “1” are defined for the words “contact” and “confirm”. Further, the user may be provided with another important word table 34 that can be added or deleted so that proper nouns can be appropriately defined.
The abstract sentence generation unit 73 searches the dialogue voice text for important words defined in the important word table 34, weights the searched important words according to the corresponding weights, and includes them in the abstract sentence to be generated.

図２０は、肯定表現である重要語（「はい」、「わかった」、「いいよ」、）了解」等）を定義する重要語テーブル３４の非限定的一例を示し、図２１は、否定表現である重要語（「いいえ」、「やだよ」、「断る」、「承認しない」等）を定義する重要語テーブル３４の非限定的一例を示す。要約文生成部７３は、これらの重要語テーブル３４も参照して、対話音声テキストから重要語を検索し、検索された重要語を対応する重みに応じて重み付けして、生成すべき要約文に含める。図２０および図２１に含まれる肯定ないし否定表現としての重要語は、適宜文章体（「承諾」、「拒否」等）に変換されてよい。
なお、好適には、要約文生成部７３は、冗長性排除部７２から複数の文が供給された場合と単独の文が供給された場合のいずれであっても、１つの通話単位について１つの要約文を生成してよい。 FIG. 20 shows a non-limiting example of the important word table 34 that defines important words that are positive expressions (“Yes”, “Understood”, “Iiyo”, ok) etc.), and FIG. A non-limiting example of the important word table 34 that defines important words that are expressions (“No”, “No”, “Refuse”, “No approval”, etc.) is shown. The abstract sentence generation unit 73 also refers to these important word tables 34 to search for important words from the dialogue voice text, weights the searched important words according to the corresponding weights, and determines the summary words to be generated. include. The important words as positive or negative expressions included in FIGS. 20 and 21 may be appropriately converted into a writing style (“accept”, “reject”, etc.).
It should be noted that, preferably, the summary sentence generation unit 73 may generate one sentence for each call unit regardless of whether a plurality of sentences are supplied from the redundancy eliminating unit 72 or a single sentence is supplied. A summary sentence may be generated.

図６に戻り、Ｓ１４で、要約生成サーバ７の要約文短縮部７５は、要約文生成部７３により生成された要約文が、所定長、例えば所定文字数の閾値を超えた場合に、該閾値内の要約文長となるよう、要約文を短縮する。
好適には、要約文短縮部７５は、対話要約文が一覧表示される照会結果表示画面において、１通話単位の要約文表示用に設けられた出力欄に要約文全文がスクロールを要することなく一瞥して可読な範囲の文字数を閾値として設定してよい。これにより、要約文確認のための追加的操作が不要となり、要約文全体の迅速な視認が可能となる。 Returning to FIG. 6, in S14, if the summary sentence generated by the summary sentence generation unit 73 exceeds the threshold of a predetermined length, for example, a predetermined number of characters, the summary sentence shortening unit 75 of the summary generation server 7 is within the threshold value. Shorten the summary to be the length of the summary.
Preferably, the summary sentence shortening unit 75 glances at the query result display screen where the dialogue summary sentences are displayed in a list in the output field provided for displaying the summary sentence for each call without scrolling the entire summary sentence. Then, the number of characters in the readable range may be set as the threshold. This eliminates the need for an additional operation for confirming the summary and enables quick visual confirmation of the entire summary.

より詳細には、要約文短縮部７５は、各種重要語テーブル３４を参照して、要約文中に出現する重要語に付与された重み（重要度ポイント）に基づいて、要約文を短縮してよい。
一例として、要約文短縮部７５は、冗長性排除部７２から供給される対話音声テキストを、句点（「。」）ごとに区切り、１つの対話音声テキスト文ごとに、文中に出現する重要語の重要度ポイントを加算し、高い重要度が算出された通話テキスト文を優先的に選択してよい。
要約文短縮部７５は、短縮された要約文を、要約文テキスト３８のファイルへ出力する。 More specifically, the summary sentence shortening unit 75 may shorten the summary sentence by referring to the various important word tables 34, based on the weights (importance points) given to the important words appearing in the summary sentence. ..
As an example, the summary sentence shortening unit 75 divides the dialogue voice text supplied from the redundancy eliminating unit 72 into punctuation marks (“.”), And for each dialogue voice text sentence, the important words appearing in the sentence are divided. It is possible to add importance points and preferentially select a call text sentence for which a high importance is calculated.
The summary sentence shortening unit 75 outputs the shortened summary sentence to the file of the summary sentence text 38.

図６のＳ１５で、本実施形態において、要約文生成部７３は、音声認識サーバ５の相槌解析部５４が生成した、「受け答え」の種別が付与された対のテキストを、出力すべき要約文に付加する。
音声認識サーバ５の相槌解析部５４により実行された図１３の相槌解析処理により、一方の話者（例えば、顧客）により発話された、受け答えと判定された対話音声テキストと、当該受け答えの直前に他方の話者（例えば、オペレータ）により発話された、当該受け答えを促した、何に対する受け答えであるかを特定する対話音声テキストとが対となり、「受け答え」の種別が付与されて、一問一答形式の対話として対話音声テキストに含まれている。 In S15 of FIG. 6, in the present embodiment, the summary sentence generation unit 73 outputs the pair of texts to which the type of “acknowledgement” has been generated, which is generated by the summation analysis unit 54 of the voice recognition server 5, and which should be output. Added to.
The dialogue analysis text of FIG. 13 executed by the dialogue analysis unit 54 of the voice recognition server 5 causes the dialogue voice text uttered by one speaker (for example, a customer) to be judged as an answer, and immediately before the answer. The dialogue voice text that utters the answer from the other speaker (for example, the operator), prompts the answer, and identifies what the answer is, is paired, and the type of “answer” is added to each question. It is included in the dialogue voice text as an answer-type dialogue.

要約文生成部７３は、この「受け答え」の種別が付与された対話音声テキストの対を重要語として取り扱い、各種変換テーブル３６を参照して、要約文用の文体に変換した上で、出力すべき要約文に付加する。例えば、「受け答え」の種別が付与された対話音声テキストが「発送は二三日後でよろしかったでしょうか（オペレータの問い）」と「はい（顧客の受け答え）」の対であるとする。この場合、要約文生成部７３は、この対話音声テキストの対から「二三日後の発送を了承」等に変換し、変換後のテキストを応対履歴における重要語（重要文）として出力すべき要約文に含める。 The summary sentence generation unit 73 treats the pair of dialogue voice texts to which the type of “acknowledgement” is given as an important word, refers to the various conversion tables 36, converts the sentence into a sentence style for the summary sentence, and then outputs it. Add to the summary. For example, it is assumed that the dialogue voice text to which the type of "acknowledgement" is added is a pair of "Is shipping correct in a few days (operator's question)" and "Yes (customer's answer)". In this case, the abstract sentence generation unit 73 converts the dialogue voice text pair into "accept shipment in a few days" or the like, and outputs the converted text as an important word (important sentence) in the response history. Include in sentence.

他の例として、「受け答え」の種別が付与された対話音声テキストが「ご注文の品は対話要約ｅＶ−Ｏｕｔｌｉｎｅでよろしいでしょうか（オペレータの問い）」と「はい、お願いします（顧客の受け答え）」の対であるとする。この場合、要約文生成部７３は、この対話音声テキストの対から「注文の品は対話要約ｅＶ−Ｏｕｔｌｉｎｅを確認」等に変換し、変換後のテキストを応対履歴における重要語（重要文）として出力すべき要約文に含める。 As another example, the dialogue voice text to which the type of “answer” is given is “Is the item you ordered in the dialogue summary eV-Outline OK (operator's question)” and “Yes, please (customer answer ) ”. In this case, the abstract sentence generation unit 73 converts this dialogue voice text pair into “confirm the dialogue summary eV-Outline for the ordered item” and the like, and the converted text as an important word (important sentence) in the response history. Include it in the summary to be output.

Ｓ１６で、要約生成サーバ７の感情解析部７４は、対話音声テキストに基づいて、対話の話者の感情解析処理を実行する。また、感情解析部７４は、要約生成部７３から感情解析サーバ６へのインターフェースを提供し、感情解析サーバ６に感情解析処理を実行させ、感情解析処理の実行結果を要約文生成部７３へ供給してもよい。あるいは感情解析サーバ６を別途設けることなく、感情解析部７４が要約文を生成すべき対話の話者の感情解析処理を実行してもよい。以下では、前者の感情解析サーバ６を使用して感情解析処理を実行する例を説明する。 In S16, the emotion analysis unit 74 of the summary generation server 7 executes the emotion analysis process of the conversation speaker based on the conversation voice text. The emotion analysis unit 74 also provides an interface from the summary generation unit 73 to the emotion analysis server 6, causes the emotion analysis server 6 to execute emotion analysis processing, and supplies the execution result of the emotion analysis processing to the summary sentence generation unit 73. You may. Alternatively, the emotion analysis unit 74 may execute the emotion analysis process of the speaker of the dialogue in which the summary sentence should be generated, without separately providing the emotion analysis server 6. In the following, an example of executing the emotion analysis process using the former emotion analysis server 6 will be described.

感情解析処理は、対話音声データを使用した非言語的感情解析処理と、音声認識結果である対話音声テキストを使用した言語的感情解析処理とを含む。
前者の対話音声データに基づく感情解析処理において、感情解析部７４から呼び出された感情解析サーバ６は、通話録音サーバ３から供給される対話音声データを入力として、話者ごとに例えば、喜怒、満足度、ストレス度、信頼度等の話者の感情を数値化した定量的指標を話者の感情解析結果として出力する。 The emotion analysis process includes a non-verbal emotion analysis process using the conversation voice data and a linguistic emotion analysis process using the conversation voice text that is the voice recognition result.
In the former emotion analysis processing based on the conversation voice data, the emotion analysis server 6 called from the emotion analysis unit 74 receives the conversation voice data supplied from the call recording server 3 as an input, and, for example, for each speaker, Quantitative indicators that quantify the emotion of the speaker such as satisfaction, stress, and reliability are output as the emotion analysis result of the speaker.

感情解析サーバ６が提供するこの感情解析処理は、話者の脳波の動きと声帯の動きとが連動するものであり、発話のプロセスにおいて人間は感情を制御することができず感情が声に現れるとの知見に基づくものである。このため、感情解析サーバ６は、話者の発話の言語に依存することなく、対話音声データから話者の感情を数値化することができる。
後者の対話音声テキストに基づく感情解析処理において、要約生成サーバ７の感情解析部７４は、音声認識サーバ５から供給される対話音声テキストを入力として、対話音声テキスト中の感情語を抽出し、感情語テーブル３７を参照して、要約文に含めるべき感情表現に変換する。 In this emotion analysis processing provided by the emotion analysis server 6, the movements of the speaker's brain waves and the movements of the vocal cords are linked, and the human cannot control the emotion in the process of utterance, and the emotion appears in the voice. It is based on the knowledge that. Therefore, the emotion analysis server 6 can quantify the emotion of the speaker from the dialogue voice data without depending on the language of the speaker's utterance.
In the latter emotion analysis process based on the dialogue voice text, the emotion analysis unit 74 of the summary generation server 7 receives the dialogue voice text supplied from the voice recognition server 5 as an input, extracts emotion words in the dialogue voice text, and With reference to the word table 37, it is converted into an emotional expression to be included in the summary sentence.

図２２は、感情解析サーバ６が、１つの通話単位の一方の話者（顧客）の対話音声データに対して、感情解析処理を実行した結果の非限定的出力例を示す。図２２を参照して、顧客（ＣＳ）の１通話中の顧客の感情の遷移が時系列上出力されている。図２２は、顧客からのクレーム対応で、通話中にオペレータが顧客を納得させた例を示す。図２２において、「喜怒」および「満足度」の感情指標は、中盤から後半にかけてともに数値が上昇しており、一方、「ストレス度」の感情指標は、中盤から後半にかけて数値が減少しており、１つの通話単位の中盤から後半にかけて、顧客の怒りおよびストレスが低下して不満が満足に転化しているとの感情の遷移を読み取ることができる。 FIG. 22 shows a non-limiting output example of the result of the emotion analysis server 6 performing the emotion analysis process on the conversation voice data of one speaker (customer) in one call unit. With reference to FIG. 22, the transition of the customer's emotion during one call of the customer (CS) is output in time series. FIG. 22 shows an example in which an operator convinces a customer during a call in response to a complaint from the customer. In FIG. 22, the emotional indexes of “emotion” and “satisfaction” both increased from the middle stage to the second half, while the emotional index of “stress” decreased from the middle stage to the second half. Therefore, it is possible to read the transition of emotions that the customer's anger and stress are reduced and the dissatisfaction is satisfactorily converted from the middle stage to the latter half of one call unit.

また、図２２に例示される顧客の感情解析結果から、他方の話者であるオペレータの応対の品質を評価する指標を得ることができる。
例えば、通話の始めから「喜怒」の感情指標がマイナスで「怒り」が高いが、通話の最後には、「喜怒」の感情指標が０またはプラスに転化して「喜び」の傾向を示し、かつ「満足度」の感情指標も０またはプラスに転化して「満足」の傾向を示している場合、オペレータの応対履歴の評価は、優れた応対を示す「応対優良」としてよい。
ただし、通話の最後に、例えば顧客の「信頼度」の感情指標がマイナスで「不信」の傾向を示している場合、当該顧客の発話内容の信頼度が低いと評価することができるため、当該顧客の発言につき要注意であることを示す「顧客注意」を注記してもよい。 In addition, an index for evaluating the quality of the response of the operator who is the other speaker can be obtained from the customer emotion analysis result illustrated in FIG.
For example, the emotional index of "joy and anger" is negative and "anger" is high from the beginning of the call, but at the end of the call, the emotional index of "joy and anger" is converted to 0 or positive and the tendency of "joy" is increased. In addition, when the emotional index of “satisfaction” is also converted to 0 or positive and shows a tendency of “satisfaction”, the evaluation of the response history of the operator may be “excellent response” indicating excellent response.
However, at the end of a call, for example, if the customer's “credibility” emotional index shows a negative tendency to “distrust”, it can be evaluated that the customer's utterance content is low in reliability. A “customer attention” may be noted to indicate that the customer's remarks need attention.

一方、通話の途中で突然「喜怒」の感情指標がマイナスに大きく転化するとともに「満足度」の感情指標もマイナスに大きく転化し、「怒り」かつ「不満」の傾向がその後も継続した場合、マイナス転化の直前のオペレータの発言が顧客の怒りや不満を誘発したと評価することができるため、当該オペレータの応対を確認することが必要であることを示す「応対注意」としてよい。
この場合も、通話の最後に、例えば顧客の「信頼度」の感情指標がマイナスで「不信」の傾向を示している場合、当該顧客の発話内容の信頼度が低いと評価することができるため、当該顧客の発言につき要注意であることを示す「顧客注意」を注記してもよい。
また、上記のような傾向が示されなかった場合には、妥当な応対であることを示す「応対通常」としてよい。 On the other hand, in the middle of a call, the emotional index of "anger and anger" suddenly turned to a large negative value, and the emotional index of "satisfaction" also turned to a large negative value, and the tendency of "anger" and "dissatisfaction" continued thereafter. Since it can be evaluated that the operator's remark just before the conversion to minus has caused the customer's anger or dissatisfaction, it may be a “response caution” indicating that it is necessary to confirm the response of the operator.
Also in this case, at the end of the call, for example, if the customer's "reliability" emotional index shows a negative tendency to "distrust", it can be evaluated that the reliability of the utterance content of the customer is low. , "Customer attention" indicating that the customer's remarks need attention may be noted.
Moreover, when the above tendency is not shown, it may be set as “reception normal” indicating that the reception is appropriate.

図２３は、感情解析サーバ６が、１つの通話単位の他方の話者（オペレータ）の対話音声データに対して、感情解析処理を実行した結果の非限定的出力例を示す。図２３は、顧客との通話でオペレータがストレスを感じている例を示す。図２３において、「ストレス度」の感情指標は、通話の始めから終わりにかけて数値が上昇しており、オペレータのストレスが高まっているとの感情の遷移を読み取ることができる。
この場合、例えば、前回までのストレス度の感情指標の数値の遷移と比較して今回の通話でのストレスが高まっている場合には、オペレータの評価指標を、当該オペレータのストレス状態を引き続き監視すべきであることを示す「応対注意」としてよい。 FIG. 23 shows a non-limiting output example of the result of the emotion analysis server 6 executing the emotion analysis process on the conversation voice data of the other speaker (operator) in one call unit. FIG. 23 shows an example in which the operator feels stress during a call with a customer. In FIG. 23, the emotional index of the “stress level” has a numerical value increasing from the beginning to the end of the call, and the emotional transition that the operator's stress is increasing can be read.
In this case, for example, when the stress in this call is higher than the transition of the numerical value of the emotion index of the stress level up to the previous time, the operator's evaluation index is continuously monitored for the operator's stress index. It may be used as "response caution" indicating that it should be done.

図２４は、ある期間内（１日、１週間等）における複数回（図２４では１５回）の通話間での感情の遷移を示す。図２４において、通話回数が増加するにつれて、オペレータの「ストレス度」の感情指標の平均数値が徐々に上昇しており、通話回数が増加するにつれて、オペレータのストレスが高まっているとの感情の遷移を読み取ることができる。
この場合、オペレータの評価指標を、当該オペレータの応対を中止させ、直ちにヒヤリングを実施すべきであることを示す「応対中止」としてよい。 FIG. 24 shows a transition of emotions between a plurality of calls (15 times in FIG. 24) within a certain period (one day, one week, etc.). In FIG. 24, the average value of the emotional index of the operator's “stress level” gradually increases as the number of calls increases, and the emotional transition that the operator's stress increases as the number of calls increases Can be read.
In this case, the operator's evaluation index may be "stop receiving", which indicates that the operator's reception should be stopped and hearing should be performed immediately.

図２５は、要約生成サーバ７の感情解析部７４が参照する感情語テーブル３７の非限定的一例を示す。図２５を参照して、感情語テーブル３７には、左欄に変換元の感情語（「まあいいか」、「それでいいよ。ありがとう」、「がっかりしたよ」、「大丈夫だよな」、「なんとかしろよ」、「いい加減にしろよ」等）が、右欄に変換先の感情表現（「渋々承諾」、「快諾」、「落胆」、「不安」、「不快」等）が、それぞれ定義されている。要約生成サーバ７の感情解析部７４は、対話音声テキストから、感情語テーブル３７に定義された変換元の感情語を検索し、検索された感情語を感情語テーブル３７に定義される対応する感情表現に変換する。これにより、対話音声テキスト中の感情語が簡潔な感情表現に変換される。 FIG. 25 shows a non-limiting example of the emotion word table 37 referred to by the emotion analysis unit 74 of the summary generation server 7. Referring to FIG. 25, in the emotional word table 37, the emotional word of the conversion source is displayed in the left column (“Is it okay”, “That's fine. Thank you”, “I was disappointed”, “It's okay”, "Something", "Make it loose", etc.), and the emotional expressions ("Shibu consent", "Pleasure consent", "Discouragement", "Anxiety", "Discomfort", etc.) of the conversion destination are displayed in the right column. It is defined. The emotion analysis unit 74 of the summary generation server 7 retrieves the emotion word of the conversion source defined in the emotion word table 37 from the dialogue voice text, and the retrieved emotion word corresponds to the emotion defined in the emotion word table 37. Convert to expression. As a result, the emotional word in the dialogue voice text is converted into a simple emotional expression.

図２７は、図２５の感情語テーブル３７を参照して、感情解析部７４が音声認識結果である対話音声テキストから感情表現を組み入れた要約文を生成する非限定的一例を示す。図２６を参照して、感情解析部７４は、図２７上段の対話音声テキスト「機器を交換したけど、また壊れて、がっかりだよ」を、図２７下段の「機器交換したが故障し落胆」の要約文へ変換する。出力すべき要約文に音声認識結果である対話音声テキストから把握される感情表現を含めることができる。変換後の「落胆」の語が話者（顧客）の感情表現を示すものであり、出力される要約文に含められる。 FIG. 27 shows a non-limiting example in which the emotion analysis unit 74 refers to the emotion word table 37 of FIG. 25 to generate a summary sentence incorporating an emotion expression from the dialogue voice text as the voice recognition result. With reference to FIG. 26, the emotion analysis unit 74 refers to the dialogue voice text “I replaced the device, but it is broken again, I am disappointed” in the upper part of FIG. Convert to the summary sentence of. The summary sentence to be output can include an emotional expression understood from the dialogue voice text as the voice recognition result. The word "disappointment" after conversion indicates the emotional expression of the speaker (customer) and is included in the output summary.

一方、図２６は、感情解析サーバ７が対話音声データ（声色）から感情解析処理を実行して得られた感情表現を、要約文テキストに括弧書で付加した非限定的一例を示す。図２６を参照して、感情解析サーバ７は、図２６上段の対話音声テキスト「食品に虫が入っているんだよ」の基となった対話音声データに対して感情解析処理を実行し、例えば当該音声データの「信頼度」の感情指標がマイナスで「不信」の傾向を示している場合、当該顧客の発言につき要注意であることを示す「顧客注意」の感情表現を生成して、要約生成サーバ７の感情解析部７４を解して要約文生成部７３へ供給する。要約生成サーバ７の要約文生成部７３は、図２６上段の対話音声テキストから生成された図２６下段の要約文「食品に虫が混入」に、感情解析サーバ６から供給された「顧客注意」を括弧書で付加する。
上記のように、生成される要約文に話者の感情表現を反映させることにより、話者の感情遷移の状況把握や、対策を取るべき問題通話の自動抽出が容易に可能となる。 On the other hand, FIG. 26 shows a non-limiting example in which the emotion expression obtained by the emotion analysis server 7 executing the emotion analysis process from the dialogue voice data (voice color) is added to the summary text in parentheses. With reference to FIG. 26, the emotion analysis server 7 executes the emotion analysis processing on the dialogue voice data that is the basis of the dialogue voice text “There is an insect in food” in the upper part of FIG. 26, For example, when the emotional index of “reliability” of the voice data is negative and indicates a tendency of “distrust”, an emotional expression of “customer attention” indicating that the customer's statement needs attention is generated, The emotion analysis unit 74 of the summary generation server 7 is solved and supplied to the summary sentence generation unit 73. The summary sentence generation unit 73 of the summary generation server 7 supplies the “customer attention” supplied from the emotion analysis server 6 to the summary sentence “food mixed with insects” in the lower part of FIG. 26 generated from the dialogue voice text in the upper part of FIG. Is added in brackets.
As described above, by reflecting the emotional expression of the speaker in the generated summary sentence, it becomes possible to easily grasp the situation of the emotional transition of the speaker and automatically extract the problematic call for which countermeasures should be taken.

図６に戻り、要約生成サーバ７の要約生成部７３は、Ｓ１７で、上記のような感情解析結果を用いて、図２７に示すように、要約文中の感情語からより端的でカテゴライズされた感情表現に置き換え、および図２６に示すように、出力すべき要約文に付加する。
Ｓ１８で、要約文生成部７３または要約文短縮部７５は、最終的に生成された要約文を要約文テキスト３８のファイルへ出力する。 Returning to FIG. 6, the summary generation unit 73 of the summary generation server 7 uses the emotion analysis result as described above at S17, as shown in FIG. 27, emotions that are more straightforward and categorized from the emotion words in the summary sentence. It is replaced with an expression and added to the summary sentence to be output as shown in FIG.
In S18, the summary sentence generation unit 73 or the summary sentence shortening unit 75 outputs the finally generated summary sentence to the file of the summary sentence text 38.

図２８ないし図３０を参照して、音声認識サーバ５が出力する要約単位に区切られた対話音声テキストから最終的に出力される要約文を生成するまでの抽出変換処理の一例を説明する。
図２８は、音声認識サーバ５が出力し、要約生成サーバ７に入力される１つの通話単位の対話音声テキストの非限定的一例を示す。図２８の対話音声テキストは、識別された話者（オペレータ（ＯＰ）または顧客（ＣＳ））ごとに、各行に１つの発話単位の対話音声テキストが示されており、各行の対話音声テキストは、四角で示される要約単位の区切りが挿入されている。
図２９は、図２８に示す対話音声テキストから、要約生成サーバ７の要約文生成部７３が中間的に出力する要約文テキストの非限定的一例を示す。図２９に示すように、図２８の２０発話単位のテキストから、６発話単位のテキスト（３番目、６番目、９番目、１１番目、１４番目、および１５番目の発話単位のテキスト）が抽出されるとともに、抽出された発話単位のテキストのそれぞれが、要約文用のより簡潔なテキストに変換されている。要約文生成部７３は、重要語テーブル３４、不要語テーブル３５、および各種変換テーブル３６を参照することにより、図２８の１通話全体の対話音声テキストから図２９の中間的要約文テキストに変換する。 With reference to FIGS. 28 to 30, an example of the extraction conversion process from the dialogue voice text divided by the summary unit output by the voice recognition server 5 to the generation of the summary sentence to be finally output will be described.
FIG. 28 shows a non-limiting example of the conversation voice text of one call unit which is output by the voice recognition server 5 and is input to the abstract generation server 7. In the dialog voice text of FIG. 28, one dialog unit of dialog voice text is shown in each line for each identified speaker (operator (OP) or customer (CS)). The break of the summary unit indicated by a square is inserted.
FIG. 29 shows a non-limiting example of the summary sentence text intermediately output by the summary sentence generation unit 73 of the summary generation server 7 from the dialogue voice text shown in FIG. As shown in FIG. 29, the text of 6 utterance units (the text of the 3rd, 6th, 9th, 11th, 14th, and 15th utterance units) is extracted from the text of 20 utterance units of FIG. In addition, each of the extracted utterance-based texts has been converted into a more concise text for a summary sentence. By referring to the important word table 34, the unnecessary word table 35, and the various conversion tables 36, the abstract sentence generation unit 73 converts the conversation voice text of one entire call of FIG. 28 into the intermediate abstract text of FIG. 29. ..

図３０は、図２９の中間的に出力する要約文テキストから、要約文生成部７３ないし要約文短縮部７５が最終的に出力する要約文テキストの非限定的一例を示す。図３０に示すように、図２９で抽出され変換された６発話単位のテキストから、５行の要約文が生成されており、各要約文の末尾は体言止めの「希望」、「確認」等に変換されている。特に、図２９の５行目のオペレータの発話（問い）と６行目の顧客の発話（受け答え）との対は、図３０において、「作成し郵送するので二三日待つ事を快諾」と１つの要約文に集約されている。要約文生成部７３は、重要語テーブル３４や各種変換テーブル３６を参照することにより、応対履歴として機能する図３０の最終的に出力される要約文テキストを生成する。図３０の５行目の要約文の文末は、上記の感情解析処理を適用して、話者（顧客）の感情表現を反映した「快諾」に変換されている。 FIG. 30 shows a non-limiting example of the summary sentence text finally output from the summary sentence generating unit 73 to the summary sentence shortening unit 75 from the intermediate sentence text output in FIG. As shown in FIG. 30, five lines of abstract sentences are generated from the texts of the six utterances extracted and converted in FIG. 29, and the end of each abstract sentence is “hope”, “confirmation”, etc. Has been converted to. In particular, the pair of the operator's utterance (question) on the 5th line and the customer's utterance (answer / answer) on the 6th line in FIG. 29 is “please consent to wait a few days because it is created and mailed” in FIG. It is summarized in one summary. The summary sentence generation unit 73 refers to the important word table 34 and the various conversion tables 36 to generate the finally output summary sentence text of FIG. 30 that functions as a response history. The end of the fifth sentence of the summary sentence in FIG. 30 is converted into “pleasant consent” that reflects the emotional expression of the speaker (customer) by applying the emotional analysis processing described above.

図３１は、図２８の対話音声テキストを照会した結果表示装置等に出力されるユーザインタフェースの非限定的一例を示す。図３１を参照して、ユーザインタフェースは、識別された話者３１１、発話単位の応対内容３１２、再生ボタン３１３、および話者の感情解析結果アイコン３１４を含んでよい。所望する発話に対応する再生ボタン３１３を選択することにより、当該発話の音声ファイルが再生される。
図３２は、感情解析結果として、図３１で照会された通話単位についての、話者ごとの感情指標について、感情指標の数値から得られる感情解析結果が、「喜怒」が「通常」、満足感が「普通」ないし「やや高い」、ストレスが「なし」、「若干あり」等と示されている。図３１と図３２は同時に視認可能に表示装置上表示されてよい。 FIG. 31 shows a non-limiting example of the user interface output to the result display device or the like as a result of querying the dialogue voice text of FIG. 28. With reference to FIG. 31, the user interface may include the identified speaker 311, the response content 312 for each utterance, a play button 313, and a speaker emotion analysis result icon 314. By selecting the reproduction button 313 corresponding to the desired utterance, the audio file of the utterance is reproduced.
FIG. 32 shows, as the emotion analysis result, the emotion analysis result obtained from the numerical value of the emotion index for each talker in the call unit inquired in FIG. The feeling is "normal" to "slightly high", stress is "none", "somewhat", etc. 31 and 32 may be displayed on the display device so as to be visible at the same time.

図３３は、１通話単位（録音時間１．２５．７１６）について話者識別された発話単位の対話音声の音声認識結果、対応するユーザ辞書等を参照した自然言語処理結果、および音声ファイルのリンク、開始および終了時間を一覧で示す非限定的表示例である。図３３左下にあるように、当該通話単位について生成された要約文が表示されており、各処理結果と要約文との間の相互参照を容易にしている。図３３のユーザインタフェースは、音声ファイルを再生した後、音声認識結果や自然言語処理結果を、ユーザにエラー訂正させるべく、編集可能に表示してもよい。
また、図３３左下の生成された要約文には、対話において最終的に「サクサファンドの目論見書をインターネットで見ることを了承」したことが示されているが、当該要約文部分のうち「了承」に対して、複数の感情指標の数値から得られる感情解析結果を、例えば、「了承（快諾）」または「了承（渋々承諾）」のように括弧書等で付加してもよく、「了承」を「快諾」ないし「渋々承諾」等の感情解析結果を含む表現で置き換えてもよい。
本実施形態によれば、このように対話録音データ、対話音声の音声認識結果、自然言語処理結果、感情解析結果、および生成された要約文を統合して出力することができる。 FIG. 33 shows a voice recognition result of conversational voice of an utterance unit identified as a speaker for one call unit (recording time of 1.25.716), a natural language processing result by referring to a corresponding user dictionary, and a link of a voice file. , Is a non-limiting display example showing a list of start and end times. As shown in the lower left of FIG. 33, the summary generated for the call unit is displayed, which facilitates cross-reference between each processing result and the summary. The user interface of FIG. 33 may display the voice recognition result and the natural language processing result editable after the voice file is played back so that the user can correct the error.
In addition, the generated summary in the lower left of FIG. 33 shows that in the dialogue, “approved to view Saxa Fund's prospectus on the Internet” was finally accepted. ”, The emotion analysis result obtained from the numerical values of a plurality of emotion indexes may be added in parentheses such as“ acceptance (please consent) ”or“ acceptance (reluctance acceptance) ”. May be replaced with an expression including the emotional analysis result such as "pleasant consent" or "reluctant consent".
According to the present embodiment, it is possible to integrate and output the dialogue recording data, the speech recognition result of the dialogue voice, the natural language processing result, the emotion analysis result, and the generated summary sentence as described above.

（各装置のハードウエア構成の一例）
図３４は、音声処理システムにおける各装置が備えるハードウエア構成の一例を示す図である。音声取得サーバ２、通話録音サーバ３、制御サーバ４、音声認識サーバ５、感情解析サーバ６、要約生成サーバ７、およびＰＣ９，１０は、図３４に示すハードウエアコンポーネントの全部又は一部を備える。図３４に示す各装置１００は、ＣＰＵ１０１、ＲＯＭ１０２、ＲＡＭ１０３、外部メモリ１０４、入力部１０５、表示部１０６、通信Ｉ／Ｆ１０７及びシステムバス１０８を備えてよい。 (Example of hardware configuration of each device)
FIG. 34 is a diagram showing an example of the hardware configuration of each device in the voice processing system. The voice acquisition server 2, the call recording server 3, the control server 4, the voice recognition server 5, the emotion analysis server 6, the summary generation server 7, and the PCs 9 and 10 include all or part of the hardware components shown in FIG. 34. Each device 100 shown in FIG. 34 may include a CPU 101, a ROM 102, a RAM 103, an external memory 104, an input unit 105, a display unit 106, a communication I / F 107, and a system bus 108.

ＣＰＵ１０１は、装置における動作を統括的に制御するものであり、システムバス１０８８を介して各構成部（１０２〜１０７）を制御する。ＣＰＵ１０１は音声認識処理、要約生成処理または感情解析処理等の各処理を実行する処理部として機能する。ＲＯＭ１０２は、ＣＰＵ１０１が処理を実行するために必要な制御プログラム等を記憶する不揮発性メモリである。なお、当該プログラムは外部メモリ１０４や着脱可能な記憶媒体（図示せず）に記憶されていてもよい。ＲＡＭ１０３は、ＣＰＵ１０１の主メモリ、ワークエリア等として機能する。よって、ＣＰＵ１０１は、処理の実行に際してＲＯＭ１０２から必要なプログラム等をＲＡＭ１０３にロードし、当該プログラム等を実行することで各種の機能動作を実現する。 The CPU 101 centrally controls the operation of the apparatus, and controls each component (102 to 107) via the system bus 1088. The CPU 101 functions as a processing unit that executes each process such as a voice recognition process, a summary generation process, or an emotion analysis process. The ROM 102 is a non-volatile memory that stores control programs and the like necessary for the CPU 101 to execute processing. The program may be stored in the external memory 104 or a removable storage medium (not shown). The RAM 103 functions as a main memory, a work area, etc. of the CPU 101. Therefore, the CPU 101 loads various programs or the like from the ROM 102 into the RAM 103 when executing the processing, and executes the programs or the like to realize various functional operations.

外部メモリ１０４は例えば、ＣＰＵ１０１がプログラムを用いた処理を行う際に必要な各種データや各種情報等を記憶する。また、外部メモリ１０４には例えば、ＣＰＵ１０１がプログラム等を用いた処理を行うことにより得られた各種データや各種情報等が記憶される。入力部１０５はキーボード、タブレット等各種入力デバイスから構成される。表示部１０６は例えば液晶ディスプレイ等からなる。通信Ｉ／Ｆ１０７は、外部装置と通信するためのインターフェースであり、例えば無線ＬＡＮ（Ｗｉ−Ｆｉ）インターフェースやＢｌｕｅｔｏｏｔｈ（登録商標）インターフェースを備える。システムバス１０８は、ＣＰＵ１０１、ＲＯＭ１０２、ＲＡＭ１０３、外部メモリ１０４、入力部１０５、表示部１０６及び通信Ｉ／Ｆ１０７を通信可能に接続する。 The external memory 104 stores, for example, various data and various information necessary when the CPU 101 performs processing using a program. Further, the external memory 104 stores, for example, various data and various information obtained by the CPU 101 performing processing using a program or the like. The input unit 105 includes various input devices such as a keyboard and a tablet. The display unit 106 is, for example, a liquid crystal display or the like. The communication I / F 107 is an interface for communicating with an external device, and includes, for example, a wireless LAN (Wi-Fi) interface or a Bluetooth (registered trademark) interface. The system bus 108 communicatively connects the CPU 101, the ROM 102, the RAM 103, the external memory 104, the input unit 105, the display unit 106, and the communication I / F 107.

以上説明したように、本実施形態によれば、対話音声から、十分に短縮化され、かつ対話中の話者の発話における感情が十分に反映された高精度な要約文を生成することができる。よって、対話音声の要約の有用性向上に資する。
なお、上述した各実施形態は、その複数を組み合わせて実現することが可能である。
また、本発明は、上述の実施形態の一部または１以上の機能を実現するプログラムによっても実現可能である。すなわち、そのプログラムを、ネットワーク又は記憶媒体を介してシステム又は装置に供給し、そのシステム又は装置のコンピュータ（またはＣＰＵやＭＰＵ等）における１つ以上のプロセッサがプログラムを読出し実行する処理により実現可能である。また、そのプログラムをコンピュータ可読な記録媒体に記録して提供してもよい。
また、コンピュータが読みだしたプログラムを実行することにより、実施形態の機能が実現されるものに限定されない。例えば、プログラムの指示に基づき、コンピュータ上で稼働しているオペレーティングシステム（ＯＳ）などが実際の処理の一部または全部を行い、その処理によって上記した実施形態の機能が実現されてもよい。 As described above, according to the present embodiment, it is possible to generate a highly accurate summary sentence that is sufficiently shortened and sufficiently reflects the emotion in the utterance of the speaker during the dialogue from the dialogue voice. .. Therefore, it contributes to improving the usefulness of the summary of the dialogue voice.
Each of the above-described embodiments can be realized by combining a plurality of them.
The present invention can also be realized by a program that realizes a part of the above-described embodiments or one or more functions. That is, the program is supplied to a system or device via a network or a storage medium, and one or more processors in the computer (or CPU, MPU, etc.) of the system or device can read and execute the program. is there. Further, the program may be recorded in a computer-readable recording medium and provided.
Further, the functions of the embodiments are not limited to being realized by executing the program read by the computer. For example, an operating system (OS) running on a computer may perform some or all of the actual processing based on the instructions of the program, and the processing may realize the functions of the above-described embodiments.

以上、本発明の実施形態について詳細に説明したが、上記実施形態は、本発明を実施するにあたっての具体例を示したに過ぎない。本発明の技術的範囲は、前記実施形態に限定されるものではない。本発明は、その趣旨を逸脱しない範囲において種々の変更が可能であり、それらも本発明の技術的範囲に含まれる。 Although the embodiments of the present invention have been described in detail above, the above embodiments merely show specific examples for carrying out the present invention. The technical scope of the present invention is not limited to the above embodiment. The present invention can be variously modified without departing from the spirit thereof, and these are also included in the technical scope of the present invention.

１ＰＢＸ
２音声取得サーバ
３通話録音サーバ
４制御サーバ
５音声認識サーバ
６感情解析サーバ
７要約生成サーバ
８構内回線
９、１０ＰＣ
３１対話音声
３２音声認識辞書
３３要約単位テキスト
３４重用語テーブル
３５不要語テーブル
３６変換テーブル
３７感情語テーブル
５１音声認識前処理部
５２音声認識部
５３音声認識後処理部
５４相槌解析部
７１テキスト補正部
７２冗長性排除部
７３要約文生成部
７４感情解析部
７５要約文短縮部 1 PBX
2 voice acquisition server 3 call recording server 4 control server 5 voice recognition server 6 emotion analysis server 7 summary generation server 8 private line 9, 10 PC
31 Dialogue Speech 32 Speech Recognition Dictionary 33 Summary Unit Text 34 Duplicate Term Table 35 Unnecessary Word Table 36 Conversion Table 37 Emotional Word Table 51 Speech Recognition Pre-Processing Section 52 Speech Recognition Section 53 Speech Recognition Post-Processing Section 54 Aizuchi Analysis Section 71 Text Correction Section 72 Redundancy Exclusion Section 73 Summary Text Generation Section 74 Emotion Analysis Section 75 Summary Text Shortening Section

Claims

A speaker identification unit for identifying the speaker of the dialogue from the dialogue voice data,
For each speaker identified by the speaker identification unit, a voice separation unit that separates the conversation voice data into utterance units,
A voice recognition unit that voice-recognizes the dialogue voice data in the utterance unit separated by the voice separation unit to generate a dialogue voice text;
A summary generation unit that summarizes the dialogue voice text generated by the voice recognition unit to generate a summary sentence text;
Analyzing the dialogue voice data for each utterance to derive an emotional expression for each speaker, adding the derived emotional expression to the summary text, or replacing a part of the summary text with the emotional expression. , Or an emotion analysis unit which outputs the summary text in association with the summary text,
A dialogue abstract generating apparatus comprising:

The emotion analysis unit further analyzes time-series emotional transitions for each speaker in one dialogue by analyzing the conversation voice data in units of the utterance, and the emotions derived for each speaker are analyzed. Output the transition in association with the summary text,
The dialogue summary generation device according to claim 1, wherein

From the dialogue voice text generated by the voice recognition unit, extract an emotional word indicating an emotion for each speaker, convert the extracted emotional word into a corresponding emotional expression, and in the converted emotional expression, A second emotion analysis unit that replaces at least a part of the summary text;
The dialogue summary generating device according to claim 1 or 2, characterized in that.

It is determined whether or not the same or similar text appears multiple times in the dialogue voice text of one dialogue unit. When the same or similar text appears multiple times, they appear forward in time series. A redundancy elimination unit for deleting text,
The dialogue summary generation device according to any one of claims 1 to 3, characterized in that:

The redundancy eliminating unit further refers to an important word table that defines important words in advance, extracts the text defined in the important word table from the dialogue voice text, and positions the text immediately before the extracted text. And searching for a second text in which the reading of the extracted text at least partially matches, and deleting the searched text from the dialogue voice text.
The dialogue summary generation device according to claim 4, wherein

A text correction unit that analyzes the dialogue voice text generated by the voice recognition unit to extract a number, assigns different units and weights depending on the type of the extracted number, and supplies the unit to the summary generation unit. To prepare further,
The dialogue abstraction generation device according to any one of claims 1 to 5, characterized in that:

Further comprising a voice acquisition unit for recording a conversation voice or a face-to-face conversation voice to acquire the conversation voice data.
The dialogue abstraction generation device according to any one of claims 1 to 6, characterized in that.

Identifying the speaker of the dialogue from the dialogue voice data,
Separating the dialogue voice data into utterance units for each identified speaker;
Generating a dialogue voice text by recognizing the dialogue voice data in units of the separated utterances;
Generating a summary text by summarizing the generated dialogue voice text,
Analyzing the dialogue voice data for each utterance to derive an emotional expression for each speaker, adding the derived emotional expression to the summary text, or replacing a part of the summary text with the emotional expression. , Or outputting in association with the summary text,
A method for generating a dialogue summary, comprising:

A dialogue abstract generation program for causing a computer to execute a dialogue abstract generation process, the program comprising:
Speaker identification processing for identifying the speaker of the dialogue from the dialogue voice data,
Voice separation processing for separating the dialogue voice data into utterance units for each identified speaker;
A voice recognition process of generating voice dialogue text by voice-recognizing the dialogue voice data in the separated utterance units;
A summary generation process for generating a summary sentence text by summarizing the generated dialogue voice text,
Analyzing the dialogue voice data for each utterance to derive an emotional expression for each speaker, adding the derived emotional expression to the summary text, or replacing a part of the summary text with the emotional expression. , Or an emotion analysis process of outputting the summary text in association with each other, and executing a process including:
A dialogue summary generation program characterized by the following.