KR20060089922A

KR20060089922A - Data abstraction apparatus by using speech recognition and method thereof

Info

Publication number: KR20060089922A
Application number: KR1020050010035A
Authority: KR
Inventors: 경연정; 엄봉수; 나동원
Original assignee: 에스케이 텔레콤주식회사
Priority date: 2005-02-03
Filing date: 2005-02-03
Publication date: 2006-08-10

Abstract

본 발명은 음성 인식을 이용한 데이터 추출 장치 및 방법에 관한 것으로, 개시된 데이터 추출 장치는 동영상 데이터 또는 오디오 데이터를 요약하기 위한 기준 정보로서 요약 키워드를 입력하는 사용자 인터페이스부와, 입력되는 동영상 데이터 또는 오디오 데이터에 대해 음성 인식을 수행하는 음성 인식부와, 음성 인식부에 의한 음성 인식 결과에 의거하여 동영상 데이터 또는 오디오 데이터에 포함된 음성과 일치하는 텍스트 데이터를 생성하는 텍스트 데이터 생성부와, 요약 키워드 단어와 일치하는 텍스트 데이터에 문장 단위로 타임 태깅을 수행하는 타임 태깅부와, 타임 태깅부에 의한 타임 태깅 정보를 저장하는 태깅 정보 저장부와, 태깅 정보 저장부에 저장된 타임 태깅 정보에 의거하여 동영상 데이터 또는 오디오 데이터 중에서 요약 데이터 구간을 결정한 후 요약 데이터 구간만을 추출하여 요약 데이터를 생성하는 데이터 추출부를 포함하며, 어떠한 동영상 데이터 또는 오디오 데이터에 대해서도 요약 데이터를 자동으로 생성할 수 있고, 요약 데이터가 높은 정확도를 갖는 이점이 있다.The present invention relates to an apparatus and method for extracting data using speech recognition. The disclosed data extracting apparatus includes a user interface for inputting a summary keyword as reference information for summarizing moving image data or audio data, and inputting moving image data or audio data. A speech recognition unit for performing speech recognition on the apparatus, a text data generator for generating text data corresponding to the speech included in the video data or the audio data based on the speech recognition result by the speech recognition unit, and a summary keyword word and Based on the time tagging unit for time tagging the text data to be matched, a tagging information storage unit for storing time tagging information by the time tagging unit, and video data based on the time tagging information stored in the tagging information storage unit. Summary data among audio data After determining the liver extract only the summary data section, including parts of extracted data to generate a summary of data, it is possible to automatically generate the summary data for any video data or audio data, there is an advantage with the summary data with high accuracy.

음성 인식, 동영상 데이터, 오디오 데이터, 요약 데이터, 타임 태깅Speech Recognition, Video Data, Audio Data, Summary Data, Time Tagging

Description

Data extraction apparatus and method using speech recognition {DATA ABSTRACTION APPARATUS BY USING SPEECH RECOGNITION AND METHOD THEREOF}

도 1은 종래 기술에 따른 데이터 추출 장치의 한 예를 보인 블록 구성도,1 is a block diagram showing an example of a data extraction apparatus according to the prior art;

도 2는 본 발명의 바람직한 실시 예에 따른 데이터 추출 장치의 블록 구성도,2 is a block diagram of a data extraction apparatus according to an embodiment of the present invention;

도 3은 본 발명의 제 1실시 예에 따른 요약 데이터 추출 과정을 보인 흐름도,3 is a flowchart illustrating a process of extracting summary data according to a first embodiment of the present invention;

도 4는 본 발명의 제 2실시 예에 따른 요약 데이터 추출 과정을 보인 흐름도.4 is a flowchart illustrating a process of extracting summary data according to a second embodiment of the present invention.

<도면의 주요 부분에 대한 부호의 설명><Explanation of symbols for the main parts of the drawings>

201 : 사용자 인터페이스부 203 : 음성 인식부201: user interface unit 203: speech recognition unit

205 : 텍스트 데이터 생성부 207 : 타임 태깅부205: text data generation unit 207: time tagging unit

209 : 태깅 정보 저장부 211 : 데이터 추출부209: tagging information storage unit 211: data extraction unit

본 발명은 음성 인식을 이용한 데이터 추출에 관한 것으로, 더욱 상세하게는 동영상 또는 오디오 데이터에 대한 음성 인식을 수행하여 소정 조건을 만족하는 데이터만을 추출하는 장치 및 그 방법에 관한 것이다.The present invention relates to data extraction using speech recognition, and more particularly, to an apparatus and a method for extracting only data satisfying a predetermined condition by performing speech recognition on video or audio data.

현대 정보화 시대는 가독성과 속도성이 비교적 높은 기존 텍스트 위주의 자료들이 멀티미디어 데이터로 변화되어 가는 추세에 있으며, 이에 따라 대용량의 멀티미디어 데이터의 처리를 위해 많은 시간이 필요하게 되었다. 또한 눈과 귀가 동시에 필요한 멀티미디어 정보 처리에 부담을 느끼는 사용자들이 증가하면서 적절한 요약 정보 필요성이 대두되었으며, 이를 위한 요약 정보 제공 기술들이 제안되었다.In the modern information age, existing text-oriented materials, which are relatively easy to read and have a high speed, are being converted into multimedia data. Accordingly, a lot of time is required for processing a large amount of multimedia data. In addition, as the number of users who are burdened with the processing of multimedia information requiring both eyes and ears increases, the necessity of appropriate summary information has emerged, and techniques for providing summary information have been proposed.

예로서, 디지털 비디오 컨텐츠는 전체를 재생하여 시청할 수도 있지만 경우에 따라서는 비디오 전체를 보지 않고도 그 내용을 이해할 수 있도록 요약된 형태의 하이라이트 동영상이 프로그램 공급자에 의하거나 혹은 사용자 시스템 자체에서 자동 생성하는 형태로 제공되기도 한다.For example, digital video content may be played in its entirety, but in some cases, a summary of highlight videos may be automatically generated by the program provider or by the user system itself so that the contents can be understood without having to watch the entire video. Also provided as.

하이라이트 동영상은 저장된 비디오 중에서 부분만을 재생하기 위한 것으로, 해당 비디오 스트림 전체를 대표하는 특징이 있다. 하이라이트 동영상은 비디오 스트림의 특정 구간을 따라 저장 또는 재생하기 위한 것으로 제한된 시간 동안 디지털 비디오 녹화기에 저장된 여러 개의 비디오 중에서 하나를 사용자가 선택하여 보기를 원할 때, 사용자는 각 비디오 스트림의 하이라이트 동영상만을 재생하여 원하는 비디오 내용을 검색하는데 걸리는 시간을 절약할 수 있다. 또한 하이라이트는 저장된 비디오 스트림의 요약 정보 이외에도 디지털 비디오 저장 장치에서 사용자가 녹화할 비디오를 선택하는데 필요한 프로그램 가이드 장치 등에서 사용할 수 있 는 프리뷰(preview) 기능을 제공할 수 있다.The highlight video plays only a part of the stored video, and has a characteristic of representing the entire video stream. The highlight movie is for storing or playing along a specific section of the video stream. When the user wants to select one of several videos stored in the digital video recorder for a limited time, the user can play only the highlight movie of each video stream. You can save time searching for the video content you want. In addition to the summary information of the stored video stream, the highlight may provide a preview function that can be used in a program guide device for selecting a video to be recorded by the user in the digital video storage device.

하이라이트는 사용자에게 비디오의 내용을 대표할 만한 의미가 있는 부분을 따로 추출해 내야 하므로 하이라이트를 생성할 구간의 설정은 상당히 까다로운 작업이다. 종래의 방법은 프로그램 제공자가 하이라이트 동영상을 따로 제작하는 방식을 사용한다. 기존 멀티미디어 정보의 요약은 담당자가 전체 데이터를 보고 중요 부분을 마킹하여 추후 마킹된 부분을 재조합하는 형식으로 처리되고 있으므로 매우 생산성이 낮은 부분이라고 볼 수 있다. 또한 담당자별로 중요 부분에 대한 기준이 상이한 경우가 있어 균일한 품질의 요약 정보 습득에 어려움이 있어왔다.Highlights need to be extracted separately to represent the content of the video to the user, so setting the section to create highlights is a very difficult task. The conventional method uses a method in which a program provider separately creates a highlight video. The summary of existing multimedia information is considered to be very low productivity because the person in charge is processing the format by marking the important part and recombining the marked part later. In addition, there is a difficulty in acquiring a uniform quality summary information because the standards for important parts are different for each person in charge.

종래 기술에 따른 데이터 추출 장치의 한 예로서, 한국공개특허 제2003-43198호에는 "하이라이트 구간 검출 장치 및 방법"이 제안되어 있다.As an example of a data extraction apparatus according to the prior art, Korean Patent Publication No. 2003-43198 proposes a "highlight interval detection apparatus and method."

도 1을 참조하면 종래의 하이라이트 구간 검출 장치는, 음성신호 추출부(101), 음성신호 세기 분석부(103), 문턱값 결정부(105), 하이라이트 결정부(107), 디스플레이부(109)를 포함하여 구성된다.Referring to FIG. 1, a conventional highlight section detecting apparatus includes a voice signal extractor 101, a voice signal intensity analyzer 103, a threshold value determiner 105, a highlight determiner 107, and a display 109. It is configured to include.

음성신호 추출부(101)는 입력되는 동영상 데이터로부터 영상신호와 음성신호를 분리하여 음성신호만을 음성신호 세기 분석부(103)로 전달한다.The voice signal extractor 101 separates the video signal and the audio signal from the input video data and transmits only the audio signal to the voice signal strength analyzer 103.

음성신호 세기 분석부(103)는 음성신호 추출부(101)에서 분리된 음성신호의 세기를 분석하여 동영상 데이터의 전체적인 음성신호 세기를 분석하여 그 결과를 출력한다.The voice signal strength analyzer 103 analyzes the strength of the voice signal separated by the voice signal extractor 101, analyzes the overall voice signal strength of the video data, and outputs the result.

문턱값 결정부(105)는 음성신호의 세기 분석 결과에 의거하여 음성신호 세기의 변화 패턴을 추출하며, 추출된 패턴에 의거하여 하이라이트 구간을 결정하기 위 한 음성신호의 문턱값, 즉 기준세기를 결정한다.The threshold determiner 105 extracts a change pattern of the voice signal strength based on the result of analyzing the strength of the voice signal, and determines a threshold value of the voice signal, that is, the reference intensity, to determine a highlight section based on the extracted pattern. Decide

하이라이트 결정부(107)는 음성신호의 세기 분석 결과와 문턱값을 비교하여 그 비교 결과에 따라 하이라이트 시작점과 종료점을 추출하며, 시작점에서부터 종료점까지를 하이라이트 구간으로 결정한다.The highlight determiner 107 compares the result of analyzing the intensity of the voice signal with a threshold value, extracts a highlight start point and an end point according to the comparison result, and determines a highlight section from the start point to the end point.

디스플레이부(109)는 하이라이트 결정부(107)에서 결정되어진 하이라이트 구간에 해당하는 동영상 데이터를 재생하여 화면으로 출력한다.The display unit 109 reproduces the moving image data corresponding to the highlight section determined by the highlight determiner 107 and outputs it to the screen.

이와 같이 구성된 종래의 하이라이트 구간 검출 장치에 의해 수행되는 하이라이트 구간 검출 과정을 살펴보면, 먼저 동영상 데이터가 입력되면 음성신호 추출부(101)는 입력되는 동영상 데이터로부터 영상신호와 음성신호를 분리하여 음성신호만을 음성신호 세기 분석부(103)로 전달한다.Looking at the highlight section detection process performed by the conventional highlight section detection device configured as described above, first, when the video data is input, the audio signal extractor 101 separates the video signal and the audio signal from the input video data and only the audio signal. The signal is transmitted to the voice signal strength analyzer 103.

그러면, 음성신호 세기 분석부(103)는 음성신호 추출부(101)에서 분리된 음성신호의 세기를 분석하여 동영상 데이터의 전체적인 음성신호 세기를 분석하여 그 결과를 출력한다.Then, the voice signal strength analyzer 103 analyzes the strength of the voice signal separated by the voice signal extractor 101, analyzes the overall voice signal strength of the video data, and outputs the result.

이후, 문턱값 결정부(105)는 음성신호 세기 분석부(103)로부터 음성신호의 세기 분석 결과를 전달받아서 음성신호의 평균값을 계산하는 방법 등으로 음성신호 세기의 변화 패턴을 추출하며, 추출된 패턴에 의거하여 하이라이트 구간의 결정에 이용되는 음성신호의 문턱값, 즉 기준세기를 결정한다. 여기서 평균값이 하이라이트 구간의 결정을 위한 문턱값으로 결정될 수도 있으며, 평균값에서 소정의 세기만큼 더 큰 신호 세기가 문턱값으로 결정될 수도 있다.Subsequently, the threshold determination unit 105 receives a result of analyzing the strength of the voice signal from the voice signal strength analyzer 103 and extracts a change pattern of the voice signal strength by calculating a mean value of the voice signal. Based on the pattern, the threshold value of the audio signal used for determining the highlight section, that is, the reference intensity is determined. In this case, the average value may be determined as a threshold for the determination of the highlight interval, and a signal intensity larger by a predetermined intensity from the average value may be determined as the threshold.

이로서, 하이라이트 결정부(107)에는 음성신호 세기 분석부(103)로부터 음성 신호의 세기 분석 결과가 입력되며, 문턱값 결정부(105)로부터 문턱값으로 결정된 음성신호의 기준세기가 입력된다.As a result, a result of analyzing the intensity of the voice signal is input to the highlight determiner 107 from the voice signal strength analyzer 103, and a reference intensity of the voice signal determined as the threshold is input from the threshold determiner 105.

그러면, 하이라이트 결정부(107)는 음성신호 세기 분석부(103)로부터 입력된 음성신호의 세기 분석 결과와 문턱값 결정부(105)로부터 입력된 문턱값, 즉 음성신호의 기준세기를 비교한다.Then, the highlight determiner 107 compares the result of analyzing the intensity of the voice signal input from the voice signal intensity analyzer 103 with the threshold value input from the threshold value determiner 105, that is, the reference intensity of the voice signal.

여기서, 하이라이트 결정부(107)는 분석된 음성신호의 세기가 기준세기를 상위하는 하이라이트 조건에 만족하면 현재의 시점을 하이라이트 시작점을 결정하며, 이후 동영상 데이터를 계속하여 검색하다가 분석된 음성신호의 세기가 기준세기를 상위하는 하이라이트 조건에 만족하지 않으면 현재의 시점을 하이라이트 종료점으로 결정하고, 기 결정된 하이라이트 시작점에서부터 하이라이트 종료점까지를 하이라이트 구간으로 검출 및 결정한다.Here, the highlight determination unit 107 determines the highlight start point of the current time point when the intensity of the analyzed voice signal satisfies the highlight condition that exceeds the reference intensity, and then continuously searches the video data and then analyzes the intensity of the analyzed voice signal. If is not satisfied with the highlight condition that exceeds the reference intensity, the current time point is determined as the highlight end point, and the highlight point from the predetermined highlight start point to the highlight end point is detected and determined.

이후, 하이라이트 결정부(107)에서 결정되어진 하이라이트 구간에 해당하는 동영상 데이터는 디스플레이부(109)에 의하여 재생되어 화면으로 출력된다.Thereafter, the moving picture data corresponding to the highlight section determined by the highlight determination unit 107 is reproduced by the display unit 109 and output to the screen.

그러나, 한국공개특허 제2003-43198호에 제안된 데이터 추출 장치는, 음성신호의 세기를 기준으로 하여 하이라이트 구간을 검출 및 결정함으로써, 적용할 수 있는 동영상 데이터가 스포츠 동영상 등과 같이 일부에 한정되며, 음성신호의 세기가 크다고 해서 모두 하이라이트 구간이라고 볼 수 없기 때문에 그 추출 데이터의 정확도가 떨어지는 문제점이 있었다.However, in the data extraction apparatus proposed in Korean Patent Laid-Open Publication No. 2003-43198, by detecting and determining a highlight section based on the intensity of a voice signal, applicable video data is limited to a part, such as a sports video, Since the loudness of the audio signal is not considered to be a highlight section, the accuracy of the extracted data is inferior.

본 발명은 이와 같은 종래의 문제점을 해결하기 위하여 제안한 것으로, 동영 상 또는 오디오 데이터에 대한 음성 인식을 수행하여 요약 키워드가 포함된 데이터 구간을 결정한 후에 요약 데이터를 추출함으로써, 어떠한 동영상 데이터 또는 오디오 데이터에 대해서도 요약 데이터를 생성할 수 있도록 하며, 높은 정확도의 요약 데이터를 추출하는 데 그 목적이 있다.The present invention has been proposed to solve such a conventional problem, and by performing speech recognition on video or audio data to determine a data section including a summary keyword and extracting the summary data, Also, it is possible to generate summary data and to extract high accuracy summary data.

이와 같은 목적을 실현하기 위한 본 발명의 제 1관점으로서 음성 인식을 이용한 데이터 추출 장치는, 동영상 데이터 또는 오디오 데이터를 요약하기 위한 기준 정보로서 요약 키워드를 입력하는 사용자 인터페이스부와, 입력되는 동영상 데이터 또는 오디오 데이터에 대해 음성 인식을 수행하는 음성 인식부와, 음성 인식부에 의한 음성 인식 결과에 의거하여 동영상 데이터 또는 오디오 데이터에 포함된 음성과 일치하는 텍스트 데이터를 생성하는 텍스트 데이터 생성부와, 요약 키워드 단어와 일치하는 텍스트 데이터에 문장 단위로 타임 태깅을 수행하는 타임 태깅부와, 타임 태깅부에 의한 타임 태깅 정보를 저장하는 태깅 정보 저장부와, 태깅 정보 저장부에 저장된 타임 태깅 정보에 의거하여 동영상 데이터 또는 오디오 데이터 중에서 요약 데이터 구간을 결정한 후 요약 데이터 구간만을 추출하여 요약 데이터를 생성하는 데이터 추출부를 포함한다.A data extracting apparatus using speech recognition as a first aspect of the present invention for realizing the above object includes a user interface unit for inputting a summary keyword as reference information for summarizing video data or audio data, and inputted video data or A speech recognition unit for performing speech recognition on the audio data, a text data generation unit for generating text data corresponding to the voice included in the video data or the audio data based on the speech recognition result by the speech recognition unit, and a summary keyword The video is based on a time tagging unit for time tagging the text data corresponding to a word in sentence units, a tagging information storage unit for storing time tagging information by the time tagging unit, and a time tagging information stored in the tagging information storage unit. Summary data, either data or audio data After determining the cross-section it includes extracting data for generating the summary data to extract only the summary data interval.

본 발명의 제 2관점에 의한 음성 인식을 이용한 데이터 추출 장치는, 동영상 데이터 또는 오디오 데이터를 요약하기 위한 기준 정보로서 요약 키워드를 입력하는 사용자 인터페이스부와, 입력되는 동영상 데이터 또는 오디오 데이터에 대해 음성 인식을 수행하여 요약 키워드의 음성 인식 결과를 제공하는 음성 인식부와, 음성 인식 결과에 의거하여 요약 키워드와 일치하는 데이터가 있는 부분에 단어 단위 로 타임 태깅을 수행하는 타임 태깅부와, 타임 태깅부에 의한 타임 태깅 정보를 저장하는 태깅 정보 저장부와, 태깅 정보 저장부에 저장된 타임 태깅 정보에 의거하여 동영상 데이터 또는 오디오 데이터 중에서 요약 데이터 구간을 결정한 후 요약 데이터 구간만을 추출하여 요약 데이터를 생성하는 데이터 추출부를 포함한다.A data extracting apparatus using speech recognition according to a second aspect of the present invention includes a user interface unit for inputting a summary keyword as reference information for summarizing video data or audio data, and voice recognition for the input video data or audio data. A voice recognition unit for providing a speech recognition result of the summary keyword by performing a time tagging unit, a time tagging unit for time tagging a word unit at a portion of the data corresponding to the summary keyword based on the speech recognition result, and a time tagging unit A tagging information storage unit for storing time tagging information and a summary data section from video data or audio data based on the time tagging information stored in the tagging information storage unit, and then extracting the summary data section to generate summary data. Contains wealth.

본 발명의 제 3관점에 의한 음성 인식을 이용한 데이터 추출 장치는, 동영상 데이터 또는 오디오 데이터를 요약하기 위한 기준 정보로서 요약 키워드를 입력하며, 문장 단위 추출 모드와 단어 단위 추출 모드 중에서 어느 하나의 추출 모드를 지정하기 위한 사용자 인터페이스부와, 입력되는 동영상 데이터 또는 오디오 데이터에 대해 음성 인식을 수행하여 문장 단위 추출 모드에서는 전체 음성 인식 결과를 제공하며, 단어 단위 추출 모드에서는 요약 키워드의 음성 인식 결과를 제공하는 음성 인식부와, 음성 인식부에 의한 전체 음성 인식 결과에 의거하여 동영상 데이터 또는 오디오 데이터에 포함된 음성과 일치하는 텍스트 데이터를 생성하는 텍스트 데이터 생성부와, 요약 키워드 단어와 일치하는 텍스트 데이터에 문장 단위로 타임 태깅을 수행하거나 요약 키워드와 일치하는 데이터가 있는 부분에 단어 단위로 타임 태깅을 수행하는 타임 태깅부와, 타임 태깅부에 의한 타임 태깅 정보를 저장하는 태깅 정보 저장부와, 태깅 정보 저장부에 저장된 타임 태깅 정보에 의거하여 동영상 데이터 또는 오디오 데이터 중에서 문장 단위 또는 단어 단위로 요약 데이터 구간을 결정한 후 요약 데이터 구간만을 추출하여 요약 데이터를 생성하는 데이터 추출부를 포함한다.A data extraction apparatus using speech recognition according to a third aspect of the present invention inputs a summary keyword as reference information for summarizing video data or audio data, and extracts one of sentence extraction mode and word extraction mode. And a user interface unit for specifying a voice recognition function for input video data or audio data to provide a full speech recognition result in sentence unit extraction mode and a speech recognition result of summary keywords in word unit extraction mode. A text data generation unit for generating text data that matches the voice included in the moving image data or the audio data based on the voice recognition unit, the entire voice recognition result by the voice recognition unit, and a sentence in the text data that matches the summary keyword word. Perform time tagging in units B) a time tagging unit performing time tagging on a word-by-word basis in a portion having data matching the summary keyword, a tagging information storage unit storing time tagging information by the time tagging unit, and time tagging information stored in the tagging information storage unit. And a data extraction unit configured to generate summary data by extracting only the summary data section after determining the summary data section based on the sentence unit or the word unit among the moving image data or the audio data according to the description.

본 발명의 제 4관점에 의한 음성 인식을 이용한 데이터 추출 방법은, 입력되 는 동영상 데이터 또는 오디오 데이터에 대해 음성 인식을 수행하는 단계와, 음성 인식 결과에 의거하여 동영상 데이터 또는 오디오 데이터에 포함된 음성과 일치하는 텍스트 데이터를 생성하는 단계와, 동영상 데이터 또는 오디오 데이터를 요약하기 위한 기준 정보인 요약 키워드와 생성된 텍스트 데이터를 비교하는 단계와, 비교 결과에 의거하여 요약 키워드 단어와 일치하는 텍스트 데이터에 문장 단위로 타임 태깅을 수행하는 단계와, 타임 태깅의 수행에 의한 타임 태깅 정보에 의거하여 동영상 데이터 또는 오디오 데이터 중에서 요약 데이터 구간을 결정하는 단계와, 동영상 데이터 또는 오디오 데이터 중에서 결정된 요약 데이터 구간만을 추출하여 요약 데이터를 생성하는 단계를 포함한다.According to a fourth aspect of the present invention, a method of extracting data using speech recognition includes performing speech recognition on input video data or audio data, and voice included in the video data or audio data based on the speech recognition result. Generating text data that matches the summary keyword; and comparing the generated text data with the summary keyword, which is the reference information for summarizing the video data or the audio data, and based on the comparison result, the text data that matches the summary keyword word. Performing time tagging on a sentence basis, determining a summary data section from video data or audio data based on time tagging information by performing time tagging, and extracting only a summary data section determined from video data or audio data Generate summary data It includes a step.

본 발명의 제 5관점에 의한 음성 인식을 이용한 데이터 추출 방법은, 입력되는 동영상 데이터 또는 오디오 데이터에 대해 음성 인식을 수행하여 동영상 데이터 또는 오디오 데이터를 요약하기 위한 기준 정보인 요약 키워드를 인식하는 단계와, 음성 인식 결과에 의거하여 요약 키워드와 일치하는 데이터가 있는 부분에 단어 단위로 타임 태깅을 수행하는 단계와, 타임 태깅의 수행에 의한 타임 태깅 정보에 의거하여 동영상 데이터 또는 오디오 데이터 중에서 요약 데이터 구간을 결정하는 단계와, 동영상 데이터 또는 오디오 데이터 중에서 결정된 요약 데이터 구간만을 추출하여 요약 데이터를 생성하는 단계를 포함한다.A data extraction method using speech recognition according to a fifth aspect of the present invention includes the steps of recognizing a summary keyword, which is reference information for summarizing video data or audio data by performing voice recognition on input video data or audio data; Performing time tagging on a word-by-word basis on the part of the data corresponding to the summary keyword based on the speech recognition result, and extracting the summary data section from the video data or the audio data based on the time tagging information by the time tagging. And determining only the extracted summary data section from the moving image data or the audio data to generate the summary data.

이하에서는 첨부한 도면을 참조하여 바람직한 실시 예에 대하여 상세히 설명하기로 한다. 이 실시 예를 통해 본 발명의 목적, 특징 및 이점들을 보다 잘 이해 할 수 있게 된다. 그러나 본 발명은 이러한 실시 예로 제한되는 것은 아니다.Hereinafter, exemplary embodiments will be described in detail with reference to the accompanying drawings. Through this embodiment, it is possible to better understand the objects, features and advantages of the present invention. However, the present invention is not limited to these examples.

도 2는 본 발명의 바람직한 실시 예에 따른 데이터 추출 장치의 블록 구성도이다.2 is a block diagram of an apparatus for extracting data according to a preferred embodiment of the present invention.

도 2를 참조하면 본 발명의 데이터 추출 장치는, 사용자 인터페이스부(201), 음성 인식부(203), 텍스트 데이터 생성부(205), 타임 태깅부(207), 태깅 정보 저장부(209), 데이터 추출부(211)를 포함한다.Referring to FIG. 2, the apparatus for extracting data of the present invention includes a user interface 201, a speech recognition unit 203, a text data generator 205, a time tagging unit 207, a tagging information storage unit 209, The data extractor 211 is included.

사용자 인터페이스부(201)는 한글과 영어 등의 각종 문자를 입력할 수 있는 공지의 키보드로 구현할 수 있으며, 사용자는 이를 통해 동영상 데이터 또는 오디오 데이터를 요약하기 위한 기준 정보로서 요약 키워드를 입력한다. 또한 사용자 인터페이스부(201)를 통해 요약 데이터 추출 기능을 선택할 수 있으며, 문장 단위 추출 방식과 단어 단위 추출 방식 중에서 어느 하나의 추출 방식(추출 모드)을 지정할 수 있다.The user interface 201 may be implemented with a known keyboard that can input various characters such as Korean and English, and the user inputs a summary keyword as reference information for summarizing video data or audio data. In addition, a summary data extraction function may be selected through the user interface unit 201, and any one extraction method (extraction mode) may be specified among a sentence unit extraction method and a word unit extraction method.

음성 인식부(203)는 입력되는 동영상 데이터 또는 오디오 데이터에 대해 음성 인식을 수행하며, 문장 단위 추출 모드에서는 음성 인식 처리 결과를 텍스트 데이터 생성부(205)로 제공하나 단어 단위 추출 모드에서는 음성 인식 처리 결과를 타임 태깅부(207)로 제공한다.The voice recognition unit 203 performs voice recognition on the input video data or audio data. In the sentence unit extraction mode, the voice recognition unit 203 provides a result of the voice recognition process to the text data generation unit 205. In the word unit extraction mode, the voice recognition process is performed. The result is provided to the time tagging unit 207.

텍스트 데이터 생성부(205)는 음성 인식부(203)에 의한 음성 처리 결과에 의거하여 동영상 데이터 또는 오디오 데이터에 포함된 음성과 일치하는 텍스트 데이터를 생성한다.The text data generator 205 generates text data that matches the voice included in the moving picture data or the audio data based on the voice processing result by the voice recognition unit 203.

타임 태깅부(207)는 사용자 인터페이스부(201)를 통해 지정된 추출 모드에 따라 요약 키워드 단어와 일치하는 데이터가 있는 부분에 단어 단위로 타임 태깅(tagging)을 하거나 요약 키워드 단어와 일치하는 텍스트 데이터에 문장 단위로 타임 태깅을 수행한다.The time tagging unit 207 performs time tagging on a word-by-word basis in a portion in which data corresponding to the summary keyword word is located according to the extraction mode specified through the user interface unit 201, or the text data corresponding to the summary keyword word. Time tagging is performed in units of sentences.

태깅 정보 저장부(209)는 타임 태깅부(207)에 의한 타임 태깅 정보를 텍스트 데이터와 함께 저장한다.The tagging information storage unit 209 stores the time tagging information by the time tagging unit 207 together with the text data.

데이터 추출부(211)는 태깅 정보 저장부(209)에 저장된 타임 태깅 정보에 의거하여 요약 데이터 구간을 결정하며, 입력되는 동영상 데이터 또는 오디오 데이터로부터 요약 데이터 구간만을 추출하여 요약 데이터를 생성한다.The data extracting unit 211 determines a summary data section based on the time tagging information stored in the tagging information storage unit 209, and extracts only the summary data section from the input video data or audio data to generate the summary data.

이와 같이 구성된 본 발명에 따른 음성 인식을 이용한 데이터 추출 장치에 의한 데이터 추출 과정을 도 2 내지 도 4를 참조하여 설명하기로 한다.A data extraction process by the data extraction apparatus using speech recognition according to the present invention configured as described above will be described with reference to FIGS. 2 to 4.

먼저, 사용자는 사용자 인터페이스부(201)를 통해 요약 데이터 추출 기능을 선택하여 요약 데이터를 제공받을 수 있는 데, 요약 데이터는 문장 단위 추출 모드와 단어 단위 추출 모드 중에서 어느 하나를 택일할 수 있다. 이를 위해 사용자 인터페이스부(201)를 통해 요약 데이터 추출 기능을 선택하면서 요약 데이터 추출 모드를 지정한다.First, a user may select a summary data extraction function through the user interface 201 and receive summary data. The summary data may be any one of a sentence unit extraction mode and a word unit extraction mode. To this end, a summary data extraction mode is specified while selecting a summary data extraction function through the user interface 201.

문장 단위 추출 모드가 지정되면, 음성 인식부(203)는 입력되는 동영상 데이터 또는 오디오 데이터 전체에 대해 음성 인식을 수행하여 그 전체 인식 결과를 텍스트 데이터 생성부(205)로 제공한다(S301).When the sentence unit extraction mode is designated, the voice recognition unit 203 performs voice recognition on all input video data or audio data and provides the entire recognition result to the text data generation unit 205 (S301).

텍스트 데이터 생성부(205)는 음성 인식부(203)에 의한 음성 처리 결과에 의거하여 동영상 데이터 또는 오디오 데이터 전체에 포함된 음성과 일치하는 텍스트 데이터를 생성한다(S303).The text data generation unit 205 generates text data that matches the voice included in the moving image data or the entire audio data based on the result of the voice processing by the voice recognition unit 203 (S303).

이때, 사용자는 사용자 인터페이스부(201)를 통해 동영상 데이터 또는 오디오 데이터를 요약하기 위한 기준 정보로서 요약 키워드를 입력하여만 한다. 물론 요약 키워드를 입력하는 시기는 텍스트 데이터 생성부(205)에 의해 텍스트 데이터가 생성된 이후에 반드시 입력할 필요는 없으나 요약 키워드가 입력되어야만 이후 과정이 수행될 수 있음을 의미한다. 또한 텍스트 데이터 생성부(205)에 의해 생성된 텍스트 데이터는 별도의 디스플레이부를 통해 외부에 표시될 수 있으며, 사용자는 디스플레이부를 통해 텍스트 데이터를 확인하면서 소정의 요약 키워드를 결정할 수도 있다.In this case, the user only needs to input the summary keyword as reference information for summarizing the moving image data or the audio data through the user interface 201. Of course, when the summary keyword is input, it is not necessary to input the text data after the text data generation unit 205 generates the text data. However, the summary keyword may be input only after the summary keyword is input. In addition, the text data generated by the text data generator 205 may be displayed externally through a separate display unit, and the user may determine a predetermined summary keyword while checking the text data through the display unit.

타임 태깅부(207)는 텍스트 데이터 생성부(205)에 의해 생성된 텍스트 데이터와 사용자 인터페이스부(201)를 통해 입력되는 요약 키워드를 비교하며, 요약 키워드 단어와 일치하는 텍스트 데이터에 문장 단위로 타임 태깅을 수행한다(S305). 그리고 태깅 정보 저장부(209)는 타임 태깅부(207)에 의한 타임 태깅 정보를 텍스트 데이터와 함께 저장한다.The time tagging unit 207 compares the text data generated by the text data generating unit 205 with the summary keyword input through the user interface unit 201, and times the text data corresponding to the summary keyword word in sentence units. Tagging is performed (S305). The tagging information storage unit 209 stores the time tagging information by the time tagging unit 207 together with the text data.

데이터 추출부(211)는 태깅 정보 저장부(209)에 저장된 타임 태깅 정보에 의거하여 요약 데이터 구간을 결정한다. 즉 타임 태깅된 문장만을 요약 데이터 구간으로 결정하거나 타임 태깅된 문장을 기준으로 전후의 소정 개의 문장들을 요약 데이터 구간으로 결정하는 것이다(S307). 그리고 입력되는 동영상 데이터 또는 오디오 데이터로부터 기 결정된 요약 데이터 구간만을 추출하여 요약 데이터를 생성한다(S309).The data extractor 211 determines a summary data section based on the time tagging information stored in the tagging information storage unit 209. That is, only the time-tagged sentences are determined as the summary data section or predetermined sentences before and after the time-tagged sentences are determined as the summary data section (S307). Then, only the predetermined summary data section is extracted from the input video data or audio data to generate summary data (S309).

사용자가 사용자 인터페이스부(201)를 통해 요약 데이터 추출 기능을 선택하면서 단어 단위로 요약 데이터 추출 모드를 지정한 경우에는 음성 인식부(203)에 의한 음성 인식의 수행 이전에 사용자 인터페이스부(201)를 통해 동영상 데이터 또는 오디오 데이터를 요약하기 위한 기준 정보로서 요약 키워드를 입력하게 된다(S401).When the user selects the summary data extraction function through the user interface unit 201 and designates the summary data extraction mode in units of words, the user interface unit 201 is provided before performing the speech recognition by the speech recognition unit 203. The summary keyword is input as reference information for summarizing the moving image data or the audio data (S401).

그러면, 음성 인식부(203)는 사용자가 입력한 키워드 정보를 인식하며, 입력되는 동영상 데이터 또는 오디오 데이터 전체에 대해 음성 인식을 수행하여 그 키워드 음성 인식 결과를 타임 태깅부(207)로 제공한다(S403).Then, the voice recognition unit 203 recognizes the keyword information input by the user, performs voice recognition on the entire input video data or audio data and provides the keyword voice recognition result to the time tagging unit 207 ( S403).

이때, 타임 태깅부(207)는 요약 키워드 단어와 일치하는 데이터가 있는 부분에만 타임 태깅을 수행하며, 타임 태깅 정보는 태깅 정보 저장부(209)에 저장된다(S405).In this case, the time tagging unit 207 performs time tagging only on a portion of the data corresponding to the summary keyword word, and the time tagging information is stored in the tagging information storage unit 209 (S405).

데이터 추출부(211)는 태깅 정보 저장부(209)에 저장된 타임 태깅 정보에 의거하여 요약 데이터 구간을 결정한다. 즉 타임 태깅된 단어가 있는 부분만을 요약 데이터 구간으로 결정하거나 타임 태깅된 단어 위치를 기준으로 전후의 소정 시간 구간을 요약 데이터 구간으로 결정하는 것이다(S407). 그리고 입력되는 동영상 데이터 또는 오디오 데이터로부터 기 결정된 요약 데이터 구간만을 추출하여 요약 데이터를 생성한다(S409).The data extractor 211 determines a summary data section based on the time tagging information stored in the tagging information storage unit 209. That is, only a portion having a time tagged word is determined as a summary data section or a predetermined time section before and after the basis of the time tagged word location is determined as a summary data section (S407). In operation S409, only the predetermined summary data section is extracted from the input video data or the audio data.

지금까지의 상세한 설명에서는 본 발명의 실시예에 국한하여 설명하였으나, 이하의 특허청구범위에 기재된 기술사상의 범위 내에서 본 발명의 기술이 당업자에 의하여 용이하게 변형 실시될 수 있음이 자명하다.In the detailed description thus far, only the embodiments of the present invention have been described, but it is apparent that the technology of the present invention can be easily modified by those skilled in the art within the scope of the technical idea described in the claims below.

전술한 바와 같이 본 발명은 동영상 또는 오디오 데이터에 대한 음성 인식을 수행하여 요약 키워드가 포함된 데이터 구간을 결정한 후에 요약 데이터를 추출함으로써, 어떠한 동영상 데이터 또는 오디오 데이터에 대해서도 요약 데이터를 자동으로 생성할 수 있으며, 요약 데이터가 높은 정확도를 갖는 효과가 있다.As described above, the present invention can automatically generate summary data for any video data or audio data by extracting the summary data after performing a voice recognition on the video or audio data to determine a data section including the summary keyword. And the summary data has the effect of high accuracy.

Claims

A user interface for inputting a summary keyword as reference information for summarizing video data or audio data;

A voice recognition unit performing voice recognition on the input video data or audio data;

A text data generator for generating text data corresponding to a voice included in the video data or the audio data based on a voice recognition result by the voice recognition unit;

A time tagging unit performing time tagging on text data corresponding to the summary keyword word in sentence units;

A tagging information storage unit for storing time tagging information by the time tagging unit;

A data extracting unit configured to generate a summary data by extracting only a summary data section after determining a summary data section from the video data or audio data based on the time tagging information stored in the tagging information storage unit;

Apparatus for extracting data using speech recognition comprising a.

The method of claim 1,

The extracting apparatus may further include: a display unit configured to display text data generated by the text data generating unit externally;

Data extraction apparatus using speech recognition further comprising.

The method of claim 1,

The data extracting unit determines only the time tagged sentence among the video data or audio data as the summary data section.

Data extraction apparatus using speech recognition characterized in that.

The method of claim 1,

The data extracting unit may determine a plurality of sentences before and after the video data or audio data as the summary data section based on the time tagged sentences.

Data extraction apparatus using speech recognition characterized in that.

A speech recognition unit configured to perform speech recognition on the input video data or audio data to provide a speech recognition result of the summary keyword;

A time tagging unit for performing time tagging on a word-by-word basis in a portion in which data corresponding to the summary keyword is present based on the speech recognition result;

Apparatus for extracting data using speech recognition comprising a.

The method of claim 5,

The data extracting unit determines only the portion of the video data or audio data having the time tagged word as the summary data section.

Data extraction apparatus using speech recognition characterized in that.

The method of claim 5,

The data extracting unit determines a predetermined time section before and after the video data or audio data based on the time tagged word location as the summary data section.

Data extraction apparatus using speech recognition characterized in that.

A user interface unit for inputting a summary keyword as reference information for summarizing video data or audio data, and for designating any one of a sentence extraction mode and a word extraction mode;

A speech recognition unit configured to perform speech recognition on the input video data or audio data to provide a whole speech recognition result in the sentence unit extraction mode, and provide a speech recognition result of the summary keyword in the word unit extraction mode;

A text data generator for generating text data corresponding to a voice included in the video data or the audio data based on the overall voice recognition result by the voice recognition unit;

A time tagging unit performing time tagging on text data that matches the summary keyword word in sentence units or time tagging on a portion of the data that matches the summary keyword word by word unit;

A data extraction unit for generating summary data by extracting only summary data sections after determining a summary data section in units of sentences or words from the video data or audio data based on the time tagging information stored in the tagging information storage unit;

Apparatus for extracting data using speech recognition comprising a.

The method of claim 8,

Data extraction apparatus using speech recognition further comprising.

The method of claim 8,

Data extraction apparatus using speech recognition characterized in that.

The method of claim 8,

Data extraction apparatus using speech recognition characterized in that.

The method of claim 8,

Data extraction apparatus using speech recognition characterized in that.

The method of claim 8,

The data extracting unit may determine a predetermined time section before and after the video data or audio data based on the time tagged word location as the summary data section.

Data extraction apparatus using speech recognition characterized in that.

Performing voice recognition on the input video data or audio data;

Generating text data corresponding to a voice included in the video data or the audio data based on the voice recognition result;

Comparing the generated text data with a summary keyword which is reference information for summarizing the video data or the audio data;

Performing time tagging on a text-by-sentence basis with the text data that matches the summary keyword word based on the comparison result;

Determining a summary data section of the video data or audio data based on time tagging information by performing the time tagging;

Generating only summary data by extracting only the determined summary data section from the video data or audio data;

Data extraction method using speech recognition comprising a.

The method of claim 14,

The extraction method may include displaying the generated text data externally.

Data extraction method using a speech recognition further comprising.

The method of claim 14,

The determining of the summary data section may include determining only the time tagged sentence among the video data or audio data as the summary data section.

A data extraction method using speech recognition, characterized in that.

The method of claim 14,

The determining of the summary data section may include determining predetermined sentences before and after the video data or audio data as the summary data section based on the time-tagged sentences.

A data extraction method using speech recognition, characterized in that.

Recognizing a summary keyword as reference information for summarizing the video data or the audio data by performing voice recognition on the input video data or the audio data;

Performing time tagging on a word-by-word basis in a portion of the data that matches the summary keyword based on the speech recognition result;

Extracting only the determined summary data section from the video data or audio data to generate summary data

Data extraction method using speech recognition comprising a.

The method of claim 18,

The determining of the summary data section may include determining only a portion of the video data or audio data having the time tagged word as the summary data section.

A data extraction method using speech recognition, characterized in that.

The method of claim 18,

The determining of the summary data section may include determining a predetermined time section before and after the video data or audio data as the summary data section based on the time tagged word position.

A data extraction method using speech recognition, characterized in that.

21. A recording medium storing a program for performing a data extraction method using any one of claims 14 to 20.