KR20100062413A

KR20100062413A - Method and apparatus for controling speech recognition of telematics apparatus

Info

Publication number: KR20100062413A
Application number: KR1020080121049A
Authority: KR
Inventors: 안성호; 방준학; 권기구; 이은령; 김종욱; 김재영
Original assignee: 한국전자통신연구원
Priority date: 2008-12-02
Filing date: 2008-12-02
Publication date: 2010-06-10

Abstract

PURPOSE: A method and an apparatus for controlling speech recognition of telematics apparatus are provided to improve the accuracy of a control operation through voice recognition by confirming a detected motion. CONSTITUTION: A voice recognition unit(200) recognizes the voice of a speaker through a voice input unit(1), and a lip reading processor(100) receives the lip shape of the speaker through a photographing unit(2). The lip reading processor recognizes the lip shape of the speaker. According to the matching of lip reading results with the voice recognition results, a multi modal recognition unit(300) outputs a corresponding command for the control of a telematics device(400).

Description

Speech recognition device for telematics device and method thereof {METHOD AND APPARATUS FOR CONTROLING SPEECH RECOGNITION OF TELEMATICS APPARATUS}

본 발명은 텔레매틱스 장치를 위한 음성인식 장치 및 그 방법에 관한 것으로, 특히, 텔레매틱스 장치를 위한 음성명령 입력시 발성자의 음성 및 발성자의 입술모양을 함께 인식하여 비교판단함으로써 음성인식 기능의 신뢰성을 높일 수 있는 텔레매틱스 장치를 위한 음성인식 장치 및 그 방법에 관한 것이다.The present invention relates to a speech recognition device and a method for a telematics device, and in particular, to recognize the voice of the speaker and the shape of the lips of the speaker at the time of inputting a voice command for the telematics device to increase the reliability of the speech recognition function. A voice recognition device for a telematics device and a method thereof.

본 발명은 지식경제부 및 정보통신연구진흥원의 IT 신성장동력핵심기술개발의 일환으로 수행한 연구로부터 도출된 것이다[과제명: 미래지능형자동차 전장용 임베디드 SW플랫폼기술 개발]. The present invention is derived from the research conducted as a part of the development of core technologies for IT new growth engines of the Ministry of Knowledge Economy and the Ministry of Information and Telecommunication Research and Development. [Task name: Development of embedded SW platform technology for future intelligent vehicle electronics].

일반적으로, 텔레매틱스(Telematics)는 원격통신(Telecommunication)과 정보과학(Information)이 합쳐진 형태를 말한다. 이러한 텔레매틱스의 초기에는 무선통신을 이용해 차량과 서비스센터를 연결해 서비스센터에서 차량 운행 중 요구하는 각종 정보와 서비스를 제공하는 방식이 적용되었다. 이러한, 텔레매틱스는 멀티미디어 분야 예컨대, 카 라디오, 카테이프, 카CD, 카TV 등으로 발전되었고, 정보통신 분야 예컨대, 네비게이션 기술을 통하여 도로 정보를 제공하는 전자기기 등으로 발 전하고 있다. 또, 텔레매틱스 전자기기의 입출력 방식은 키패드, 버튼 등의 단순입력 방식으로부터 터치스크린 등의 보다 편의성을 제공하는 방식으로 발전하고 있다. In general, telematics refers to a combination of telecommunication and information science. In the early days of telematics, a method of connecting a vehicle to a service center using wireless communication and providing a variety of information and services required by the service center during the operation of the vehicle was applied. Such telematics has been developed in the multimedia field such as car radio, tape, car CD, car TV, and the like, and is being developed as an electronic device for providing road information through an information communication field such as navigation technology. In addition, the input / output method of the telematics electronic device has been developed from a simple input method such as a keypad and a button to a method of providing more convenience such as a touch screen.

최근, 텔레매틱스 장치에 대한 입력방식은 음성인식기술을 이용하는 경우가 많아지고 있는데, 차량내에 설치된 멀티미디어 및 네비게이션 등은 입력 방식으로 음성인식기술을 사용하고 있다. 음성인식기술은 음성인식 장치를 통해 발성자의 음성을 수신하고, 수신된 음성을 기설정된 음성 패턴과 비교함으로써 해당 음성이 갖는 의미를 판단하여 명령에 대한 인식을 수행하게 되는 것이다. 예컨대, 발성자의 발성 음성이 '지시'일 경우 해당 음성과 기설정된 음성패턴을 비교하여 '지시'라는 명령을 인지하게 되는 것이다. 이렇게 발성자의 음성을 입력받아서 멀티미디어 또는 네비게이션과 같은 탤레매틱스 전자기기는 해당 명령을 수행함으로써 사용자의 별도 조작없이도 용이하게 동작되게 된다.In recent years, the input method for the telematics device has been increasingly using a voice recognition technology, the multimedia and navigation, etc. installed in the vehicle is using the voice recognition technology as an input method. In the speech recognition technology, a voice of a speaker is received through a voice recognition device, and the received voice is compared with a preset voice pattern to determine the meaning of the corresponding voice to recognize the command. For example, when the speaker's voice is 'instruction', the voice is compared with a preset voice pattern to recognize a command 'instruction'. Thus, the telematics electronic device such as multimedia or navigation by receiving the voice of the speaker is easily operated without the user's separate operation by performing the corresponding command.

그러나, 음성인식 기술에서 가장 문제가 되는 부분 중의 하나는 음성과 함께 수집되는 음향 잡음이다. 이러한 음향 잡음은 기기 자체의 구동 잡음으로부터 통신망 잡음 및 환경 소음에 이르기까지 다양하다. 그래서, 음성인식방식은 기존 입력방식에 비하여 조작성이 뛰어남에도 정확성은 아직 만족할 만한 수준은 아니다. 특히, 텔레매틱스 전자기기 등은 자동차 내에서 사용되기 때문에 차량내 상황이 조용하지 않은 상태라면 더더욱 음성인식의 인식율은 떨어지게 된다. 즉, 발성자의 음성인식시 주변 소음 등에 의해서 음성인식의 정확성이 비교적 많이 떨어지게 된다. 이에 차량내 음성인식을 통한 텔레매틱스 전자기기의 조작 정확성을 높이기 위한 방법이 필요하다. However, one of the most problematic parts of speech recognition technology is acoustic noise collected with speech. These acoustic noises range from the drive noise of the device itself to network noise and environmental noise. Thus, the speech recognition method is superior in operability compared to the existing input method, but the accuracy is still not satisfactory. In particular, since telematics electronic devices and the like are used in automobiles, the recognition rate of voice recognition is further reduced if the situation in the vehicle is not quiet. That is, the accuracy of speech recognition is relatively lowered due to ambient noise during speech recognition. Therefore, there is a need for a method for improving the operation accuracy of telematics electronic devices through in-vehicle speech recognition.

상술한 문제점을 해결하기 위하여, 본 발명은 기존의 차량내 음성인식을 통한 제어장치의 인식률을 높이기 위해 입술모양의 검출 및 확인 과정을 통해 음성 인식의 정확성을 높이는 텔레매틱스 장치를 위한 음성인식 장치 및 그 방법을 제공하는 데 그 목적이 있다. In order to solve the above problems, the present invention provides a speech recognition device for a telematics device to increase the accuracy of the speech recognition through the process of detecting and confirming the lip shape to increase the recognition rate of the conventional control device through the speech recognition in the vehicle and its The purpose is to provide a method.

상술한 문제점을 해결하기 위한 본 발명의 바람직한 실시예에 따른 텔레매틱스 장치를 위한 음성인식 장치는 음성입력 수단을 통해 발성자의 음성을 입력받고 발성자의 음성을 인식하여 음성인식결과를 생성하는 음성인식 수단; 촬영 수단을 통해 발성자의 입술모양을 수신하고 발성자의 입술모양을 인식하여 독순결과를 생성하는 독순처리 수단; 생성된 음성인식 결과 및 독순결과의 매칭여부를 판단하고, 이를 근거로 텔레매틱스 장치를 제어하기 위한 해당 음성명령으로 출력하는 멀티모달 인지 수단을 포함하는 것을 특징으로 한다.Voice recognition device for a telematics device according to a preferred embodiment of the present invention for solving the above problems is a voice recognition means for receiving a voice of the speaker through the voice input means and recognize the voice of the speaker to generate a voice recognition result; Reading means for receiving the speaker's lip shape through the photographing means and recognizing the speaker's lip shape to generate a reading result; It is characterized in that it comprises a multi-modal recognition means for determining whether the generated speech recognition result and the reading results match, and based on this output as a corresponding voice command for controlling the telematics device.

또, 독순처리 수단은, 촬영 수단을 통해 발성자의 입술모양을 촬영한 영상신호를 입력받고 입력된 영상신호에서 발성자의 얼굴 영역을 인식하고, 그 얼굴 영역 내에서 입술의 위치를 추적하여 발성자의 입술모양에 대한 입술특징을 검출하는 특징추출부; 및 특징추출부에서 검출된 입술특징과 기설정된 독순패턴을 비교하여 독순결과를 생성하는 비교판단부를 포함하는 것이 바람직하다.In addition, the reading processing means receives an image signal photographing the lip shape of the speaker through the photographing means, recognizes the face region of the speaker from the input image signal, and tracks the position of the lips within the face region to detect the speaker's lips. Feature extraction unit for detecting a lip feature for the shape; And a comparison determination unit that compares the lip features detected by the feature extraction unit with a predetermined reading pattern to generate a reading result.

또한, 음성인식 수단은, 음성입력 수단을 통해 발성자의 음성을 변환하여 생성된 음성신호를 입력받고 입력된 음성신호를 기설정된 음성패턴과 비교하여 음성을 인식하는 음성인식부; 및 음성인식부에서 인식된 음성을 처리하여 음성인식결과를 생성하는 음성처리부를 포함하는 것이 바람직하다.The voice recognition unit may include a voice recognition unit that receives a voice signal generated by converting the voice of the speaker through the voice input unit and recognizes the voice by comparing the input voice signal with a preset voice pattern; And a speech processing unit for processing the speech recognized by the speech recognition unit to generate a speech recognition result.

또, 멀티모달 인지 수단은, 음성인식 결과 및 독순결과를 멀티모달 인지 방식으로 동시에 입력받는 멀티모달 인터페이스; 멀티모달 인터페이스를 통해 입력된 음성인식 결과 및 독순결과를 비교하여 일치여부를 판단하는 분석 및 판단부; 및 분석 및 판단부에서의 판단결과 음성인식 결과 및 독순결과가 일치할 경우 그에 연동되는 해당 음성명령을 출력하는 동작제어부를 포함하는 것이 바람직하다.In addition, the multi-modal recognition means, the multi-modal interface for simultaneously receiving the voice recognition result and the reading results in a multi-modal recognition method; An analysis and determination unit for comparing a voice recognition result and a reading result inputted through a multi-modal interface to determine whether they match; And an operation control unit for outputting a corresponding voice command interlocked when the voice recognition result and the reading result coincide with the determination result in the analysis and determination unit.

한편, 본 발명의 바람직한 실시예에 따른 텔레매틱스 장치를 위한 음성인식 방법은 발성자의 음성 및 입술모양을 수신하는 단계; 수신된 발성자의 음성을 인식하여 음성인식결과를 생성하는 단계; 수신된 발성자의 입술모양을 인식하여 독순결과를 생성하는 단계; 및 음성인식 결과 및 독순결과의 매칭여부를 판단하고, 이를 근거로 텔레매틱스 장치를 제어하기 위한 해당 음성명령으로 출력하는 단계를 포함하는 것을 특징으로 한다.On the other hand, the speech recognition method for a telematics device according to a preferred embodiment of the present invention comprises the steps of receiving a voice and lip shape of the speaker; Generating a voice recognition result by recognizing the voice of the received speaker; Recognizing the shape of the received speaker's lips to generate a reading result; And determining whether or not the voice recognition result and the reading result match, and outputting the corresponding voice command for controlling the telematics device based on the matching result.

또, 수신된 발성자의 입술모양을 인식하여 독순결과를 생성하는 단계는, 발성자의 입술모양을 촬영한 영상신호를 입력받는 단계; 입력된 영상신호에서 발성자의 얼굴 영역을 인식하는 단계; 발성자의 얼굴 영역 내에서 입술의 위치를 추적하여 해당 발성자의 입술모양에 대한 입술특징을 검출하는 단계; 및 입술특징과 기설정된 독순패턴을 비교하여 독순결과를 생성하는 단계;를 포함하는 것이 바람직하 다.In addition, the step of generating a reading result by recognizing the lip shape of the speaker, receiving an image signal photographing the lip shape of the speaker; Recognizing a face region of the speaker from the input image signal; Detecting lip features of the speaker's lip shape by tracking the position of the lips within the face area of the speaker; And comparing the lip features with a predetermined reading pattern to generate a reading result.

또한, 수신된 발성자의 음성을 인식하여 음성인식결과를 생성하는 단계는, 발성자의 음성을 변환하여 생성된 음성신호를 입력받는 단계; 입력된 음성신호를 기설정된 음성패턴과 비교하여 음성을 인식하는 단계; 및 인식된 음성을 처리하여 음성인식결과를 생성하는 단계를 포함하는 것이 바람직하다.The generating of the voice recognition result by recognizing the received voice of the speaker may include: receiving a voice signal generated by converting the voice of the speaker; Recognizing a voice by comparing an input voice signal with a preset voice pattern; And processing the recognized voice to generate a voice recognition result.

또, 텔레매틱스 장치를 제어하기 위한 해당 음성명령으로 출력하는 단계는, 음성인식 결과 및 독순결과를 멀티모달 인지 방식으로 동시에 입력받는 단계; 입력된 음성인식 결과 및 독순결과를 비교하여 일치여부를 판단하는 단계; 및 판단결과 음성인식 결과 및 독순결과가 일치할 경우 그에 연동되는 해당 음성명령을 출력하는 단계를 포함하는 것이 바람직하다.The outputting of the voice command for controlling the telematics device may include: simultaneously receiving a voice recognition result and a reading result in a multimodal recognition method; Comparing the input speech recognition result with the reading result to determine whether or not a match is made; And outputting a corresponding voice command interlocked with the voice recognition result when the voice recognition result and the reading result coincide.

여기서, 음성인식 결과 및 독순결과를 멀티모달 인지 방식으로 동시에 입력받는 단계는, 음성인식 결과 및 독순결과를 동시에 입력받고 각각에 대한 시간차이를 보정하여 하나의 멀티모달 정보로 생성하는 것이 바람직하다.Here, in the step of simultaneously inputting the voice recognition result and the reading result in a multi-modal recognition method, it is preferable to generate the multi-modal information by receiving the voice recognition result and the reading result at the same time and correcting the time difference for each.

또, 음성명령은 텔레매틱스 장치의 제어를 위한 기설정된 제어명령인 것이 바람직하다.In addition, the voice command is preferably a predetermined control command for controlling the telematics device.

본 발명에 따르면, 기존의 차량내 텔레매틱스 단말기 뿐만 아니라, 차량에 장착된 제어장치의 음성인식에 의한 제어를 보다 원활하게 하도록 사용자 인터페이스의 보조장치로서 입술 모양의 움직임을 검출하고 이를 확인하여 음성인식을 통한 제어의 정확성을 높이는 효과를 갖는다.According to the present invention, in addition to the existing in-vehicle telematics terminal, as a secondary device of the user interface so as to facilitate the control by the voice recognition of the control device mounted on the vehicle, the movement of the lip shape is detected and confirmed, thereby recognizing the voice recognition. It has the effect of increasing the accuracy of the control through.

이를 통해, 텔레매틱스 장치를 음성인식 방식으로 제어할 경우, 소음 또는 발성자의 불명료한 발음으로 인해 음성인식이 실패하거나 오인식 되는 확률을 줄임으로써 음성인식의 신뢰성을 향상시키는 탁월한 효과를 갖게 된다.Through this, when the telematics device is controlled by the voice recognition method, it has an excellent effect of improving the reliability of the voice recognition by reducing the probability that the voice recognition fails or misunderstanding due to noise or unclear pronunciation of the speaker.

이하, 첨부된 도면을 참조하여 본 발명의 실시예를 설명하면 다음과 같다. 여기서 반복되는 설명, 본 발명의 요지를 불필요하게 흐릴 수 있는 공지 기능, 및 구성에 대한 상세한 설명은 생략한다. 본 발명의 실시형태는 당 업계에서 평균적인 지식을 가진 자에게 본 발명을 보다 완전하게 설명하기 위해서 제공되는 것이다. 따라서, 도면에서의 요소들의 형상 및 크기 등은 보다 명확한 설명을 위해 과장될 수 있다. Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings. Repeated descriptions, well-known functions and configurations that may unnecessarily obscure the subject matter of the present invention, and detailed description of the configuration will be omitted. Embodiments of the present invention are provided to more completely describe the present invention to those skilled in the art. Accordingly, the shape and size of elements in the drawings may be exaggerated for clarity.

도 1은 본 발명의 실시예에 따른 텔레매틱스 장치를 위한 음성인식 장치를 도시한 블록도이다.1 is a block diagram showing a speech recognition device for a telematics device according to an embodiment of the present invention.

도 1을 참조하면, 본 발명의 실시예에 따른 텔레매틱스 장치를 위한 음성인식 장치는 음성입력 수단(1), 촬영 수단(2), 독순처리 수단(100), 음성인식 수단(200), 멀티모달 인지 수단(300)을 포함하여 구성된다.Referring to FIG. 1, a speech recognition apparatus for a telematics apparatus according to an exemplary embodiment of the present invention includes a speech input means 1, a photographing means 2, a reading order means 100, a speech recognition means 200, and a multimodal mode. It comprises a recognition means (300).

음성입력 수단(1)은 발성자로부터 발성되는 음성을 입력받는 역할을 수행한다. 바람직하게는 음성입력 수단(1)은 발성자의 소리를 수신할 수 있는 마이크 등이 채용될 수 있다. 음성입력 수단(1)은 계속 음성을 수신하고 있는 상태이고, 발 성자의 음성뿐만 아니라 주위 환경에서 발생하는 잡음도 수집하게 된다. 이렇게 음성입력 수단(1)은 음성 인식을 발성자의 음성을 수신하여 음성신호를 생성하게 된다.The voice input unit 1 serves to receive a voice spoken by the speaker. Preferably, the voice input means 1 may employ a microphone or the like capable of receiving a speaker's sound. The voice input means 1 is continuously receiving voice and collects not only the voice of the speaker but also the noise generated in the surrounding environment. In this way, the voice input means 1 receives the voice of the speaker for voice recognition to generate a voice signal.

촬영 수단(2)은 발성자가 음성을 입력하는 동안 발성자의 입술모양을 촬영하는 역할을 수행한다. 바람직하게는 촬영 수단(2)은 발성자의 얼굴, 입술 모양을 촬영할 수 있는 화상 카메라, 디지털 카메라 등이 채용될 수 있다. 이렇게 촬영 수단(2)은 음성을 발성하는 발성자의 얼굴, 바람직하게는 입술모양을 검출하여 영상신호를 생성하게 된다.The photographing means 2 plays a role of photographing the lip shape of the speaker while the speaker inputs the voice. Preferably, the photographing means 2 may be an image camera, a digital camera, or the like, capable of photographing the face and lips of the speaker. In this way, the photographing means 2 detects the voice of the speaker, preferably the shape of the lip, which produces voice, thereby generating an image signal.

독순처리 수단(100)은 촬영 수단(2)에서 발성자의 입술모양에 대한 영상신호를 입력받아서 해당하는 독순결과를 생성하는 역할을 수행한다. 독순처리 수단(100)은 입력된 영상신호에서 발성자의 입술모양을 추적하여 입술특징을 검출하게 된다. 독순처리 수단(100)은 검출된 입술특징과 기설정된 독순패턴과의 일치여부를 판단하여 해당 독순결과를 생성하게 된다. 즉, 독순처리 수단(100)은 발성자의 입술모양에서 검출된 입술특징이 기설정된 독순패턴과 일치하면 일치하는 독순패턴에 대한 독순결과를 출력하게 된다. 이러한 독순처리 수단(100)에 대한 사항은 당업계에 종사하는 사람이라면 용이하게 이해될 수 있으므로 상세한 설명은 생략한다. The reading processing means 100 receives an image signal of the speaker's lip shape from the photographing means 2 and generates a corresponding reading result. The reading processing means 100 detects lip features by tracking the shape of the speaker's lips in the input image signal. The reading processing means 100 determines whether the detected lip feature coincides with a predetermined reading pattern and generates a corresponding reading result. That is, when the lip feature detected in the lip shape of the speaker coincides with the preset reading pattern, the reading processing means 100 outputs the reading result for the matching reading pattern. Details of the reading process means 100 may be easily understood by those skilled in the art, and thus detailed description thereof will be omitted.

다시 말하면, 독순처리 수단(100)은 촬영 수단(2)에서 입력된 영상신호에서 발성자의 얼굴영역을 인식한다. 독순처리 수단(100)은 인식된 발성자의 얼굴영역 내에서 입술의 위치를 추적하여 해당 발성자의 입술모양에 대한 입술특징을 검출한 다. 예컨대, 영상신호 내의 단일 프레임 상에 발성자의 얼굴 및 배경이 포함되어 있다면, 독순처리 수단(100)은 얼굴에 대한 색상 또는 모양을 근거로 발성자의 얼굴 영역을 인식하게 된다. 그리고 독순처리 수단(100)은 얼굴 영역 내에서 입술에 대한 색상 및 형태를 근거로 일술 위치를 추적한 후 입술 모양의 특징을 추출하게 된다. 이러한 입술모양의 특징을 추출하는 방식은 어떠한 방식이라도 적용 가능할 것이다. In other words, the reading processing means 100 recognizes the face area of the speaker from the image signal input from the photographing means 2. The reading processing means 100 detects the lip characteristics of the lip shape of the corresponding speaker by tracking the position of the lips in the recognized speaker's face area. For example, if the speaker's face and background are included in a single frame in the video signal, the reading processing means 100 recognizes the speaker's face area based on the color or shape of the face. In addition, the reading processing means 100 tracks the position of the liquor based on the color and shape of the lips in the face area and extracts the feature of the shape of the lips. The method of extracting such lip features may be applied in any manner.

그래서, 독순처리 수단(100)은 추출된 발성자의 입술모양에 대한 입술특징과 데이터베이스 상에 기설정된 독순패턴과의 일치여부를 판단하여 해당 독순결과를 생성하게 된다. 독순처리 수단(100)은 발성자에 대한 입술특징을 기초로 데이터베이스를 검색하여 입력된 입술특징과 매칭되는 독순패턴을 찾게된다. 그리고 독순처리 수단(100)은 발성자의 입술모양에서 검출된 입술특징이 기설정된 독순패턴과 일치하면 일치하는 독순패턴에 대한 독순결과를 생성한다. 독순처리 수단(100)은 이러한 독순결과를 순차적으로 출력한다. Thus, the reading process means 100 determines whether the extracted lip feature of the speaker's lip shape matches the reading pattern set in the database and generates a corresponding reading result. The reading processing means 100 searches the database based on the lip characteristics of the speaker to find a reading pattern matching the input lip characteristics. If the lip feature detected in the lip shape of the speaker coincides with the preset reading order pattern, the reading order processing unit 100 generates a reading result for the matching reading pattern. The reading processing means 100 sequentially outputs the reading results.

음성인식 수단(200)은 음성입력 수단(1)으로부터 입력된 발성자로부터 발성된 음성을 수신하여 생성된 음성신호를 인식하여 음성인식 결과를 출력하는 역할을 수행한다. 음성인식 수단(200)은 발성자로부터 발성된 음성에 대한 음성신호를 인식하고, 그 인식된 정보를 처리하여 음성인식결과로서 출력하게 된다. 예컨대, 음성인식 수단(200)은 음성신호를 일정구간, 음대역 또는 높낮이 등을 기설정된 음성패턴과 대조하는 방식으로 해당하는 이진 또는 텍스트 데이터를 생성하게 되는 것 이다. The voice recognition means 200 recognizes a voice signal generated by receiving voice spoken from the speaker input from the voice input means 1 and outputs a voice recognition result. The voice recognition means 200 recognizes a voice signal for the voice spoken by the speaker, processes the recognized information, and outputs the voice signal as a voice recognition result. For example, the speech recognition means 200 may generate corresponding binary or text data in a manner in which a speech signal is compared with a predetermined speech pattern for a predetermined period, a sound band, or a height.

다시 말하면, 음성인식 수단(200)은 음성입력 수단(1)으로부터 수신된 음성신호를 기설정된 음성패턴 등과 비교하여 해당하는 음성을 인식한다. 그리고, 음성인식 수단(200)은 인식된 음성을 처리하여 음성인식결과를 생성하여 출력한다.In other words, the voice recognition means 200 recognizes a corresponding voice by comparing the voice signal received from the voice input means 1 with a preset voice pattern. The voice recognition means 200 processes the recognized voice to generate and output a voice recognition result.

멀티모달 인지 수단(300)은 멀티모달 정보를 분석처리하여 단일 형태의 정보를 출력하는 역할을 수행하는 것으로, 독순처리 수단(100)에 의해 나온 독순결과와 음성인식 수단(200)에 의해 나온 음성인식결과를 받고, 독순결과와 음성인식결과를 서로 비교분석하여 최종적인 음성명령을 출력하는 역할을 수행한다. 여기서, 멀티모달은 동시에 2가지 이상의 다른 형태의 입력을 받아 최종적으로 최적의 단일 출력를 내는 방식을 말한다. 예를 들어, 음성인식부(210)의 출력은 "지도" 혹은 "시도"가 될 수 있을 때, 입술 움직임이 "지도"라는 데이터 정보로 판단되었다면, 출력은 사용자가 "지도"라고 말한 것으로 판단하게 됨을 의미한다.The multi-modal recognition means 300 analyzes and processes the multi-modal information and outputs a single form of information. The multi-modal recognition means 300 reads the reading result obtained by the reading order processing means 100 and the voice output by the speech recognition means 200. Receives the recognition result and compares the reading result and the voice recognition result with each other and plays the role of outputting the final voice command. Here, multimodal refers to a method of receiving two or more different types of inputs at the same time and finally generating an optimal single output. For example, when the output of the voice recognition unit 210 may be "map" or "try", if the lip movement is determined as data information of "map", the output is determined to be the "map" by the user. It means to be done.

다시 말하면, 멀티모달 인지 수단(300)은 멀티모달 정보를 입력받는 구성을 갖게된다. 멀티모달 인지 수단(300)은 2가지 형태의 입력, 예컨데, 발명자의 입술 움직임에 대한 독순결과와 음성에 대한 음성인식결과를 동시에 입력받는다. 즉, 멀티모달 인지 수단(300)에서 입력받는 2가지 입력 정보는 각각 독순장치 수단과 음성인식 수단(200)의 출력이 되며, 이를 시간차가 있거나 혹은 없는 상태로 동시에 입력 받기 위한 구성을 갖게 된다.In other words, the multi-modal recognition means 300 has a configuration for receiving multi-modal information. The multi-modal recognition means 300 receives two types of input, for example, a reading result of the inventor's lip movement and a voice recognition result of the voice. That is, the two pieces of input information received from the multimodal recognition means 300 are the outputs of the reading device means and the voice recognition means 200, respectively, and have a configuration for receiving them simultaneously with or without time difference.

또, 멀티모달 인지 수단(300)은 받아들인 각각의 입력신호(음성 및 영상)가 들어온 시간 정보와 분석된 정보를 바탕으로 판단을 하게 된다. 멀티모달 인지 수 단(300)은 판단시 음성인식결과의 음성명령과 독순결과의 음성명령의 정보를 비교하여 같으면 해당 음성명령을 출력하게 되고, 각기 서로 다른 2개 이상의 음성명령 후보가 나오면, 데이터베이스에 있는 기설정된 히스토리 및 상황인지를 통한 올바른 음성명령의 경우로 추론하여 판단하게 된다. 그리고, 이에 맞는 제어신호를 생성 출력하게 된다. 이후, 제어신호는 차량내 제반 음성인식을 통해 제어될 수 있는 텔레매틱스 장치(400)에 입력되어 해당 텔레매틱스 장치(400)를 동작하게 한다. 예컨대, 발성자의 음성과 일치하는 독순결과가 '시작' 또는 '종료' 일 경우, '시작' 또는 '종료'라는 음성명령을 출력하게 되면, 해당 텔레매틱스 장치는 그에 상응하는 동작을 취하게 된다. 여기서, 데이터베이스는 상술한 음성패턴, 음성인식패턴, 음성명령 정보 등을 포함하는 저장수단이다. 데이터베이스에 저장되는 음성명령 정보는 적용되는 텔레매틱스 전자기기에 따라 설정되는 것이 바람직하다. 예컨대, 본 발명이 네비게이션에 적용될 경우 데이터베이스에 저장되는 음성명령 정보는 지명, 도로, 상가, 인명 등의 길찾기 정보 및 네비게이션을 구동하기 위한 시작, 종료, 검색, 안내 등의 제어명령 등을 포함하게 된다. 이러한, 데이터베이스는 공지된 형태의 저장형식으로 구성되어 배치된다면 어떠한 구성이라도 가능할 것이다. 물론, 데이터베이스는 각 구성마다 별도로 구성될 수도 있고 통합되어 구성될 수 있으면 당연하다.In addition, the multi-modal recognition means 300 makes a judgment based on the time information and the analyzed information of each of the received input signals (audio and video). The multi-modal recognition stage 300 compares the information of the voice command of the voice recognition result with the information of the voice command of the reading result when the judgment is the same and outputs the corresponding voice command, and if two or more different voice command candidates are present, the database It is determined by inferring the case of the correct voice command through the preset history and the situation in the. Then, a control signal corresponding thereto is generated and output. Thereafter, the control signal is input to the telematics device 400 which can be controlled through the in-vehicle voice recognition to operate the corresponding telematics device 400. For example, when the reading result coinciding with the voice of the speaker is 'start' or 'end', when the voice command 'start' or 'end' is output, the corresponding telematics device takes a corresponding operation. Here, the database is storage means including the above-described voice pattern, voice recognition pattern, voice command information, and the like. Voice command information stored in the database is preferably set according to the applied telematics electronic device. For example, when the present invention is applied to navigation, voice command information stored in a database may include directions information such as place names, roads, stores, and names, and control commands such as start, end, search, and guidance for driving navigation. do. Such a database may be of any configuration provided that the database is configured and arranged in a known storage format. Of course, the database can be configured separately for each configuration, it is natural if it can be integrated.

여기서, 텔레매틱스 장치(400)는 텔레매틱스 장치로서 차량내에서 사용될 수 있는 네비게이션, 멀티미디어시스템 등의 모든 장치를 포함할 수 있다.Here, the telematics device 400 may include all devices such as a navigation system and a multimedia system that can be used in a vehicle as the telematics device.

이처럼 기존의 차량내 텔레매틱스 단말기 뿐만 아니라, 차량에 장착된 제어 장치의 음성인식에 의한 제어를 보다 원활하게 하도록 사용자 인터페이스의 보조장치로서 입술 모양의 움직임을 검출하고 이를 확인하여 제어의 정확성을 높이는 효과를 갖는다.In this way, as well as the existing in-vehicle telematics terminal, as well as the auxiliary device of the user interface to detect the movement of the lip shape as a user interface to facilitate the control by the voice recognition of the control device mounted on the vehicle, the accuracy of the control is improved. Have

도 2는 도 1에 도시된 독순처리 수단(100)의 내부 구성을 도시한 블록도이고, 도 3은 도 1에 도시된 음성인식 수단(200)의 내부 구성을 도시한 블록도이고, 도 4는 도 1에 도시된 멀티모달 인지 수단(300)의 내부 구성을 도시한 블록도이다.FIG. 2 is a block diagram showing an internal configuration of the reading process means 100 shown in FIG. 1, FIG. 3 is a block diagram showing an internal configuration of the voice recognition means 200 shown in FIG. Is a block diagram showing the internal configuration of the multi-modal recognition means 300 shown in FIG.

상술한 독순처리 수단(100)을 도 2를 참조하여 보다 상세히 설명하면, 독순처리 수단(100)은 특징추출부(110), 비교판단부(120), 제1 출력부(130), 데이터베이스(140)로 구성될 수 있다. Referring to FIG. 2 in more detail with reference to FIG. 2, the reading process means 100 includes a feature extraction unit 110, a comparison determination unit 120, a first output unit 130, and a database ( 140).

특징추출부(110)는 촬영 수단(2)을 통해 입력된 영상신호를 처리하여 발성자의 입술특징을 추출하는 역할을 수행한다. 특징추출부(110)는 촬영 수단(2)에서 입력된 영상신호에서 발성자의 얼굴 영역을 인식한다. 특징추출부(110)는 인식된 발성자의 얼굴영역 내에서 입술의 위치를 추적하여 해당 발성자의 입술모양에 대한 입술특징을 검출한다. The feature extractor 110 processes the image signal input through the photographing means 2 to extract the lip feature of the speaker. The feature extractor 110 recognizes the face area of the speaker from the image signal input from the photographing means 2. The feature extractor 110 detects a lip feature of the speaker's lip shape by tracking the position of the lips in the recognized speaker's face area.

예컨대, 영상신호 내의 단일 프레임 상에 발성자의 얼굴 및 배경이 포함되어 있다면, 특징추출부(110)는 얼굴에 대한 색상 또는 모양을 근거로 발성자의 얼굴 영역을 인식하게 된다. 그리고 특징추출부(110)는 얼굴 영역 내에서 입술에 대한 색상 및 형태를 근거로 일술 위치를 추적한 후 입술 모양의 특징을 추출하게 된다. For example, if the speaker's face and background are included in a single frame in the image signal, the feature extractor 110 recognizes the speaker's face area based on the color or shape of the face. The feature extracting unit 110 extracts a feature of a lip shape after tracking a position of a liquor based on the color and shape of the lips in the face region.

비교판단부(120)는 특징추출부(110)에서 추출된 발성자의 입술모양에 대한 입술특징과 데이터베이스(140) 상에 기설정된 독순패턴과의 일치여부를 판단하여 해당 독순결과를 생성하게 된다. 비교판단부(120)는 특징추출부(110)로부터 입술특징을 입력받게 되면, 데이터베이스(140)를 검색하여 입력된 입술특징과 매칭되는 독순패턴을 찾게된다. 그래서 비교판단부(120)는 발성자의 입술모양에서 검출된 입술특징이 기설정된 독순패턴과 일치하면 일치하는 독순패턴에 대한 독순결과를 생성하게 된다. 그러면, 제1 출력부(130)는 비교판단부(120)에서 생성된 독순결과를 순차적으로 출력하게 된다.The comparison determination unit 120 determines whether the lip feature of the speaker's lip shape extracted from the feature extraction unit 110 matches the reading pattern set in the database 140 and generates a corresponding reading result. When the comparison determination unit 120 receives the lip feature from the feature extractor 110, the comparison determination unit 120 searches the database 140 to find a reading pattern matching the input lip feature. Thus, the comparison determination unit 120 generates a reading result for the matching reading pattern when the lip feature detected in the lip shape of the speaker matches the predetermined reading pattern. Then, the first output unit 130 sequentially outputs the reading result generated by the comparison determination unit 120.

또, 상술한 음성인식 수단(200)을 도 3을 참조하여 보다 상세히 설명하면, 음성인식 수단(200)은 음성인식부(210), 음성처리부(220), 제2 출력부(230)를 포함하여 구성된다. 음성인식부(210)는 음성입력 수단(1)으로부터 수신된 음성신호를 기설정된 음성패턴 등과 비교하여 해당하는 음성을 인식한다. 음성처리부(220)는 음성인식부(210)에서 인식된 음성을 처리하여 음성인식결과를 생성한다. In addition, when the voice recognition means 200 is described in more detail with reference to FIG. 3, the voice recognition means 200 includes a voice recognition unit 210, a voice processing unit 220, and a second output unit 230. It is configured by. The voice recognition unit 210 recognizes a corresponding voice by comparing the voice signal received from the voice input unit 1 with a preset voice pattern. The voice processor 220 processes the voice recognized by the voice recognizer 210 to generate a voice recognition result.

또한, 상술한 멀티모달 인지 수단(300)을 도 4를 참조하여 보다 상세히 설명하면, 멀티모달 인지 수단(300)은 멀티모달 인터페이스(310), 분석 및 판단부(320), 데이터베이스(340), 동작제어부(330)를 포함하여 구성된다. In addition, the multi-modal recognition means 300 described above in more detail with reference to Figure 4, the multi-modal recognition means 300 is a multi-modal interface 310, the analysis and determination unit 320, the database 340, It is configured to include an operation control unit 330.

멀티모달 인터페이스(310)는 멀티모달 정보를 입력받는 구성을 갖게 된다. 멀티모달 인터페이스(310)는 2가지 형태의 입력, 예컨데, 발명자의 입술 움직임에 대한 독순결과와 음성에 대한 음성인식결과를 동시에 입력받는다.The multi-modal interface 310 has a configuration for receiving multi-modal information. The multi-modal interface 310 simultaneously receives two types of inputs, for example, a reading result of the inventor's lip movement and a speech recognition result of the voice.

분석 및 판단부(320)는 멀티모달 인터페이스(310)에서 받아들인 각각의 입력신호(음성 및 영상)가 들어온 시간 정보와 분석된 정보를 바탕으로 판단을 하게 된다. 분석 및 판단부(320)는 판단시 음성인식결과의 음성명령과 독순결과의 음성명령의 정보를 비교하여 같으면 해당 음성명령을 출력하게 된다. 이때, 분석 및 판단부(320)는 각기 서로 다른 2개 이상의 음성명령 후보가 나오면, 데이터베이스(340)에 있는 발성자의 히스토리 및 상황인지를 통한 올바른 음성명령의 경우로 추론하여 판단하게 된다. 그리고, 이에 맞는 제어신호를 동작제어부(330)에서 생성 출력하게 된다. 이후, 제어신호는 차량내 제반 음성인식을 통해 제어될 수 있는 텔레매틱스 장치(400)에 입력되어 해당 텔레매틱스 장치(400)를 동작하게 한다. 예컨대, 발성자의 음성과 일치하는 독순결과가 '시작' 또는 '종료' 일 경우, '시작' 또는 '종료'라는 음성명령을 출력하게 된다. The analysis and determination unit 320 makes a judgment based on the time information and the analyzed information of each input signal (audio and video) received by the multi-modal interface 310. The analysis and determination unit 320 compares the information of the voice command of the voice recognition result and the voice command of the reading result when the judgment is the same and outputs the corresponding voice command. In this case, when two or more different voice command candidates appear, the analysis and determination unit 320 infers the case of the correct voice command based on the history and situation of the speaker in the database 340. Then, the control signal corresponding thereto is generated and output by the operation controller 330. Thereafter, the control signal is input to the telematics device 400 which can be controlled through the in-vehicle voice recognition to operate the corresponding telematics device 400. For example, when the reading result coinciding with the voice of the speaker is 'start' or 'end', the voice command 'start' or 'end' is output.

데이터베이스(340)는 음성명령 정보 등을 포함하는 저장수단이다. 데이터베이스(340)에 저장되는 음성명령 정보는 적용되는 텔레매틱스 전자기기에 따라 설정되는 것이 바람직하다. 예컨대, 본 발명이 네비게이션에 적용될 경우 데이터베이스에 저장되는 음성명령 정보는 지명, 도로, 상가, 인명 등의 길찾기 정보 및 네비게이션을 구동하기 위한 시작, 종료, 검색, 안내 등의 제어명령 등을 포함하게 된다.The database 340 is a storage means including voice command information. Voice command information stored in the database 340 is preferably set according to the applied telematics electronic device. For example, when the present invention is applied to navigation, voice command information stored in a database may include directions information such as place names, roads, stores, and names, and control commands such as start, end, search, and guidance for driving navigation. do.

이처럼 본 발명은 텔레매틱스 장치를 음성인식 방식으로 제어할 경우, 소음 또는 발성자의 불명료한 발음으로 인해 음성인식이 실패하거나 오인식 되는 확률을 줄임으로써 음성인식의 신뢰성을 향상시키는 탁월한 효과를 갖게 된다.As described above, when the telematics device is controlled by a voice recognition method, the present invention has an excellent effect of improving the reliability of voice recognition by reducing the probability of failure or misrecognition due to noise or unclear pronunciation of a speaker.

이하, 본 발명의 실시예에 따른 텔레매틱스 장치를 위한 음성인식 방법을 설명하면 다음과 같다. Hereinafter, a speech recognition method for a telematics device according to an embodiment of the present invention will be described.

도 5는 본 발명의 실시예에 따른 텔레매틱스 장치를 위한 음성인식 방법을 도시한 순서도이다. 설명에 있어서 도 1 내지 도 4에 도시된 동일한 도면부호는 같은 기능을 수행하는 것으로 한다. 5 is a flowchart illustrating a voice recognition method for a telematics device according to an embodiment of the present invention. In the description, the same reference numerals shown in FIGS. 1 to 4 are assumed to perform the same functions.

본 발명의 바람직한 실시예에 따른 텔레매틱스 장치를 위한 음성인식 방법을 간략하게 설명하면, 기존의 차량내 음성인식을 통한 제어장치의 인식율을 높이기 위해 입술모양의 검출 및 확인 과정을 통해 음성인식의 정확성을 높이기 위한 방법이다. 이를 통해, 기존의 차량내 텔레매틱스 단말기 뿐만 아니라, 차량에 장착된 제어장치 등의 음성인식에 의한 제어를 보다 원활하게 하도록 사용자 인터페이스의 보조장치로서 입술 모양의 움직임을 검출하고 이를 확인하여 제어의 정확성을 높이는 효과를 갖는다.The speech recognition method for the telematics device according to the preferred embodiment of the present invention will be briefly described. It is a way to increase. Through this, as well as the existing in-vehicle telematics terminal, the lip shape movement is detected and confirmed as an auxiliary device of the user interface to facilitate the control by voice recognition of the control device mounted on the vehicle. Height has an effect.

도 5를 참조하여 본 발명의 바람직한 실시예에 따른 텔레매틱스 장치를 위한 음성인식 방법을 보다 상세히 설명하기로 한다. 설명에 있어서 상술한 본 발명의 바람직한 실시예에 따른 텔레매틱스 장치를 위한 음성인식 장치를 예로 설명하도록 한다.Referring to Figure 5 will be described in more detail voice recognition method for a telematics device according to an embodiment of the present invention. In the description, a voice recognition device for a telematics device according to a preferred embodiment of the present invention described above will be described as an example.

본 발명은 발성자가 차내의 텔레매틱스 장치를 구동하기 위한 음성을 발성하면서 개시된다. 음성입력 수단(1)은 발성자로부터 발성되는 음성을 입력받고, 촬영 수단(2)은 발성자가 음성을 입력하는 동안 발성자의 입술모양을 촬영하게 된다(S10). 여기서, 음성입력 수단(1)은 발성자의 소리를 수신할 수 있는 마이크에 해당하기 때문에 음성입력 수단(1)은 계속 음성을 수신하고 있는 상태이고, 발성자의 음성뿐만 아니라 주위 환경에서 발생하는 잡음도 수집하게 된다. 이렇게 음성입력 수단(1)은 음성 인식을 발성자의 음성을 수신하여 음성신호를 생성하게 된다. 또, 촬영 수단(2)은 발성자의 얼굴, 입술 모양을 촬영할 수 있는 화상 카메라일 경우 촬영 수단(2)은 음성을 발성하는 발성자의 얼굴 및 입술모양을 연속적으로 검출하여 영상신호를 생성하게 된다.The present invention is disclosed while a talker speaks a voice for driving a telematics device in a vehicle. The voice input means 1 receives the voice spoken by the speaker, and the photographing means 2 captures the shape of the speaker's lips while the speaker inputs the voice (S10). Here, since the voice input means 1 corresponds to a microphone capable of receiving the voice of the speaker, the voice input means 1 continues to receive voice, and not only the voice of the speaker but also noise generated in the surrounding environment. Will be collected. In this way, the voice input means 1 receives the voice of the speaker for voice recognition to generate a voice signal. In addition, when the photographing means 2 is an image camera capable of photographing the face and lip shape of the speaker, the photographing means 2 continuously generates the image signal by continuously detecting the face and lips of the speaker that speaks the voice.

다음, 음성인식 수단(200)은 음성입력 수단(1)으로부터 수신된 음성신호를 기설정된 음성패턴 등과 비교하여 해당하는 음성을 인식하고, 인식된 음성을 처리하여 음성인식결과를 생성한다(S20). 즉, 음성인식 수단(200)은 발성자가 발성한 음성을 인식처리된 최초의 음성인식 결과를 생성하게 되는데, 이 음성인식 결과에는 차내 잡음 또는 발성자의 불명료한 발음 등에 의해서 잘못 인식된 정보가 포함되어 있을 수 있다.Next, the voice recognition means 200 recognizes a corresponding voice by comparing the voice signal received from the voice input means 1 with a preset voice pattern and the like, and processes the recognized voice to generate a voice recognition result (S20). . In other words, the speech recognition means 200 generates the first speech recognition result of the speech processed by the speaker, and the speech recognition result includes information that is incorrectly recognized by noise in the vehicle or unclear pronunciation of the speaker. There may be.

이때, 독순처리 수단(100)은 음성인식 수단(200)에서 음성을 인식하는 동안 촬영 수단(2)에서 발성자의 입술 모양에 대한 영상신호를 입력받아서 해당하는 독순결과를 생성하게 된다(S30). 독순처리 수단(100)은 촬영 수단(2)을 통해 입력된 영상신호를 처리하여 발성자의 입술특징을 추출한 후 기설정된 독순패턴과 비교하여 독순결과를 출력하게 되는 것이다. At this time, the reading process means 100 receives the image signal for the shape of the speaker's lips from the photographing means 2 while generating the corresponding reading result while the voice recognition means 200 recognizes the voice (S30). The reading order processing means 100 processes the image signal input through the photographing means 2, extracts the lip feature of the speaker, and then outputs the reading result by comparing with the preset reading order pattern.

다시 말하면, 독순처리 수단(100)은 추출된 발성자의 입술모양에 대한 입술특징과 데이터베이스 상에 기설정된 독순패턴과의 일치여부를 판단하여 해당 독순결과를 생성하게 된다. 즉, 독순처리 수단(100)은 데이터베이스를 검색하여 입력된 입술특징과 매칭되는 독순패턴을 찾게된다. 그래서 독순처리 수단(100)은 발성자의 입술모양에서 검출된 입술특징이 기설정된 독순패턴과 일치하면 일치하는 독순패턴에 대한 독순결과를 생성하고 순차적으로 출력하게 된다.In other words, the reading processing means 100 determines whether or not the extracted lip feature of the speaker's lip shape coincides with the reading pattern set in the database, and generates a corresponding reading result. That is, the reading order processing means 100 searches the database to find a reading pattern matching the input lip feature. Thus, if the lip feature detected in the lip shape of the speaker coincides with the predetermined reading order pattern, the reading order processing unit 100 generates reading results for the matching reading order pattern sequentially.

여기서, 촬영 수단(2)을 통해 촬영된 영상신호에 발성자의 입술모양과 함께 얼굴, 배경 등이 포함되어 있다면, 우선 독순처리 수단(100)은 촬영 수단(2)에서 입력된 영상신호에서 발성자의 얼굴 영역을 인식한다. 그리고 독순처리 수단(100)은 인식된 발성자의 얼굴영역 내에서 입술의 위치를 추적하여 해당 발성자의 입술모양에 대한 입술특징을 검출한다. 예컨대, 영상신호 내의 단일 프레임 상에 발성자의 얼굴 및 배경이 포함되어 있다면, 얼굴에 대한 색상 또는 모양을 근거로 발성자의 얼굴 영역을 인식하게 된다. 그리고 얼굴 영역 내에서 입술에 대한 색상 및 형태를 근거로 일술 위치를 추적한 후 입술 모양의 특징을 추출하게 되는 것이다.Here, if the image signal photographed by the photographing means 2 includes a face, a background, etc. together with the shape of the lip of the speaker, first, the reading order means 100 may determine the speaker's speech from the image signal input from the photographing means 2. Recognize facial areas. In addition, the reading processing means 100 detects lip characteristics of the lip shape of the corresponding speaker by tracking the position of the lips in the recognized speaker's face region. For example, if the face and background of the speaker are included in a single frame in the video signal, the face region of the speaker is recognized based on the color or shape of the face. Then, the position of the liquor is tracked based on the color and shape of the lips in the face area, and the features of the shape of the lips are extracted.

이어, 멀티모달 인지 수단(300)은 음성인식 결과와 독순결과를 동시에 입력받게 된다(S40). 즉, 멀티모달 인지 수단(300)은 2가지 형태의 입력, 예컨데, 발명자의 입술 움직임에 대한 독순결과와 음성에 대한 음성인식결과를 동시에 입력받는다. 즉, 멀티모달인지 수단에서 입력받는 2가지 입력 정보는 각각 독순처리 수 단(100)과 음성인식 수단(200)의 출력이 된다. Subsequently, the multi-modal recognition means 300 receives the voice recognition result and the reading result simultaneously (S40). That is, the multi-modal recognition means 300 receives two types of inputs, for example, a reading result of the inventor's lip movement and a voice recognition result of the voice. That is, the two pieces of input information received from the multimodal recognition means are output from the reading processing means 100 and the voice recognition means 200, respectively.

이때, 멀티모달 인지 수단(300)은 입력된 음성인식 결과와 독순결과 사이의 시간차가 존재하는 경우 이를 보정하게 된다. 이러한 시간보정은 음성인식 결과 와 독순결과의 데이터 수를 비교하거나 각각의 정보 내에 시작과 종료 플래그를 선입력함으로써 시간 보정을 취할 수 있다.At this time, the multi-modal recognition means 300 corrects the time difference between the input voice recognition result and the reading result. This time correction can be time corrected by comparing the number of data of the speech recognition result with the reading result or by inputting the start and end flags in the respective information.

다음, 멀티모달 인지 수단(300)은 독순처리 수단(100)에 의해 나온 독순결과와 음성인식 수단(200)에 의해 나온 음성인식결과를 받고, 독순결과와 음성인식결과를 서로 비교분석하게 된다(S50). 그래서, 멀티모달 인지 수단(300)은 판단시 음성인식결과의 음성명령과 독순결과의 음성명령의 정보를 비교하여 같으면 해당 음성명령을 출력하게 된다(S70). 그리고, 이에 맞는 제어신호를 동작제어부(330)에서 생성 출력하게 된다(S70). 이후, 제어신호는 차량내 제반 음성인식을 통해 제어될 수 있는 텔레매틱스 장치(400)에 입력되어 해당 텔레매틱스 장치(400)를 동작하게 한다. 예컨대, 발성자의 음성과 일치하는 독순결과가 '시작' 또는 '종료' 일 경우, '시작' 또는 '종료'라는 음성명령을 출력하게 된다. 여기서, 상술한 바와 같이, 데이터베이스에 저장되는 음성명령 정보는 적용되는 텔레매틱스 전자기기에 따라 설정된 것으로, 예컨대, 본 발명이 네비게이션에 적용될 경우 데이터베이스에 저장되는 음성명령 정보는 지명, 도로, 상가, 인명 등의 길찾기 정보 및 네비게이션을 구동하기 위한 시작, 종료, 검색, 안내 등의 제어명령이 이에 해당된다.Next, the multi-modal recognition means 300 receives the reading results obtained by the reading order processing means 100 and the speech recognition results issued by the speech recognition means 200, and compares the reading results and the speech recognition results with each other ( S50). Therefore, the multi-modal recognition means 300 compares the information of the voice command of the voice recognition result with the information of the voice command of the reading result when the determination is the same and outputs the corresponding voice command (S70). Then, the control signal corresponding thereto is generated and output by the operation controller 330 (S70). Thereafter, the control signal is input to the telematics device 400 which can be controlled through the in-vehicle voice recognition to operate the corresponding telematics device 400. For example, when the reading result coinciding with the voice of the speaker is 'start' or 'end', the voice command 'start' or 'end' is output. Here, as described above, the voice command information stored in the database is set according to the applied telematics electronic device. For example, when the present invention is applied to navigation, the voice command information stored in the database may be a place name, a road, a store, a human name, or the like. This includes control commands such as start, end, search, and guidance for driving the navigation information and navigation.

한편 단계 S50에서 멀티모달 인지 수단(300)은 각기 서로 다른 2개 이상의 음성명령 후보가 나오면, 데이터베이스에 있는 발성자의 히스토리 및 상황인지를 통한 올바른 음성명령의 경우로 추론하여 판단하게 된다(S60). 그리고, 이에 맞는 제어신호를 동작제어부(330)에서 생성 출력하게 된다(S70). 이후, 제어신호는 차량내 제반 음성인식을 통해 제어될 수 있는 텔레매틱스 장치(400)에 입력되어 해당 텔레매틱스 장치(400)를 동작하게 한다. 예컨대, 발성자의 음성과 일치하는 독순결과가 없을 경우, 독순결과가 '시작' 또는 '종료' 일 경우, '시작' 또는 '종료'라는 음성명령을 출력하게 된다. On the other hand, in step S50, when the multimodal recognition means 300 comes out with two or more different voice command candidates, the multimodal recognition means 300 infers the case of the correct voice command based on the history and situation of the speaker in the database (S60). Then, the control signal corresponding thereto is generated and output by the operation controller 330 (S70). Thereafter, the control signal is input to the telematics device 400 which can be controlled through the in-vehicle voice recognition to operate the corresponding telematics device 400. For example, when there is no reading result that matches the voice of the speaker, if the reading result is 'start' or 'end', the voice command 'start' or 'end' is output.

상술한 바와 같이, 기존의 차량내 텔레매틱스 단말기 뿐만 아니라, 차량에 장착된 제어장치의 음성인식에 의한 제어를 보다 원활하게 하도록 사용자 인터페이스의 보조장치로서 입술 모양의 움직임을 검출하고 이를 확인하여 제어의 정확성을 높이는 효과를 갖는다.As described above, in addition to the existing in-vehicle telematics terminal, the lip shape movement is detected and confirmed as an auxiliary device of the user interface to facilitate the control by the voice recognition of the control device mounted on the vehicle. Has the effect of raising.

이상, 본 발명의 바람직한 일 실시예에 대하여 도시하고 설명하였지만, 본 발명은 상술한 특정의 실시예에 한정되지 아니하며, 청구범위에서 청구하는 본 발명의 요지를 벗어남이 없이 당해 발명이 속하는 기술분야에서 통상의 지식을 가진자에 의해 다양한 변형실시가 가능한 것은 물론이고, 이러한 변형 실시예들은 본 발명의 기술적 사상이나 전망으로부터 개별적으로 이해되어져서는 안될 것이다.As mentioned above, although one preferred embodiment of the present invention has been illustrated and described, the present invention is not limited to the specific embodiments described above, and the present invention belongs to the present invention without departing from the gist of the present invention as claimed in the claims. Various modifications may be made by those skilled in the art, and these modifications should not be individually understood from the technical spirit or the prospect of the present invention.

도 1은 본 발명의 실시예에 따른 텔레매틱스 음성인식 장치를 도시한 블록도.1 is a block diagram showing a telematics speech recognition device according to an embodiment of the present invention.

도 2는 도 1에 도시된 독순처리 수단의 내부 구성을 도시한 블록도.FIG. 2 is a block diagram showing an internal configuration of the reading process means shown in FIG.

도 3은 도 1에 도시된 음성인식 수단의 내부 구성을 도시한 블록도.3 is a block diagram showing the internal configuration of the voice recognition means shown in FIG.

도 4는 도 1에 도시된 멀티모달 인지 수단의 내부 구성을 도시한 블록도.4 is a block diagram showing the internal configuration of the multi-modal recognition means shown in FIG.

도 5는 본 발명의 실시예에 따른 텔레매틱스 음성인식 방법을 도시한 순서도.5 is a flowchart illustrating a telematics speech recognition method according to an embodiment of the present invention.

<도면의 주요부분에 대한 설명><Description of main parts of drawing>

1 : 음성입력 수단 2 : 촬영 수단1: Voice input means 2: Shooting means

100 : 독순처리 수단 110 : 특징추출부100: poisonous processing means 110: feature extraction unit

120 : 비교판단부 130 : 제1 출력부 120: comparison determination unit 130: first output unit

200 : 음성인식 수단 210 : 음성 인식부 200: speech recognition means 210: speech recognition unit

220 : 음성 처리부 230 : 제2 출력부220: voice processing unit 230: second output unit

300 : 멀티모달 인지 수단 310 : 멀티모달 인터페이스300: multimodal recognition means 310: multimodal interface

320 : 분석 및 판단부 330 : 동작제어부320: analysis and determination unit 330: motion control unit

140, 340 : 데이터베이스 400 : 텔레매틱스 장치140, 340: Database 400: Telematics device

Claims

In a speech recognition device for a telematics device,

Voice recognition means for receiving a voice of a speaker through voice input means and recognizing the voice of the speaker to generate a voice recognition result;

Reading means for receiving a lip shape of the speaker through the photographing means and recognizing the lip shape of the speaker to generate a reading result;

And a multi-modal recognition means for determining whether the generated speech recognition result and the reading result are matched, and outputting the corresponding speech command for controlling the telematics device.

The method according to claim 1,

The reading process means,

Receiving an image signal photographing the lip shape of the speaker through the photographing means, recognizes the face region of the speaker from the input image signal, and tracks the position of the lips within the face region to characterize the lips of the speaker. Feature extraction unit for detecting; And

Speech recognition device for a telematics device, characterized in that it comprises a comparison determination unit for comparing the lip feature detected by the feature extraction unit and a predetermined reading order pattern to generate a reading result.

The method according to claim 1,

The voice recognition means,

A voice recognition unit receiving a voice signal generated by converting a voice of a speaker through the voice input unit and recognizing the voice by comparing the input voice signal with a preset voice pattern; And

And a speech processing unit for processing the speech recognized by the speech recognition unit to generate a speech recognition result.

The method according to claim 1,

The multimodal recognition means,

A multi-modal interface for simultaneously receiving the voice recognition result and the reading result in a multi-modal recognition method;

An analysis and determination unit comparing the voice recognition result and the reading order inputted through the multi-modal interface to determine whether they match; And

And a motion control unit for outputting a corresponding voice command interlocked when the voice recognition result and the reading result coincide with the determination result of the analysis and determination unit.

In the speech recognition method for a telematics device,

Receiving voice and lip of a speaker;

Generating a voice recognition result by recognizing the voice of the received speaker;

Recognizing the shape of the lips of the received speaker to generate a reading result; And

And determining whether the voice recognition result and the reading result match, and outputting the corresponding voice command for controlling the telematics device based on the matching result.

The method according to claim 5,

Recognizing the shape of the lips of the received speaker to generate a reading result,

Receiving an image signal photographing the lips of the speaker;

Recognizing a face region of a speaker from the input image signal;

Detecting lip characteristics of the speaker's lip shape by tracking the position of the lips within the face area of the speaker; And

And generating a reading result by comparing the lip features with a predetermined reading pattern.

The method according to claim 5,

Recognizing the received voice of the speaker and generating a voice recognition result,

Receiving a voice signal generated by converting a voice of a speaker;

Recognizing a voice by comparing the input voice signal with a preset voice pattern; And

Processing the recognized voice to generate a voice recognition result.

The method according to claim 5,

The step of outputting the voice command for controlling the telematics device,

Simultaneously receiving the voice recognition result and the reading result in a multi-modal recognition method;

Comparing the input voice recognition result with the reading result to determine whether or not a match exists; And

And outputting a corresponding voice command interlocked with the voice recognition result when the voice recognition result and the reading result coincide with each other.

The method according to claim 8,

Simultaneously receiving the voice recognition result and the reading result in a multi-modal recognition method,

Speech recognition method for a telematics device, characterized in that for receiving the voice recognition result and the reading result at the same time and correcting the time difference for each of the multi-modal information.

The method according to claim 5 or 8,

The voice command is a voice recognition method for a telematics device, characterized in that the preset control command for the control of the telematics device.