KR100822880B1

KR100822880B1 - Speaker Recognition System and Method through Audio-Video Based Sound Tracking in Intelligent Robot Environment

Info

Publication number: KR100822880B1
Application number: KR1020060104171A
Authority: KR
Inventors: 곽근창; 김혜진; 배경숙; 지수영
Original assignee: 한국전자통신연구원
Priority date: 2006-10-25
Filing date: 2006-10-25
Publication date: 2008-04-17

Abstract

본 발명은 다채널 음원보드가 내장된 지능형 로봇 환경에서 사용자가 지능형 로봇에게 음성을 통해 호출하면, 오디오와 비디오 정보에 기반한 음원추적(Sound Localization)을 통해 화자 인식(User identification)을 수행하는 시스템과 그 방법에 관한 것으로, 지능형 로봇 환경에서 입력된 음성에 대해 음성인식을 수행하여 실제 호출에 사용되는 호출 음성인지 여부를 확인하는 단계, 상기 확인된 호출 음성으로부터 화자인식을 수행하여 호출자가 누구인지를 파악하는 단계, 상기 확인된 호출음성과 이미 구축된 일반화된 배경화자 모델을 비교하여 호출자가 가족구성원의 일원인지 화자검증을 수행하는 단계 및 상기 호출자가 가족구성원이면, 음원 추적을 수행하여 얻은 방위각만큼 지능형 로봇을 회전시키고, 얼굴 검출과 얼굴인식을 수행하여 호출자를 찾아 접근하는 단계로 이뤄짐으로써 여러 명이 있는 복잡한 환경에서 좀 더 좋은 정확성과 신뢰성을 보장하면서 인간과 로봇이 자연스럽게 상호작용할 수 있는 효과가 있다.The present invention provides a system for performing user identification through sound localization based on audio and video information when a user calls a voice to an intelligent robot in an intelligent robot environment in which a multi-channel sound board is embedded. The present invention relates to a method for performing voice recognition on an input voice in an intelligent robot environment to determine whether a voice is actually used for a call, and performing caller recognition from the confirmed call voice to determine who the caller is. Identifying, comparing the identified call voice with the already established generalized background speaker model to perform speaker verification of whether the caller is a member of the family member, and if the caller is a family member, azimuth obtained by performing sound source tracking. Rotate the intelligent robot, perform face detection and face recognition to find the caller By approaching, there is an effect that humans and robots can interact naturally with more accuracy and reliability in a complex environment with many people.

Description

Speaker identification system and method through sound localization based audio-visual under robot environments and method

도 1은 지능형 로봇 환경에서 오디오-비디오 기반 음원추적을 통한 화자 인식 시스템을 나타낸 구성도,1 is a block diagram showing a speaker recognition system through audio-video based sound source tracking in an intelligent robot environment;

도 2는 지능형 로봇에 오디로-비디오 기반 음원추적을 수행하기 위하여 마이크로폰과 카메라의 배치를 나타낸 도,2 is a diagram illustrating an arrangement of a microphone and a camera to perform audio-video based sound tracking on an intelligent robot;

도 3은 도 1에 따른 시스템을 이용한 지능형 로봇 환경에서 오디오-비디오 기반 음원추적을 통한 화자 인식 방법을 나타낸 흐름도이다.3 is a flowchart illustrating a speaker recognition method through audio-video based sound source tracking in an intelligent robot environment using the system according to FIG. 1.

<도면의 주요 부분에 대한 부호의 설명><Explanation of symbols for the main parts of the drawings>

10. 호출음성 인식부 11. 호출음성 입력부10. Calling voice recognition unit 11. Calling voice input unit

12. 음성 검출부 13. 음성 인식부12. Voice detection unit 13. Voice recognition unit

20. 화자인식/검증부 30. 음원 추적부20. Speaker recognition / verification unit 30. Sound source tracking unit

40. 로봇 제어부 50. 얼굴 검출/인식부40. Robot Control Unit 50. Face Detection / Recognition Unit

51. 얼굴 검출부 52. 얼굴 인식부51. Face detection unit 52. Face recognition unit

60. 카메라 M. 마이크로폰60. Camera M. Microphone

본 발명은 지능형 로봇 환경에서 호출자의 특정 호출 음성이나 음원을 통해 음원추적하여 호출자에게 서비스를 제공하기 위해 로봇이 다가가도록 하는 시스템 및 방법에 관한 것으로, 더욱 상세하게는 다채널 음원보드가 내장된 지능형 로봇 환경에서 사용자가 지능형 로봇에게 음성을 통해 호출하면, 지능형 로봇은 오디오와 비디오 정보에 기반하여 음원추적(Sound Localization)을 수행하고, 이를 통해 화자 인식(User identification)을 수행하는 시스템과 그 방법에 관한 것이다.The present invention relates to a system and method for allowing a robot to approach to provide a service to a caller by tracking a sound source through a caller's specific call voice or sound source in an intelligent robot environment, and more specifically, an integrated multi-channel sound board. In a robotic environment, when a user calls an intelligent robot via voice, the intelligent robot performs sound localization based on audio and video information, and thereby performs a user identification system and method. It is about.

최근 로봇에 지능을 부여하기 위해서 사용되는 인간-로봇 상호작용(Human-Robot Interaction) 기술 중 가장 기본적인 기술은 로봇이 호출자에게 다가가기 위한 음원추적기술이다.Recently, the most basic technology of Human-Robot Interaction technology, which is used to give intelligence to the robot, is the sound source tracking technology for the robot to approach the caller.

현재 로봇환경에서 음원추적기술은 많은 연구가 진행되고 있다. 가장 잘 알려진 방법은 마이크로폰의 오디오 정보에만 의존하여 호출자의 박수소리와 같은 특정 음원에 대해서만 반응하여 추적함으로써 로봇이 호출한 방향으로 회전하는 것이다. 이 기술은 비교적 간단한 기술이지만, 주변잡음이 존재하는 환경에서 오차가 발생하는 단점이 있다.At present, many researches are being conducted on sound source tracking technology in robot environment. The best known method is to rotate in the direction the robot calls by relying on the microphone's audio information to track and react only to a specific sound source, such as a caller's applause. This technique is a relatively simple technique, but there is a disadvantage that an error occurs in an environment where ambient noise exists.

또한 이러한 문제점을 보완하기 위해, 음원추적 오차를 보상하기 위하여 얼굴검출(Face Detection)을 통해 로봇 카메라에서 중심에 있는 사람에게 다가가는 방법과 가까이에 있는 사용자에게 다가가는 방법 등 오디오 정보와 카메라의 비디오 정보를 융합한 멀티모달 음원추적(Multimodal Sound Localization) 방법들이 널리 각광받고 있다. 이 기술은 전자에서 언급한 음원추적 오차를 얼굴검출 알고리즘을 이용하여 오차를 보상할 수가 있다. 따라서, 오디오와 비디오정보를 이용하여 좀 더 효과적으로 음원 추적을 수행할 수가 있다.In addition, to compensate for this problem, the audio information and video of the camera, such as the method of approaching a person at the center of the robot camera and the method of approaching a nearby user through face detection to compensate for sound source tracking errors. Multimodal sound localization methods that fuse information are widely spotlighted. This technique can compensate for the error using the face detection algorithm for the sound source tracking error mentioned above. Therefore, it is possible to perform sound source tracking more effectively using audio and video information.

그러나 음원추적후 여러 사람이 존재할 경우에 얼굴 검출을 통해 여러 사람 중 가장 가까이에 있는 사람, 즉 얼굴 영상이 가장 큰 사람에게 다가가는 기술을 이용한다. 이러한 기술은 호출자가 누구인지를 알지 못하기 때문에 호출자가 아닌 사람에게 다가갈 수 있는 문제점이 있다. 즉, 카메라의 영상 안에 여러 명이 존재하는 경우에는 얼굴검출만으로 로봇을 호출한 사용자에게 정확하게 다가가기가 어렵다. However, if there are several people after the sound source tracking, face detection is used to approach the nearest person, that is, the person with the largest face image. This technique has the problem of reaching people who are not callers because they do not know who the callers are. In other words, if there are several people in the image of the camera, it is difficult to reach the user who called the robot with only face detection.

또한, 호출하는 음성을 듣고 바로 음원추적을 직접적으로 수행하는 것보다 그 음성이 호출 음성인지를 인식한(음성인식: Speaker Recognition) 후, 음원 추적하는 것이 좀 더 인간 친화적인 방법이다.In addition, it is more human-friendly to recognize the voice as a calling voice (Speaker Recognition) and then track the sound source rather than directly performing the sound source tracking directly after listening to the voice being called.

따라서, 좀 더 정확한 음원추적을 수행하기 위해서는 음원추적과 사용자 인식 방법이 함께 사용되어야 하며, 로봇이 인간과 유사한 사고 능력을 갖기 위해서는 호출자의 음성이 미리 누구인지 알고 음원추적을 수행할 수 있는 기술이 필요하다. Therefore, in order to perform more accurate sound source tracking, sound source tracking and user recognition method must be used together, and in order for the robot to have a human-like thinking ability, a technology capable of performing sound source tracking knowing the caller's voice in advance need.

또한, 음원추적 후에도 화자인식으로 이미 알고 있는 호출자를 향해 다가가기 위해서는 카메라를 통해 얻어진 비디오 정보로부터 얼굴검출을 수행하는 얼굴인식방법과 함께 사용되어야 한다.In addition, in order to approach the caller who is already aware of the speaker even after sound source tracking, it must be used with a face recognition method that performs face detection from video information obtained through a camera.

따라서 본 발명의 목적은 상기한 종래 기술의 문제점을 해결하기 위해 이루어진 것으로서, 본 발명의 목적은 지능형 로봇의 인간-로봇 상호작용을 위해 각각 다채널 음원보드와 카메라로부터 얻어진 오디오와 비디오 정보를 기반으로 한 멀티모달 음원추적 및 화자 인식 시스템 및 방법을 제공하는 것이다.Accordingly, an object of the present invention is to solve the above problems of the prior art, and an object of the present invention is based on audio and video information obtained from a multichannel sound board and a camera, respectively, for human-robot interaction of an intelligent robot. To provide a multi-modal sound source tracking and speaker recognition system and method.

상세하게는 호출 음성으로부터 호출하고자하는 음성인지 혹은 로봇 명령인지를 판단하기 위한 음성인식 기술과, 동시에 가족구성원 중에 누구인지를 파악하고 가족 구성원의 일부인지를 파악하는 화자인식 및 검증 기술과, 음원을 추적하여 로봇이 회전한 후 로봇 카메라의 영상에 나타난 여러 명 중 화자인식으로부터 미리 알게 된 가족구성원을 찾기 위한 얼굴검출과 얼굴인식 기술을 이용한 오디오와 비 디오 정보를 기반으로 한 멀티모달 음원추적 및 화자 인식 시스템 및 방법을 제공하는 것이다.Specifically, voice recognition technology for determining whether a voice or a robot command is to be called from a calling voice, speaker recognition and verification technology for identifying who is a member of a family and part of a family member, and a sound source Multi-modal sound tracker and speaker based on audio and video information using face detection and face recognition technology to find family members who have already known from speaker recognition after the robot rotates after tracking the robot. It is to provide a recognition system and method.

상기와 같은 목적을 달성하기 위한 본 발명의 지능형 로봇 환경에서 오디오-비디오 기반 음원추적을 통한 화자 인식 시스템은, 입력되는 음성을 인식하여 실제 호출 음성 여부를 판단하는 호출음성 인식부, 상기 호출 음성에 대한 호출자를 먼저 인식하고, 상기 인식된 호출자를 검증하는 화자 인식/검증부, 상기 호출 음성에 대한 지연시간을 이용하여 음원 추적을 수행하는 음원 추적부, 상기 음원 추적을 통해 얻어진 방위각만큼 로봇을 회전하거나, 상기 호출자에게 접근하도록 제어하는 로봇 제어부 및 카메라를 통해 입력되는 영상에서 얼굴검출과 얼굴인식을 통해 상기 검증된 호출자를 찾는 얼굴 검출/인식부를 포함하여 이루어진 것을 특징으로 한다.In the intelligent robot environment of the present invention for achieving the above object, the speaker recognition system through the audio-video based sound source tracking, the call voice recognition unit for determining whether the actual call voice by recognizing the input voice, the call voice The caller first recognizes the caller, and the speaker recognition / verification unit to verify the recognized caller, the sound source tracking unit for performing the sound source tracking using the delay time for the call voice, the robot rotates by the azimuth angle obtained through the sound source tracking Or a face detection / recognition unit for finding the verified caller through face detection and face recognition from an image input through a robot control unit and a camera controlling access to the caller.

한편, 본 발명의 지능형 로봇 환경에서 오디오-비디오 기반 음원추적을 통한화자 인식 방법은, 지능형 로봇 환경에서 입력된 음성에 대해 음성인식을 수행하여 실제 호출에 사용되는 호출 음성인지 여부를 확인하는 단계, 상기 확인된 호출 음성으로부터 화자인식을 수행하여 호출자가 누구인지를 파악하는 단계, 상기 확인된 호출음성과 이미 구축된 일반화된 배경화자 모델을 비교하여 호출자가 가족구성원의 일원인지 화자검증을 수행하는 단계 및 상기 호출자가 가족구성원이면, 음원 추 적을 수행하여 얻은 방위각만큼 지능형 로봇을 회전시키고, 얼굴 검출과 얼굴인식을 수행하여 호출자를 찾아 접근하는 단계를 포함하여 이루어진 것을 특징으로 한다.On the other hand, the speaker recognition method through audio-video based sound source tracking in the intelligent robot environment of the present invention, performing the voice recognition on the voice input in the intelligent robot environment to determine whether the call voice used for the actual call, Performing a speaker recognition from the identified call voice to determine who the caller is, and performing a speaker verification whether the caller is a member of the family by comparing the identified call voice with the already established generalized background speaker model. And if the caller is a family member, rotating the intelligent robot by the azimuth obtained by performing sound source tracking, and performing face detection and face recognition to find and access the caller.

이와 같이 사용자가 로봇의 이름을 통해 호출하면, 음성인식을 통해 호출여부를 인지하고, 그 음성을 통해 화자인식을 수행한다. 이렇게 함으로써 가족구성원 중 누가 호출했는지, 호출자가 가족구성원인지 여부도 함께 알 수가 있다. 그 후에 음원추적 알고리즘을 이용하여 음원을 추적하고 로봇은 그 방향으로 회전한다. As such, when the user calls through the name of the robot, the user recognizes whether the call is made through voice recognition, and performs speaker recognition through the voice. By doing this, you can see who is calling the family member and whether the caller is a family member. The sound source is then tracked using a sound source tracking algorithm and the robot rotates in that direction.

만약 로봇카메라로 획득한 영상에 여러 사람이 존재할 경우라도 얼굴 검출 및 얼굴인식(Face Recognition)을 수행하여 화자인식을 통해 이미 알고 있는 호출자에게 자연스럽게 다가갈 수가 있다.Even if there are several people in the image acquired by the robot camera, face detection and face recognition can be performed to naturally reach the caller who is already known through speaker recognition.

이하, 본 발명의 지능형 로봇 환경에서 오디오-비디오 기반 음원추적을 통한 화자 인식 시스템에 대하여 첨부된 도면을 참조하여 상세히 설명하기로 한다.Hereinafter, a speaker recognition system through audio-video based sound source tracking in an intelligent robot environment of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 지능형 로봇 환경에서 오디오-비디오 기반 음원추적을 통한 화자 인식 시스템을 나타낸 것으로, 호출자를 포함한 여러 명이 존재하는 복잡한 환경에서 멀티모달 형태의 음원추적을 수행하기 위해, 본 발명의 시스템은 호출음성 인식부(10), 화자인식/검증부(20), 음원추적부(30), 로봇 제어부(40), 얼굴검출/인식 부(50)로 구성된다.1 illustrates a speaker recognition system through audio-video based sound source tracking in an intelligent robot environment. In order to perform a multimodal sound source tracking in a complex environment including multiple callers, the system of the present invention provides a call voice. Recognition unit 10, speaker recognition / verification unit 20, the sound source tracking unit 30, the robot control unit 40, the face detection / recognition unit 50 is composed of.

호출음성 인식부(10)는 호출음성 입력부(11), 음성 검출부(12), 음성 인식부(13)을 포함하며, 음성인식을 통해 실제 호출 음성 여부를 판단한다. 로봇을 호출하고자 하는 음원은 특정한 음원(박수소리), 특정한 음원패턴, 로봇이름과 같은 호출음성이 있을 수 있으며, 본 발명에서는 로봇의 이름과 같은 호출 음성에 대해서만 로봇이 음원추적을 수행하도록 한다. 이는 박수소리와 같은 특정한 음원이나 패턴은 비슷한 소리와 혼동하여 호출여부와 관계없이 음원 추적을 수행하기 때문에 문제점이 발생할 수 있다. 그렇기 때문에 음성인식을 통해 특정한 호출 음성에만 반응하여 음원추적을 수행한다.The call voice recognition unit 10 includes a call voice input unit 11, a voice detector 12, and a voice recognition unit 13. The call voice recognition unit 10 determines whether the call voice is actually called. The sound source to call the robot may have a call sound such as a specific sound source (applause), a specific sound source pattern, and the robot name. In the present invention, the robot performs sound source tracking only for a call voice such as the name of the robot. This may cause problems because certain sound sources or patterns, such as clapping sounds, are confused with similar sounds and perform sound source tracking regardless of whether they are called or not. Therefore, voice recognition performs sound source tracking only in response to a specific call voice.

구체적으로, 호출음성 입력부(11)가 사용자의 음성을 입력받아 음성 검출부(12)에 전달하고, 음성 검출부(12)는 전달받은 음성에 대해 로그 에너지(log energy)를 이용한 끝점검출(Endpoint Detection) 알고리즘을 적용하여 시작점과 끝점을 찾아 음성을 검출한다. 그런 다음, 음성 검출부(12)의 위너필터로 검출된 음성의 잡음을 제거하여 음성 인식부(13)로 전달한다. 음성 인식부(13)는 일반적으로 알려진 은닉 마코프 모델(HMM: Hidden Markov Model)을 이용하여 잡음을 제거한 음성에 대해 음성인식을 수행하여 실제 호출을 위해 사용된 호출음성인지 여부를 판단하고, 이 음성인식 결과를 화자인식/검증부(20)로 전달한다.In detail, the call voice input unit 11 receives a user's voice and transmits it to the voice detector 12, and the voice detector 12 uses an endpoint energy to log the received voice. The algorithm is applied to find the start and end points to detect speech. Then, the noise of the voice detected by the Wiener filter of the voice detector 12 is removed and transmitted to the voice recognizer 13. The speech recognition unit 13 performs speech recognition on the speech-removed speech using a generally known Hidden Markov Model (HMM) to determine whether the speech is used for actual calling, and determines whether the speech is used. The recognition result is transmitted to the speaker recognition / verification unit 20.

화자 인식/검증부(20)는 화자인식을 수행하기 위해 로봇 이름과 같이 미리 특정한 음성을 프레임별로 나누고, 각 프레임에 해당하는 특징벡터(멜 캡스트럼 계수)를 구한다. 그런 다음, 이미 화자 등록으로부터 구축된 화자모델을 근거로 하여 호출음성의 최대 로그 우도값을 구해 화자인식을 수행하고, 화자 인식을 통해 얻어진 스코어와 이미 구축된 일반화된 배경화자 모델(UBM: Universal Background Model)을 근거로 해서 얻어진 스코어를 뺀 값을 구하고, 이 값과 미리 설정한 일정 임계값을 비교하여 화자검증을 수행한다(도 3에 대한 설명에서 수학식 5 참조하여 상세하게 설명한다). 이 구해진 값이 일정 임계값 이하이면 호출을 거절하고, 일정 임계값 이상이면 가족 구성원으로 받아들인다. In order to perform speaker recognition, the speaker recognition / verification unit 20 divides a specific voice into frames in advance, such as a robot name, and obtains a feature vector (mel capstrum coefficient) corresponding to each frame. Then, based on the speaker model already established from the speaker registration, speaker recognition is performed by obtaining the maximum log likelihood value of the call voice, and the score obtained through the speaker recognition and the already established generalized background speaker model (UBM: Universal Background) A score obtained by subtracting the score obtained on the basis of the model) is obtained, and the speaker verification is performed by comparing this value with a predetermined predetermined threshold value (described in detail with reference to Equation 5 in FIG. 3). If the value is less than a certain threshold, the call is rejected, and if it is above a certain threshold, it is accepted as a family member.

이때, 화자모델 구축은 가족구성원의 호출음성을 포함하여 온라인에서 여러 문장을 통해 화자를 등록하고, 각 채널로부터 얻어진 음성으로부터 멜 캡스트럼(MFCC: Mel-Frequency Cepstral Coefficients) 특징 벡터를 구하여, 가족 구성원 모두에 대해서 각각 가우시안 화자모델(Gaussian Mixture Model)을 만들어 놓는다.At this time, the speaker model is built by registering the speaker through several sentences online including family member's call voice, and obtaining the Mel-Frequency Cepstral Coefficients (MFCC) feature vector from the voice obtained from each channel. Create a Gaussian Mixture Model for each.

음원추적부(30)는 일반적으로 잘 알려진 시간지연을 이용한 음원추적을 수행한다. 음원추적을 통해 얻어진 방위각의 각도만큼 로봇은 회전한다. The sound source tracking unit 30 generally performs sound source tracking using a well known time delay. The robot rotates by the angle of azimuth obtained through sound source tracking.

얼굴검출/인식부(50)는 얼굴 검출부(51)와 얼굴 인식부(52)를 포함하며, 얼굴 검출부(51)는 음원추적부(30)의 음원추적 결과 구해진 방위각만큼 회전된 방향 의 사람들을 카메라(60)로 촬영하고, 눈 검출 방법과 피부색정보에 의한 얼굴검출을 수행하여 촬영한 영상안에 몇 명의 사람이 있는지 확인한다. The face detection / recognition unit 50 includes a face detection unit 51 and a face recognition unit 52, and the face detection unit 51 detects people in a direction rotated by the azimuth angle obtained as a result of the sound source tracking of the sound source tracking unit 30. The camera 60 photographs and detects how many people are in the captured image by performing face detection using eye detection method and skin color information.

얼굴 인식부(52)는 검출된 얼굴 영상에 대해 다중 주성분 분석기법(multiple principal component analysis)과 에지정보를 이용하여 얼굴인식을 수행한다.The face recognition unit 52 performs face recognition on the detected face image by using multiple principal component analysis and edge information.

여기에서 얼굴 인식부(52)는 화자인식을 통해 알고 있는 호출자를 얼굴인식을 수행하여 온라인상에서 이미 등록되어 있는 가족 구성원의 얼굴 영상과 비교하여 재확인하며 일치하는 최종 호출자에게 로봇은 다가간다.Here, the face recognition unit 52 performs face recognition of the caller who is known through speaker recognition, compares it with the face image of the family member who is already registered online, and confirms the robot to the matching end caller.

도 2는 지능형 로봇에서 오디오-비디오 기반 음원추적을 위한 세 개의 마이크로폰(M)과 카메라의 배치를 나타낸 것이다. 본 발명에서 언급된 다채널 음원보드의 마이크로폰의 수는 세 개에서 여덟 개까지 선택이 가능하다. 간소성을 위해 본 발명에서는 세 개의 마이크로폰을 가정한다.2 shows the arrangement of three microphones (M) and a camera for audio-video based sound source tracking in an intelligent robot. The number of microphones of the multi-channel sound board mentioned in the present invention can be selected from three to eight. For simplicity, the present invention assumes three microphones.

그러면, 상기와 같은 구성을 가지는 본 발명의 지능형 로봇 환경에서 오디오-비디오 기반 음원추적을 통한 화자 인식 방법에 대해 도 3을 참조하여 설명하기로 한다.Next, a speaker recognition method through audio-video based sound source tracking in the intelligent robot environment having the above configuration will be described with reference to FIG. 3.

도 3은 도 1에 따른 시스템을 이용한 음원추적을 통한 화자 인식 방법을 나타낸 것으로, 본 발명에서 제안하는 지능형 로봇환경에서 오디오-비디오 정보를 기 반으로 한 멀티 모달 음원추적을 통한 화자 인식 방법의 흐름을 나타낸다. 여기서, 본 발명에서 언급된 다채널 음원보드의 마이크로폰의 수는 세 개에서 여덟 개까지 선택이 가능하며, 본 발명에서는 세 개의 마이크로폰을 가정한다.FIG. 3 illustrates a speaker recognition method through sound source tracking using the system according to FIG. 1. The flow of a speaker recognition method through multi-modal sound source tracking based on audio-video information in an intelligent robot environment proposed by the present invention is shown. Indicates. Here, the number of microphones of the multi-channel sound board mentioned in the present invention can be selected from three to eight, and the present invention assumes three microphones.

본 발명에 따른 방법은 호출자를 포함한 여러 명이 존재하는 복잡한 환경에서 입력되는 음성(S300)에 대해 음성인식을 수행하여 실제 호출 음성 여부를 확인한다(S310). The method according to the present invention performs voice recognition on the voice (S300) input in a complex environment including a plurality of callers to check whether the actual call voice (S310).

구체적으로, 음성인식을 수행하기 위한 전처리 단계로서 입력되는 음성(S300)에 대해 로그 에너지(log energy)를 이용한 끝점검출(Endpoint Detection) 알고리즘을 적용하여 시작점과 끝점을 찾아 호출한 음성을 검출한다. 그런 다음, 위너필터(Winer filter)를 이용하여 주변 잡음을 제거하여 검출한 음성을 개선한다. 그리고 잡음을 제거한 음성에 대해 일반적으로 알려진 은닉 마코프 모델(HMM: Hidden Markov Model)을 이용하여 음성인식을 수행한다. 음성인식을 수행하여 호출음성이 실제 호출을 위해 사용된 음성인지 여부를 판단한다(S310). 본 발명은 3m 이내에서 호출 음성을 발성하는 것으로 한정한다. In detail, an endpoint detection algorithm using log energy is applied to the voice S300 that is input as a preprocessing step for performing voice recognition to detect a starting point and an endpoint and detect the called voice. Then, the Winner filter is used to remove the ambient noise to improve the detected voice. In addition, speech recognition is performed using a hidden markov model (HMM), which is commonly known as a noise canceling speech. The voice recognition is performed to determine whether the call voice is the voice used for the actual call (S310). The present invention is limited to uttering voice calls within 3 m.

검출한 음성이 호출음성으로 인식된다면(S310), 그 음성을 통해 화자가 누구인지 알 수 있고, 가족 구성원 여부도 함께 알 수 있는 화자인식(S320) 및 화자검증(S330)을 수행한다.If the detected voice is recognized as a call voice (S310), it is possible to know who the speaker is through the voice and perform speaker recognition (S320) and speaker verification (S330), which can also know whether the family member is present.

구체적으로, 화자인식을 수행하기 위해 S310에서 전달되는 실제 호출음성이 입력으로 들어오면, 로봇 이름과 같이 미리 특정한 음성을 프레임별로 나누고, 각 프레임에 해당하는 특징벡터(멜 캡스트럼 계수(MFCC: Mel-Frequency Cepstral Coefficients))를 구한다. 그런 다음, 이미 화자 등록으로부터 구축된 화자 모델을 근거로 하여 호출음성의 최대 로그 우도값을 구해 화자인식을 수행한다(S320). 화자 인식을 통해(S320) 얻어진 스코어와 이미 구축된 일반화된 배경화자 모델(UBM: Universal Background Model)을 근거로 해서 얻어진 스코어를 뺀 값과 미리 설정한 일정 임계값을 비교하여 가족 구성원인지 아닌지 화자검증을 수행한다(수학식 5를 참조하여 아래에서 설명한다)(S330). 즉, 구해진 값이 일정 임계값 이하이면 호출을 거절하고, 일정 임계값 이상이면 가족 구성원으로 받아들인다. Specifically, when the actual call voice transmitted from S310 is input to perform speaker recognition, the voice is divided into frames in advance, such as a robot name, and a feature vector corresponding to each frame (Mel Capstrum Coefficient (MFCC: Mel) -Frequency Cepstral Coefficients). Then, based on the speaker model already established from the speaker registration, speaker recognition is performed by obtaining the maximum log likelihood value of the call voice (S320). Speaker verification is performed by comparing the score obtained by speaker recognition (S320) with the score obtained on the basis of the already-established Universal Background Model (UBM) and a predetermined threshold value. To perform (described below with reference to Equation 5) (S330). In other words, if the obtained value is less than or equal to a certain threshold, the call is rejected.

이때, 화자모델 구축은 가족구성원의 호출음성을 포함하여 온라인에서 여러 문장을 통해 화자를 등록하고, 각 채널로부터 얻어진 음성으로부터 멜 캡스트럼(MFCC: Mel-Frequency Cepstral Coefficients) 특징 벡터를 구하여, 가족 구성원 모두에 대해서 각각 가우시안 화자모델(Gaussian Mixture Model)을 만들어 놓는다. At this time, the speaker model is built by registering the speaker through several sentences online including family member's call voice, and obtaining the Mel-Frequency Cepstral Coefficients (MFCC) feature vector from the voice obtained from each channel. Create a Gaussian Mixture Model for each.

따라서 S300 내지 S330 과정을 통해 만약 가족 구성원 중 "A"가 인식되었다면(Yes), 로봇은 "A님 호출하셨습니까?" 라는 멘트를 보낼 수가 있다. 또한, 음성인식에 의해 호출음성을 인지한다 할지라고 화자검증을 통해 가족 구성원이 아니면(No), "당신은 가족 구성원이 아닙니다."라는 멘트를 주면서 호출에 대해 거부할 수가 있다(S380).Therefore, if the family member "A" is recognized (Yes) through the S300 to S330 process, the robot asks "Are you calling A?" You can send a comment. In addition, if the call is recognized by voice recognition, if the speaker is not a family member through speaker verification (No), the statement "You are not a member of the family" can be rejected (S380).

그런 다음, 일반적으로 잘 알려진 시간지연을 이용한 음원추적을 수행한다(S350). 음원추적을 통해 얻어진 방위각의 각도만큼 로봇은 회전한다(S360). 멀티모달 음원추적을 이용한다 할지라도 이 단계(S350)에서 큰 오차가 발생하면 좀처럼 찾기가 어려운 환경이 될 수 있다. Then, the sound source tracking using a well known time delay is generally performed (S350). The robot rotates by the angle of the azimuth obtained through sound source tracking (S360). Even if the multi-modal sound source tracking is used, if a large error occurs in this step (S350), it may be a difficult environment to find.

그러나 본 발명은 화자인식을 통해 호출자를 알기 때문에 음원 추적 후 얼굴검출과 얼굴인식을 통해 주변을 두리번거림으로서 화자를 찾을 수가 있다. However, since the present invention knows the caller through speaker recognition, the speaker can be found by swiping around through face detection and face recognition after sound source tracking.

음원추적을 수행하여(S350) 구한 방위각만큼 로봇이 회전한 후(S360), 얼굴검출과 얼굴인식을 통해 로봇 카메라의 영상에 몇 명의 사람이 있는지와 누구인지를 알 수가 있어 실제 호출자를 재확인한다(S370). 그런 후, 실제 호출자에게 접근한다. After performing the sound source tracking (S350), the robot rotates by the azimuth obtained (S360), and through the face detection and face recognition, it is possible to know how many people are in the image of the robot camera and who they are. S370). Then, the real caller is approached.

따라서 본 발명은 기존의 얼굴검출만을 이용한 방법이 가지는 여러 사람 중 누구에게 다가가야 하는지 불확실하며, 얼굴인식만을 수행한다 할지라도 여러 명이 존재할 경우에는 호출자가 누구인지를 알 수 없는 문제점을 해결한다.Therefore, the present invention is unclear which of the various people have a conventional method using only face detection, and solves the problem of not knowing who the caller is when there are multiple people even if only face recognition is performed.

상술한 본 발명의 각 단계를 수식으로 전개하여 더욱 구체적으로 설명하면, S320에서는 온라인 화자등록에서 획득한 개별 음성으로부터 추출된 특징벡터들의 분포가 가우시안 혼합 밀도에 의해 수행된다. D차원의 특징벡터에 대해서, 화자에 대한 혼잡 밀도는 수학식 1과 같이 표현된다.In more detail, each step of the present invention described above is developed by using a formula, and in S320, a distribution of feature vectors extracted from individual voices acquired by online speaker registration is performed by Gaussian mixing density. For the D-dimensional feature vector, the congestion density for the speaker is expressed as in Equation (1).

[수학식 1][Equation 1]

여기서 Wi는 혼합 가중치이며, bi는 가우시안 혼합모델을 통해 얻어진 확률이고 D는 특징벡터

의 차수이다. 여기서 밀도(P)는 평균벡터 μi 와 공분산 행렬 Σi에 의해 파라미터화 된 M개의 가우시안 혼합모델의 가중치된 선형적인 결합이다.Where Wi is the mixed weight, bi is the probability obtained from the Gaussian mixture model, and D is the feature vector.

Is the order of. Where density P is the weighted linear combination of the M Gaussian mixture model parameterized by the mean vector μi and the covariance matrix Σi.

임의의 화자로부터 온라인 등록된 음성이 주어졌을 때 가우시안 혼합모델의 파라미터를 추정한다. 잘 알려진 방법은 최대 우도 추정방법(maximum likelihood estimation)이다. T개의 프레임으로 구성된 한 음성으로부터 얻어진 확률에 대해서, 가우시안 혼합모델의 우도 값은 수학식 2에 의해 표현된다.Given an online registered voice from any speaker, we estimate the parameters of the Gaussian mixture model. A well known method is maximum likelihood estimation. For the probability obtained from one voice composed of T frames, the likelihood value of the Gaussian mixture model is represented by Equation (2).

[수학식 2][Equation 2]

여기서 화자 모델의 파라미터 λs는 가중치, 평균, 공분산으로 구성된 집합체(={ωi, μi, Σi})이고, i=1,2,...,M 이고, X={x₁, x₂, ..., x_T} 를 의미한다.Here, the parameter λs of the speaker model is an aggregate (= {ωi, μi, Σi}) consisting of weights, averages, and covariances, i = 1,2, ..., M, and X = {x ₁ , x ₂ ,. .., x _T }

수학식 2의 우도값은 편리성을 위해 로그(log) 값으로 변환한다. 최대 우도 파라미터 추정은 수학식 3과 같이 잘 알려진 EM(Expectation-Maximization) 알고리즘을 이용하여 얻어진다.The likelihood value in Equation 2 is converted into a log value for convenience. Maximum likelihood parameter estimation is obtained using a well-known Expectation-Maximization (EM) algorithm such as Equation 3.

[수학식 3][Equation 3]

수학식 3의

는 EM 알고리즘에 의해 추정된 파라미터들이며, 이 들 값을 수학식 4에 적용한 결과에 의해 호출자가 가족 구성원 중에 누구인지를 알 수 있는 화자인식이 수행된다(S320).Of equation (3)

Are parameters estimated by the EM algorithm, and speaker recognition is performed to determine who the caller is among the family members by applying these values to Equation 4 (S320).

[수학식 4][Equation 4]

여기서 k=1, ..., s이고, 각각의 화자 S는 각 화자의 모델

로서 표현되어진다.Where k = 1, ..., s, and each speaker S is the model of each speaker

It is expressed as

또한 수학식 5를 이용하여 구한 값이 설정한 일정 임계값 θ 이하일 경우에는 가족 구성원의 일부가 아님을 알려주는 화자검증이 수행된다(S330).In addition, when the value obtained by using Equation 5 is less than or equal to the predetermined threshold value θ, speaker verification is performed to inform that the speaker is not part of the family member (S330).

[수학식 5][Equation 5]

여기서 특징벡터 X가 주어질 때 S는 가족 구성원이고, S'은 침입자가 된다. log L(X)는 정규화 된 로그 우도 스코어 값을 나타낸다. 또한, H0은 X가 가설된 화자 S임을 나타내며, H1은 X가 가설된 화자 S가 아님을 의미한다.Given the feature vector X, S is a family member and S 'is an intruder. log L (X) represents the normalized log likelihood score value. In addition, H0 represents X is hypothesized speaker S, H1 means that X is not hypothesized speaker S.

그 다음으로, 음원을 검출하기 위해 음원의 에너지를 수학식 6을 이용하여 계산한다. 이 식은 N개의 분석구간에 대해 분석구간 시작부터 n개까지의 샘플에 대한 에너지를 구하는 것이다. 여기서 에너지의 임계값은 특정 호출 음성에 대해서 미리 설정되어야 한다.Next, the energy of the sound source is calculated by using Equation 6 to detect the sound source. This formula calculates the energy for the N samples from the start of the analysis interval to n samples. Here, the threshold of energy must be set in advance for a particular call voice.

[수학식 6][Equation 6]

그런 다음, 음원이 발생한 구간을 결정하기 위해서 수학식 7을 이용하여 일반적으로 잘 알려진 일반화된 상호 상관관계에 의한 지연시간

을 구한다.Then, to determine the section in which the sound source occurred, the delay time due to generalized cross correlation generally known using Equation (7)

Obtain

[수학식 7][Equation 7]

여기서

은 자기상관함수를 의미하며, d의 값은 -10,000에서 10,000 사이의 값으로 지정한다.here

Is an autocorrelation function, and d is a value between -10,000 and 10,000.

지연시간을 구하고, 각각 마이크로폰 신호들 중에서 가장 작은 지연시간을 보이는 것을 선택하고, 이에 대해 수학식 8을 이용해 방위각을 계산한 후(S350) 구한 방위각만큼 로봇은 회전한다(S360).The delay time is determined, and each of the microphone signals having the smallest delay time is selected, and the azimuth angle is calculated using Equation 8 (S350), and the robot rotates as much as the azimuth angle obtained (S360).

[수학식 8][Equation 8]

여기서

는 지연시간이고, R은 원 중심에서 각각의 마이크로폰까지의 거리이며, C는 소리속도이다. 또한, 마이크로폰이 3개일 경우,

이다. here

Is the delay time, R is the distance from the circle center to each microphone, and C is the sound velocity. Also, if you have 3 microphones,

to be.

그 다음으로, 음원 추적 후(S350), 눈 검출 방법과 피부색정보에 의한 얼굴검출을 수행하여 카메라의 영상 안에 몇 명의 사람이 있는지를 확인한다(S370). 검출된 얼굴 영상에 대해 다중 주성분 분석기법(multiple principal component analysis)과 에지정보를 이용하여 얼굴인식을 수행한다(S370).Next, after tracking the sound source (S350), face detection by eye detection method and skin color information is performed to check how many people are in the image of the camera (S370). Face recognition is performed on the detected face image using multiple principal component analysis and edge information (S370).

여기에서 가족 구성원의 얼굴 영상들은 온라인상에서 이미 등록되어 진다. 따라서 화자인식을 통해 알고 있는 호출자를 얼굴인식으로부터 재확인하며(S370) 로봇은 그 호출자에게 다가간다.Here, the face images of family members are already registered online. Therefore, the caller recognizes the caller who is known through face recognition from face recognition (S370) and the robot approaches the caller.

이상에서 몇 가지 실시예를 들어 본 발명을 더욱 상세하게 설명하였으나, 본 발명은 반드시 이러한 실시예로 국한되는 것이 아니고 본 발명의 기술사상을 벗어나지 않는 범위 내에서 다양하게 변형실시될 수 있다.Although the present invention has been described in more detail with reference to some embodiments, the present invention is not necessarily limited to these embodiments, and various modifications can be made without departing from the spirit of the present invention.

상기에 설명한 바와 같이, 본 발명에서는 지능형 로봇에서 인간-로봇 상호작용을 수행하기 위해 오디오-비디오 기반 음원추적과 사용자 인식에 대한 시스템 및 방법을 제공하는 것으로, 주로 오디오-비디오 기반 음원추적은 음원추적 기술과 얼굴검출 기술을 융합하여 호출자에게 다가가지만, 본 발명에서는 인간과 같은 사고방식으로 음원추적을 수행하기 위해 음성인식, 화자인식, 얼굴인식을 함께 결합하면서, 여러 명이 있는 복잡한 환경에서 좀 더 좋은 정확성과 신뢰성을 보장하면서 좀 더 인간 친화적인 특성을 제공함으로써 지능형 로봇 분야에서 인간과 로봇이 자연스럽게 상호작용하기 위해 발명의 효과가 크다고 할 수 있다.As described above, the present invention provides a system and method for audio-video based sound source tracking and user recognition in order to perform human-robot interaction in an intelligent robot, and audio-video based sound source tracking mainly includes sound source tracking. In the present invention, a combination of voice recognition, face recognition technology, and voice recognition, speaker recognition, and face recognition are combined to perform sound source tracking in a human-like mindset, but in a complex environment with many people, By providing more human-friendly features while ensuring accuracy and reliability, the invention has a great effect on the natural interaction between humans and robots in the field of intelligent robots.

Claims

An intelligent robot is provided only for a call voice input unit for receiving a voice, a voice detector for detecting a start point and an end point of the input voice, removing noise of the detected voice, and recognizing the detected voice. A call speech recognition section comprising a speech recognition section for determining whether or not the call speech used for the actual call is made to react;

A speaker recognition / verification unit that first recognizes who the actual caller is for the call voice and then verifies the recognized caller;

A sound source tracking unit for performing sound source tracking using the delay time for the call voice;

A robot controller which controls the robot to rotate by the azimuth obtained through the sound source tracking or to approach the caller; And

Face detection / recognition unit for finding the verified caller through face detection and face recognition in the image input through the camera

Speaker recognition system through audio-video based sound tracking in an intelligent robot environment comprising a.

delete

The method of claim 1, wherein the speaker recognition / verification unit,

Divide a specific voice by frame, obtain a Mel-Frequency Cepstral Coeffcients (MFCC) feature vector for each frame, and maximize the logarithm of the call voice based on a Gaussian Mixture Model that has already been constructed from speaker registration. By using the likelihood value and speaker recognition,

In the intelligent robot environment, a speaker verification is performed by subtracting a score obtained based on the speaker recognition and a score obtained based on an already constructed universal background model (UBM) with a predetermined threshold value. Speaker recognition system based on audio-video based sound source tracking.

According to claim 3, The sound source tracking unit,

The threshold value of energy for sound source detection is to calculate the azimuth angle by selecting the smallest delay time among the microphone signals, and the speaker recognition system through the audio-video based sound source tracking in an intelligent robot environment.

The method of claim 4, wherein the face detection / recognition unit,

A face detection unit which checks how many people are in the photographed image of the camera through eye detection and face detection by skin color information;

A face recognition unit reconfirming a caller by performing face recognition using multiple principal component analysis and edge information on the detected face image;

delete

According to claim 3, Speaker verification using the threshold value,

If the threshold is less than a certain value, the call is rejected, if it is above a certain value, the speaker recognition system through audio-video based sound source tracking in an intelligent robot environment, characterized in that accepting.

The Gaussian Mixture Model of claim 3, wherein the constructed Gaussian Mixture Model is

Registers the speaker by receiving several sentences online including the specific calling voice of the family member, and obtains Mel-Frequency Cepstral Coeffcients (MFCC) feature vector from the speech obtained from the input sentence, and builds each of the family members. Speaker recognition system through audio-video based sound tracking in an intelligent robot environment characterized in that the.

(a) performing voice recognition on a voice input in an intelligent robot environment to determine whether the voice is a call voice used for an actual call;

(b) performing speaker recognition from the identified call voice to determine who is the caller;

(c) If the score obtained by the speaker recognition and the score obtained based on the already-established Universal Background Model (UBM) is less than or equal to a certain threshold, the call is rejected. Performing speaker verification to recognize a family member; And

(d) if the caller is a family member, rotating the intelligent robot by the azimuth obtained by performing sound source tracking, and performing face detection and face recognition to find and access the caller;

Speaker recognition method through audio-video based sound source tracking in an intelligent robot environment comprising a.

The method of claim 9, wherein step (a) comprises:

Detecting a start point and an end point of the voice through an endpoint detection algorithm using log energy on the input voice;

Removing ambient noise using a Wiener filter on the detected voice; And

Performing speech recognition on the speech with the ambient noise removed using a hidden markov model (HMM)

The method of claim 10, wherein step (b) comprises:

Dividing the recognized speech into frames and obtaining a Mel-Frequency Cepstral Coeffcients (MFCC) which is a feature vector corresponding to each frame; And

Performing speaker recognition by obtaining a maximum log likelihood value of the voice based on a previously established Gaussian Mixture Model;

The method of claim 11, wherein step (c) comprises:

Speaker recognition method through audio-video based sound tracking in an intelligent robot environment, characterized in that by performing the speaker verification from the confirmed call voice, the caller can reject the call if the caller is not a family member.

The method of claim 12, wherein step (d)

Rotating the intelligent robot by the azimuth obtained by performing sound source tracking;

Checking how many people are in the captured image of the camera through eye detection and face detection by skin color information; And

Reconfirming the caller by performing face recognition using multiple principal component analysis and edge information on the detected face image;

delete

The method of claim 11, wherein the constructed Gaussian speaker model (Gaussian Mixture Model),

Registers the speaker by receiving several sentences online including the specific calling voice of the family member, and obtains Mel-Frequency Cepstral Coeffcients (MFCC) feature vector from the voices obtained from the input sentences, and constructs each of the family members. Speaker recognition method based on audio-video based sound tracking in an intelligent robot environment characterized in that the.

delete