KR20230085744A

KR20230085744A - Learning method and learning device for ai agent according to modular object-centric approach model including dual task stream of interactive perception and action policy, and testing method and testing device using the same

Info

Publication number: KR20230085744A
Application number: KR1020210174204A
Authority: KR
Inventors: 김병휘; 루즈베 모타기; 스완시 밤브리; 최종현; 쿠날 프라탑 싱
Original assignee: 광주과학기술원
Priority date: 2021-12-07
Filing date: 2021-12-07
Publication date: 2023-06-14
Also published as: WO2023106446A1

Abstract

Interactive Perception 및 Action Policy의 이중적 태스크 스트림을 포함하는 Modular Object-Centric Approach(MOCA) 모델에 따른 AI 에이전트의 학습 방법이 개시된다. 즉, (a) 상기 AI 에이전트가, 제t 입력 프레임 - t는 1 이상의 정수임 - 이 획득되면, 상기 제t 입력 프레임에 대한 제t 시각 피처 및 상기 AI 에이전트에 대한 자연어 형태의 지시 데이터를 Interactive Perception Module(IPM) 및 Action Policy Module(APM)에 입력하여, 상기 IPM 및 상기 APM으로부터 제t 예측 클래스 및 제t 예측 액션 정보를 각각 획득하는 단계; 및 (b) 상기 AI 에이전트가, (i-1) 상기 제t 예측 클래스를 포함하는 제1 내지 제N - N은 t이상의 정수임 - 예측 클래스 중 적어도 일부 및 (i-2) 상기 제t 예측 액션 정보를 포함하는 제1 내지 제N 예측 액션 정보 중 적어도 일부와 (ii) 이들 각각에 대한 Ground-Truth(GT) 정보를 참조하여 IPM 로스 및 APM 로스를 생성한 후, 이를 백프로퍼게이션함으로써 상기 IPM 및 상기 APM의 파라미터들 중 적어도 일부를 학습하는 단계를 포함하는 것을 특징으로 하는 방법이 개시된다.A learning method of an AI agent according to the Modular Object-Centric Approach (MOCA) model including dual task streams of Interactive Perception and Action Policy is disclosed. That is, (a) when the AI agent acquires the tth input frame, where t is an integer greater than or equal to 1, the tth visual feature for the tth input frame and the instruction data in the form of natural language for the AI agent through Interactive Perception Obtaining t-th prediction class and t-th prediction action information from the IPM and the APM, respectively, by inputting information to an IPM and an Action Policy Module (APM); and (b) the AI agent, (i-1) the first to Nth prediction classes including the tth prediction class, where N is an integer greater than or equal to t—at least some of the prediction classes and (i-2) the tth prediction action After generating an IPM loss and an APM loss by referring to at least some of the first to Nth predicted action information including information and (ii) Ground-Truth (GT) information for each of them, and then backpropagating them, the IPM and learning at least some of the parameters of the APM.

Description

AI agent learning method and learning device according to the Modular Object-Centric Approach (MOCA) model including interactive perception and action policy dual task streams, and test method and test device using the same {LEARNING METHOD AND LEARNING DEVICE FOR AI AGENT ACCORDING TO MODULAR OBJECT-CENTRIC APPROACH MODEL INCLUDING DUAL TASK STREAM OF INTERACTIVE PERCEPTION AND ACTION POLICY, AND TESTING METHOD AND TESTING DEVICE USING THE SAME}

본 발명은 Interactive Perception 및 Action Policy의 이중적 태스크 스트림을 포함하는 Modular Object-Centric Approach(MOCA) 모델에 따른 AI 에이전트의 학습 방법 및 학습 장치, 그리고 이를 이용한 테스트 방법 및 테스트 장치에 관한 것이다.The present invention relates to an AI agent learning method and learning device according to a Modular Object-Centric Approach (MOCA) model including interactive perception and action policy dual task streams, and a test method and test device using the same.

언어 지시에 따라 집안일 등의 귀찮은 일들을 수행할 수 있는 AI 부하를 갖는 것은 모두가 꿈꾸는 일이다. AI가 이와 같은 일들을 대신 수행할 수 있으려면, AI는 시각적으로 풍부한 3D 환경에서 탐색, 객체 상호 작용 및 대화형 추론을 할 수 있어야 한다. 한 걸음 더 나아가, AI가 "Interactive Instruction Following" 개념을 갖추어, 자기중심적 비전에 기반하여 자연어 지시에 따라 환경을 탐색하고, 객체와 상호 작용하며, 장기적인 업무를 해낼 수 있다면 더욱 이상적일 것이다.It's everyone's dream to have an AI subordinate that can do chores and other chores according to verbal instructions. For AI to be able to do these things for you, it must be able to navigate, interact with objects and make interactive inferences in visually rich 3D environments. Going one step further, it would be ideal if AI could have the concept of "Interactive Instruction Following", based on its egocentric vision, to navigate its environment, interact with objects, and accomplish long-term tasks by following natural language instructions.

이와 같은 "Interactive Instruction Following"의 목표를 달성하기 위해서는, AI는 액션의 흐름과 객체와의 상호작용을 추측할 수 있어야 한다. 이 때, 액션의 예측에는 전역적인 의미론적 단서가 필요하지만, 객체의 위치 파악은 환경에 대한 픽셀 수준의 이해를 요하며, 이와 같은 특성의 차이는 양 과제를 의미론적으로 다른 작업으로 만든다. 하지만, 이와 같이 "Interactive Instruction Following" 개념에 따라 언어적 지시에 의해 동작하는 AI를 개발하고자 했던 종래의 기술들은 이와 같은 특성의 차이를 고려하지 않고 구조를 설계하여, 충분한 성능을 얻어내지 못하였다.To achieve the goal of "Interactive Instruction Following", the AI must be able to guess the flow of actions and interactions with objects. At this time, action prediction requires global semantic clues, but object location requires a pixel-level understanding of the environment, and the difference in these characteristics makes the two tasks semantically different tasks. However, conventional technologies that tried to develop AI that operates by verbal instructions according to the concept of "Interactive Instruction Following" did not design structures without considering such differences in characteristics, and did not obtain sufficient performance.

본 발명은 상술한 문제점을 해결하는 것을 목적으로 한다.The present invention aims to solve the above problems.

또한 본 발명은 "Interactive Instruction Following"을 구현하기 위해, Interactive Perception과 Action Policy를 각각의 태스크 스트림으로 포함하는 모델을 구현하는 방법을 제공하는 것을 목적으로 한다.Another object of the present invention is to provide a method of implementing a model including Interactive Perception and Action Policy as each task stream in order to implement "Interactive Instruction Following".

또한 본 발명은 태스크를 수행하기 위한 Object-Centric 위치 결정 알고리즘 및 장애물 회피 메커니즘을 제공하는 것을 목적으로 한다.It is also an object of the present invention to provide an Object-Centric positioning algorithm and an obstacle avoidance mechanism for performing a task.

상기한 바와 같은 본 발명의 목적을 달성하고, 후술하는 본 발명의 특징적인 효과를 실현하기 위한 본 발명의 특징적인 구성은 하기와 같다.The characteristic configuration of the present invention for achieving the object of the present invention as described above and realizing the characteristic effects of the present invention described later is as follows.

본 발명의 일 태양에 따르면, Interactive Perception 및 Action Policy의 이중적 태스크 스트림을 포함하는 Modular Object-Centric Approach(MOCA) 모델에 따른 AI 에이전트의 학습 방법에 있어서, (a) 상기 AI 에이전트가, 제t 입력 프레임 - t는 1 이상의 정수임 - 이 획득되면, 상기 제t 입력 프레임에 대한 제t 시각 피처 및 상기 AI 에이전트에 대한 자연어 형태의 지시 데이터를 Interactive Perception Module(IPM) 및 Action Policy Module(APM)에 입력하여, 상기 IPM 및 상기 APM으로부터 제t 예측 클래스 및 제t 예측 액션 정보를 각각 획득하는 단계; 및 (b) 상기 AI 에이전트가, (i-1) 상기 제t 예측 클래스를 포함하는 제1 내지 제N - N은 t이상의 정수임 - 예측 클래스 중 적어도 일부 및 (i-2) 상기 제t 예측 액션 정보를 포함하는 제1 내지 제N 예측 액션 정보 중 적어도 일부와 (ii) 이들 각각에 대한 Ground-Truth(GT) 정보를 참조하여 IPM 로스 및 APM 로스를 생성한 후, 이를 백프로퍼게이션함으로써 상기 IPM 및 상기 APM의 파라미터들 중 적어도 일부를 학습하는 단계를 포함하는 것을 특징으로 하는 방법이 개시된다.According to one aspect of the present invention, in the learning method of an AI agent according to the Modular Object-Centric Approach (MOCA) model including dual task streams of Interactive Perception and Action Policy, (a) the AI agent has a t input When the frame - t is an integer greater than or equal to 1 - is acquired, the t visual feature for the t th input frame and the instruction data in the form of natural language for the AI agent are input to the Interactive Perception Module (IPM) and Action Policy Module (APM) obtaining t-th prediction class and t-th prediction action information from the IPM and the APM, respectively; and (b) the AI agent, (i-1) the first to Nth prediction classes including the tth prediction class, where N is an integer greater than or equal to t—at least some of the prediction classes and (i-2) the tth prediction action After generating an IPM loss and an APM loss by referring to at least some of the first to Nth predicted action information including information and (ii) Ground-Truth (GT) information for each of them, and then backpropagating them, the IPM and learning at least some of the parameters of the APM.

일례로서, 상기 (a) 단계는, (a11) 상기 AI 에이전트가, 상기 IPM에 포함된 IPM용 자연어 인코더로 하여금, 상기 지시 데이터에 자신에 포함된 IPM용 인코딩 BiLSTM을 사용한 연산을 가하여 제t IPM용 어텐션 자연어 피처를 획득하도록 하는 단계; (a12) 상기 AI 에이전트가, 상기 IPM에 포함된 IPM용 다이나믹 필터 연산기로 하여금, 상기 제t IPM용 어텐션 자연어 피처에 자신에 포함된 IPM용 필터 생성 Fully-Connected(FC) 네트워크를 사용한 연산을 가하여 적어도 하나의 제t IPM용 다이나믹 필터를 생성하도록 한 다음, 상기 제t 시각 피처에 상기 제t IPM용 다이나믹 필터를 사용한 연산을 가하여 제t IPM용 어텐션 시각 피처를 획득하도록 하는 단계; 및 (a13) 상기 AI 에이전트가, 상기 IPM에 포함된 클래스 디코더로 하여금, 상기 제t IPM용 어텐션 자연어 피처, 상기 제t IPM용 어텐션 시각 피처 및 제(t-1) 예측 액션 정보를 포함하는 제t IPM용 통합 벡터에, 자신에 포함된 IPM용 디코딩 LSTM을 사용한 연산을 가하여 제t IPM용 디코딩 히든스테이트 벡터를 생성한 다음, 상기 제t IPM용 디코딩 히든스테이트 벡터에 자신에 포함된 IPM용 디코딩 FC 네트워크를 사용한 연산을 가하여 상기 제t 예측 클래스를 생성하도록 하는 단계를 포함하는 것을 특징으로 하는 방법이 개시된다. As an example, in the step (a), (a11) the AI agent causes the natural language encoder for IPM included in the IPM to apply an operation using the encoding BiLSTM for IPM included in the indication data to the t IPM obtaining a language attention natural language feature; (a12) The AI agent causes the dynamic filter operator for IPM included in the IPM to apply an operation using a Fully-Connected (FC) network for generating a filter for IPM included in the AI agent to the tth IPM attention natural language feature, generating at least one dynamic filter for the tth IPM, and then obtaining an attention time feature for the tth IPM by applying an operation using the dynamic filter for the tth IPM to the tth visual feature; and (a13) the AI agent causes a class decoder included in the IPM to include the attention natural language feature for the t th IPM, the attention time feature for the t th IPM, and the (t-1)th predictive action information. The tth IPM decoding hidden state vector is generated by applying an operation using the IPM decoding LSTM included therein to the t IPM integration vector, and then the IPM decoding included in the tth IPM decoding hidden state vector. applying an operation using an FC network to generate the tth prediction class.

일례로서, 상기 (a) 단계는, (a21) 상기 AI 에이전트가, 상기 APM에 포함된 APM용 자연어 인코더로 하여금, 상기 지시 데이터에 자신에 포함된 APM용 인코딩 BiLSTM을 사용한 연산을 가하여 제t APM용 어텐션 자연어 피처를 획득하도록 하는 단계; (a22) 상기 AI 에이전트가, 상기 APM에 포함된 APM용 다이나믹 필터 연산기로 하여금, 상기 제t APM용 어텐션 자연어 피처에 자신에 포함된 APM용 필터 생성 FC 네트워크를 사용한 연산을 가하여 적어도 하나의 제t APM용 다이나믹 필터를 생성하도록 한 다음, 상기 제t 시각 피처에 상기 제t APM용 다이나믹 필터를 사용한 연산을 가하여 제t APM용 어텐션 시각 피처를 획득하도록 하는 단계; 및 (a23) 상기 AI 에이전트가, 상기 APM에 포함된 액션 디코더로 하여금, 상기 제t APM용 어텐션 자연어 피처, 상기 제t APM용 어텐션 시각 피처 및 제(t-1) 예측 액션 정보를 포함하는 제t APM용 통합 벡터에, 자신에 포함된 APM용 디코딩 LSTM을 사용한 연산을 가하여 제t APM용 디코딩 히든스테이트 벡터를 생성한 다음, 상기 제t APM용 통합 벡터 및 상기 제t APM용 디코딩 히든스테이트 벡터를 포함하는 벡터에 자신에 포함된 APM용 디코딩 FC 네트워크를 사용한 연산을 가하여 상기 제t 예측 액션 정보를 생성하도록 하는 단계를 포함하는 것을 특징으로 하는 방법이 개시된다. As an example, in the step (a), (a21) the AI agent causes the natural language encoder for APM included in the APM to apply an operation using the encoding BiLSTM for APM included in the instruction data to the t APM obtaining a language attention natural language feature; (a22) The AI agent causes the APM dynamic filter operator included in the APM to apply an operation using the APM filter generation FC network included in the AI agent to the attention natural language feature for the tth APM to obtain at least one t generating a dynamic filter for APM and then obtaining an attention time feature for t APM by applying an operation using the dynamic filter for t APM to the t th visual feature; and (a23) the AI agent causes an action decoder included in the APM to include the attention natural language feature for the tth APM, the attention time feature for the tth APM, and the (t-1)th predictive action information. An operation using the APM decoding LSTM included therein is applied to the integrated vector for t APM to generate a decoding hidden state vector for the tth APM, and then the integrated vector for the t APM and the decoding hidden state vector for the tth APM A method is disclosed which includes generating the t-th prediction action information by applying an operation using an APM decoding FC network included therein to a vector including .

본 발명의 다른 태양에 따르면, Interactive Perception 및 Action Policy의 이중적 태스크 스트림을 포함하는 Modular Object-Centric Approach(MOCA) 모델에 따른 AI 에이전트의 테스트 방법에 있어서, (a) 상기 AI 에이전트가, (1) 제t 학습용 입력 프레임 - t는 1 이상의 정수임 - 이 획득되면, 상기 제t 학습용 입력 프레임에 대한 제t 학습용 시각 피처 및 상기 AI 에이전트에 대한 자연어 형태의 학습용 지시 데이터를 Interactive Perception Module(IPM) 및 Action Policy Module(APM)에 입력하여, 각각의 상기 IPM 및 상기 APM으로부터 제t 학습용 예측 클래스 및 제t 학습용 예측 액션 정보를 획득하는 프로세스; 및 (2) (i-1) 상기 제t 학습용 예측 클래스를 포함하는 제1 내지 제N - N은 t이상의 정수임 - 학습용 예측 클래스 중 적어도 일부 및 (i-2) 상기 제t 학습용 예측 액션 정보를 포함하는 제1 내지 제N 학습용 예측 액션 정보 중 적어도 일부와 (ii) 이들 각각에 대한 Ground-Truth(GT) 정보를 참조하여 IPM 로스 및 APM 로스를 생성한 후, 이를 백프로퍼게이션함으로써 상기 IPM 및 상기 APM의 파라미터들 중 적어도 일부를 학습하는 프로세스를 수행함으로써 학습이 완료된 상태에서, 제t 테스트용 입력 프레임이 획득되면, 상기 제t 테스트용 입력 프레임에 대한 제t 테스트용 시각 피처 및 상기 AI 에이전트에 대한 상기 자연어 형태의 테스트용 지시 데이터를 상기 IPM 및 상기 APM에 입력하여, 상기 IPM 및 상기 APM으로부터 제t 테스트용 예측 마스크 및 제t 테스트용 예측 액션 정보 각각을 획득하는 단계; 및 (b) 상기 AI 에이전트가, 상기 제t 테스트용 예측 마스크 및 상기 제t 테스트용 예측 액션 정보를 참조하여 상기 테스트용 지시 데이터에 따른 동작의 적어도 일부를 수행하는 단계를 포함하는 것을 특징으로 하는 방법이 개시된다.According to another aspect of the present invention, in the test method of an AI agent according to the Modular Object-Centric Approach (MOCA) model including dual task streams of Interactive Perception and Action Policy, (a) the AI agent, (1) When the t learning input frame - t is an integer greater than or equal to 1 - is acquired, the t learning visual feature for the t learning input frame and the learning instruction data in the form of natural language for the AI agent are transmitted to the Interactive Perception Module (IPM) and Action a process of obtaining prediction class for t learning and prediction action for learning t from each of the IPM and the APM by inputting information to a Policy Module (APM); and (2) (i-1) first to Nth including the t-th prediction class for learning - N is an integer greater than or equal to t - at least some of the prediction classes for learning and (i-2) the t-th prediction action information for learning After generating an IPM loss and an APM loss by referring to at least some of the first to Nth prediction action information for learning and (ii) Ground-Truth (GT) information for each of them, and then backpropagating them, the IPM and When an input frame for test t is obtained in a state in which learning is completed by performing a process of learning at least some of the parameters of the APM, a visual feature for test t and the AI agent for the input frame for test t acquiring a prediction mask for test t and prediction action information for test t from the IPM and the APM, respectively, by inputting test instruction data in the form of natural language for the IPM and the APM; and (b) the AI agent performing at least a part of an operation according to the test instruction data by referring to the t test prediction mask and the t t test prediction action information. A method is disclosed.

일례로서, 상기 (a) 단계는, (a11) 상기 AI 에이전트가, 상기 IPM에 포함된 IPM용 자연어 인코더로 하여금, 상기 테스트용 지시 데이터에 자신에 포함된 IPM용 인코딩 BiLSTM을 사용한 연산을 가하여 제t 테스트용 IPM용 어텐션 자연어 피처를 획득하도록 하는 단계; (a12) 상기 AI 에이전트가, 상기 IPM에 포함된 IPM용 다이나믹 필터 연산기로 하여금, 상기 제t 테스트용 IPM용 어텐션 자연어 피처에 자신에 포함된 IPM용 필터 생성 Fully-Connected(FC) 네트워크를 사용한 연산을 가하여 적어도 하나의 제t 테스트용 IPM용 다이나믹 필터를 생성하도록 한 다음, 상기 제t 테스트용 시각 피처에 상기 제t 테스트용 IPM용 다이나믹 필터를 사용한 연산을 가하여 제t 테스트용 IPM용 어텐션 시각 피처를 획득하도록 하는 단계; 및 (a13) 상기 AI 에이전트가, 상기 IPM에 포함된 클래스 디코더로 하여금, 상기 제t 테스트용 IPM용 어텐션 자연어 피처, 상기 제t 테스트용 IPM용 어텐션 시각 피처 및 제(t-1) 테스트용 예측 액션 정보를 포함하는 제t 테스트용 IPM용 통합 벡터에, 자신에 포함된 IPM용 디코딩 LSTM을 사용한 연산을 가하여 제t 테스트용 IPM용 디코딩 히든스테이트 벡터를 생성한 다음, 상기 제t 테스트용 IPM용 디코딩 히든스테이트 벡터에 자신에 포함된 IPM용 디코딩 FC 네트워크를 사용한 연산을 가하여 상기 제t 테스트용 예측 클래스를 생성하도록 하는 단계; 및 (a14) 상기 AI 에이전트가, 상기 IPM에 포함된 Object-Centric 로컬라이저로 하여금, 상기 제t 테스트용 예측 클래스를 참조하여 제t 테스트용 예측 마스크를 생성하도록 하는 단계를 포함하는 것을 특징으로 하는 방법이 개시된다.As an example, in the step (a), (a11) the AI agent causes the natural language encoder for IPM included in the IPM to apply an operation using the encoding BiLSTM for IPM included in the test indication data, acquiring an attention natural language feature for IPM for t test; (a12) The AI agent causes the dynamic filter operator for IPM included in the IPM to create an IPM filter included in the attention natural language feature for IPM for the t test. Operation using a Fully-Connected (FC) network. to generate at least one dynamic filter for the IPM for the t test, and then applying an operation using the dynamic filter for the IPM for the t test to the visual feature for the t test, thereby obtaining an attention time feature for the IPM for the t test. to obtain; and (a13) the AI agent causes a class decoder included in the IPM to include an attention natural language feature for the IPM for test t, an attention-visual feature for IPM for test t, and prediction for the (t-1)th test. An operation using the decoding LSTM for IPM included in the integrated vector for the IPM for the t test is generated, and the decoding hidden state vector for the IPM for the t test is generated, and then the IPM for the t test generating a prediction class for the t th test by applying an operation using an IPM decoding FC network included therein to a decoding hidden state vector; and (a14) allowing an Object-Centric localizer included in the IPM to generate a prediction mask for test t by referring to the prediction class for test t , by the AI agent. A method is disclosed.

일례로서, 상기 (a14) 단계는, (a141) 상기 AI 에이전트가, 상기 IPM에 포함된 상기 Object-Centric 로컬라이저로 하여금, 자신에 포함된 마스크 생성기를 사용하여, 상기 제t 테스트용 입력 프레임 상에서 상기 제t 테스트용 예측 클래스에 대응하는 적어도 하나의 제t 테스트용 후보 마스크를 생성하도록 하는 단계; 및 (a142) 상기 AI 에이전트가, 상기 IPM에 포함된 상기 Object-Centric 로컬라이저로 하여금, 자신에 포함된 인스턴스 결정기를 사용하여, 상기 적어도 하나의 제t 테스트용 후보 마스크들 중 하나인 상기 제t 테스트용 예측 마스크를 결정하도록 하는 단계를 포함하는 것을 특징으로 하는 방법이 개시된다.As an example, in step (a14), (a141) the AI agent causes the Object-Centric localizer included in the IPM to use a mask generator included in the AI agent on the input frame for the t test. generating at least one candidate mask for the t-th test corresponding to the prediction class for the t-th test; and (a142) the AI agent causes the Object-Centric localizer included in the IPM to use an instance determiner included in the AI agent to determine the at least one t t that is one of candidate masks for testing. A method comprising determining a predictive mask for testing is disclosed.

일례로서, 상기 (a142) 단계는, 상기 AI 에이전트가, 상기 Object-Centric 로컬라이저로 하여금, 상기 인스턴스 결정기를 사용하여, (i) 상기 제t 테스트용 예측 클래스 및 제(t-1) 테스트용 예측 클래스를 비교한 다음, (ii-1) 상기 제t 테스트용 예측 클래스 및 상기 제(t-1) 테스트용 예측 클래스가 동일할 경우, 상기 제t 테스트용 후보 마스크들 중 제(t-1) 테스트용 예측 마스크와의 유사도가 가장 높은 것을 상기 제t 테스트용 예측 마스크로 결정하고, (ii-2) 상기 제t 테스트용 예측 클래스 및 상기 제(t-1) 테스트용 예측 클래스가 동일하지 않을 경우, 상기 제t 테스트용 후보 마스크들 중 가장 컨피던스 스코어가 높은 것을 상기 제t 테스트용 예측 마스크로 결정하도록 하는 것을 특징으로 하는 방법이 개시된다.As an example, in step (a142), the AI agent causes the Object-Centric localizer to use the instance determiner to determine (i) the prediction class for the t test and the (t-1) test class. After comparing prediction classes, (ii-1) when the prediction class for the t th test and the prediction class for the (t-1) th test are the same, the (t-1) th test candidate masks among the t th test candidate masks ) The one having the highest similarity with the test prediction mask is determined as the prediction mask for the tth test, and (ii-2) the prediction class for the tth test and the prediction class for the (t-1)th test are not the same. If not, a method having the highest confidence score among the candidate masks for the tth test is determined as the prediction mask for the tth test.

일례로서, 상기 (a) 단계는, (a21) 상기 AI 에이전트가, 상기 APM에 포함된 APM용 자연어 인코더로 하여금, 상기 테스트용 지시 데이터에 자신에 포함된 APM용 인코딩 BiLSTM을 사용한 연산을 가하여 제t 테스트용 APM용 어텐션 자연어 피처를 획득하도록 하는 단계; (a22) 상기 AI 에이전트가, 상기 APM에 포함된 APM용 다이나믹 필터 연산기로 하여금, 상기 제t 테스트용 APM용 어텐션 자연어 피처에 자신에 포함된 APM용 필터 생성 FC 네트워크를 사용한 연산을 가하여 적어도 하나의 제t 테스트용 APM용 다이나믹 필터를 생성하도록 한 다음, 상기 제t 테스트용 시각 피처에 상기 제t 테스트용 APM용 다이나믹 필터를 사용한 연산을 가하여 제t 테스트용 APM용 어텐션 시각 피처를 획득하도록 하는 단계; 및 (a23) 상기 AI 에이전트가, 상기 APM에 포함된 액션 디코더로 하여금, 상기 제t 테스트용 APM용 어텐션 자연어 피처, 상기 제t 테스트용 APM용 어텐션 시각 피처 및 제(t-1) 테스트용 예측 액션 정보를 포함하는 제t 테스트용 APM용 통합 벡터에, 자신에 포함된 APM용 디코딩 LSTM을 사용한 연산을 가하여 제t 테스트용 APM용 디코딩 히든스테이트 벡터를 생성한 다음, 상기 제t 테스트용 APM용 통합 벡터 및 상기 제t 테스트용 APM용 디코딩 히든스테이트 벡터를 포함하는 벡터에 자신에 포함된 APM용 디코딩 FC 네트워크를 사용한 연산을 가하여 상기 제t 테스트용 예측 액션 정보를 생성하도록 하는 단계를 포함하는 것을 특징으로 하는 방법이 개시된다.As an example, in the step (a), (a21) the AI agent causes the natural language encoder for APM included in the APM to apply an operation using the encoding BiLSTM for APM included in the test indication data, obtaining an attention natural language feature for APM for t test; (a22) The AI agent causes the APM dynamic filter operator included in the APM to apply an operation using the APM filter generating FC network included in the AI agent to the attention natural language feature for the APM for the test t t , thereby obtaining at least one Generating a dynamic filter for the APM for the t test, and then obtaining an attention visual feature for the APM for the t test by applying an operation using the dynamic filter for the APM for the t test to the visual feature for the t test. ; and (a23) the AI agent causes the action decoder included in the APM to include the attention natural language feature for the APM for the test t, the attention-time feature for the APM for the test t, and the prediction for the (t-1)th test. A decoding hidden state vector for the APM for the t test is generated by applying an operation using the decoding LSTM for the APM included in the integrated vector for the APM for the t test including the action information, and then the APM for the t test Generating prediction action information for the t test by applying an operation using an APM decoding FC network included therein to a vector including an integrated vector and a decoding hidden state vector for the APM for the t test A method of characterization is disclosed.

일례로서, 상기 (a23) 단계는, 상기 AI 에이전트가, 상기 APM에 포함된 상기 장애물 회피기로 하여금, 상기 제t 테스트용 시각 피처 및 제(t-1) 테스트용 시각 피처를 참조하여 장애물 조우 여부를 판단한 상태에서, 상기 AI 에이전트가 상기 장애물을 조우한 것으로 판단될 경우, 상기 APM 디코딩 FC 네트워크를 사용한 연산을 수행한 후, 상기 APM용 디코딩 FC 네트워크의 출력 노드들 중 상기 제t-1 테스트용 예측 액션 정보에 대응하는 특정 출력 노드에서 출력된 특정 값을 제외한 값들 중 가장 높은 값에 대응하는 것을 상기 제t 테스트용 예측 액션 정보로 선택하도록 하는 것을 특징으로 하는 방법이 개시된다.As an example, in step (a23), the AI agent causes the obstacle avoider included in the APM to determine whether an obstacle is encountered by referring to the t test visual feature and the (t-1) th test visual feature. In the state of determining, when it is determined that the AI agent has encountered the obstacle, after performing an operation using the APM decoding FC network, the t-1th test among output nodes of the APM decoding FC network Disclosed is a method characterized by selecting, as the prediction action information for the t test, a value corresponding to the highest value among values excluding a specific value output from a specific output node corresponding to the prediction action information.

일례로서, 상기 (a23) 단계는, 상기 AI 에이전트가, 상기 APM에 포함된 상기 장애물 회피기로 하여금, 상기 제t 테스트용 시각 피처 및 제(t-1) 테스트용 시각 피처 간의 유사도를 계산한 후, 상기 유사도가 임계치 이상일 경우 상기 AI 에이전트가 상기 장애물을 조우한 것으로 판단하는 것을 특징으로 하는 방법이 개시된다.As an example, in step (a23), after the AI agent calculates the degree of similarity between the tth test visual feature and the (t-1)th test visual feature by the obstacle avoider included in the APM, , When the similarity is greater than or equal to a threshold value, it is determined that the AI agent has encountered the obstacle.

본 발명의 또다른 태양에 따르면, Interactive Perception 및 Action Policy의 이중적 태스크 스트림을 포함하는 Modular Object-Centric Approach(MOCA) 모델에 따른 학습을 수행하는 AI 에이전트에 있어서, 인스트럭션들을 저장하는 하나 이상의 메모리; 및 상기 인스트럭션들을 수행하도록 설정된 하나 이상의 프로세서를 포함하되, 상기 프로세서는, (I) 상기 AI 에이전트가, 제t 입력 프레임 - t는 1 이상의 정수임 - 이 획득되면, 상기 제t 입력 프레임에 대한 제t 시각 피처 및 상기 AI 에이전트에 대한 자연어 형태의 지시 데이터를 Interactive Perception Module(IPM) 및 Action Policy Module(APM)에 입력하여, 상기 IPM 및 상기 APM으로부터 제t 예측 클래스 및 제t 예측 액션 정보를 각각 획득하는 프로세스; 및 (II) 상기 AI 에이전트가, (i-1) 상기 제t 예측 클래스를 포함하는 제1 내지 제N - N은 t이상의 정수임 - 예측 클래스 중 적어도 일부 및 (i-2) 상기 제t 예측 액션 정보를 포함하는 제1 내지 제N 예측 액션 정보 중 적어도 일부와 (ii) 이들 각각에 대한 Ground-Truth(GT) 정보를 참조하여 IPM 로스 및 APM 로스를 생성한 후, 이를 백프로퍼게이션함으로써 상기 IPM 및 상기 APM의 파라미터들 중 적어도 일부를 학습하는 프로세스를 수행하도록 설정된 것을 특징으로 하는 AI 에이전트가 개시된다.According to another aspect of the present invention, an AI agent that performs learning according to a Modular Object-Centric Approach (MOCA) model including dual task streams of Interactive Perception and Action Policy, comprising: one or more memories for storing instructions; and one or more processors configured to perform the instructions, wherein: (I) the AI agent, when a t th input frame, where t is an integer greater than or equal to 1, is obtained, the t th input frame for the t th input frame; By inputting visual features and natural language instruction data for the AI agent into the Interactive Perception Module (IPM) and the Action Policy Module (APM), the t prediction class and the t prediction action information are obtained from the IPM and the APM, respectively. process; and (II) the AI agent performs (i-1) first to Nth prediction classes including the tth prediction class, where N is an integer greater than or equal to t—at least some of the prediction classes and (i-2) the tth prediction action After generating an IPM loss and an APM loss by referring to at least some of the first to Nth predicted action information including information and (ii) Ground-Truth (GT) information for each of them, and then backpropagating them, the IPM and an AI agent configured to perform a process of learning at least some of the parameters of the APM.

일례로서, 상기 (I) 프로세스는, (I11) 상기 IPM에 포함된 IPM용 자연어 인코더로 하여금, 상기 지시 데이터에 자신에 포함된 IPM용 인코딩 BiLSTM을 사용한 연산을 가하여 제t IPM용 어텐션 자연어 피처를 획득하도록 하는 프로세스; (I12) 상기 IPM에 포함된 IPM용 다이나믹 필터 연산기로 하여금, 상기 제t IPM용 어텐션 자연어 피처에 자신에 포함된 IPM용 필터 생성 Fully-Connected(FC) 네트워크를 사용한 연산을 가하여 적어도 하나의 제t IPM용 다이나믹 필터를 생성하도록 한 다음, 상기 제t 시각 피처에 상기 제t IPM용 다이나믹 필터를 사용한 연산을 가하여 제t IPM용 어텐션 시각 피처를 획득하도록 하는 프로세스; 및 (I13) 상기 IPM에 포함된 클래스 디코더로 하여금, 상기 제t IPM용 어텐션 자연어 피처, 상기 제t IPM용 어텐션 시각 피처 및 제(t-1) 예측 액션 정보를 포함하는 제t IPM용 통합 벡터에, 자신에 포함된 IPM용 디코딩 LSTM을 사용한 연산을 가하여 제t IPM용 디코딩 히든스테이트 벡터를 생성한 다음, 상기 제t IPM용 디코딩 히든스테이트 벡터에 자신에 포함된 IPM용 디코딩 FC 네트워크를 사용한 연산을 가하여 상기 제t 예측 클래스를 생성하도록 하는 프로세스를 포함하는 것을 특징으로 하는 AI 에이전트가 개시된다. As an example, in the (I) process, (I11) causes the natural language encoder for IPM included in the IPM to apply an operation using the encoding BiLSTM for IPM included in the indication data to obtain a tth IPM attention natural language feature process to obtain; (I12) A dynamic filter operator for IPM included in the IPM applies an operation using a Fully-Connected (FC) network for generating a filter for IPM included in the attention natural language feature for the tth IPM to at least one t a process of generating a dynamic filter for IPM and then applying an operation using the dynamic filter for t th IPM to the t th visual feature to obtain an attention visual feature for t th IPM; and (I13) causing a class decoder included in the IPM to integrate an attention natural language feature for the tth IPM, an attention time feature for the tth IPM, and a (t-1)th predictive action information for the tth IPM. , an operation using the decoding LSTM for IPM included therein is applied to generate a decoding hidden state vector for the t th IPM, and then an operation using the decoding FC network for IPM included in the decoding hidden state vector for the t th IPM. An AI agent comprising a process for generating the t-th prediction class by applying

일례로서, 상기 (I) 프로세스는, (I21) 상기 APM에 포함된 APM용 자연어 인코더로 하여금, 상기 지시 데이터에 자신에 포함된 APM용 인코딩 BiLSTM을 사용한 연산을 가하여 제t APM용 어텐션 자연어 피처를 획득하도록 하는 프로세스; (I22) 상기 APM에 포함된 APM용 다이나믹 필터 연산기로 하여금, 상기 제t APM용 어텐션 자연어 피처에 자신에 포함된 APM용 필터 생성 FC 네트워크를 사용한 연산을 가하여 적어도 하나의 제t APM용 다이나믹 필터를 생성하도록 한 다음, 상기 제t 시각 피처에 상기 제t APM용 다이나믹 필터를 사용한 연산을 가하여 제t APM용 어텐션 시각 피처를 획득하도록 하는 프로세스; 및 (I23) 상기 APM에 포함된 액션 디코더로 하여금, 상기 제t APM용 어텐션 자연어 피처, 상기 제t APM용 어텐션 시각 피처 및 제(t-1) 예측 액션 정보를 포함하는 제t APM용 통합 벡터에, 자신에 포함된 APM용 디코딩 LSTM을 사용한 연산을 가하여 제t APM용 디코딩 히든스테이트 벡터를 생성한 다음, 상기 제t APM용 통합 벡터 및 상기 제t APM용 디코딩 히든스테이트 벡터를 포함하는 벡터에 자신에 포함된 APM용 디코딩 FC 네트워크를 사용한 연산을 가하여 상기 제t 예측 액션 정보를 생성하도록 하는 프로세스를 포함하는 것을 특징으로 하는 AI 에이전트가 개시된다. As an example, in the (I) process, (I21) causes the APM natural language encoder included in the APM to apply an operation using the APM encoding BiLSTM included in the indication data to obtain a tth APM attention natural language feature process to obtain; (I22) The APM dynamic filter operator included in the APM applies an operation using the APM filter generation FC network included in the t th APM attention natural language feature to generate at least one t APM dynamic filter generating and then applying an operation using the dynamic filter for the tth APM to the tth visual feature to obtain an attention visual feature for the tth APM; and (I23) causing an action decoder included in the APM to integrate the attention natural language feature for the tth APM, the attention time feature for the tth APM, and the (t-1)th predictive action information for the tth APM. , to generate a decoding hidden state vector for the tth APM by applying an operation using the APM decoding LSTM included therein, and then to a vector including the integrated vector for the tth APM and the decoding hidden state vector for the tth APM An AI agent characterized in that it includes a process of generating the t-th prediction action information by applying an operation using a decoding FC network for APM included therein.

본 발명의 또다른 태양에 따르면, Interactive Perception 및 Action Policy의 이중적 태스크 스트림을 포함하는 Modular Object-Centric Approach(MOCA) 모델에 따른 테스트를 수행하는 AI 에이전트에 있어서, 인스트럭션들을 저장하는 하나 이상의 메모리; 및 상기 인스트럭션들을 수행하도록 설정된 하나 이상의 프로세서를 포함하되, 상기 프로세서는, (I) (1) 제t 학습용 입력 프레임 - t는 1 이상의 정수임 - 이 획득되면, 상기 제t 학습용 입력 프레임에 대한 제t 학습용 시각 피처 및 상기 AI 에이전트에 대한 자연어 형태의 학습용 지시 데이터를 Interactive Perception Module(IPM) 및 Action Policy Module(APM)에 입력하여, 각각의 상기 IPM 및 상기 APM으로부터 제t 학습용 예측 클래스 및 제t 학습용 예측 액션 정보를 획득하는 프로세스; 및 (2) (i-1) 상기 제t 학습용 예측 클래스를 포함하는 제1 내지 제N - N은 t이상의 정수임 - 학습용 예측 클래스 중 적어도 일부 및 (i-2) 상기 제t 학습용 예측 액션 정보를 포함하는 제1 내지 제N 학습용 예측 액션 정보 중 적어도 일부와 (ii) 이들 각각에 대한 Ground-Truth(GT) 정보를 참조하여 IPM 로스 및 APM 로스를 생성한 후, 이를 백프로퍼게이션함으로써 상기 IPM 및 상기 APM의 파라미터들 중 적어도 일부를 학습하는 프로세스를 수행함으로써 학습이 완료된 상태에서, 제t 테스트용 입력 프레임이 획득되면, 상기 제t 테스트용 입력 프레임에 대한 제t 테스트용 시각 피처 및 상기 AI 에이전트에 대한 상기 자연어 형태의 테스트용 지시 데이터를 상기 IPM 및 상기 APM에 입력하여, 상기 IPM 및 상기 APM으로부터 제t 테스트용 예측 마스크 및 제t 테스트용 예측 액션 정보 각각을 획득하는 프로세스; 및 (II) 상기 AI 에이전트가, 상기 제t 테스트용 예측 마스크 및 상기 제t 테스트용 예측 액션 정보를 참조하여 상기 테스트용 지시 데이터에 따른 동작의 적어도 일부를 수행하는 프로세스를 수행하도록 설정된 것을 특징으로 하는 AI 에이전트가 개시된다.According to another aspect of the present invention, an AI agent performing a test according to a Modular Object-Centric Approach (MOCA) model including dual task streams of Interactive Perception and Action Policy, comprising: one or more memories for storing instructions; and one or more processors configured to perform the instructions, the processor comprising: (I) (1) when a t th input frame for learning, where t is an integer greater than or equal to 1, is obtained, the t th input frame for the t th learning input frame; A visual feature for learning and instruction data for learning in the form of natural language for the AI agent are input to the Interactive Perception Module (IPM) and the Action Policy Module (APM), and from each of the IPM and the APM, prediction classes for t learning and t learning a process of acquiring predictive action information; and (2) (i-1) first to Nth including the t-th prediction class for learning - N is an integer greater than or equal to t - at least some of the prediction classes for learning and (i-2) the t-th prediction action information for learning After generating an IPM loss and an APM loss by referring to at least some of the first to Nth prediction action information for learning and (ii) Ground-Truth (GT) information for each of them, and then backpropagating them, the IPM and When an input frame for test t is obtained in a state in which learning is completed by performing a process of learning at least some of the parameters of the APM, a visual feature for test t and the AI agent for the input frame for test t a process of inputting test instruction data in the form of natural language to the IPM and the APM, and obtaining a prediction mask for test t and prediction action information for test t from the IPM and the APM, respectively; and (II) the AI agent is configured to perform a process of performing at least a part of an operation according to the test instruction data by referring to the t test prediction mask and the t test prediction action information. An AI agent that does is initiated.

일례로서, 상기 (I) 프로세스는, (I11) 상기 IPM에 포함된 IPM용 자연어 인코더로 하여금, 상기 테스트용 지시 데이터에 자신에 포함된 IPM용 인코딩 BiLSTM을 사용한 연산을 가하여 제t 테스트용 IPM용 어텐션 자연어 피처를 획득하도록 하는 프로세스; (I12) 상기 IPM에 포함된 IPM용 다이나믹 필터 연산기로 하여금, 상기 제t 테스트용 IPM용 어텐션 자연어 피처에 자신에 포함된 IPM용 필터 생성 Fully-Connected(FC) 네트워크를 사용한 연산을 가하여 적어도 하나의 제t 테스트용 IPM용 다이나믹 필터를 생성하도록 한 다음, 상기 제t 테스트용 시각 피처에 상기 제t 테스트용 IPM용 다이나믹 필터를 사용한 연산을 가하여 제t 테스트용 IPM용 어텐션 시각 피처를 획득하도록 하는 프로세스; 및 (I13) 상기 IPM에 포함된 클래스 디코더로 하여금, 상기 제t 테스트용 IPM용 어텐션 자연어 피처, 상기 제t 테스트용 IPM용 어텐션 시각 피처 및 제(t-1) 테스트용 예측 액션 정보를 포함하는 제t 테스트용 IPM용 통합 벡터에, 자신에 포함된 IPM용 디코딩 LSTM을 사용한 연산을 가하여 제t 테스트용 IPM용 디코딩 히든스테이트 벡터를 생성한 다음, 상기 제t 테스트용 IPM용 디코딩 히든스테이트 벡터에 자신에 포함된 IPM용 디코딩 FC 네트워크를 사용한 연산을 가하여 상기 제t 테스트용 예측 클래스를 생성하도록 하는 프로세스; 및 (I14) 상기 IPM에 포함된 Object-Centric 로컬라이저로 하여금, 상기 제t 테스트용 예측 클래스를 참조하여 제t 테스트용 예측 마스크를 생성하도록 하는 프로세스를 포함하는 것을 특징으로 하는 AI 에이전트가 개시된다.As an example, in the (I) process, (I11) causes the IPM natural language encoder included in the IPM to apply an operation using the IPM encoding BiLSTM included in the test indication data to obtain a t test IPM. a process for obtaining an attention natural language feature; (I12) The dynamic filter operator for IPM included in the IPM applies an operation using a Fully-Connected (FC) network for generating a filter for IPM included in the attention natural language feature for IPM for test t to at least one Process of generating a dynamic filter for IPM for test t and then obtaining an attention visual feature for IPM for test t by applying an operation using the dynamic filter for IPM for test t to the visual feature for test t ; and (I13) causing a class decoder included in the IPM to include an attention natural language feature for the IPM for the t test, an attention time feature for the IPM for the t test, and prediction action information for the (t-1)th test. An operation using the IPM decoding LSTM included therein is applied to the integrated vector for the IPM for the t test to generate a decoding hidden state vector for the IPM for the t test, and then to the decoding hidden state vector for the IPM for the t test a process of generating a prediction class for the tth test by applying an operation using a decoding FC network for IPM included therein; and (I14) a process of causing an Object-Centric localizer included in the IPM to generate a prediction mask for test t by referring to the prediction class for test t t . .

일례로서, 상기 (I14) 프로세스는, (I141) 상기 IPM에 포함된 상기 Object-Centric 로컬라이저로 하여금, 자신에 포함된 마스크 생성기를 사용하여, 상기 제t 테스트용 입력 프레임 상에서 상기 제t 테스트용 예측 클래스에 대응하는 적어도 하나의 제t 테스트용 후보 마스크를 생성하도록 하는 프로세스; 및 (I142) 상기 IPM에 포함된 상기 Object-Centric 로컬라이저로 하여금, 자신에 포함된 인스턴스 결정기를 사용하여, 상기 적어도 하나의 제t 테스트용 후보 마스크들 중 하나인 상기 제t 테스트용 예측 마스크를 결정하도록 하는 프로세스를 포함하는 것을 특징으로 하는 AI 에이전트가 개시된다.As an example, the (I14) process may (I141) cause the Object-Centric localizer included in the IPM to use a mask generator included in the IPM for the t test on the input frame for the t test. a process for generating at least one candidate mask for a t test corresponding to a prediction class; and (I142) causing the Object-Centric localizer included in the IPM to determine the prediction mask for the t-th test, which is one of the at least one candidate masks for the t-th test, using the instance determiner included therein. An AI agent comprising a process for making a decision is disclosed.

일례로서, 상기 (I142) 프로세스는, 상기 프로세서가, 상기 Object-Centric 로컬라이저로 하여금, 상기 인스턴스 결정기를 사용하여, (i) 상기 제t 테스트용 예측 클래스 및 제(t-1) 테스트용 예측 클래스를 비교한 다음, (ii-1) 상기 제t 테스트용 예측 클래스 및 상기 제(t-1) 테스트용 예측 클래스가 동일할 경우, 상기 제t 테스트용 후보 마스크들 중 제(t-1) 테스트용 예측 마스크와의 유사도가 가장 높은 것을 상기 제t 테스트용 예측 마스크로 결정하고, (ii-2) 상기 제t 테스트용 예측 클래스 및 상기 제(t-1) 테스트용 예측 클래스가 동일하지 않을 경우, 상기 제t 테스트용 후보 마스크들 중 가장 컨피던스 스코어가 높은 것을 상기 제t 테스트용 예측 마스크로 결정하도록 하는 것을 특징으로 하는 AI 에이전트가 개시된다.As an example, the (I142) process causes the processor to cause the Object-Centric localizer to use the instance determiner to: (i) the prediction class for the t-th test and the prediction for the (t-1)th test; After class comparison, (ii-1) if the prediction class for the t th test and the prediction class for the (t-1) th test are the same, the (t-1) th test candidate mask is The one having the highest similarity with the test prediction mask is determined as the prediction mask for the tth test, and (ii-2) the prediction class for the tth test and the prediction class for the (t-1)th test are not the same. In this case, the AI agent characterized in that the prediction mask for the t-th test is determined as the prediction mask for the t-th test, which has the highest confidence score among the candidate masks for the t-th test.

일례로서, 상기 (I) 프로세스는, (I21) 상기 APM에 포함된 APM용 자연어 인코더로 하여금, 상기 테스트용 지시 데이터에 자신에 포함된 APM용 인코딩 BiLSTM을 사용한 연산을 가하여 제t 테스트용 APM용 어텐션 자연어 피처를 획득하도록 하는 프로세스; (I22) 상기 APM에 포함된 APM용 다이나믹 필터 연산기로 하여금, 상기 제t 테스트용 APM용 어텐션 자연어 피처에 자신에 포함된 APM용 필터 생성 FC 네트워크를 사용한 연산을 가하여 적어도 하나의 제t 테스트용 APM용 다이나믹 필터를 생성하도록 한 다음, 상기 제t 테스트용 시각 피처에 상기 제t 테스트용 APM용 다이나믹 필터를 사용한 연산을 가하여 제t 테스트용 APM용 어텐션 시각 피처를 획득하도록 하는 프로세스; 및 (I23) 상기 APM에 포함된 액션 디코더로 하여금, 상기 제t 테스트용 APM용 어텐션 자연어 피처, 상기 제t 테스트용 APM용 어텐션 시각 피처 및 제(t-1) 테스트용 예측 액션 정보를 포함하는 제t 테스트용 APM용 통합 벡터에, 자신에 포함된 APM용 디코딩 LSTM을 사용한 연산을 가하여 제t 테스트용 APM용 디코딩 히든스테이트 벡터를 생성한 다음, 상기 제t 테스트용 APM용 통합 벡터 및 상기 제t 테스트용 APM용 디코딩 히든스테이트 벡터를 포함하는 벡터에 자신에 포함된 APM용 디코딩 FC 네트워크를 사용한 연산을 가하여 상기 제t 테스트용 예측 액션 정보를 생성하도록 하는 프로세스를 포함하는 것을 특징으로 하는 AI 에이전트가 개시된다.As an example, in the (I) process, (I21) causes the APM natural language encoder included in the APM to apply an operation using the APM encoding BiLSTM included in the test instruction data to obtain a t test APM. a process for obtaining an attention natural language feature; (I22) The APM dynamic filter operator included in the APM applies an operation using the APM filter generation FC network included in the attention natural language feature for the APM for the t test to at least one APM for the t test a process of generating a dynamic filter for APM for test t and then applying an operation using the dynamic filter for APM for test t to the visual feature for test t to obtain an attention visual feature for APM for test t ; and (I23) causing an action decoder included in the APM to include an attention natural language feature for the APM for the t test, an attention time feature for the APM for the t test, and prediction action information for the (t-1)th test. An operation using the APM decoding LSTM included therein is applied to the integrated vector for the APM for the t test to generate a decoding hidden state vector for the APM for the t test, and then the integrated vector for the APM for the t test and the th A process of generating prediction action information for the t test by applying an operation using an APM decoding FC network included therein to a vector including a decoding hidden state vector for APM for t test AI agent comprising: is initiated.

일례로서, 상기 (I23) 프로세스는, 상기 프로세서가, 상기 APM에 포함된 상기 장애물 회피기로 하여금, 상기 제t 테스트용 시각 피처 및 제(t-1) 테스트용 시각 피처를 참조하여 장애물 조우 여부를 판단한 상태에서, 상기 AI 에이전트가 상기 장애물을 조우한 것으로 판단될 경우, 상기 APM 디코딩 FC 네트워크를 사용한 연산을 수행한 후, 상기 APM용 디코딩 FC 네트워크의 출력 노드들 중 상기 제t-1 테스트용 예측 액션 정보에 대응하는 특정 출력 노드에서 출력된 특정 값을 제외한 값들 중 가장 높은 값에 대응하는 것을 상기 제t 테스트용 예측 액션 정보로 선택하도록 하는 것을 특징으로 하는 AI 에이전트가 개시된다.As an example, in the (I23) process, the processor determines whether an obstacle is encountered by referring to the t test visual feature and the (t-1) th visual feature for the test to cause the obstacle avoider included in the APM to be detected. In the determined state, if it is determined that the AI agent has encountered the obstacle, after performing an operation using the APM decoding FC network, the prediction for the t-1th test among output nodes of the APM decoding FC network Disclosed is an AI agent characterized in that the prediction action information for the t test is selected corresponding to the highest value among values excluding a specific value output from a specific output node corresponding to the action information.

일례로서, 상기 (I23) 프로세스는, 상기 프로세서가, 상기 APM에 포함된 상기 장애물 회피기로 하여금, 상기 제t 테스트용 시각 피처 및 제(t-1) 테스트용 시각 피처 간의 유사도를 계산한 후, 상기 유사도가 임계치 이상일 경우 상기 AI 에이전트가 상기 장애물을 조우한 것으로 판단하는 것을 특징으로 하는 AI 에이전트가 개시된다.As an example, in the (I23) process, after the processor calculates a degree of similarity between the tth test visual feature and the (t-1)th test visual feature by the obstacle avoider included in the APM, Disclosed is an AI agent characterized in that it is determined that the AI agent has encountered the obstacle when the similarity is greater than or equal to a threshold value.

본 발명은 "Interactive Instruction Following"을 구현하기 위해, Interactive Perception과 Action Policy를 각각의 태스크 스트림으로 포함하는 모델을 구현할 수 있도록 하는 효과가 있다.In order to implement "Interactive Instruction Following", the present invention has an effect of enabling implementation of a model including Interactive Perception and Action Policy as respective task streams.

또한 본 발명은 태스크를 수행하기 위한 Object-Centric 위치 결정 알고리즘 및 장애물 회피 메커니즘을 제공할 수 있는 효과가 있다.In addition, the present invention has the effect of providing an Object-Centric location determination algorithm and an obstacle avoidance mechanism for performing a task.

도 1은 본 발명의 일 실시예에 따른 Interactive Perception 및 Action Policy의 이중적 태스크 스트림을 포함하는 Modular Object-Centric Approach(MOCA) 모델에 따른 AI 에이전트의 학습 단계에서의 구성을 나타낸 도면이다.
도 2는 본 발명의 일 실시예에 따른 Interactive Perception 및 Action Policy의 이중적 태스크 스트림을 포함하는 MOCA 모델에 따른 AI 에이전트의 학습 방법을 나타낸 흐름도이다.
도 3은 본 발명의 일 실시예에 따른 MOCA 모델에 따른 AI 에이전트의 학습 방법에서 채택된 Interactive Perception 및 Action Policy의 이중적 태스크 스트림을 나타낸 도면이다.
도 4는 본 발명의 일 실시예에 따른 Interactive Perception 및 Action Policy의 이중적 태스크 스트림을 포함하는 MOCA 모델에 따른 AI 에이전트의 테스트 방법에서 선택적으로 사용되는 Confidence-Based 및 Association-Based 마스크 결정 알고리즘을 나타낸 도면이다.
도 5는 본 발명의 일 실시예에 따른 Interactive Perception 및 Action Policy의 이중적 태스크 스트림을 포함하는 MOCA 모델에 따른 AI 에이전트의 테스트 방법에서 사용되는 장애물 회피 알고리즘을 나타낸 도면이다.1 is a diagram showing the configuration in the learning step of an AI agent according to the Modular Object-Centric Approach (MOCA) model including dual task streams of Interactive Perception and Action Policy according to an embodiment of the present invention.
2 is a flowchart illustrating an AI agent learning method according to a MOCA model including dual task streams of Interactive Perception and Action Policy according to an embodiment of the present invention.
3 is a diagram showing dual task streams of Interactive Perception and Action Policy adopted in the AI agent learning method according to the MOCA model according to an embodiment of the present invention.
4 is a diagram showing Confidence-Based and Association-Based mask determination algorithms selectively used in the AI agent test method according to the MOCA model including dual task streams of Interactive Perception and Action Policy according to an embodiment of the present invention. am.
5 is a diagram illustrating an obstacle avoidance algorithm used in an AI agent test method according to a MOCA model including dual task streams of Interactive Perception and Action Policy according to an embodiment of the present invention.

후술하는 본 발명에 대한 상세한 설명은, 본 발명이 실시될 수 있는 특정 실시예를 예시로서 도시하는 첨부 도면을 참조한다. 이들 실시예는 당업자가 본 발명을 실시할 수 있기에 충분하도록 상세히 설명된다. 본 발명의 다양한 실시예는 서로 다르지만 상호 배타적일 필요는 없음이 이해되어야 한다. 예를 들어, 여기에 기재되어 있는 특정 형상, 구조 및 특성은 일 실시예에 관련하여 본 발명의 정신 및 범위를 벗어나지 않으면서 다른 실시예로 구현될 수 있다. 또한, 각각의 개시된 실시예 내의 개별 구성요소의 위치 또는 배치는 본 발명의 정신 및 범위를 벗어나지 않으면서 변경될 수 있음이 이해되어야 한다. 따라서, 후술하는 상세한 설명은 한정적인 의미로서 취하려는 것이 아니며, 본 발명의 범위는, 적절하게 설명된다면, 그 청구항들이 주장하는 것과 균등한 모든 범위와 더불어 첨부된 청구항에 의해서만 한정된다. 도면에서 유사한 참조부호는 여러 측면에 걸쳐서 동일하거나 유사한 기능을 지칭한다.DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS The detailed description of the present invention which follows refers to the accompanying drawings which illustrate, by way of illustration, specific embodiments in which the present invention may be practiced. These embodiments are described in sufficient detail to enable one skilled in the art to practice the present invention. It should be understood that the various embodiments of the present invention are different from each other but are not necessarily mutually exclusive. For example, specific shapes, structures, and characteristics described herein may be implemented in one embodiment in another embodiment without departing from the spirit and scope of the invention. Additionally, it should be understood that the location or arrangement of individual components within each disclosed embodiment may be changed without departing from the spirit and scope of the invention. Accordingly, the detailed description set forth below is not to be taken in a limiting sense, and the scope of the present invention, if properly described, is limited only by the appended claims, along with all equivalents as claimed by those claims. Like reference numbers in the drawings indicate the same or similar function throughout the various aspects.

이하, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 본 발명을 용이하게 실시할 수 있도록 하기 위하여, 본 발명의 바람직한 실시예들에 관하여 첨부된 도면을 참조하여 상세히 설명하기로 한다.Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those skilled in the art can easily practice the present invention.

도 1은 본 발명의 일 실시예에 따른 Interactive Perception 및 Action Policy의 이중적 태스크 스트림을 포함하는 Modular Object-Centric Approach(MOCA) 모델에 따른 AI 에이전트의 학습 단계에서의 구성을 나타낸 도면이다.1 is a diagram showing the configuration in the learning step of an AI agent according to the Modular Object-Centric Approach (MOCA) model including dual task streams of Interactive Perception and Action Policy according to an embodiment of the present invention.

도 1을 참조하면, AI 에이전트(100)는 CNN(130), Interactive Perception Module(IPM)(200) 및 Action Policy Module(APM)(300)을 포함할 수 있다. 이 때, CNN(130), IPM(200) 및 APM(300)의 입출력 및 연산 과정은 각각 통신부(110) 및 프로세서(120)에 의해 이루어질 수 있다. 다만, 도 1에서는 통신부(110) 및 프로세서(120)의 구체적인 연결 관계를 생략하였다. 또한, 메모리(115)는 후술할 여러 가지 인스트럭션들을 저장한 상태일 수 있고, 프로세서(120)는 메모리에 저장된 인스트럭션들을 수행하도록 됨으로써 추후 설명할 프로세스들을 수행하여 본 발명을 수행할 수 있다. 이와 같이 AI 에이전트(100)가 묘사되었다고 하여, AI 에이전트(100)가 본 발명을 실시하기 위한 미디엄, 프로세서 및 메모리가 통합된 형태인 통합 프로세서를 포함하는 경우를 배제하는 것은 아니다.Referring to FIG. 1 , the AI agent 100 may include a CNN 130, an Interactive Perception Module (IPM) 200, and an Action Policy Module (APM) 300. At this time, input/output and calculation processes of the CNN 130, the IPM 200, and the APM 300 may be performed by the communication unit 110 and the processor 120, respectively. However, in FIG. 1, a specific connection relationship between the communication unit 110 and the processor 120 is omitted. Also, the memory 115 may be in a state of storing various instructions to be described later, and the processor 120 may perform the present invention by performing processes to be described later by executing the instructions stored in the memory. Even though the AI agent 100 is described in this way, it does not exclude the case where the AI agent 100 includes an integrated processor in which a medium, a processor, and a memory are integrated to implement the present invention.

여기서 CNN(130)은 이미지를 프로세싱할 수 있는 뉴럴 네트워크라면 어떠한 구조를 가지는 것이라도 사용 가능한데, 일례로 종래 기술에 따른 ResNET이 사용될 수 있다. 또한, IPM(200) 및 APM(300)은 각각 자연어 인코더(210 및 310) 및 다이나믹 필터 연산기(220 및 320)를 포함할 수 있고, 디코더로서 클래스 디코더(230) 및 액션 디코더(330)를 포함할 수 있다. IPM(200)과 APM(300)의 차이에 대해 설명하면, APM(300)이 수행하는 Action Prediction은, 시각적으로 관찰된 정보, 즉 입력 이미지에 대한 글로벌 신 레벨의 이해가 필요한 반면, IPM(200)이 수행하는 Object Interaction의 경우, 글로벌 신 레벨의 이해와 더불어 오브젝트에 특화된 피처도 필요하다. 이와 같은 차이에 따라 AI 에이전트(100)는 두 태스크를 분리하여 각각 IPM(200) 및 APM(300)을 통해 수행하도록 설계된 것이다. 이 때, IPM(200)과 APM(300)은 도면과 같이 서로 유사한 구성을 가질 수 있으나, 추후 설명하겠지만, 디코더들(230 및 330) 간의 약간의 구조 차이와, 각각 별개의 LSTM을 통해 동작하도록 구현된 설계 차이에 따라 상술한 기능의 차이를 구현할 수 있다.Here, as long as the CNN 130 is a neural network capable of processing images, any structure can be used. For example, ResNET according to the prior art can be used. In addition, the IPM 200 and the APM 300 may include natural language encoders 210 and 310 and dynamic filter operators 220 and 320, respectively, and include a class decoder 230 and an action decoder 330 as decoders. can do. To explain the difference between the IPM (200) and the APM (300), Action Prediction performed by the APM (300) requires a global scene level understanding of visually observed information, that is, an input image, while the IPM (200) ), object-specific features are required along with an understanding of the global scene level. According to this difference, the AI agent 100 is designed to separate the two tasks and perform them through the IPM 200 and the APM 300, respectively. At this time, the IPM 200 and the APM 300 may have a configuration similar to each other as shown in the drawing, but as will be described later, a slight structural difference between the decoders 230 and 330 and separate LSTMs to operate Depending on the implemented design differences, the above-described differences in functions may be implemented.

또한, 테스트 과정에서의 AI 에이전트(100)는, 실제 세계에서 보조자로서의 역할을 수행할 수 있는 로봇에 탑재된 것일 수도 있고, 이와 같은 로봇과 통신하며 이를 통제하는 서버에 탑재된 것일 수도 있다. 또한, 학습 과정에서의 AI 에이전트(100)는, 가상 세계에서 보조자로서의 업무를 수행하는 상황을 시뮬레이션함으로써 학습을 할 수도 있고, 실제 세계에서 보조자로서 업무를 수행하면서 학습을 할 수도 있을 것이다. 이하의 설명에서는, 학습 과정에서는 AI 에이전트(100)가 가상 세계에서 보조자로서의 역할을 시뮬레이션하면서 학습하는 것으로 가정하고, 테스트 과정에서는 AI 에이전트(100)가 로봇에 탑재되어 동작하는 것으로 가정하여 설명할 것이나, 권리범위가 이에 한정될 것은 아니다.In addition, the AI agent 100 in the test process may be mounted on a robot that can serve as an assistant in the real world, or may be mounted on a server that communicates with and controls such a robot. In addition, the AI agent 100 in the learning process may learn by simulating a situation of performing work as an assistant in the virtual world, or may learn while performing work as an assistant in the real world. In the following description, it will be assumed that the AI agent 100 learns while simulating its role as an assistant in the virtual world in the learning process, and it will be assumed that the AI agent 100 is mounted on the robot and operated in the test process. However, the scope of rights is not limited to this.

이상 본 발명의 일 실시예에 따른 학습 장치(100)의 구성에 대해 설명한 바, 본 발명의 일 실시예에 따른 Interactive Perception 및 Action Policy의 이중적 태스크 스트림을 포함하는 MOCA 모델에 따른 AI 에이전트의 학습 방법에 대해 전반적으로 살피도록 한다.Having described the configuration of the learning device 100 according to an embodiment of the present invention, the learning method of an AI agent according to the MOCA model including dual task streams of Interactive Perception and Action Policy according to an embodiment of the present invention Let's look at it in general.

도 2는 본 발명의 일 실시예에 따른 Interactive Perception 및 Action Policy의 이중적 태스크 스트림을 포함하는 MOCA 모델에 따른 AI 에이전트의 학습 방법을 나타낸 흐름도이다.2 is a flowchart illustrating an AI agent learning method according to a MOCA model including dual task streams of Interactive Perception and Action Policy according to an embodiment of the present invention.

도 2를 참조하면, AI 에이전트(100)가, 제t 입력 프레임(t는 1 이상의 정수임)이 획득되면, 제t 입력 프레임에 대한 제t 시각 피처 및 AI 에이전트에 대한 자연어 형태의 지시 데이터를 IPM(200) 및 APM(300)에 입력할 수 있다(S01). 그리고, AI 에이전트(100)가, IPM(200) 및 APM(300)으로부터 제t 예측 클래스 및 제t 예측 액션 정보를 각각 획득할 수 있다(S02). 다음으로, AI 에이전트(100)가, (i-1) 제t 예측 클래스를 포함하는 제1 내지 제N(N은 t이상의 정수임)예측 클래스 중 적어도 일부 및 (i-2) 제t 예측 액션 정보를 포함하는 제1 내지 제N 예측 액션 정보 중 적어도 일부와 (ii) 이들 각각에 대한 Ground-Truth(GT) 정보를 참조하여 IPM 로스 및 APM 로스를 생성한 후, 이를 백프로퍼게이션함으로써 IPM(200) 및 APM(300)의 파라미터들 중 적어도 일부를 학습할 수 있다(S03). 이하 각각의 단계에 대해 도 3을 참조로 구체적으로 설명하도록 한다.Referring to FIG. 2 , when the AI agent 100 acquires the tth input frame (t is an integer greater than or equal to 1), the tth visual feature for the tth input frame and the natural language instruction data for the AI agent are IPM (200) and APM (300) (S01). Then, the AI agent 100 may acquire t prediction class and t prediction action information from the IPM 200 and the APM 300, respectively (S02). Next, the AI agent 100 includes at least some of the first to Nth (N is an integer greater than or equal to t) prediction classes including the (i-1)th prediction class and (i-2)th prediction action information IPM (200 ) and at least some of the parameters of the APM 300 may be learned (S03). Hereinafter, each step will be described in detail with reference to FIG. 3 .

도 3은 본 발명의 일 실시예에 따른 MOCA 모델에 따른 AI 에이전트의 학습 방법에서 채택된 Interactive Perception 및 Action Policy의 이중적 태스크 스트림을 나타낸 도면이다.3 is a diagram showing dual task streams of Interactive Perception and Action Policy adopted in the AI agent learning method according to the MOCA model according to an embodiment of the present invention.

도 3을 참조하면, IPM(200) 및 APM(300)에 포함된 각각의 구성요소들의 더욱 자세한 구성과 이들의 프로세스 흐름을 확인할 수 있다. 본 도면에서, 아래첨자

은 IPM(200)과 관련된 것이고, 아래첨자

는 APM(300)과 관련된 것이다. 먼저 IPM(200)의 프로세스 흐름에 대해 설명한 다음, APM(300)은 이와 다소 유사한 흐름을 가지고 있으므로, APM(300)과 IPM(200)이 수행하는 프로세스의 차이점에 대해 설명함으로써 APM(300)에 대해 설명하는 방식으로 설명을 진행하고자 한다.Referring to FIG. 3 , more detailed configurations and process flows of respective components included in the IPM 200 and the APM 300 can be confirmed. In this figure, the subscript

is related to the IPM 200, and the subscript

is related to the APM (300). First, the process flow of the IPM (200) is explained, and then, since the APM (300) has a somewhat similar flow, the difference between the processes performed by the APM (300) and the IPM (200) is explained to understand the APM (300). I would like to proceed with the explanation in the way I explain it.

IPM(200)은, 먼저 AI 에이전트(100)에 대한 자연어 형태의 지시 데이터

를 획득할 수 있다. 지시 데이터

는 Goal의 형태로 제공될 수도 있고 Instruction의 형태로 제공될 수도 있는데, Goal의 형태로 제공되는 것은, AI 에이전트(100)가 수행하였으면 하는 동작의 최종 결과물만이 제공되는 것을 의미할 수 있고, Instruction의 형태로 제공되는 것은, 해당 최종 결과물을 획득하기 위해 AI 에이전트(100)가 수행하여야 할 각각의 프로세스가 일일이 제공되는 것을 의미할 수 있다. 일 예로, Goal의 형태로 제공되는 것의 예시는 "전기 스탠드의 빛을 사용해 빈 박스를 찾아"일 수 있고, Instruction의 형태로 제공되는 것의 예시는 "앞으로 가, 왼쪽으로 돌아, 테이블 위의 박스들을 찾아, 그 중에 빈 것을 찾아, 테이블에서 빈 박스를 가져와, 오른쪽으로 돌아"일 수 있다. The IPM 200 first provides directive data in the form of natural language to the AI agent 100.

can be obtained. instruction data

may be provided in the form of Goal or Instruction, provided in the form of Goal may mean that only the final result of the operation desired to be performed by the AI agent 100 is provided, and Instruction Being provided in the form of may mean that each process to be performed by the AI agent 100 to obtain the corresponding final result is provided one by one. For example, an example of what is provided in the form of a Goal may be "find an empty box using the light of the desk lamp", and an example of what is provided in the form of an Instruction is "Go forward, turn left, find the boxes on the table." Find it, find an empty one in it, take an empty box from the table, turn right".

IPM(200)에 포함된 IPM용 자연어 인코더(210)는, 이를 IPM용 인코딩 BiLSTM

을 통해 연산할 수 있다. BiLSTM은 딥 러닝 분야의 통상의 기술자에게 잘 알려진 기술이므로, 이의 작동 방식에 대한 설명은 생략하도록 한다. 이후, IPM용 인코딩 BiLSTM

의 출력은

에서 연산될 수 있는데, 이는 IPM용 디코딩 LSTM

의 t-1 시점의 hidden state

로부터 weight를 도출하고, 이를 사용해 IPM용 인코딩 BiLSTM

의 출력의 각 성분을 가중합하는 연산일 수 있다. 즉, IPM용 인코딩 BiLSTM

은 Key-Value Pair 형태의 값들을 성분으로 포함하는 벡터일 수 있는데, Key 부분의 값들과

의 각각의 대응하는 값들 간의 유사도를 계산한 다음, 각각의 유사도를 weight로 이용하여 value들의 가중합을 계산함으로써 attn 연산이 수행될 수 있다.The natural language encoder 210 for IPM included in the IPM 200 encodes BiLSTM for IPM.

can be computed through Since BiLSTM is a technology well known to those skilled in the art in the field of deep learning, a description of its operation method will be omitted. Afterwards, encoding BiLSTM for IPM

the output of

Can be operated on, which is a decoding LSTM for IPM

hidden state at time t-1 of

Derive weight from , and use it to encode BiLSTM for IPM

It may be an operation of weighted summing each component of the output of . i.e. encoding BiLSTM for IPM

may be a vector containing values in the form of Key-Value Pairs as components, and the values of the Key part and

The attn operation may be performed by calculating the similarity between each corresponding value of , and then calculating a weighted sum of the values using each similarity as a weight.

이를 통해 지시 데이터

의 각 구성단위(예를 들어, 각각의 단어)에 대한 연산이 완료되면, AI 에이전트(100)는, 이를 통해 획득된 제t IPM용 어텐션 자연어 피처

을 IPM용 다이나믹 필터 연산기(220)에 입력할 수 있다. IPM용 다이나믹 필터 연산기(220)는 제t IPM용 어텐션 자연어 피처

를 사용하여 추후 설명할 제t 시각 피처에 적용할 제t IPM용 다이나믹 필터를 생성할 수 있다. 이와 같이 생성된 제t IPM용 다이나믹 필터는 지시 데이터로부터 획득된 것으로, Language-Guided Filter로 볼 수 있으며, 이와 같은 과정을 통해, 제t 시각 피처를 연산할 때 지시 데이터로부터 획득된 정보를 반영할 수 있다. 이를 수식으로 표현하면 하기와 같다. In this way, the instruction data

When the operation for each constituent unit (eg, each word) of is completed, the AI agent 100 acquires the attention natural language feature for the t IPM obtained through the operation.

may be input to the dynamic filter operator 220 for IPM. The dynamic filter operator 220 for IPM is an attention natural language feature for t IPM.

A dynamic filter for the t th IPM to be applied to the t th visual feature, which will be described later, can be generated using . The dynamic filter for the tth IPM generated in this way is obtained from the indication data, and can be regarded as a Language-Guided Filter. can If this is expressed as a formula, it is as follows.

여기서

연산은 IPM용 다이나믹 필터 연산기(220)에 포함된 i번째 IPM용 필터 생성 Fully-Connected(FC) 네트워크를 사용한 연산일 수 있고,

는 생성할 제t IPM용 다이나믹 필터의 개수일 수 있다(도면에서는 3개로 묘사되었으나, 이에 한정될 것은 아니다).

은 제t IPM용 다이나믹 필터들 중 i번째를 의미하며, 도 3에서는

로 표현되었다. 또한,

는 제t 입력 프레임

를 CNN(130)이 연산함으로써 생성된 제t 시각 피처를 의미할 수 있다.here

The operation may be an operation using a Fully-Connected (FC) network for generating an i-th IPM filter included in the dynamic filter operator 220 for IPM,

may be the number of dynamic filters for the tth IPM to be generated (depicted as three in the drawing, but is not limited thereto).

denotes the ith of the dynamic filters for the t th IPM, and in FIG.

was expressed as also,

is the t input frame

may refer to a tth visual feature generated by the CNN 130 calculating .

이후, 도면과 같이 제t IPM용 다이나믹 필터가 제t 시각 피처를 연산함으로써 어텐션 맵들이 생성되면, IPM용 다이나믹 필터 연산기(220)가 이를 통합함으로써 제t IPM용 어텐션 시각 피처

를 생성할 수 있다. 다음으로, AI 에이전트(100)는, IPM(200)에 포함된 클래스 디코더(230)로 하여금, 제t IPM용 어텐션 자연어 피처

, 상기 제t IPM용 어텐션 시각 피처

및 제(t-1) 예측 액션 정보

를 통합하여 제t IPM용 통합 벡터를 생성하도록 할 수 있다. 이후, AI 에이전트(100)는, 클래스 디코더(230)로 하여금, 제t IPM용 통합 벡터에 IPM용 디코딩 LSTM

을 사용한 연산을 가하여 제t IPM용 디코딩 히든스테이트 벡터

을 생성하도록 할 수 있다. 이를 수식으로 표현하면 하기와 같다.Then, as shown in the figure, when attention maps are generated by calculating the tth visual feature by the dynamic filter for tth IPM, the dynamic filter operator 220 for IPM integrates them to obtain the attentional visual feature for tth IPM.

can create Next, the AI agent 100 causes the class decoder 230 included in the IPM 200 to use the attention natural language feature for the t IPM.

, the attention time feature for the t th IPM.

and the (t-1)th predicted action information

It is possible to generate an integration vector for the t IPM by integrating. Thereafter, the AI agent 100 causes the class decoder 230 to decode LSTM for IPM into the integration vector for tth IPM.

Decoding hidden state vector for tth IPM by applying an operation using

can be created. If this is expressed as a formula, it is as follows.

여기서의

은 전술한 IPM용 자연어 인코더(210)의

과 다른 것이며, 여기서의 제t IPM용 디코딩 히든스테이트 벡터

는, IPM용 자연어 인코더(210)에서 사용되었던 t-1 시점의 hidden state

로부터 업데이트된 것일 수 있다.here

of the aforementioned natural language encoder 210 for IPM.

is different from, and the decoding hidden state vector for the t th IPM here

is the hidden state at time t-1 that was used in the natural language encoder 210 for IPM

may have been updated from

이후, AI 에이전트(100)는, 클래스 디코더(230)로 하여금, 제t IPM용 디코딩 히든스테이트 벡터

에 자신에 포함된 IPM용 디코딩 FC 네트워크

을 사용한 연산을 가하여 제t 예측 클래스

를 생성하도록 할 수 있다. 여기서의 IPM용 디코딩 FC 네트워크는 IPM용 다이나믹 필터 연산기(220)의 IPM용 필터 생성 FC 네트워크와는 다르다. 상기 프로세스를 수식으로 표현하면 하기와 같다.Thereafter, the AI agent 100 causes the class decoder 230 to decode the hidden state vector for the tth IPM.

Decoding FC network for IPM included in itself

The t predicted class by applying an operation using

can be created. The IPM decoding FC network here is different from the IPM filter generation FC network of the IPM dynamic filter operator 220 . Expressing the above process in a formula is as follows.

여기서

는 예측할 수 있는 클래스의 개수를 의미할 수 있다. 이와 같은 수식을 통해,

에서 연산된 스코어들 중 가장 높은 값에 대응하는 클래스를 제t 예측 클래스로서 선택할 수 있다.here

may mean the number of predictable classes. Through such a formula,

A class corresponding to the highest value among the scores calculated in may be selected as the t prediction class.

Inference 시, 즉 테스트 시에는 제t 예측 클래스

가 생성된 이후에 추가적인 프로세스가 실시될 수 있지만, 학습 과정에서는 본 프로세스까지만 실시된 후, 바로 로스 생성 후 백프로퍼게이션을 통한 학습이 수행될 수 있다. 즉, AI 에이전트(100)는, 제t 예측 클래스를 포함하는 제1 내지 제N(N은 t이상의 정수임) 예측 클래스 중 적어도 일부와 이들에 대응하는 GT 정보, 즉 제1 내지 제N GT 클래스 정보 중 적어도 일부를 사용하여 IPM 로스를 생성한 후, 이를 백프로퍼게이션함으로써 IPM용 인코딩 BiLSTM, IPM용 필터 생성 FC 네트워크, IPM용 디코딩 LSTM 및 IPM용 디코딩 FC 네트워크의 파라미터들 중 적어도 일부를 학습할 수 있다. 여기서 IPM 로스는 크로스-엔트로피 로스 수식을 사용하여 계산될 수 있다. At the time of inference, i.e. the test, the predicted class

Although an additional process may be performed after is generated, in the learning process, only up to this process is performed, and then learning through backpropagation may be performed immediately after loss generation. That is, the AI agent 100 includes at least some of the first to Nth (N is an integer greater than or equal to t) prediction classes including the tth prediction class and GT information corresponding to them, that is, the first to Nth GT class information After generating an IPM loss using at least some of the parameters of the IPM loss, backpropagating the IPM encoding BiLSTM, the IPM filter generation FC network, the IPM decoding LSTM, and the IPM decoding FC network. At least some of the parameters can be learned. there is. Here, the IPM loss can be calculated using the cross-entropy loss formula.

이하 APM(300)의 학습 과정에 대해 설명하도록 한다. APM(300)은 IPM(200)과 대동소이한 구조를 가지고 있는데, APM용 자연어 인코더(310) 및 APM용 다이나믹 필터 연산기(320)까지의 프로세스는 IPM(200)의 대응되는 프로세스와 동일하다. 즉, AI 에이전트(100)가, APM(300)에 포함된 APM용 자연어 인코더(310)로 하여금, 지시 데이터

에 자신에 포함된 APM용 인코딩 BiLSTM을 사용한 연산을 가하여 제t APM용 어텐션 자연어 피처

를 획득하도록 할 수 있다. 다음으로, AI 에이전트(100)가, APM(300)에 포함된 APM용 다이나믹 필터 연산기(320)로 하여금, 제t APM용 어텐션 자연어 피처

에 자신에 포함된 APM용 필터 생성 FC 네트워크를 사용한 연산을 가하여 적어도 하나의 제t APM용 다이나믹 필터를 생성하도록 한 다음, 제t 시각 피처에 제t APM용 다이나믹 필터를 사용한 연산을 가하여 제t APM용 어텐션 시각 피처

를 획득하도록 할 수 있다.Hereinafter, the learning process of the APM 300 will be described. The APM 300 has a structure similar to that of the IPM 200, and processes up to the natural language encoder 310 for APM and the dynamic filter operator 320 for APM are the same as the corresponding processes of the IPM 200. That is, the AI agent 100 causes the APM natural language encoder 310 included in the APM 300 to indicate data

Attention natural language feature for t APM by applying an operation using the encoding BiLSTM for APM included in t

can be obtained. Next, the AI agent 100 causes the dynamic filter operator 320 for APM included in the APM 300 to use the attention natural language feature for t APM.

generate at least one dynamic filter for the tth APM by applying an operation using the FC network to generate a filter for the APM included in itself, and then applying an operation using the dynamic filter for the tth APM to the tth visual feature to obtain a t APM Attention visual feature for

can be obtained.

또한, 액션 디코더(330)에서 LSTM을 사용하는 프로세스까지도 동일하다. 즉, AI 에이전트(100)는, APM(300)에 포함된 액션 디코더(330)로 하여금, 제t APM용 어텐션 자연어 피처

, 상기 제t APM용 어텐션 시각 피처

및 제(t-1) 예측 액션 정보

를 포함하는 제t APM용 통합 벡터를 생성하도록 하고, 이에 APM용 디코딩 LSTM을 사용한 연산을 가하여 제t APM용 디코딩 히든스테이트 벡터

를 생성하도록 할 수 있다. 이후의 과정이 다른데, 제t IPM용 디코딩 히든스테이트 벡터를 바로 IPM용 디코딩 FC 네트워크에 입력하는 IPM(200)과 다르게, APM(300)은 APM용 디코딩 LSTM에 입력되었던 제t APM용 통합 벡터를 다시 제t APM용 디코딩 히든스테이트 벡터와 통합하고, 이 벡터에 APM용 디코딩 FC 네트워크를 사용한 연산을 가하여 제t 예측 액션 정보를 생성할 수 있다. 이를 수식으로 표현하면 하기와 같다.In addition, even the process of using LSTM in the action decoder 330 is the same. That is, the AI agent 100 causes the action decoder 330 included in the APM 300 to use the attention natural language feature for the t APM.

, the attention time feature for the tth APM

and the (t-1)th predicted action information

To generate an integrated vector for the tth APM including , and to this, an operation using the decoding LSTM for the APM is added to decode the hidden state vector for the tth APM

can be created. The subsequent process is different. Unlike IPM (200), which directly inputs the decoding hidden state vector for the t IPM to the decoding FC network for IPM, the APM (300) converts the integrated vector for the t APM input to the decoding LSTM for APM. Again, the tth prediction action information may be generated by integrating with the tth decoding hidden state vector for APM and applying an operation using the decoding FC network for APM to this vector. If this is expressed as a formula, it is as follows.

여기서

은 AI 에이전트(100)가 수행 가능한 액션의 개수일 수 있다. 이와 같은 과정을 통해, 최종적인 제t 예측 액션 정보의 결정에 더 많은 정보를 반영할 수 있게 된다.here

may be the number of actions that the AI agent 100 can perform. Through this process, more information can be reflected in determining the final t-th prediction action information.

Inference 시, 즉 테스트 시에는 제t 예측 액션 정보

의 생성 과정에 추가적인 구성요소가 사용될 수 있으나, 학습 시에는 해당 추가 구성요소 없이, APM용 디코딩 FC 네트워크만이 사용될 수 있다.At the time of inference, that is, at the time of testing, the t prediction action information

Additional components may be used in the generation process of , but during learning, only the decoding FC network for APM may be used without the additional components.

이후, AI 에이전트(100)는, 제t 예측 액션 정보를 포함하는 제1 내지 제N 예측 액션 정보 중 적어도 일부와, 이들 각각에 대한 GT 정보, 즉 제1 내지 제N GT 액션 정보를 참조하여 APM 로스를 생성한 후, 이를 백프로퍼게이션함으로써 APM용 인코딩 BiLSTM, APM용 필터 생성 FC 네트워크, APM용 디코딩 LSTM 및 APM용 디코딩 FC 네트워크의 파라미터들 중 적어도 일부를 학습할 수 있다. 여기서 APM 로스는 크로스-엔트로피 로스 수식을 사용하여 생성될 수 있고, 제1 내지 제N GT 액션 정보는 전문 인력이 제t 입력 프레임에 대응하는 상황에 기반하여 판단한 액션 정보가 입력된 것일 수 있다.Thereafter, the AI agent 100 refers to at least some of the 1st to Nth predicted action information including the tth predicted action information and GT information for each of them, that is, the 1st to Nth GT action information to APM. After generating the loss, by backpropagating it, at least some of the parameters of the encoding BiLSTM for APM, the filter generation FC network for APM, the decoding LSTM for APM, and the decoding FC network for APM can be learned. Here, the APM loss may be generated using a cross-entropy loss formula, and the first to Nth GT action information may be action information determined by experts based on a situation corresponding to the t th input frame.

이하 본 발명의 일 실시예에 따른 Interactive Perception 및 Action Policy의 이중적 태스크 스트림을 포함하는 Modular Object-Centric Approach(MOCA) 모델에 따른 AI 에이전트의 테스트 방법에 대해 설명하도록 한다. 이를 수행하는 AI 에이전트의 테스트 단계에서의 구성은, 학습 단계에서의 그것과 유사하나, IPM(200)과 APM(300)에 추가적인 구성을 포함할 수 있다는 점에서 약간 다르다. 즉, IPM(200)은 Object-Centric 로컬라이저(미도시)를, APM(300)은 장애물 회피기(미도시)를 추가적으로 포함할 수 있다. 이하 테스트 방법에 대해 더욱 구체적으로 설명하도록 한다.Hereinafter, a method for testing an AI agent according to a Modular Object-Centric Approach (MOCA) model including dual task streams of Interactive Perception and Action Policy according to an embodiment of the present invention will be described. The configuration in the test phase of the AI agent performing this is similar to that in the learning phase, but is slightly different in that additional configurations may be included in the IPM 200 and the APM 300. That is, the IPM 200 may additionally include an object-centric localizer (not shown), and the APM 300 may additionally include an obstacle avoider (not shown). Hereinafter, the test method will be described in more detail.

AI 에이전트(100)가, (1) 제t 학습용 입력 프레임이 획득되면, 제t 학습용 입력 프레임에 대한 제t 학습용 시각 피처 및 AI 에이전트(100)에 대한 자연어 형태의 학습용 지시 데이터를 IPM(200) 및 APM(300)에 입력하여, 각각의 IPM(200) 및 APM(300)으로부터 제t 학습용 예측 클래스 및 제t 학습용 예측 액션 정보를 획득하는 프로세스; 및 (2) (i-1) 제t 학습용 예측 클래스를 포함하는 제1 내지 제N 학습용 예측 클래스 중 적어도 일부 및 (i-2) 제t 학습용 예측 액션 정보를 포함하는 제1 내지 제N 학습용 예측 액션 정보 중 적어도 일부와 (ii) 이들 각각에 대한 Ground-Truth(GT) 정보를 참조하여 IPM 로스 및 APM 로스를 생성한 후, 이를 백프로퍼게이션함으로써 IPM 및 APM의 파라미터들 중 적어도 일부를 학습하는 프로세스를 수행함으로써 학습이 완료된 상태에서, 제t 테스트용 입력 프레임이 획득되면, 제t 테스트용 입력 프레임에 대한 제t 테스트용 시각 피처 및 AI 에이전트에 대한 자연어 형태의 테스트용 지시 데이터를 IPM(200) 및 APM(300)에 입력하여, IPM(200) 및 APM(300)으로부터 제t 테스트용 예측 마스크 및 제t 테스트용 예측 액션 정보를 각각 획득할 수 있다. 그리고, AI 에이전트(100)가, 제t 테스트용 예측 마스크 및 제t 테스트용 예측 액션 정보를 참조하여 테스트용 지시 데이터에 따른 동작의 적어도 일부를 수행할 수 있다.When the AI agent 100 obtains (1) the t-th learning input frame, the visual feature for the t-learning input frame for the t-learning input frame and the instruction data for learning in the form of natural language for the AI agent 100 are transmitted to the IPM 200 and a process of obtaining prediction class for t learning and prediction action for t learning from each of the IPM 200 and the APM 300 by inputting the information to the APM 300 ; and (2) at least some of the first to Nth prediction classes for learning including the (i-1) t prediction class for learning and the first to Nth predictions for learning including (i-2) t prediction action information for learning. After generating IPM loss and APM loss by referring to at least some of the action information and (ii) Ground-Truth (GT) information for each of them, backpropagating them to learn at least some of the parameters of IPM and APM When the input frame for the t test is obtained in a state in which learning is completed by performing the process, the visual feature for the t test for the input frame for the t test and the test instruction data in the form of natural language for the AI agent are converted to IPM (200 ) and the APM 300, the prediction mask for the t test and the prediction action information for the t test may be obtained from the IPM 200 and the APM 300, respectively. In addition, the AI agent 100 may perform at least part of an operation according to the test instruction data by referring to the prediction mask for test t and the prediction action information for test t.

이와 같은 테스트 프로세스에 대해 모듈별로 더욱 자세히 설명하도록 한다.This test process will be described in more detail for each module.

즉, AI 에이전트(100)가, IPM(200)에 포함된 IPM용 자연어 인코더(210)로 하여금, 테스트용 지시 데이터에 자신에 포함된 IPM용 인코딩 BiLSTM을 사용한 연산을 가하여 제t 테스트용 IPM용 어텐션 자연어 피처를 획득하도록 할 수 있다. 그리고, AI 에이전트(100)가, IPM(200)에 포함된 IPM용 다이나믹 필터 연산기(220)로 하여금, 제t 테스트용 IPM용 어텐션 자연어 피처에 자신에 포함된 IPM용 필터 생성 Fully-Connected(FC) 네트워크를 사용한 연산을 가하여 적어도 하나의 제t 테스트용 IPM용 다이나믹 필터를 생성하도록 한 다음, 제t 테스트용 시각 피처에 제t 테스트용 IPM용 다이나믹 필터를 사용한 연산을 가하여 제t 테스트용 IPM용 어텐션 시각 피처를 획득하도록 할 수 있다. 다음으로, AI 에이전트(200)가, IPM(200)에 포함된 클래스 디코더(230)로 하여금, 제t 테스트용 IPM용 어텐션 자연어 피처, 제t 테스트용 IPM용 어텐션 시각 피처 및 제(t-1) 테스트용 예측 액션 정보를 포함하는 제t 테스트용 IPM용 통합 벡터에, 자신에 포함된 IPM용 디코딩 LSTM을 사용한 연산을 가하여 제t 테스트용 IPM용 디코딩 히든스테이트 벡터를 생성한 다음, 제t 테스트용 IPM용 디코딩 히든스테이트 벡터에 자신에 포함된 IPM용 디코딩 FC 네트워크를 사용한 연산을 가하여 제t 테스트용 예측 클래스를 생성하도록 할 수 있다.That is, the AI agent 100 causes the natural language encoder 210 for IPM included in the IPM 200 to apply an operation using the encoding BiLSTM for IPM included in the test instruction data to obtain Attention natural language features can be acquired. In addition, the AI agent 100 causes the dynamic filter operator 220 for IPM included in the IPM 200 to create a filter for IPM included in the attention natural language feature for IPM for z test Fully-Connected (FC ) network to generate at least one dynamic filter for the IPM for the z test, and then apply an operation using the dynamic filter for the IPM for the z test to the visual feature for the z test to generate a dynamic filter for the IPM for the z test. Attention visual features can be acquired. Next, the AI agent 200 causes the class decoder 230 included in the IPM 200 to include an attention natural language feature for IPM for z test, an attention visual feature for IPM for z test, and a (t-1) ) The decoding hidden state vector for the IPM for the t test is generated by applying an operation using the IPM decoding LSTM included in the integrated vector for the IPM for the t test containing the prediction action information for the test, and then the t test The prediction class for the t test can be generated by applying an operation using the decoding FC network for IPM included in the decoding hidden state vector for IPM.

그 다음으로, 학습 단계에서는 수행되지 않는 프로세스에 대해 설명하도록 한다. 즉, AI 에이전트(100)는, IPM(200)에 포함된 Object-Centric 로컬라이저(미도시)로 하여금, 제t 테스트용 예측 클래스를 참조하여 제t 테스트용 예측 마스크를 생성하도록 할 수 있다. Next, the process not performed in the learning phase will be described. That is, the AI agent 100 may cause an object-centric localizer (not shown) included in the IPM 200 to generate a prediction mask for test t by referring to a prediction class for test t.

즉, AI 에이전트(100)는, IPM(200)에 포함된 Object-Centric 로컬라이저(미도시)로 하여금, 자신에 포함된 마스크 생성기를 사용하여, 제t 테스트용 입력 프레임 상에서 제t 테스트용 예측 클래스에 대응하는 적어도 하나의 제t 테스트용 후보 마스크를 생성하도록 할 수 있다. 여기서 마스크 생성기는 제t 테스트용 입력 프레임 상에서 제t 테스트용 예측 클래스에 해당하는 오브젝트를 찾고, 이의 위치를 마스킹하여 제t 테스트용 후보 마스크를 생성하며, 마스킹 프로세스에 대한 컨피던스 스코어를 계산할 수 있는 구성요소일 수 있다. 이와 같은 마스크 생성기는, 통상의 기술자에게 널리 알려진 이미지 세그멘테이션 기술 및 컨피던스 스코어 생성 기술을 사용해 구현될 수 있으므로 더욱 자세한 설명은 생략하도록 한다. 또한, 마스크 생성기는, 전술한 본 발명의 학습 과정을 통해 학습되는 것이 아니라, 이미 학습이 완료된 것을 도입한 것일 수 있다.That is, the AI agent 100 causes the Object-Centric localizer (not shown) included in the IPM 200 to predict test t on the input frame for test t using the mask generator included therein. At least one candidate mask for the t-th test corresponding to the class may be generated. Here, the mask generator finds an object corresponding to the predicted class for the t test on the input frame for the t test, masks its position to generate a candidate mask for the t test, and calculates a confidence score for the masking process. can be an element. Since such a mask generator can be implemented using image segmentation technology and confidence score generation technology widely known to those skilled in the art, a detailed description thereof will be omitted. In addition, the mask generator may not be learned through the above-described learning process of the present invention, but may have already been learned.

이상의 과정을 통해 제t 테스트용 후보 마스크가 생성되면, AI 에이전트(100)는 이들 중 하나를 제t 테스트용 예측 마스크로 결정하기 위해(제t 테스트용 후보 마스크가 하나일 경우 당연히 해당 마스크가 제t 테스트용 예측 마스크가 될 것이다), 자신에 포함된 인스턴스 결정기를 사용할 수 있다. 인스턴스 결정기의 프로세스에 대해 설명하면, AI 에이전트(100)는, Object-Centric 로컬라이저(미도시)로 하여금, 인스턴스 결정기를 사용하여, 제t 테스트용 예측 클래스 및 제(t-1) 테스트용 예측 클래스를 비교하도록 할 수 있다. 만일 양 클래스가 서로 다를 경우, Object-Centric 로컬라이저(미도시)는, 그들 중 컨피던스 스코어가 가장 높은 것을 제t 테스트용 예측 마스크로 결정할 수 있다. 양 클래스가 서로 동일할 경우, Object-Centric 로컬라이저(미도시)는, 제t 테스트용 후보 마스크들 중 제(t-1) 테스트용 예측 마스크와의 유사도가 가장 높은 것을 제t 테스트용 예측 마스크로 결정할 수 있다. 여기서 유사도는, 각 마스크의 중심 간의 거리가 가까울수록 유사도가 높은 것으로 계산될 수 있다. 이를 수식으로 표현하면, 제t 테스트용 예측 마스크

일 때, 다음과 같다.When candidate masks for the t test are generated through the above process, the AI agent 100 determines one of them as a prediction mask for the t test (if there is one candidate mask for the t test, of course, the corresponding mask is t will be the prediction mask for testing), it can use its own contained instance determinant. Describing the process of the instance determiner, the AI agent 100 causes the Object-Centric localizer (not shown) to use the instance determiner to predict the t test's prediction class and the (t-1)th test's prediction class. Classes can be compared. If both classes are different from each other, an Object-Centric localizer (not shown) may determine a prediction mask having the highest confidence score among them as a prediction mask for the t test. When both classes are equal to each other, the Object-Centric localizer (not shown) selects a prediction mask for the t test, which has the highest similarity with the (t-1) th test prediction mask among candidate masks for the t test. can be determined by Here, the degree of similarity may be calculated as higher as the distance between the centers of the respective masks is smaller. Expressing this as a formula, the prediction mask for the z test

When working, it is as follows.

여기서 아래첨자

는 제t 테스트용 후보 마스크들 중 i번째 것에 대응하는 것일 수 있고, 이에 따라

는 제t 테스트용 후보 마스크들 중 i번째 것의 컨피던스 스코어를 의미하고,

는 제(t-1) 테스트용 예측 마스크의 중심 좌표를 의미할 수 있다. 이와 같은 프로세스를 통해 제t 테스트용 예측 마스크를 결정하는 이유에 대해 설명하기 위해 도 4를 참조하도록 한다.subscript here

may correspond to the ith one of the candidate masks for the tth test, and thus

Means the confidence score of the i-th one of the candidate masks for the t-th test,

may mean the center coordinates of the (t-1)th test prediction mask. Referring to FIG. 4 to explain why the prediction mask for the t test is determined through such a process.

도 4는 본 발명의 일 실시예에 따른 Interactive Perception 및 Action Policy의 이중적 태스크 스트림을 포함하는 Modular Object-Centric Approach(MOCA) 모델에 따른 AI 에이전트의 테스트 방법에서 선택적으로 사용되는 Confidence-Based 및 Association-Based 마스크 결정 알고리즘을 나타낸 도면이다.4 is a Confidence-Based and Association-Based Test method selectively used in an AI agent test method according to a Modular Object-Centric Approach (MOCA) model including dual task streams of Interactive Perception and Action Policy according to an embodiment of the present invention. This is a diagram showing the Based mask decision algorithm.

도 4를 참조하면, 제(t-2) 테스트용 예측 클래스는 서랍에 대응하지 않지만, 제(t-1) 테스트용 예측 클래스 및 제t 테스트용 예측 클래스가 서랍에 대응하는 경우 어떻게 제t 테스트용 예측 마스크가 결정되는지를 알 수 있다. 이 경우, 제(t-2) 테스트용 입력 프레임에 대응하는 타이밍에 AI 에이전트(100)가 상호작용하는 오브젝트와 제(t-1) 테스트용 입력 프레임에 대응하는 타이밍에 AI 에이전트(100)가 상호작용하는 오브젝트는 서로 다른 것이다. 따라서 제(t-1) 테스트용 후보 마스크들 중 그 컨피던스가 0.912로 가장 높은 것을 제(t-1) 테스트용 예측 마스크로 결정한 것이다. 하지만, 제t 테스트용 입력 프레임에 대응하는 타이밍에 AI 에이전트(100)가 상호작용하는 오브젝트는, 제(t-1) 테스트용 입력 프레임에 대응하는 타이밍에서의 그것과 동일할 확률이 매우 높다. 따라서, AI 에이전트(100)는, 제(t-1) 테스트용 입력 프레임에 대응하는 타이밍에 상호작용했던 오브젝트와 상호작용하기 위해, 컨피던스 스코어가 0.760으로, 가장 높지는 않음에도, Association-based로, 즉 이전에 선택했던 제(t-1) 테스트용 예측 마스크와 동일한 것을 제t 테스트용 예측 마스크로 결정하는 것이다.Referring to FIG. 4 , if the prediction class for the (t-2)th test does not correspond to a drawer, but the prediction class for the (t-1)th test and the prediction class for the t test correspond to a drawer, how can the t test It can be known whether a prediction mask for use is determined. In this case, the object with which the AI agent 100 interacts at the timing corresponding to the (t-2)th input frame for testing and the AI agent 100 at the timing corresponding to the (t-1)th input frame for testing The objects you interact with are different. Therefore, among the (t-1)th test candidate masks, the one having the highest confidence of 0.912 is determined as the (t-1)th test prediction mask. However, the object with which the AI agent 100 interacts at the timing corresponding to the tth input frame for testing is highly likely to be the same as the object at the timing corresponding to the (t-1)th input frame for testing. Therefore, in order to interact with the object that the AI agent 100 interacted with at the timing corresponding to the (t-1)th test input frame, the confidence score is 0.760, which is not the highest, but Association-based. , that is, the same as the previously selected prediction mask for the (t-1)th test is determined as the prediction mask for the tth test.

이하 테스트 과정에서의 APM(300)의 프로세스에 대해 설명하도록 한다. 즉, AI 에이전트(100)가, APM(300)에 포함된 APM용 자연어 인코더(310)로 하여금, 테스트용 지시 데이터에 자신에 포함된 APM용 인코딩 BiLSTM을 사용한 연산을 가하여 제t 테스트용 APM용 어텐션 자연어 피처를 획득하도록 할 수 있다. 그리고, AI 에이전트(100)가, APM(300)에 포함된 APM용 다이나믹 필터 연산기(320)로 하여금, 제t 테스트용 APM용 어텐션 자연어 피처에 자신에 포함된 APM용 필터 생성 FC 네트워크를 사용한 연산을 가하여 적어도 하나의 제t 테스트용 APM용 다이나믹 필터를 생성하도록 한 다음, 제t 테스트용 시각 피처에 제t 테스트용 APM용 다이나믹 필터를 사용한 연산을 가하여 제t 테스트용 APM용 어텐션 시각 피처를 획득하도록 할 수 있다. 그리고, AI 에이전트(100)가, APM(300)에 포함된 액션 디코더(330)로 하여금, 제t 테스트용 APM용 어텐션 자연어 피처, 제t 테스트용 APM용 어텐션 시각 피처 및 제(t-1) 테스트용 예측 액션 정보를 포함하는 제t 테스트용 APM용 통합 벡터에, 자신에 포함된 APM용 디코딩 LSTM을 사용한 연산을 가하여 제t 테스트용 APM용 디코딩 히든스테이트 벡터를 생성한 다음, 제t 테스트용 APM용 통합 벡터 및 제t 테스트용 APM용 디코딩 히든스테이트 벡터를 포함하는 벡터에 자신에 포함된 APM용 디코딩 FC 네트워크를 사용한 연산을 가하여 제t 테스트용 예측 액션 정보를 생성하도록 할 수 있다. 학습 과정에서는, APM 디코딩 FC 네트워크만을 사용하여 예측 액션 정보를 생성하지만, 테스트 과정에서는 장애물 회피기(미도시)를 함께 사용하는 바, 이에 대해 설명하도록 한다.Hereinafter, the process of the APM 300 in the test process will be described. That is, the AI agent 100 causes the APM natural language encoder 310 included in the APM 300 to apply an operation using the APM encoding BiLSTM included in the test instruction data to obtain Attention natural language features can be obtained. In addition, the AI agent 100 causes the dynamic filter operator 320 for APM included in the APM 300 to calculate the APM filter generation FC network included in the attention natural language feature for APM for the z test using the FC network. to generate at least one dynamic filter for the APM for the z-t test, and then apply an operation using the dynamic filter for the APM for the z-t test to the visual feature for the z-t test to obtain an attention visual feature for the APM for the z-t test. can make it In addition, the AI agent 100 causes the action decoder 330 included in the APM 300 to perform an attention natural language feature for APM for z test, an attention visual feature for APM for z test, and (t-1) A decoding hidden state vector for the APM for the t test is generated by applying an operation using the decoding LSTM for the APM included in the integration vector for the APM for the t test, which includes the prediction action information for the test, and then for the t test Predictive action information for test t can be generated by applying an operation using the decoding FC network for APM included in the vector including the integrated vector for APM and the decoding hidden state vector for APM for test t . In the learning process, only the APM decoding FC network is used to generate predictive action information, but in the test process, an obstacle avoider (not shown) is also used, which will be described.

즉, AI 에이전트(100)는, APM(300)에 포함된 장애물 회피기로 하여금, 제t 테스트용 시각 피처 및 제(t-1) 테스트용 시각 피처를 참조하여 장애물 조우 여부를 판단할 수 있다. 여기서 장애물의 조우 여부는 제t 테스트용 시각 피처 및 제(t-1) 테스트용 시각 피처 간의 유사도를 계산함으로써(유사도는 벡터 간의 거리 계산 알고리즘을 통해, 거리가 멀수록 유사도가 낮은 것으로 계산될 수 있다) 추측될 수 있다. 일례로, 유사도가 임계치 이상일 경우, AI 에이전트(100)가 장애물에 막혀 움직이지 못하는 것으로 추측할 수 있고(움직이지 못하므로 입력 프레임들의 피처들이 서로 유사할 것이다), 유사도가 임계치 이하일 경우 AI 에이전트(100)가 잘 이동하고 있는 것으로 보아 장애물에 막히지 않은 것으로 판단할 수 있다. 이 때, AI 에이전트(100)가 장애물에 막힌 경우, 제(t-1) 입력 프레임에 대응하는 타이밍에 수행했던 행동을 제t 입력 프레임에 대응하는 타이밍에 다시 수행하면 계속 장애물에 막혀 있게 되므로, AI 에이전트(100)는, 해당 액션이 선택되지 않도록 할 수 있다. 즉, 기본적으로 제t 테스트용 APM 통합 벡터 및 제t 테스트용 APM 디코딩 히든스테이트 벡터를 포함하는 벡터에 APM용 디코딩 FC 네트워크를 사용한 연산을 가하여, 해당 결과들 중 가장 높은 값에 대응하는 액션 정보를 제t 테스트용 예측 액션 정보로 선택하는 것이지만, 장애물을 만난 것으로 판단되는 경우, APM용 디코딩 FC 네트워크의 출력 노드들 중 제(t-1) 테스트용 예측 액션 정보에 대응하는 특정 출력 노드에서 출력된 특정 값은 제외하고, 나머지 값들 중 가장 높은 값에 대응하는 것을 제t 테스트용 예측 액션 정보로 선택할 수 있다. 이를 수식으로 표현하면 하기와 같다.That is, the AI agent 100 may determine whether the obstacle avoider included in the APM 300 encounters an obstacle by referring to the visual feature for the t test and the visual feature for the (t-1) th test. Here, the encounter of the obstacle is determined by calculating the similarity between the visual feature for the t test and the visual feature for the (t-1) th test (the similarity can be calculated through a distance calculation algorithm between vectors, the higher the distance, the lower the similarity). Yes) can be inferred. For example, if the similarity is greater than or equal to a threshold, it can be assumed that the AI agent 100 is blocked by an obstacle and cannot move (because it cannot move, the features of the input frames will be similar to each other), and if the similarity is less than or equal to the threshold, the AI agent ( 100) is moving well, so it can be determined that it is not blocked by obstacles. At this time, if the AI agent 100 is blocked by an obstacle, if the action performed at the timing corresponding to the (t-1)th input frame is performed again at the timing corresponding to the tth input frame, the AI agent 100 continues to be blocked by the obstacle, The AI agent 100 may prevent the corresponding action from being selected. That is, by applying an operation using the decoding FC network for APM to a vector including the APM integration vector for the t test and the APM decoding hidden state vector for the t test, action information corresponding to the highest value among the results is obtained. It is selected as the prediction action information for the t test, but when it is determined that an obstacle has been encountered, the output node is output from a specific output node corresponding to the (t-1) th test prediction action information among the output nodes of the decoding FC network for APM. Excluding a specific value, the one corresponding to the highest value among the remaining values may be selected as the prediction action information for the t test. If this is expressed as a formula, it is as follows.

각 문자의 의미는 모두 전술한 수식에서 설명하였으므로 이들에 대한 설명은 생략하도록 한다. 이하 이와 같은 장애물 회피 알고리즘이 수행되는 실질적인 예시에 대해 도 5를 참조하여 설명하도록 한다.Since the meaning of each character has been explained in the above formula, descriptions thereof will be omitted. Hereinafter, a practical example of performing such an obstacle avoidance algorithm will be described with reference to FIG. 5 .

도 5는 본 발명의 일 실시예에 따른 Interactive Perception 및 Action Policy의 이중적 태스크 스트림을 포함하는 MOCA 모델에 따른 AI 에이전트의 테스트 방법에서 사용되는 장애물 회피 알고리즘을 나타낸 도면이다.5 is a diagram illustrating an obstacle avoidance algorithm used in an AI agent test method according to a MOCA model including dual task streams of Interactive Perception and Action Policy according to an embodiment of the present invention.

도 5를 참조하면, 제(t-1) 입력 프레임에 대응하는 타이밍에 직진을 수행하였는데, 이에 따라 장애물을 만나게 되었고, 이에 따라 이는 제(t-1) 테스트용 시각 피처와 제t 테스트용 시각 피처 간의 유사도가 임계치 이상으로 도출되었음을 확인할 수 있다. 이 때, 학습 과정이었다면 직진 액션이 제t 테스트용 예측 액션 정보로 선택되었을 것이나, 장애물 회피 알고리즘 때문에, AI 에이전트(100)는 그 다음으로 값이 높은 우회전을 택할 것이다. 하지만, 제(t+1) 입력 프레임에 대한 제(t+1) 테스트용 예측 액션 정보의 선택 예시에서 볼 수 있듯, 장애물이 없다면 APM용 디코딩 FC 네트워크의 출력 값들 중 가장 큰 값에 대응하는 액션이 선택된다.Referring to FIG. 5, straight-ahead was performed at a timing corresponding to the (t-1)th input frame, and an obstacle was encountered accordingly, which resulted in the (t-1)th test time feature and the t test time feature. It can be confirmed that the similarity between features is derived above the threshold value. At this time, if it was a learning process, the straight-ahead action would have been selected as the predicted action information for the t test, but because of the obstacle avoidance algorithm, the AI agent 100 would choose a right turn with the next highest value. However, as can be seen in the selection example of the (t+1)th test predictive action information for the (t+1)th input frame, if there is no obstacle, the action corresponding to the largest value among the output values of the decoding FC network for APM is selected

이와 같이 제t 테스트용 예측 마스크 및 제t 테스트용 예측 액션 정보가 획득되면, AI 에이전트(100)는, 제t 테스트용 예측 마스크에 대응하는 특정 오브젝트에, 제t 테스트용 예측 액션 정보에 대응하는 액션을 가할 수 있을 것이다.When the prediction mask for test t and the predicted action information for test t are obtained in this way, the AI agent 100 assigns a specific object corresponding to the prediction mask for test t to a specific object corresponding to the predicted action information for test t. you will be able to take action.

이하 본 발명의 일 실시예에 따른 Interactive Perception 및 Action Policy의 이중적 태스크 스트림을 포함하는 MOCA 모델에 따른 AI 에이전트의 우수성에 대해 설명하도록 한다. Hereinafter, the excellence of the AI agent according to the MOCA model including dual task streams of Interactive Perception and Action Policy according to an embodiment of the present invention will be described.

위 실험 결과는 AI2-THOR 가상공간 상에서 동작하는 ALFRED 벤치마크를 사용한 것으로, seen은 test 시의 데이터셋이 training 시의 데이터셋의 부분집합인 경우이고, unseen은 양 시점의 데이터셋이 전혀 다른 경우를 의미한다. 또한, Task 부분은 지시에 따라 AI 에이전트가 수행하도록 의도된 최종 목표의 완료율을 의미하고, Goal-Cond 부분은 각 task 에 대해서 지시를 구성하는 각각의 작업들의 완료율을 의미할 수 있다. 위 표에서 볼 수 있듯, 모든 면에서 본 발명의 MOCA를 적용한 것이 압도적으로 높은 완료율을 기록했음을 알 수 있다. 이와 같은 정량적 분석 결과만이 아니라, 정성적 분석 결과에서도 본 발명의 MOCA가 뛰어난 성능을 발휘하였다.The above experimental results are from the ALFRED benchmark operating in the AI2-THOR virtual space. means In addition, the Task part may mean the completion rate of the final goal intended to be performed by the AI agent according to the instruction, and the Goal-Cond part may mean the completion rate of each task constituting the instruction for each task. As can be seen from the table above, it can be seen that the application of the MOCA of the present invention recorded an overwhelmingly high completion rate in all respects. In addition to such quantitative analysis results, the MOCA of the present invention exhibited excellent performance in qualitative analysis results.

이상 설명된 본 발명에 따른 실시예들은 다양한 컴퓨터 구성요소를 통하여 수행될 수 있는 프로그램 명령어의 형태로 구현되어 컴퓨터 판독 가능한 기록 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능한 기록 매체는 프로그램 명령어, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 컴퓨터 판독 가능한 기록 매체에 기록되는 프로그램 명령어는 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 분야의 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능한 기록 매체의 예에는, 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체, CD-ROM, DVD와 같은 광기록 매체, 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 ROM, RAM, 플래시 메모리 등과 같은 프로그램 명령어를 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령어의 예에는, 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드도 포함된다. 상기 하드웨어 장치는 본 발명에 따른 처리를 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.Embodiments according to the present invention described above may be implemented in the form of program instructions that can be executed through various computer components and recorded on a computer-readable recording medium. The computer readable recording medium may include program instructions, data files, data structures, etc. alone or in combination. Program instructions recorded on the computer-readable recording medium may be specially designed and configured for the present invention, or may be known and usable to those skilled in the art of computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks and magnetic tapes, optical recording media such as CD-ROMs and DVDs, and magneto-optical media such as floptical disks. media), and hardware devices specially configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions include high-level language codes that can be executed by a computer using an interpreter or the like as well as machine language codes such as those produced by a compiler. The hardware device may be configured to act as one or more software modules to perform processing according to the present invention and vice versa.

이상에서 본 발명이 구체적인 구성요소 등과 같은 특정 사항들과 한정된 실시예 및 도면에 의해 설명되었으나, 이는 본 발명의 보다 전반적인 이해를 돕기 위해서 제공된 것일 뿐, 본 발명이 상기 실시예들에 한정되는 것은 아니며, 본 발명이 속하는 기술분야에서 통상적인 지식을 가진 자라면 이러한 기계로부터 다양한 수정 및 변형을 꾀할 수 있다.In the above, the present invention has been described by specific details such as specific components and limited embodiments and drawings, but these are provided to help a more general understanding of the present invention, and the present invention is not limited to the above embodiments. , Those skilled in the art to which the present invention pertains can make various modifications and variations from these machines.

따라서, 본 발명의 사상은 상기 설명된 실시예에 국한되어 정해져서는 아니 되며, 후술하는 특허청구범위뿐만 아니라 이 특허청구범위와 균등하게 또는 등가적으로 변형된 모든 것들은 본 발명의 사상의 범주에 속한다고 할 것이다.Therefore, the spirit of the present invention should not be limited to the above-described embodiments, and not only the claims described later, but also all modifications equivalent or equivalent to these claims belong to the scope of the spirit of the present invention. will do it

Claims

In the learning method of an AI agent according to the Modular Object-Centric Approach (MOCA) model including dual task streams of Interactive Perception and Action Policy,
(a) When the AI agent acquires the tth input frame, where t is an integer greater than or equal to 1, the tth visual feature for the tth input frame and instruction data in natural language form for the AI agent are transmitted to the Interactive Perception Module ( IPM) and Action Policy Module (APM), obtaining t prediction class and t prediction action information from the IPM and the APM, respectively; and
(b) the AI agent, (i-1) first to Nth prediction classes including the t-th prediction class, where N is an integer greater than or equal to t-at least some of the prediction classes and (i-2) the t-th prediction action information After generating an IPM loss and an APM loss by referring to at least some of the first to Nth predicted action information including (ii) Ground-Truth (GT) information for each of them, and then backpropagating them, the IPM and learning at least some of the parameters of the APM;
A method comprising a.

According to claim 1,
In step (a),
(a11) causing, by the AI agent, an IPM natural language encoder included in the IPM to obtain a tth IPM attention natural language feature by applying an operation using the IPM encoding BiLSTM included therein to the indication data;
(a12) The AI agent causes the dynamic filter operator for IPM included in the IPM to apply an operation using a Fully-Connected (FC) network for generating a filter for IPM included in the AI agent to the tth IPM attention natural language feature, generating at least one dynamic filter for the tth IPM, and then obtaining an attention time feature for the tth IPM by applying an operation using the dynamic filter for the tth IPM to the tth visual feature; and
(a13) The AI agent causes a class decoder included in the IPM to include the attention natural language feature for the t th IPM, the attention time feature for the t th IPM, and the (t-1) th predictive action information. The integrated vector for IPM is subjected to an operation using the decoding LSTM for IPM included therein to generate a decoding hidden state vector for the t th IPM, and then decoding for IPM FC included in the decoding hidden state vector for the t th IPM. Applying an operation using a network to generate the t prediction class
A method comprising a.

According to claim 1,
In step (a),
(a21) causing, by the AI agent, a natural language encoder for APM included in the APM to obtain an attention natural language feature for t APM by applying an operation using the encoding BiLSTM for APM included therein to the indication data;
(a22) The AI agent causes the APM dynamic filter operator included in the APM to apply an operation using the APM filter generation FC network included in the AI agent to the attention natural language feature for the tth APM to obtain at least one t generating a dynamic filter for APM and then obtaining an attention time feature for t APM by applying an operation using the dynamic filter for t APM to the t th visual feature; and
(a23) The AI agent causes the action decoder included in the APM to include the attention natural language feature for the tth APM, the attention time feature for the tth APM, and the (t-1)th predicted action information. A decoding hidden state vector for the tth APM is generated by applying an operation using the APM decoding LSTM included therein to the integrated vector for the APM, and then the integrated vector for the t APM and the decoding hidden state vector for the tth APM are generated. Generating the t-th prediction action information by applying an operation using a decoding FC network for APM included therein to a vector containing
A method comprising a.

In the AI agent test method according to the Modular Object-Centric Approach (MOCA) model including dual task streams of Interactive Perception and Action Policy,
(a) When the AI agent obtains (1) the tth input frame for learning, where t is an integer greater than or equal to 1, is acquired, the tth learning visual feature for the tth learning input frame and the AI agent for natural language learning a process of inputting instruction data to an Interactive Perception Module (IPM) and an Action Policy Module (APM), and obtaining prediction class for t learning and prediction action information for t learning from each of the IPM and the APM; and (2) (i-1) first to Nth including the t-th prediction class for learning - N is an integer greater than or equal to t - at least some of the prediction classes for learning and (i-2) the t-th prediction action information for learning After generating an IPM loss and an APM loss by referring to at least some of the first to Nth prediction action information for learning and (ii) Ground-Truth (GT) information for each of them, and then backpropagating them, the IPM and When an input frame for test t is obtained in a state in which learning is completed by performing a process of learning at least some of the parameters of the APM, a visual feature for test t and the AI agent for the input frame for test t acquiring a prediction mask for test t and prediction action information for test t from the IPM and the APM, respectively, by inputting test instruction data in the form of natural language for the IPM and the APM; and
(b) performing, by the AI agent, at least part of an operation according to the test instruction data by referring to the prediction mask for test t and the prediction action information for test t t ;
A method comprising a.

According to claim 4,
In step (a),
(a11) The AI agent causes the natural language encoder for IPM included in the IPM to apply an operation using the encoding BiLSTM for IPM included in the AI agent to the test instruction data to obtain an attention natural language feature for IPM for test t to do;
(a12) The AI agent causes the dynamic filter operator for IPM included in the IPM to create an IPM filter included in the attention natural language feature for IPM for the t test. Operation using a Fully-Connected (FC) network. to generate at least one dynamic filter for the IPM for the t test, and then applying an operation using the dynamic filter for the IPM for the t test to the visual feature for the t test, thereby obtaining an attention time feature for the IPM for the t test. to obtain; and
(a13) The AI agent causes the class decoder included in the IPM to include an attention natural language feature for the IPM for test t, an attention time feature for IPM for test t, and a prediction action for test (t-1). The decoding hidden state vector for the IPM for the t test is generated by applying an operation using the IPM decoding LSTM included therein to the integration vector for the IPM for the t test including information, and then decoding the IPM for the t test generating a prediction class for the tth test by applying an operation using an IPM decoding FC network included in the hidden state vector to the hidden state vector; and
(a14) allowing an object-centric localizer included in the IPM, by the AI agent, to generate a prediction mask for test t by referring to the prediction class for test t ;
A method comprising a.

According to claim 5,
In the step (a14),
(a141) The AI agent causes the Object-Centric localizer included in the IPM to use a mask generator included in the AI agent to determine a prediction class corresponding to the t test prediction class on the t test input frame. generating at least one candidate mask for a t test; and
(a142) The AI agent causes the Object-Centric localizer included in the IPM to use an instance determiner included in the AI agent to test the t test that is one of the at least one candidate masks for the test t t . to determine the prediction mask for
A method comprising a.

According to claim 6,
In the step (a142),
The AI agent causes the Object-Centric localizer to compare (i) the prediction class for the tth test and the prediction class for the (t-1)th test using the instance determiner, and then (ii- 1) When the prediction class for the t th test and the prediction class for the (t-1) th test are the same, the most similarity with the prediction mask for the (t-1) th test among the candidate masks for the t th test A higher one is determined as the prediction mask for the tth test, and (ii-2) if the prediction class for the tth test and the prediction class for the (t-1)th test are not identical, the candidate mask for the tth test. A method characterized in that determining a prediction mask having the highest confidence score among them as the prediction mask for the t test.

According to claim 4,
In step (a),
(a21) The AI agent causes the APM natural language encoder included in the APM to apply an operation using the APM encoding BiLSTM included in the AI agent to the test instruction data to obtain an attention natural language feature for APM for test t to do;
(a22) The AI agent causes the APM dynamic filter operator included in the APM to apply an operation using the APM filter generating FC network included in the AI agent to the attention natural language feature for the APM for the test t t , thereby obtaining at least one Generating a dynamic filter for the APM for the t test, and then obtaining an attention visual feature for the APM for the t test by applying an operation using the dynamic filter for the APM for the t test to the visual feature for the t test. ; and
(a23) The AI agent causes the action decoder included in the APM to include the attention natural language feature for the APM for the test t, the attention time feature for the APM for the test t, and the predicted action for the (t-1)th test. A decoding hidden state vector for APM for test t is generated by applying an operation using the APM decoding LSTM included therein to the integration vector for APM for test t t containing information, and then the integration vector for APM for test t t is generated. Generating prediction action information for the t test by applying an operation using an APM decoding FC network included therein to a vector including a vector and a decoding hidden state vector for the APM for the t test
A method comprising a.

According to claim 8,
In the step (a23),
In a state in which the AI agent determines whether an obstacle is encountered by referring to the t test visual feature and the (t-1) th visual feature for the test, the AI agent determines whether an obstacle is encountered by an obstacle avoider included in the APM. , after performing an operation using the APM decoding FC network, output from a specific output node corresponding to the prediction action information for the t-1 test among output nodes of the APM decoding FC network. and selecting, as the prediction action information for the t test, a value corresponding to the highest value among values excluding the specified value.

According to claim 9,
In the step (a23),
The AI agent causes the obstacle avoider included in the APM to calculate the similarity between the t test visual feature and the (t-1) th visual feature, and if the similarity is greater than or equal to a threshold value, the AI agent A method characterized in that it is determined that the obstacle has been encountered.

In an AI agent that performs learning according to the Modular Object-Centric Approach (MOCA) model including dual task streams of Interactive Perception and Action Policy,
one or more memories to store instructions; and
and one or more processors configured to perform the instructions, wherein: (I) the AI agent, when a tth input frame, where t is an integer greater than or equal to 1, is obtained, a tth time for the tth input frame; Inputting instruction data in the form of natural language for features and the AI agent to Interactive Perception Module (IPM) and Action Policy Module (APM) to obtain t prediction class and t prediction action information from the IPM and the APM, respectively process; and (II) the AI agent performs (i-1) first to Nth prediction classes including the tth prediction class, where N is an integer greater than or equal to t—at least some of the prediction classes and (i-2) the tth prediction action After generating an IPM loss and an APM loss by referring to at least some of the first to Nth predicted action information including information and (ii) Ground-Truth (GT) information for each of them, and then backpropagating them, the IPM and an AI agent configured to perform a process of learning at least some of the parameters of the APM.

According to claim 11,
The (I) process,
(I11) a process of causing an IPM natural language encoder included in the IPM to obtain a tth IPM attention natural language feature by applying an operation using the IPM encoding BiLSTM included therein to the indication data; (I12) A dynamic filter operator for IPM included in the IPM applies an operation using a Fully-Connected (FC) network for generating a filter for IPM included in the attention natural language feature for the tth IPM to at least one t a process of generating a dynamic filter for IPM and then applying an operation using the dynamic filter for t th IPM to the t th visual feature to obtain an attention visual feature for t th IPM; and (I13) causing a class decoder included in the IPM to integrate an attention natural language feature for the tth IPM, an attention time feature for the tth IPM, and a (t-1)th predictive action information for the tth IPM. , an operation using the decoding LSTM for IPM included therein is applied to generate a decoding hidden state vector for the t th IPM, and then an operation using the decoding FC network for IPM included in the decoding hidden state vector for the t th IPM. AI agent characterized in that it comprises a process of generating the t-th prediction class by applying a.

According to claim 11,
The (I) process,
(I21) a process of causing the natural language encoder for APM included in the APM to obtain a tth APM attention natural language feature by applying an operation using the APM encoding BiLSTM included therein to the indication data; (I22) The APM dynamic filter operator included in the APM applies an operation using the APM filter generation FC network included in the t th APM attention natural language feature to generate at least one t APM dynamic filter generating and then applying an operation using the dynamic filter for the tth APM to the tth visual feature to obtain an attention visual feature for the tth APM; and (I23) causing an action decoder included in the APM to integrate the attention natural language feature for the tth APM, the attention time feature for the tth APM, and the (t-1)th predictive action information for the tth APM. , to generate a decoding hidden state vector for the tth APM by applying an operation using the APM decoding LSTM included therein, and then to a vector including the integrated vector for the tth APM and the decoding hidden state vector for the tth APM A process of generating the t prediction action information by applying an operation using a decoding FC network for APM included therein
AI agent comprising a.

In an AI agent that performs a test according to the Modular Object-Centric Approach (MOCA) model including dual task streams of Interactive Perception and Action Policy,
one or more memories to store instructions; and
one or more processors configured to perform the instructions, the processor comprising: (I) (1) when an input frame for learning t, where t is an integer greater than or equal to 1, is obtained, for the t learning input frame for the t learning input frame; By inputting visual features and instruction data for learning in the form of natural language for the AI agent to the Interactive Perception Module (IPM) and Action Policy Module (APM), the t learning prediction class and the t learning prediction from each of the IPM and the APM. a process of obtaining action information; and (2) (i-1) first to Nth including the t-th prediction class for learning - N is an integer greater than or equal to t - at least some of the prediction classes for learning and (i-2) the t-th prediction action information for learning After generating an IPM loss and an APM loss by referring to at least some of the first to Nth prediction action information for learning and (ii) Ground-Truth (GT) information for each of them, and then backpropagating them, the IPM and When an input frame for test t is obtained in a state in which learning is completed by performing a process of learning at least some of the parameters of the APM, a visual feature for test t and the AI agent for the input frame for test t a process of inputting test instruction data in the form of natural language to the IPM and the APM, and obtaining a prediction mask for test t and prediction action information for test t from the IPM and the APM, respectively; and (II) the AI agent is configured to perform a process of performing at least a part of an operation according to the test instruction data by referring to the t test prediction mask and the t test prediction action information. AI agents that do.

According to claim 14,
The (I) process,
(I11) a process of causing the natural language encoder for IPM included in the IPM to obtain an attention natural language feature for IPM for test t by applying an operation using the encoding BiLSTM for IPM included therein to the test instruction data; (I12) The dynamic filter operator for IPM included in the IPM applies an operation using a Fully-Connected (FC) network for generating a filter for IPM included in the attention natural language feature for IPM for test t to at least one Process of generating a dynamic filter for IPM for test t and then obtaining an attention visual feature for IPM for test t by applying an operation using the dynamic filter for IPM for test t to the visual feature for test t ; and (I13) causing a class decoder included in the IPM to include an attention natural language feature for the IPM for the t test, an attention time feature for the IPM for the t test, and prediction action information for the (t-1)th test. An operation using the IPM decoding LSTM included therein is applied to the integrated vector for the IPM for the t test to generate a decoding hidden state vector for the IPM for the t test, and then to the decoding hidden state vector for the IPM for the t test a process of generating a prediction class for the tth test by applying an operation using a decoding FC network for IPM included therein; and (I14) a process of causing an object-centric localizer included in the IPM to generate a prediction mask for test t by referring to the prediction class for test t t .

According to claim 15,
The (I14) process,
(I141) Causes the Object-Centric localizer included in the IPM to generate at least one t corresponding to the prediction class for the t test on the t th test input frame using a mask generator included therein. a process for generating candidate masks for testing; and (I142) causing the Object-Centric localizer included in the IPM to determine the prediction mask for the t-th test, which is one of the at least one candidate masks for the t-th test, using the instance determiner included therein. An AI agent comprising a process for making a decision.

According to claim 16,
The (I142) process,
The processor causes the Object-Centric localizer to compare (i) the prediction class for the tth test and the prediction class for the (t-1)th test using the instance determiner, and then (ii-1) ) When the prediction class for the t th test and the prediction class for the (t-1) th test are identical, the similarity with the prediction mask for the (t-1) th test among the candidate masks for the t th test is the highest. is determined as the prediction mask for the tth test, and (ii-2) if the prediction class for the tth test and the prediction class for the (t-1)th test are not the same, the candidate masks for the tth test The AI agent characterized in that the one having the highest confidence score among the prediction masks for the t test is determined.

According to claim 14,
The (I) process,
(I21) a process of causing the APM natural language encoder included in the APM to obtain an attention natural language feature for the APM for the t test by applying an operation using the APM encoding BiLSTM included therein to the test indication data; (I22) The APM dynamic filter operator included in the APM applies an operation using the APM filter generation FC network included in the attention natural language feature for the APM for the t test to at least one APM for the t test a process of generating a dynamic filter for APM for test t and then applying an operation using the dynamic filter for APM for test t to the visual feature for test t to obtain an attention visual feature for APM for test t ; and (I23) causing an action decoder included in the APM to include an attention natural language feature for the APM for the t test, an attention time feature for the APM for the t test, and prediction action information for the (t-1)th test. An operation using the APM decoding LSTM included therein is applied to the integrated vector for the APM for the t test to generate a decoding hidden state vector for the APM for the t test, and then the integrated vector for the APM for the t test and the th A process of generating prediction action information for the t test by applying an operation using an APM decoding FC network included therein to a vector including a decoding hidden state vector for APM for t test AI agent comprising: .

According to claim 18,
The (I23) process,
In a state in which the processor determines whether an obstacle is encountered by referring to the t test visual feature and the (t-1) th visual feature for the test, the AI agent determines whether the obstacle is encountered by the obstacle avoider included in the APM. If it is determined that it has been encountered, after performing an operation using the APM decoding FC network, output from a specific output node corresponding to the prediction action information for the t-1 test among output nodes of the APM decoding FC network. The AI agent characterized in that the prediction action information for the t test is selected corresponding to the highest value among values excluding a specific value.

According to claim 19,
The (I23) process,
After the processor calculates the similarity between the t test visual feature and the (t-1) th visual feature for the test by the obstacle avoider included in the APM, if the similarity is greater than or equal to a threshold value, the AI agent An AI agent, characterized in that it is determined that the obstacle has been encountered.