KR102649948B1

KR102649948B1 - Text augmentation apparatus and method using hierarchy-based word replacement

Info

Publication number: KR102649948B1
Application number: KR1020210159638A
Authority: KR
Inventors: 김남규; 김무성
Original assignee: 국민대학교산학협력단
Priority date: 2021-07-20
Filing date: 2021-11-18
Publication date: 2024-03-22
Also published as: KR20230014040A

Abstract

본 발명은 계층 기반 단어 대체를 통한 텍스트 데이터 증강 장치 및 방법에 관한 것으로, 상기 장치는 입력 텍스트를 수신하여 복수의 토큰들로 분리하는 단계; 상기 복수의 토큰들 중에서 적어도 하나의 후보 토큰을 선택하고 후보 토큰 별로 품사를 정의하여 분류하는 단계; 및 상기 적어도 하나의 후보 토큰을 해당 품사에 따라 계층적 관계를 형성하는 대체 토큰으로 변경하여 상기 입력 텍스트에 관한 증강 텍스트를 생성하는 단계;를 포함한다.The present invention relates to an apparatus and method for augmenting text data through layer-based word replacement, the apparatus comprising: receiving input text and separating it into a plurality of tokens; selecting at least one candidate token from among the plurality of tokens and classifying them by defining a part of speech for each candidate token; and generating augmented text related to the input text by changing the at least one candidate token into a replacement token that forms a hierarchical relationship according to the corresponding part of speech.

Description

Text data augmentation device and method through hierarchy-based word replacement {TEXT AUGMENTATION APPARATUS AND METHOD USING HIERARCHY-BASED WORD REPLACEMENT}

본 발명은 데이터 증강 기술에 관한 것으로, 보다 상세하게는 단어가 갖는 품사별 특징 중 계층 정보를 활용하여 단어를 대체함으로써 텍스트 데이터의 증강을 제공하는 계층 기반 단어 대체를 통한 텍스트 데이터 증강 장치 및 방법에 관한 것이다.The present invention relates to data augmentation technology, and more specifically, to an apparatus and method for augmenting text data through layer-based word replacement that provides augmentation of text data by replacing words using layer information among the features of each part of speech of words. It's about.

인공지능 시대에 접어들면서 비구조적 데이터인 비정형 데이터의 수요와 공급이 기하급수적으로 증가하게 되었으며, 이에 따라 다양한 유형의 비정형 데이터를 분석하기 위해 딥 러닝(Deep Learning) 알고리즘을 활용하는 연구가 활발하게 수행되고 있다. 딥 러닝 알고리즘은 여러 층(Layer)을 쌓아 만든 신경망 모델을 사용하며, 층을 지날 때마다 데이터의 특징을 발견하고 유의미한 표현을 학습하는 과정으로 이루어질 수 있다. 이러한 딥 러닝 기술은 자연어 처리(Natural Language Processing), 음성 인식(Speech Recognition), 이미지 분류(Image Classification), 객체 감지(Object Detection) 등에 널리 활용되고 있다.As we enter the era of artificial intelligence, the demand and supply of unstructured data has increased exponentially, and accordingly, research using deep learning algorithms to analyze various types of unstructured data has been actively conducted. It is becoming. Deep learning algorithms use a neural network model created by stacking multiple layers, and can be accomplished through the process of discovering data characteristics and learning meaningful expressions each time it passes through the layers. These deep learning technologies are widely used in natural language processing, speech recognition, image classification, and object detection.

최근의 딥 러닝 기술은 서로 다른 특징 차원을 가진 데이터를 동시에 학습하는 멀티모달 딥 러닝(Multimodal Deep Learning)에 관한 연구로 확장되는 경향을 보인다. 멀티모달 딥 러닝은 하나의 특징 차원을 가진 데이터를 학습하는 싱글모달 학습(Single Modal Learning)과 달리 텍스트, 이미지, 오디오, 비디오 등의 다양한 데이터를 상호보완적으로 사용하여 학습 성능을 향상시킬 수 있다. 특히 텍스트와 이미지 데이터를 함께 다루는 멀티모달 딥 러닝에 대한 연구가 활발하게 수행되고 있으며, 대표적 응용으로 'Text to Image' 합성이 있다. Text to Image 합성은 입력 텍스트에 대응하는 적절한 이미지를 출력하는 기술로, 생성적 적대 신경망(GAN: Generative Adversarial Network)을 바탕으로 다양한 연구가 이루어지고 있다.Recent deep learning technologies tend to expand into research on multimodal deep learning, which simultaneously learns data with different feature dimensions. Unlike single modal learning, which learns data with one feature dimension, multimodal deep learning can improve learning performance by complementary use of various data such as text, images, audio, and video. . In particular, research on multimodal deep learning that handles text and image data together is being actively conducted, and a representative application is 'Text to Image' synthesis. Text to Image synthesis is a technology that outputs appropriate images corresponding to input text, and various research is being conducted based on Generative Adversarial Network (GAN).

이러한 Text to Image 합성 기술의 잘 알려진 응용 사례로 ReStGAN을 들 수 있다. ReStGAN은 아마존이 개발하여 의류 검색 시스템에 적용한 알고리즘으로, 고객이 입력한 제품 설명과 일치하는 의류를 생성할 수 있다. Text to Image 합성은 다양한 분야에서 활용 가능성이 높은 기술로 많은 관심을 받고 있지만, 텍스트의 의미를 제대로 반영하는 이미지를 생성하는 것은 상당히 어려운 일이다. 이는 동일한 이미지를 텍스트로 설명할 때 다양한 단어들이 사용될 수 있으며, 동일한 단어라도 문맥에 따라 다른 의미로 해석될 수 있기 때문이다. 즉, 텍스트 데이터의 특징과 이미지 데이터의 특징을 잘 매핑(Mapping)하는 것이 가장 중요한 관건이다.ReStGAN is a well-known application example of this Text to Image synthesis technology. ReStGAN is an algorithm developed by Amazon and applied to the clothing search system, and can generate clothing that matches the product description entered by the customer. Text to Image synthesis is receiving a lot of attention as a technology with high potential for use in various fields, but it is quite difficult to create an image that properly reflects the meaning of the text. This is because various words can be used when describing the same image in text, and even the same word can be interpreted with different meanings depending on the context. In other words, the most important key is to well map the characteristics of text data and the characteristics of image data.

이질적인 데이터의 특징을 매핑하기 위해, Text to Image 합성은 기본적으로 방대한 양의 이미지와 텍스트 데이터가 학습에 필요하며, 이때 각 이미지와 이미지를 설명하는 복수의 텍스트가 하나의 쌍(Pair)으로 구성되어야 한다. 하지만, 이와 같이 이미지와 텍스트의 쌍으로 구성된 데이터는 제한적으로 공개되어 있으므로, Text to Image 합성을 위한 충분한 양의 학습 데이터를 확보하는 것은 매우 어려운 일이다.In order to map the features of heterogeneous data, Text to Image synthesis basically requires a vast amount of image and text data for learning, and at this time, each image and multiple texts describing the image must be composed of a pair. do. However, since the data consisting of image and text pairs is limited, it is very difficult to secure a sufficient amount of learning data for Text to Image synthesis.

한국등록특허 제10-1973642호 (2019.04.23)Korean Patent No. 10-1973642 (2019.04.23)

본 발명의 일 실시예는 단어가 갖는 품사별 특징 중 계층 정보를 활용하여 단어를 대체함으로써 텍스트 데이터의 증강을 제공하는 계층 기반 단어 대체를 통한 텍스트 데이터 증강 장치 및 방법을 제공하고자 한다.An embodiment of the present invention seeks to provide an apparatus and method for augmenting text data through layer-based word replacement that provides augmentation of text data by replacing words using layer information among the features of each part of speech of a word.

실시예들 중에서, 계층 기반 단어 대체를 통한 텍스트 데이터 증강 장치는 입력 텍스트를 수신하여 복수의 토큰들로 분리하는 단계; 상기 복수의 토큰들 중에서 적어도 하나의 후보 토큰을 선택하고 후보 토큰 별로 품사를 정의하여 분류하는 단계; 및 상기 적어도 하나의 후보 토큰을 해당 품사에 따라 계층적 관계를 형성하는 대체 토큰으로 변경하여 상기 입력 텍스트에 관한 증강 텍스트를 생성하는 단계;를 포함한다.Among embodiments, an apparatus for text data augmentation through layer-based word replacement includes receiving input text and separating it into a plurality of tokens; selecting at least one candidate token from among the plurality of tokens and classifying them by defining a part of speech for each candidate token; and generating augmented text related to the input text by changing the at least one candidate token into a replacement token that forms a hierarchical relationship according to the corresponding part of speech.

상기 토큰 분리부는 상기 입력 텍스트에 토크나이저(tokenizer)를 적용하여 상기 복수의 토큰들을 생성할 수 있다.The token separator may generate the plurality of tokens by applying a tokenizer to the input text.

상기 품사 분류부는 상기 복수의 토큰들 중에서 기 정의된 불용어 사전의 불용어(stopword)를 제거하고 상기 불용어가 제거된 토큰들 중에서 임의로 선택된 n개(상기 n은 자연수)의 토큰들을 상기 적어도 하나의 후보 토큰으로 결정할 수 있다.The part-of-speech classification unit removes stopwords from a predefined stopword dictionary from among the plurality of tokens and selects n tokens (where n is a natural number) randomly selected from the tokens from which the stopwords have been removed as the at least one candidate token. can be decided.

상기 품사 분류부는 상기 적어도 하나의 후보 토큰에 대해 해당 품사를 기준으로 특정 품사 또는 상기 특정 품사 이외의 기타 품사로 각각 분류할 수 있다.The part-of-speech classification unit may classify the at least one candidate token into a specific part-of-speech or other parts-of-speech other than the specific part-of-speech based on the corresponding part-of-speech.

상기 텍스트 증강부는 품사 별로 독립적인 대체 규칙을 정의하고 상기 해당 품사에 따라 상기 적어도 하나의 후보 토큰을 상기 대체 규칙에 따른 대체 토큰으로 변경할 수 있다.The text augmentation unit may define an independent replacement rule for each part of speech and change the at least one candidate token into a replacement token according to the replacement rule according to the corresponding part of speech.

상기 텍스트 증강부는 상기 해당 품사에 따라 후보 토큰을 복수의 유의어들 중 임의로 선택된 어느 하나로 대체할 수 있다.The text enhancer may replace the candidate token with one randomly selected among a plurality of synonyms according to the corresponding part of speech.

상기 텍스트 증강부는 상기 특정 품사로 분류된 후보 토큰을 상기 계층적 관계에 따른 상위어로 대체하고 상기 기타 품사로 분류된 후보 토큰을 유의어로 대체하는 동작을 선택적으로 수행할 수 있다.The text augmentation unit may selectively perform an operation of replacing candidate tokens classified as the specific part of speech with a hypernym according to the hierarchical relationship and replacing candidate tokens classified as the other parts of speech with a synonym.

상기 텍스트 증강부는 상기 특정 품사로 분류된 후보 토큰과 계층적 관계를 형성하는 제1 상위어가 존재하고 상기 제1 상위어와 계층적 관계를 형성하는 제2 상위어가 존재하는 경우 상기 후보 토큰을 상기 제1 및 제2 상위어들 각각으로 대체할 수 있다.If a first hypernym forming a hierarchical relationship with the candidate token classified into the specific part of speech exists and a second hypernym forming a hierarchical relationship with the first hypernym exists, the text augmentation unit binds the candidate token to the first hypernym. and second superordinate words, respectively.

실시예들 중에서, 계층 기반 단어 대체를 통한 텍스트 데이터 증강 방법은 입력 텍스트를 수신하여 복수의 토큰들로 분리하는 단계; 상기 복수의 토큰들 중에서 적어도 하나의 후보 토큰을 선택하고 후보 토큰 별로 품사를 정의하여 분류하는 단계; 및 상기 적어도 하나의 후보 토큰을 해당 품사에 따라 계층적 관계를 형성하는 대체 토큰으로 변경하여 상기 입력 텍스트에 관한 증강 텍스트를 생성하는 단계;를 포함한다.Among embodiments, a method for augmenting text data through layer-based word replacement includes receiving input text and separating it into a plurality of tokens; selecting at least one candidate token from among the plurality of tokens and classifying them by defining a part of speech for each candidate token; and generating augmented text related to the input text by changing the at least one candidate token into a replacement token that forms a hierarchical relationship according to the corresponding part of speech.

상기 품사를 정의하여 분류하는 단계는 상기 적어도 하나의 후보 토큰에 대해 해당 품사를 기준으로 특정 품사 또는 상기 특정 품사 이외의 기타 품사로 각각 분류하는 단계를 포함할 수 있다.The step of defining and classifying the part of speech may include classifying each of the at least one candidate token into a specific part of speech or other parts of speech other than the specific part of speech based on the corresponding part of speech.

상기 증강 텍스트를 생성하는 단계는 품사 별로 독립적인 대체 규칙을 정의하고 상기 해당 품사에 따라 상기 적어도 하나의 후보 토큰을 상기 대체 규칙에 따른 대체 토큰으로 변경하는 단계를 포함할 수 있다.The step of generating the augmented text may include defining an independent replacement rule for each part of speech and changing the at least one candidate token into a replacement token according to the replacement rule according to the corresponding part of speech.

상기 증강 텍스트를 생성하는 단계는 상기 특정 품사로 분류된 후보 토큰을 상기 계층적 관계에 따른 상위어로 대체하고 상기 기타 품사로 분류된 후보 토큰을 유의어로 대체하는 동작을 선택적으로 수행하는 단계를 포함할 수 있다.The step of generating the augmented text may include selectively performing an operation of replacing candidate tokens classified as the specific part of speech with a hypernym according to the hierarchical relationship and replacing candidate tokens classified as other parts of speech with a synonym. You can.

개시된 기술은 다음의 효과를 가질 수 있다. 다만, 특정 실시예가 다음의 효과를 전부 포함하여야 한다거나 다음의 효과만을 포함하여야 한다는 의미는 아니므로, 개시된 기술의 권리범위는 이에 의하여 제한되는 것으로 이해되어서는 아니 될 것이다.The disclosed technology can have the following effects. However, since it does not mean that a specific embodiment must include all of the following effects or only the following effects, the scope of rights of the disclosed technology should not be understood as being limited thereby.

본 발명의 일 실시예에 따른 계층 기반 단어 대체를 통한 텍스트 데이터 증강 장치 및 방법은 단어가 갖는 품사별 특징 중 계층 정보를 활용하여 단어를 대체함으로써 텍스트 데이터의 증강을 제공할 수 있다.The apparatus and method for augmenting text data through layer-based word replacement according to an embodiment of the present invention can provide augmentation of text data by replacing words using layer information among the characteristics of each part of speech possessed by the word.

본 발명의 일 실시예에 따른 계층 기반 단어 대체를 통한 텍스트 데이터 증강 장치 및 방법은 제한된 수의 이미지에 대해 더 많은 텍스트 정보를 생성함으로써 이미지와 텍스트를 함께 다루는 멀티모달 딥 러닝 분석의 정확도 향상에 기여할 수 있다.The text data augmentation device and method through layer-based word replacement according to an embodiment of the present invention contributes to improving the accuracy of multimodal deep learning analysis that handles images and text together by generating more text information for a limited number of images. You can.

도 1은 본 발명에 따른 데이터 증강 시스템을 설명하는 도면이다.
도 2는 도 1의 데이터 증강 장치의 기능적 구성을 설명하는 도면이다.
도 3은 본 발명에 따른 계층 기반 단어 대체를 통한 텍스트 데이터 증강 방법을 설명하는 순서도이다.
도 4는 본 발명에 따른 계층 기반 단어 대체를 통한 텍스트 데이터 증강 프로세스를 설명하는 도면이다.
도 5는 입력 텍스트로부터 최종 토큰 집합을 구성하는 과정을 설명하는 도면이다.
도 6은 본 발명에 따른 계층 기반 단어 대체를 통한 텍스트 데이터 증강 방법의 구체적 과정을 설명하는 도면이다.
도 7 내지 10은 본 발명에 따른 계층 기반 단어 대체를 통한 텍스트 데이터 증강 방법에 관한 실험 결과를 설명하는 도면이다.
도 11은 성능 분석을 위한 전체 실험 프로세스를 설명하는 도면이다.
도 12 및 13은 성능 비교 분석 결과를 설명하는 도면이다.1 is a diagram explaining a data enhancement system according to the present invention.
FIG. 2 is a diagram explaining the functional configuration of the data enhancement device of FIG. 1.
Figure 3 is a flowchart explaining a method of text data augmentation through layer-based word replacement according to the present invention.
Figure 4 is a diagram illustrating a text data augmentation process through layer-based word replacement according to the present invention.
Figure 5 is a diagram explaining the process of constructing a final token set from input text.
Figure 6 is a diagram illustrating a specific process of the text data augmentation method through layer-based word replacement according to the present invention.
Figures 7 to 10 are diagrams illustrating experimental results regarding the text data augmentation method through layer-based word replacement according to the present invention.
Figure 11 is a diagram explaining the entire experimental process for performance analysis.
12 and 13 are diagrams explaining performance comparison analysis results.

본 발명에 관한 설명은 구조적 내지 기능적 설명을 위한 실시예에 불과하므로, 본 발명의 권리범위는 본문에 설명된 실시예에 의하여 제한되는 것으로 해석되어서는 아니 된다. 즉, 실시예는 다양한 변경이 가능하고 여러 가지 형태를 가질 수 있으므로 본 발명의 권리범위는 기술적 사상을 실현할 수 있는 균등물들을 포함하는 것으로 이해되어야 한다. 또한, 본 발명에서 제시된 목적 또는 효과는 특정 실시예가 이를 전부 포함하여야 한다거나 그러한 효과만을 포함하여야 한다는 의미는 아니므로, 본 발명의 권리범위는 이에 의하여 제한되는 것으로 이해되어서는 아니 될 것이다.Since the description of the present invention is only an example for structural or functional explanation, the scope of the present invention should not be construed as limited by the examples described in the text. In other words, since the embodiments can be modified in various ways and can have various forms, the scope of rights of the present invention should be understood to include equivalents that can realize the technical idea. In addition, the purpose or effect presented in the present invention does not mean that a specific embodiment must include all or only such effects, so the scope of the present invention should not be understood as limited thereby.

한편, 본 출원에서 서술되는 용어의 의미는 다음과 같이 이해되어야 할 것이다.Meanwhile, the meaning of the terms described in this application should be understood as follows.

"제1", "제2" 등의 용어는 하나의 구성요소를 다른 구성요소로부터 구별하기 위한 것으로, 이들 용어들에 의해 권리범위가 한정되어서는 아니 된다. 예를 들어, 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소도 제1 구성요소로 명명될 수 있다.Terms such as “first” and “second” are used to distinguish one component from another component, and the scope of rights should not be limited by these terms. For example, a first component may be named a second component, and similarly, the second component may also be named a first component.

어떤 구성요소가 다른 구성요소에 "연결되어"있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결될 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어"있다고 언급된 때에는 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다. 한편, 구성요소들 간의 관계를 설명하는 다른 표현들, 즉 "~사이에"와 "바로 ~사이에" 또는 "~에 이웃하는"과 "~에 직접 이웃하는" 등도 마찬가지로 해석되어야 한다.When a component is referred to as being “connected” to another component, it should be understood that it may be directly connected to the other component, but that other components may exist in between. On the other hand, when a component is referred to as being “directly connected” to another component, it should be understood that there are no other components in between. Meanwhile, other expressions that describe the relationship between components, such as "between" and "immediately between" or "neighboring" and "directly neighboring" should be interpreted similarly.

단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한 복수의 표현을 포함하는 것으로 이해되어야 하고, "포함하다"또는 "가지다" 등의 용어는 실시된 특징, 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것이 존재함을 지정하려는 것이며, 하나 또는 그 이상의 다른 특징이나 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.Singular expressions should be understood to include plural expressions unless the context clearly indicates otherwise, and terms such as “comprise” or “have” refer to implemented features, numbers, steps, operations, components, parts, or them. It is intended to specify the existence of a combination, and should be understood as not excluding in advance the possibility of the presence or addition of one or more other features, numbers, steps, operations, components, parts, or combinations thereof.

각 단계들에 있어 식별부호(예를 들어, a, b, c 등)는 설명의 편의를 위하여 사용되는 것으로 식별부호는 각 단계들의 순서를 설명하는 것이 아니며, 각 단계들은 문맥상 명백하게 특정 순서를 기재하지 않는 이상 명기된 순서와 다르게 일어날 수 있다. 즉, 각 단계들은 명기된 순서와 동일하게 일어날 수도 있고 실질적으로 동시에 수행될 수도 있으며 반대의 순서대로 수행될 수도 있다.For each step, identification codes (e.g., a, b, c, etc.) are used for convenience of explanation. The identification codes do not explain the order of each step, and each step clearly follows a specific order in context. Unless specified, events may occur differently from the specified order. That is, each step may occur in the same order as specified, may be performed substantially simultaneously, or may be performed in the opposite order.

본 발명은 컴퓨터가 읽을 수 있는 기록매체에 컴퓨터가 읽을 수 있는 코드로서 구현될 수 있고, 컴퓨터가 읽을 수 있는 기록 매체는 컴퓨터 시스템에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록 장치를 포함한다. 컴퓨터가 읽을 수 있는 기록 매체의 예로는 ROM, RAM, CD-ROM, 자기 테이프, 플로피 디스크, 광 데이터 저장 장치 등이 있다. 또한, 컴퓨터가 읽을 수 있는 기록 매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산 방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수 있다.The present invention can be implemented as computer-readable code on a computer-readable recording medium, and the computer-readable recording medium includes all types of recording devices that store data that can be read by a computer system. . Examples of computer-readable recording media include ROM, RAM, CD-ROM, magnetic tape, floppy disk, and optical data storage devices. Additionally, the computer-readable recording medium can be distributed across computer systems connected to a network, so that computer-readable code can be stored and executed in a distributed manner.

여기서 사용되는 모든 용어들은 다르게 정의되지 않는 한, 본 발명이 속하는 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가진다. 일반적으로 사용되는 사전에 정의되어 있는 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한 이상적이거나 과도하게 형식적인 의미를 지니는 것으로 해석될 수 없다.All terms used herein, unless otherwise defined, have the same meaning as commonly understood by a person of ordinary skill in the field to which the present invention pertains. Terms defined in commonly used dictionaries should be interpreted as consistent with the meaning they have in the context of the related technology, and cannot be interpreted as having an ideal or excessively formal meaning unless clearly defined in the present application.

딥 러닝은 인공신경망에서 은닉층을 깊게 쌓은 신경망 구조를 활용하여 학습하는 알고리즘에 해당할 수 있으며, 합성곱 신경망(CNN: Convolutional Neural Network), 순환 신경망(RNN: Recurrent Neural Network), 그리고 생성적 적대 신경망(GAN: Generative Adversarial Network) 등을 포함할 수 있다. 딥 러닝 알고리즘은 이질적인 데이터 특징들의 표현을 학습하는 멀티모달 학습에 활용될 수 있으며, 예를 들어 입력 텍스트에 대응하는 적절한 이미지를 생성하는 기술인 Text to Image 합성에 사용될 수 있다.Deep learning can correspond to an algorithm that learns by utilizing a neural network structure with deep hidden layers in an artificial neural network, and includes convolutional neural networks (CNN), recurrent neural networks (RNN), and generative adversarial neural networks. (GAN: Generative Adversarial Network) may be included. Deep learning algorithms can be used in multimodal learning to learn representations of heterogeneous data features, and for example, can be used in Text to Image synthesis, a technology that generates appropriate images corresponding to input text.

Text to Image 합성은 생성 알고리즘인 GAN을 바탕으로 다양한 연구가 수행되고 있다. GAN은 생성기 네트워크와 판별기 네트워크가 적대적으로 경쟁하면서 학습을 진행하는 신경망으로, 생성기는 판별기를 속이기 위해 실제와 유사한 가짜 데이터를 생성하고, 판별기는 실제 데이터와 생성된 데이터를 판단하기 위한 학습을 수행할 수 있다.Various studies are being conducted on Text to Image synthesis based on GAN, a generation algorithm. GAN is a neural network in which a generator network and a discriminator network compete hostilely to learn. The generator generates fake data similar to the real thing to deceive the discriminator, and the discriminator performs learning to judge real data and generated data. can do.

Scott Reed는 2016년 GAN을 활용한 간단한 모델 구조를 통해 텍스트로부터 이미지를 생성해내는 방법론을 제안하였다. 하지만, 초창기 GAN의 간단한 모델 구조는 고해상도 이미지를 생성할 수 없다는 한계를 가질 수 있다. StackGAN은 이러한 한계를 해결하기 위해 두 개의 GAN을 쌓아 두 개의 스테이지(Stages)를 구성할 수 있다. StackGAN의 첫 스테이지는 입력 텍스트에 해당하는 객체를 스케치하고, 두 번째 스테이지는 첫 번째 스테이지에서 생성된 객체의 잘못된 부분을 수정하고 세부 정보를 추가함으로써 더 나은 고해상도 이미지를 생성할 수 있다.In 2016, Scott Reed proposed a methodology for generating images from text through a simple model structure using GAN. However, the simple model structure of early GANs may have limitations in that they cannot generate high-resolution images. To solve this limitation, StackGAN can configure two stages by stacking two GANs. The first stage of StackGAN sketches an object corresponding to the input text, and the second stage can generate better, high-resolution images by correcting incorrect parts of the object created in the first stage and adding details.

한편, AttnGAN은 텍스트를 글로벌 문장 벡터(Global Sentence Vector)로만 인코딩하는 경우 단어 수준에서의 세부 정보를 잘 활용하지 못한다는 한계를 해결하기 위해, 여러 개의 GAN을 쌓은 구조에 어텐션 메커니즘을 적용할 수 있다. AttnGAN은 입력 텍스트로부터 하위 영역의 이미지를 생성할 때, 해당 이미지와 관련된 단어에 더욱 주목하여 높은 가중치를 부여할 수 있다. 또한, AttnGAN은 생성된 이미지와 입력 텍스트 사이의 매칭 손실(Matching Loss)을 계산하여 더 나은 생성기 학습을 유도함으로써, 텍스트의 의미를 더욱 정확하게 반영하는 고해상도 이미지를 생성할 수 있다.Meanwhile, AttnGAN can apply an attention mechanism to a structure that stacks multiple GANs to solve the limitation of not being able to utilize detailed information at the word level when text is encoded only as a global sentence vector. . When AttnGAN generates a sub-region image from input text, it can pay more attention to words related to the image and give them a higher weight. In addition, AttnGAN can generate high-resolution images that more accurately reflect the meaning of the text by calculating the matching loss between the generated image and the input text to induce better generator learning.

데이터 증강이란 인위적인 변화를 통해 데이터의 수를 증가시켜 학습에 필요한 충분한 수의 데이터를 확보하는 기법이다. 데이터 증강은 특히 이미지 데이터의 수를 늘리기 위해 널리 사용될 수 있다. 구체적으로 이미지 데이터에 대한 Flipping, Color Space, Cropping, Rotation, Translation 등의 간단한 변형을 통해 이미지 데이터의 수를 늘리는 전통적인 방법뿐 아니라, 학습된 Feature Space에서의 변환, Neural Transfer 혹은 GAN을 활용한 새로운 데이터 생성 등 딥 러닝 기술을 적용한 데이터 증강 알고리즘들도 새롭게 제안되고 있다.Data augmentation is a technique that secures a sufficient number of data needed for learning by increasing the number of data through artificial changes. Data augmentation can be widely used, especially to increase the number of image data. Specifically, in addition to traditional methods of increasing the number of image data through simple transformations such as flipping, color space, cropping, rotation, and translation of image data, new data using transformation in the learned feature space, neural transfer, or GAN New data augmentation algorithms that apply deep learning technologies such as generation are also being proposed.

최근에는 이미지 데이터뿐 아니라 자연어 처리 분야에서도 데이터 증강 기법을 활용하려는 시도가 증가하고 있으며, 대표적인 연구로 어휘 대체 기반의 텍스트 데이터 증강이 있다. 텍스트 데이터 증강은 문장 내에 있는 단어를 유의어로 대체하는 기법으로 시소러스나 임베딩 모델 등을 사용할 수 있다. 시소러스 기반의 데이터 증강은 주로 워드넷에서 트리 구조로 정의된 유의어 사이의 관계를 사용하며, 이를 통해 문장에 포함된 일부 단어를 유의어로 대체함으로써 유사한 내용의 여러 문장을 생성할 수 있다. 하지만, 워드넷과 같은 시소러스 기반 데이터 증강은 시소러스 구축에 상당한 비용과 시간이 소요될 뿐 아니라, 시소러스에 포함되지 않은 어휘를 처리할 수 없다는 한계를 가질 수 있다. 한편, 임베딩 모델 기반의 데이터 증강은 말뭉치에 대한 학습을 통해 문장에 포함된 단어의 벡터와 유사한 벡터를 갖는 단어를 찾는 방식으로, Word2Vec, Fasttext, 그리고 Glove 등의 단어 임베딩 알고리즘을 통해 구현될 수 있다.Recently, attempts to utilize data augmentation techniques are increasing not only in image data but also in the field of natural language processing, and a representative study is text data augmentation based on vocabulary replacement. Text data augmentation is a technique that replaces words in a sentence with synonyms and can use a thesaurus or embedding model. Thesaurus-based data augmentation mainly uses relationships between synonyms defined in a tree structure in WordNet, and through this, several sentences with similar content can be generated by replacing some words included in a sentence with synonyms. However, thesaurus-based data augmentation such as WordNet not only requires considerable cost and time to build thesaurus, but may also have limitations in that it cannot process vocabulary not included in the thesaurus. Meanwhile, data augmentation based on an embedding model is a method of finding words with vectors similar to the vectors of words included in sentences through learning about the corpus, and can be implemented through word embedding algorithms such as Word2Vec, Fasttext, and Glove. .

이외에도 기계 번역을 활용하여 원래 문장의 의미를 보존하면서 의역을 통해 다르게 표현된 문장을 추가하는 역 번역 기반 증강, 그리고 BERT, GPT2 등 대규모 사전 학습이 이루어진 언어 모델을 미세 조정하는 언어 모델 기반 증강 등 딥 러닝 기술을 활용한 텍스트 데이터 증강 연구가 활발히 수행되고 있다.In addition, deep augmentation includes back-translation-based augmentation, which uses machine translation to preserve the meaning of the original sentence while adding differently expressed sentences through paraphrase, and language model-based augmentation, which fine-tunes language models that have undergone large-scale pre-training such as BERT and GPT2. Text data augmentation research using learning technology is being actively conducted.

하지만, 이러한 텍스트 데이터 증강 방법들은 단어들 간의 유의한 관계를 기반으로만 텍스트 증강이 이루어지고, 단어의 계층적 관계를 고려하지 못한다는 한계를 가질 수 있다.However, these text data augmentation methods may have limitations in that text augmentation is performed only based on meaningful relationships between words and they do not take into account the hierarchical relationships of words.

이하, 도 1 내지 6을 통해 본 발명에 따른 계층 기반 단어 대체를 통한 텍스트 데이터 증강 장치 및 방법을 보다 자세히 설명한다.Hereinafter, the text data augmentation device and method through layer-based word replacement according to the present invention will be described in more detail with reference to FIGS. 1 to 6.

도 1은 본 발명에 따른 데이터 증강 시스템을 설명하는 도면이다.1 is a diagram explaining a data enhancement system according to the present invention.

도 1을 참조하면, 데이터 증강 시스템(100)은 사용자 단말(110), 데이터 증강 장치(130) 및 데이터베이스(150)를 포함할 수 있다.Referring to FIG. 1 , the data enhancement system 100 may include a user terminal 110, a data enhancement device 130, and a database 150.

사용자 단말(110)은 데이터 증강 장치(130)와 연결되어 텍스트 데이터를 제공하고 데이터 증강에 따른 증강 데이터를 수신하는 컴퓨팅 장치에 해당할 수 있다. 사용자 단말(110)은 스마트폰, 노트북 또는 컴퓨터로 구현될 수 있으며, 반드시 이에 한정되지 않고, 태블릿 PC 등 다양한 디바이스로도 구현될 수 있다. 사용자 단말(110)은 데이터 증강 장치(130)와 네트워크를 통해 연결될 수 있고, 복수의 사용자 단말(110)들이 데이터 증강 장치(130)와 동시에 연결될 수도 있다.The user terminal 110 may correspond to a computing device that is connected to the data augmentation device 130 to provide text data and receive augmented data according to data augmentation. The user terminal 110 may be implemented as a smartphone, a laptop, or a computer, but is not necessarily limited thereto, and may also be implemented as a variety of devices such as a tablet PC. The user terminal 110 may be connected to the data enhancement device 130 through a network, and a plurality of user terminals 110 may be connected to the data enhancement device 130 at the same time.

데이터 증강 장치(130)는 본 발명에 따른 계층 기반 단어 대체를 통한 텍스트 데이터 증강 방법을 수행하는 컴퓨터 또는 프로그램에 해당하는 서버로 구현될 수 있다. 데이터 증강 장치(130)는 사용자 단말(110)과 유선 또는 무선 네트워크를 통해 연결될 수 있고 상호 간에 데이터를 주고받을 수 있다.The data augmentation device 130 may be implemented as a server corresponding to a computer or program that performs the text data augmentation method through layer-based word replacement according to the present invention. The data enhancement device 130 may be connected to the user terminal 110 through a wired or wireless network and may exchange data with the user terminal 110.

일 실시예에서, 데이터 증강 장치(130)는 본 발명에 따른 계층 기반 단어 대체를 통한 텍스트 데이터 증강 방법을 수행하는 과정에서 다양한 외부 시스템(또는 서버)과 연동하여 동작할 수 있다. 예를 들어, 데이터 증강 장치(130)는 SNS 서비스, 포털 사이트, 블로그 등을 통해 텍스트로 이루어진 다양한 문서들에 접근할 수 있으며, 데이터 증강에 필요한 학습 모델의 구축 과정에서 필요한 데이터를 수집할 수 있다.In one embodiment, the data augmentation device 130 may operate in conjunction with various external systems (or servers) in the process of performing the text data augmentation method through layer-based word replacement according to the present invention. For example, the data augmentation device 130 can access various documents consisting of text through SNS services, portal sites, blogs, etc., and collect necessary data in the process of building a learning model necessary for data augmentation. .

데이터베이스(150)는 데이터 증강 장치(130)의 동작 과정에서 필요한 다양한 정보들을 저장하는 저장장치에 해당할 수 있다. 예를 들어, 데이터베이스(150)는 텍스트의 토큰 분리와 품사 태깅을 위한 정보를 저장할 수 있고, 학습 모델 구축을 위한 학습 알고리즘 및 모델 정보를 저장할 수 있으며, 반드시 이에 한정되지 않고, 데이터 증강 장치(130)가 본 발명에 따른 계층 기반 단어 대체를 통한 텍스트 데이터 증강 방법을 수행하는 과정에서 다양한 형태로 수집 또는 가공된 정보들을 저장할 수 있다.The database 150 may correspond to a storage device that stores various information required during the operation of the data enhancement device 130. For example, the database 150 may store information for separating text tokens and tagging parts of speech, and may store learning algorithms and model information for building a learning model, but is not necessarily limited thereto, and the data augmentation device 130 ) can store information collected or processed in various forms in the process of performing the text data augmentation method through layer-based word replacement according to the present invention.

도 2는 도 1의 데이터 증강 장치의 기능적 구성을 설명하는 도면이다.FIG. 2 is a diagram explaining the functional configuration of the data enhancement device of FIG. 1.

도 2를 참조하면, 데이터 증강 장치(130)는 본 발명에 따른 계층 기반 단어 대체를 통한 텍스트 데이터 증강 방법을 실행하여 구현할 수 있다. 구체적으로, 도 4에서, 데이터 증강 장치(130)는 입력받은 텍스트 데이터를 토큰화(Tokenize)한 뒤 n 개의 토큰(Token)을 선택하여 품사를 정의하고 분류하는 Phase 1, 그리고 워드넷을 활용하여 품사에 따라 계층적으로 단어를 대체하는 Phase 2의 두 단계로 구성될 수 있다.Referring to FIG. 2, the data augmentation device 130 can be implemented by executing the text data augmentation method through layer-based word replacement according to the present invention. Specifically, in Figure 4, the data augmentation device 130 tokenizes the input text data, selects n tokens, defines and classifies parts of speech, and uses WordNet to define and classify parts of speech. It can be composed of two stages, Phase 2, which replaces words hierarchically according to their parts of speech.

먼저, Phase 1은 입력받은 원본 텍스트 데이터를 (1) NLTK의 토크나이저(Tokenizer)를 사용하여 토큰으로 분리하고, (2) 분리된 토큰들의 집합에서 NLTK의 불용어에 해당되지 않는 n개의 토큰들을 선택한 뒤, (3) 선택된 단어들에 대해 품사를 정의하는 품사 태깅을 통해 명사와 다른 품사를 구분하는 작업을 수행하는 단계에 해당할 수 있다. Phase 2는 (4) 선택된 토큰들의 품사가 명사이면 상위어를, 명사 외의 품사이면 유의어를 워드넷에서 추출하고, (5) 추출된 상위어 혹은 유의어를 (2)에서 선택된 토큰들과 대체하는 과정을 통해 최종적으로 새로운 텍스트 데이터를 생성하는 단계에 해당할 수 있다.First, Phase 1 (1) separates the input original text data into tokens using NLTK's tokenizer, and (2) selects n tokens that do not correspond to NLTK's stopwords from the set of separated tokens. Second, (3) may correspond to the step of performing the task of distinguishing nouns from other parts of speech through part-of-speech tagging that defines the parts of speech for the selected words. Phase 2 is through the process of (4) extracting a hypernym from WordNet if the part of speech of the selected tokens is a noun, and synonyms if the part of speech of the selected tokens is other than a noun, and (5) replacing the extracted hypernyms or synonyms with the tokens selected in (2). Finally, this may correspond to the step of creating new text data.

일 실시예에서, 데이터 증강 장치(130)는 토큰 분리부(210), 품사 분류부(230), 텍스트 증강부(250) 및 제어부(270)를 포함할 수 있다. 다만, 본 발명의 실시예에 따른 데이터 증강 장치(130)가 상기의 구성들을 동시에 모두 포함해야 하는 것은 아니며, 각각의 실시예에 따라 상기의 구성들 중 일부를 생략하거나, 상기의 구성들 중 일부 또는 전부를 선택적으로 포함하여 구현될 수도 있다. 이하, 각 구성들의 동작을 구체적으로 설명한다.In one embodiment, the data augmentation device 130 may include a token separation unit 210, a part-of-speech classification unit 230, a text augmentation unit 250, and a control unit 270. However, the data enhancement device 130 according to an embodiment of the present invention does not have to include all of the above components at the same time, and depending on each embodiment, some of the above components are omitted, or some of the above components are omitted. Alternatively, it may be implemented by selectively including all of them. Hereinafter, the operation of each component will be described in detail.

토큰 분리부(210)는 입력 텍스트를 수신하여 복수의 토큰들로 분리할 수 있다. 즉, 토큰 분리부(210)는 입력 받은 텍스트 데이터에 대한 토크나이징(tokenizing)을 수행하여 복수의 토큰들로 분리하는 동작을 수행할 수 있다(도 4의 S410). 보다 구체적으로, 단어 대체 기반의 텍스트 증강을 위해서는 문장 형태의 텍스트 데이터를 단어 단위로 분리하는 작업이 필요할 수 있다. 토큰화(tokenization)는 텍스트 데이터 세트인 코퍼스(corpus)에서 의미 있는 단위로 나누는 작업에 해당할 수 있으며, 토큰(token)은 토큰화를 통해 생성된 산출물에 해당할 수 있다. 토큰의 단위는 목적에 따라 상이할 수 있으며, 여기에서는 단어(word)와 동일한 의미로 사용한다.The token separator 210 may receive input text and separate it into a plurality of tokens. That is, the token separation unit 210 may perform an operation of tokenizing the input text data and separating it into a plurality of tokens (S410 in FIG. 4). More specifically, text augmentation based on word replacement may require separating text data in the form of sentences into words. Tokenization may correspond to the operation of dividing a corpus, a text data set, into meaningful units, and a token may correspond to an output created through tokenization. The unit of a token may be different depending on the purpose, and here it is used with the same meaning as a word.

일 실시예에서, 토큰 분리부(210)는 입력 텍스트에 토크나이저(tokenizer)를 적용하여 복수의 토큰들을 생성할 수 있다. 예를 들어, 토큰 분리부(210)는 NLTK에서 제공하는 토크나이저를 사용하여 입력 텍스트의 문장을 복수의 토큰들로 분리할 수 있다. 이후, 토큰 분리부(210)에 의해 분리된 토큰들은 품사 분류부(230)에 전달되어 다음 단계의 동작 과정에서 사용될 수 있다.In one embodiment, the token separator 210 may generate a plurality of tokens by applying a tokenizer to the input text. For example, the token separator 210 may separate sentences of the input text into a plurality of tokens using a tokenizer provided by NLTK. Thereafter, the tokens separated by the token separation unit 210 are transferred to the part-of-speech classification unit 230 and can be used in the next step of the operation process.

품사 분류부(230)는 복수의 토큰들 중에서 적어도 하나의 후보 토큰을 선택하고 후보 토큰 별로 품사를 정의하여 분류할 수 있다. 즉, 품사 분류부(230)는 입력 텍스트에서 분리된 토큰들 중에 n개의 토큰들을 선택하는 동작을 수행할 수 있으며(도 4의 S420), 선택된 토큰들의 품사를 정의하는 품사 태깅(tagging) 동작과 명사와 다른 품사를 나누는 품사 구분(또는 분류) 동작을 수행할 수 있다(도 4의 S430).The part-of-speech classification unit 230 may select at least one candidate token from a plurality of tokens and classify them by defining a part-of-speech for each candidate token. That is, the part-of-speech classification unit 230 can perform an operation of selecting n tokens among tokens separated from the input text (S420 in FIG. 4), and a part-of-speech tagging operation that defines the parts of speech of the selected tokens. A part-of-speech classification (or classification) operation that divides nouns from other parts of speech can be performed (S430 in FIG. 4).

일 실시예에서, 품사 분류부(230)는 복수의 토큰들 중에서 기 정의된 불용어 사전의 불용어(stopword)를 제거하고 불용어가 제거된 토큰들 중에서 임의로 선택된 n개(상기 n은 자연수)의 토큰들을 적어도 하나의 후보 토큰으로 결정할 수 있다. 예를 들어, 품사 분류부(230)는 토큰 분리부(210)로부터 분리된 토큰들을 수신하고, 분리된 토큰들 중에서 최종적으로 대체하고자 하는 단어의 개수만큼 후보 토큰들을 선택할 수 있으며, 이때 NLTK의 불용어 사전이 사용될 수 있다.In one embodiment, the part-of-speech classification unit 230 removes stopwords from a predefined stopword dictionary from a plurality of tokens and randomly selects n tokens (where n is a natural number) from among the tokens from which the stopwords have been removed. A decision can be made with at least one candidate token. For example, the part-of-speech classification unit 230 may receive separated tokens from the token separation unit 210, and select candidate tokens from among the separated tokens as many as the number of words to be finally replaced. At this time, the NLTK stop word A dictionary may be used.

즉, 품사 분류부(230)는 사전 정의된 불용어 사전의 불용어를 제외한 n개의 토큰들을 도 5에서와 같이 랜덤하게 선택할 수 있다. 도 5의 경우, 이미지와 이미지에 해당되는 텍스트 쌍으로 구성된 입력 데이터에 대해 토크나이저와 불용어 사전을 사용하여 입력 텍스트(510)로부터 최종 토큰 집합(530)을 구성한 일 실시예에 해당할 수 있다. 도 5에서, 'a man standing on a white surfboard holding a long paddle'이라는 문장을 토큰으로 분리한 뒤, 최종적으로 'man', 'surfboard', 그리고 'holding'의 3개의 토큰을 선정한 결과가 도시되어 있다. 토큰의 선정 동작은 불용어를 제외한 나머지 토큰 중 임의의 n 개 토큰을 선택하는 방식으로 수행될 수 있으며, 토큰의 수인 n은 하이퍼파라미터(Hyperparameter)로서 데이터 증강 장치(130)에 의해 직접 설정될 수 있다.That is, the part-of-speech classification unit 230 may randomly select n tokens excluding stop words from a predefined stop word dictionary, as shown in FIG. 5 . In the case of Figure 5, it may correspond to an embodiment in which the final token set 530 is constructed from the input text 510 using a tokenizer and a stop word dictionary for input data consisting of an image and a text pair corresponding to the image. In Figure 5, the result of dividing the sentence 'a man standing on a white surfboard holding a long paddle' into tokens and finally selecting three tokens 'man', 'surfboard', and 'holding' is shown. there is. The token selection operation can be performed by selecting n random tokens from among the remaining tokens excluding stop words, and n, the number of tokens, is a hyperparameter and can be directly set by the data augmentation device 130. .

한편, 품사 분류부(230)는 토큰들의 품사를 정의하는 품사 태깅 동작을 수행할 수 있으며, 여기에서는 품사에 따라 토큰의 대체 방식이 달라질 수 있기 때문에 매우 중요한 동작에 해당할 수 있다. 품사 태그의 정확도는 데이터의 크기나 단어의 수 등 상황에 따라 상이하게 나타날 수 있으며, 예를 들어 영어 코퍼스에서 NLTK의 품사 태그 정확도는 약 97%로 나타날 수 있다. 품사 분류부(230)는 높은 수준의 정확도를 보이는 NLTK의 품사 태그를 사용하여 품사 식별 동작을 수행할 수 있다. 도 6의 그림 (a)에서, 'man'은 'NN', 즉 명사로, 'surfboard'는 'NN', 즉 명사로, 그리고 'holding'은 'VBG', 즉 동사로 품사가 식별될 수 있다.Meanwhile, the part-of-speech classification unit 230 may perform a part-of-speech tagging operation that defines the parts of speech of tokens. Here, this may be a very important operation because the replacement method of the token may vary depending on the part of speech. The accuracy of part-of-speech tags may vary depending on the situation, such as the size of the data or the number of words. For example, in an English corpus, the accuracy of NLTK's part-of-speech tags may be approximately 97%. The part-of-speech classification unit 230 can perform a part-of-speech identification operation using NLTK part-of-speech tags that show a high level of accuracy. In figure (a) of Figure 6, 'man' can be identified as 'NN', i.e., a noun, 'surfboard' as 'NN', i.e., a noun, and 'holding' as 'VBG', i.e., a verb. there is.

일 실시예에서, 품사 분류부(230)는 적어도 하나의 후보 토큰에 대해 해당 품사를 기준으로 특정 품사 또는 특정 품사 이외의 기타 품사로 각각 분류할 수 있다. 예를 들어, 품사가 식별된 토큰들은 품사를 기준으로 2개의 카테고리로 분류될 수 있으며, 하나는 명사 카테고리이고 다른 하나는 명사 이외의 품사 카테고리에 해당할 수 있다. 즉, 여기에서 특정 품사는 명사이고, 기타 품사는 명사 이외의 품사에 해당할 수 있으나, 반드시 이에 한정되지 않음은 물론이다. 또한, 명사의 범위는 NLTK의 품사 태그 리스트에서 단수 명사인 'NN', 복수 명사인 'NNS', 단수 고유명사인 'NNP', 복수 고유명사인 'NNPS'로 설정될 수 있다. 도 6의 그림 (a)에서, 세 토큰 중 'man'과 'surfboard'는 명사인 'Noun'으로, 그리고 다른 토큰은 명사 외의 품사를 나타내는 'Others'로 구분될 수 있다.In one embodiment, the part-of-speech classification unit 230 may classify at least one candidate token into a specific part-of-speech or other parts-of-speech other than the specific part-of-speech based on the corresponding part-of-speech. For example, tokens with identified parts of speech may be classified into two categories based on the part of speech, one of which may correspond to a noun category and the other to a non-noun part of speech category. That is, here, a specific part of speech is a noun, and other parts of speech may correspond to parts of speech other than nouns, but of course, the part of speech is not necessarily limited thereto. Additionally, the scope of the noun can be set to 'NN' as a singular noun, 'NNS' as a plural noun, 'NNP' as a singular proper noun, and 'NNPS' as a plural proper noun in NLTK's part-of-speech tag list. In figure (a) of Figure 6, 'man' and 'surfboard' among the three tokens can be classified as 'Noun', which is a noun, and the other tokens can be classified as 'Others', which represents parts of speech other than nouns.

텍스트 증강부(250)는 적어도 하나의 후보 토큰을 해당 품사에 따라 계층적 관계를 형성하는 대체 토큰으로 변경하여 입력 텍스트에 관한 증강 텍스트를 생성할 수 있다. 일 실시예에서, 텍스트 증강부(250)는 해당 품사에 따라 후보 토큰을 복수의 유의어들 중 임의로 선택된 어느 하나로 대체할 수 있다. 일 실시예에서, 텍스트 증강부(250)는 품사 별로 독립적인 대체 규칙을 정의하고 해당 품사에 따라 적어도 하나의 후보 토큰을 대체 규칙에 따른 대체 토큰으로 변경할 수 있다. 즉, 텍스트 증강부(250)는 원본 텍스트의 단어들을 품사에 따라 상이한 방식으로 대체하여 텍스트 데이터에 관한 증강 동작을 수행할 수 있다(도 4의 S440 및 S450).The text augmentation unit 250 may generate augmented text for the input text by changing at least one candidate token into a replacement token that forms a hierarchical relationship according to the corresponding part of speech. In one embodiment, the text enhancer 250 may replace the candidate token with one randomly selected among a plurality of synonyms according to the corresponding part of speech. In one embodiment, the text augmentation unit 250 may define independent substitution rules for each part of speech and change at least one candidate token according to the part of speech into a substitution token according to the substitution rule. That is, the text augmentation unit 250 may perform an augmentation operation on text data by replacing words in the original text in different ways depending on the part of speech (S440 and S450 in FIG. 4).

일 실시예에서, 텍스트 증강부(250)는 특정 품사로 분류된 후보 토큰을 계층적 관계에 따른 상위어로 대체하고 기타 품사로 분류된 후보 토큰을 유의어로 대체하는 동작을 선택적으로 수행할 수 있다. 예를 들어, 명사와 명사 외의 품사로 구분된 토큰들은 워드넷의 단어 연관 정보를 바탕으로 단어 대체 동작이 수행될 수 있으며, 명사로 분류된 토큰은 상위어로, 명사 외의 품사로 분류된 토큰은 유의어로 대체될 수 있다. 이때, 유의어는 복수 개가 존재할 수 있으며, 텍스트 증강부(250)는 임의추출로 토큰 당 하나의 유의어를 선택할 수 있다. 도 6의 그림 (b)는 품사에 따라 계층적으로 단어 대체가 이루어진 가상의 결과로, 명사인 'man'과 'surfboard'는 각각 상위어인 'person'과 'board'로, 그리고 명사가 아닌 'holding'은 유의어인 'keeping'으로 대체되었음을 확인할 수 있다.In one embodiment, the text augmentation unit 250 may selectively perform an operation of replacing candidate tokens classified as a specific part of speech with a hypernym according to a hierarchical relationship and replacing candidate tokens classified as other parts of speech with a synonym. For example, for tokens classified as nouns and parts of speech other than nouns, a word substitution operation can be performed based on word association information in WordNet, tokens classified as nouns can be used as hypernyms, and tokens classified as parts of speech other than nouns can be used as synonyms. can be replaced with At this time, there may be a plurality of synonyms, and the text enhancer 250 may select one synonym per token through random extraction. Figure (b) in Figure 6 is a hypothetical result of hierarchical word substitution according to part of speech, where the nouns 'man' and 'surfboard' are changed to the hypernyms 'person' and 'board', respectively, and the non-noun ' It can be seen that 'holding' has been replaced with the synonym 'keeping'.

일 실시예에서, 텍스트 증강부(250)는 특정 품사로 분류된 후보 토큰과 계층적 관계를 형성하는 제1 상위어가 존재하고, 제1 상위어와 계층적 관계를 형성하는 제2 상위어가 존재하는 경우 후보 토큰을 제1 및 제2 상위어들 각각으로 대체할 수 있다. 예를 들어, 명사인 'desk'는 제1 상위어인 'table'과 계층적 관계를 형성할 수 있고, 제1 상위어인 'table'은 제2 상위어인 'furniture'와 계층적 관계를 형성할 수 있다. 즉, 토큰들 사이에는 반복적인 계층적 관계가 형성될 수 있으며, 텍스트 증강부(250)는 하나의 토큰에 대해 계층적 관계를 형성하는 복수의 상위어들이 존재하는 경우 해당 토큰을 복수의 상위어들 중 어느 하나로 대체할 수 있다.In one embodiment, the text augmentation unit 250 operates when a first hypernym forming a hierarchical relationship with a candidate token classified into a specific part of speech exists and a second hypernym forming a hierarchical relationship with the first hypernym exists. The candidate token can be replaced with each of the first and second superordinate words. For example, the noun 'desk' can form a hierarchical relationship with the first hypernym 'table', and the first hypernym 'table' can form a hierarchical relationship with the second hypernym 'furniture'. there is. That is, a repetitive hierarchical relationship may be formed between tokens, and if there are a plurality of hypernyms forming a hierarchical relationship with one token, the text enhancer 250 selects the token among the plurality of hypernyms. Any one can be replaced.

또한, 상위어들 사이에 계층적 관계가 반복되는 경우, 텍스트 증강부(250)는 계층적 관계의 반복에 따른 상위어들로 해당 토큰을 독립적으로 대체할 수 있으며, 필요에 따라 대체 가능한 계층적 관계의 반복 횟수를 제한적으로 적용할 수 있다. 예를 들어, 대체 가능한 계층적 관계의 반복 횟수가 2로 설정된 경우, 특정 토큰에 대해 제1 상위어가 존재하고 제1 상위어에 대해 제2 상위어가 존재하면 텍스트 증강부(250)는 특정 토큰을 제1 상위어 및 제2 상위어로 각각 대체할 수 있다. 만약 제2 상위어에 대해 계층적 관계를 형성하는 제3 상위어가 존재하는 경우, 텍스트 증강부(250)는 계층적 관계의 반복 횟수가 2를 초과함에 따라 제3 상위어로의 대체 동작을 수행하지 않을 수 있다.In addition, when the hierarchical relationship between superordinate words is repeated, the text augmentation unit 250 can independently replace the corresponding token with a hypernym according to the repetition of the hierarchical relationship, and, if necessary, replace the hierarchical relationship. The number of repetitions can be limited. For example, when the number of repetitions of a replaceable hierarchical relationship is set to 2, if a first hypernym exists for a specific token and a second hypernym exists for the first hypernym, the text enhancer 250 provides the specific token. It can be replaced with 1 hypernym and 2nd hypernym respectively. If there is a third hypernym that forms a hierarchical relationship with the second hypernym, the text enhancer 250 will not perform a substitution operation for the third hypernym as the number of repetitions of the hierarchical relationship exceeds 2. You can.

제어부(270)는 데이터 증강 장치(130)의 전체적인 동작을 제어하고, 토큰 분리부(210), 품사 분류부(230) 및 텍스트 증강부(250) 간의 제어 흐름 또는 데이터 흐름을 관리할 수 있다.The control unit 270 can control the overall operation of the data augmentation device 130 and manage the control flow or data flow between the token separation unit 210, the part-of-speech classification unit 230, and the text augmentation unit 250.

도 3은 본 발명에 따른 계층 기반 단어 대체를 통한 텍스트 데이터 증강 방법을 설명하는 순서도이다.Figure 3 is a flowchart explaining a method of text data augmentation through layer-based word replacement according to the present invention.

도 3을 참조하면, 데이터 증강 장치(130)는 토큰 분리부(210)를 통해 입력 텍스트를 수신하고(단계 S310), 입력 텍스트를 복수의 토큰들로 분리할 수 있다(단계 S330).Referring to FIG. 3, the data enhancement device 130 may receive input text through the token separator 210 (step S310) and separate the input text into a plurality of tokens (step S330).

데이터 증강 장치(130)는 품사 분류부(230)를 통해 복수의 토큰들 중에서 적어도 하나의 후보 토큰을 선택하고 후보 토큰 별로 품사를 정의하여 분류할 수 있다(단계 S350).The data enhancement device 130 may select at least one candidate token from among a plurality of tokens through the part-of-speech classification unit 230 and classify it by defining a part of speech for each candidate token (step S350).

데이터 증강 장치(130)는 텍스트 증강부(250)를 통해 적어도 하나의 후보 토큰을 해당 품사에 따라 계층적 관계를 형성하는 대체 토큰으로 변경할 수 있으며(단계 S370), 토큰 대체 결과를 기초로 입력 텍스트에 관한 증강 텍스트를 생성할 수 있다(단계 S390).The data augmentation device 130 may change at least one candidate token into a replacement token forming a hierarchical relationship according to the corresponding part of speech through the text augmentation unit 250 (step S370), and input text based on the token replacement result. Augmented text about can be generated (step S390).

이하, 도 7 내지 10을 통해 본 발명에 따른 방법을 실제 데이터에 적용한 실험 결과를 설명한다.Hereinafter, experimental results of applying the method according to the present invention to actual data will be described through FIGS. 7 to 10.

해당 실험에는 MSCOCO 데이터 세트가 사용될 수 있다. MSCOCO는 객체 인식, 분할 그리고 이미지 캡셔닝 연구에 주로 사용되며 약 33만 건의 데이터가 공개되어 있다. 여기에서는, 2014년에 공개된 데이터에서 이미지당 5개씩 부여된 텍스트가 사용될 수 있으며, 텍스트 데이터 증강은 훈련용(Training) 이미지 데이터 약 8만 건에 부여된 텍스트에만 적용될 수 있고, 검증용(Validation) 이미지 데이터 약 4만 건에 부여된 텍스트에는 적용되지 않을 수 있다. 또한, 실험환경은 Python 3.6과 NLTK 패키지를 바탕으로 구축될 수 있으며, Text to Image 합성 모델은 Pytorch 기반으로 구현된 AttnGAN이 사용될 수 있고, AttnGAN 모델에 다양한 텍스트 증강 방법을 적용하는 방식으로 실험이 진행될 수 있다.The MSCOCO data set can be used for this experiment. MSCOCO is mainly used in object recognition, segmentation, and image captioning research, and about 330,000 pieces of data are publicly available. Here, five texts assigned per image from the data released in 2014 can be used, and text data augmentation can only be applied to the text assigned to about 80,000 images of training image data, and validation purposes. ) It may not be applied to the text assigned to approximately 40,000 pieces of image data. In addition, the experimental environment can be built based on Python 3.6 and the NLTK package, and AttnGAN implemented based on Pytorch can be used as the Text to Image synthesis model, and the experiment will be conducted by applying various text augmentation methods to the AttnGAN model. You can.

도 7 내지 9를 참조하면, MSCOCO의 2014년 데이터에서 훈련용 이미지 약 8만 건에 부여된 텍스트를 NLTK 토크나이저를 사용하여 토큰 단위로 분할한 뒤, 대체하고자 하는 n개의 토큰들이 선택된 집합의 결과를 확인할 수 있다. 도 7은 훈련용 데이터에서 임의의 이미지에 부여된 다섯 개의 텍스트이며, 도 8은 텍스트를 소문자로 변환한 뒤, NLTK 토크나이저를 사용하여 토큰 단위로 분리한 결과를 나타낼 수 있다. 도 9의 그림(a)는 분리된 토큰들 중, NLTK에서 제공하는 Stopwords에 해당하지 않는 토큰을 각 문장별로 1개씩 임의추출한 결과를 나타낼 수 있다.Referring to Figures 7 to 9, the text given to about 80,000 training images in MSCOCO's 2014 data is divided into token units using an NLTK tokenizer, and the result is a set of n tokens to be replaced. You can check. Figure 7 shows five texts assigned to random images in the training data, and Figure 8 shows the results of converting the text to lowercase letters and separating them into tokens using the NLTK tokenizer. Figure 9 (a) shows the result of randomly extracting one token for each sentence that does not correspond to the stopwords provided by NLTK among the separated tokens.

도 9의 그림 (b)는 도 4의 단계 (3)~(4)에 해당하는 과정의 결과, 즉 품사 식별 및 분류를 통해 명사와 다른 품사의 토큰들을 구분하고, 품사에 따라 선택적인 대체 기준을 적용하여 대체 단어를 찾은 결과를 나타낼 수 있다. 대체하고자 하는 토큰들의 품사 태깅 결과, 'slim'은 'JJ', 즉 형용사로, 'screen', 'monitor', 'keyboard', 'desk'는 'NN', 즉 명사로 정의될 수 있다. 명사로 식별된 단어들은 상위어로 대체될 수 있으며, 명사를 제외한 품사들은 유의어로 대체될 수 있다. 대체 단어는 워드넷에서 탐색하여 추출하며, 유의어 및 상위어가 복수일 경우 임의로 하나만 추출할 수 있다. 그 결과 'slim'은 'slender'로 유의어가 추출되었고, 'screen', 'monitor', 'keyboard' 그리고 'desk'는 각각 'display', 'display', 'device', 그리고 'table'로 상위어가 추출된 것을 확인할 수 있다.Figure (b) in Figure 9 shows the result of the process corresponding to steps (3) to (4) in Figure 4, that is, distinguishing tokens from nouns and other parts of speech through part-of-speech identification and classification, and selective substitution criteria according to the part of speech. You can display the results of finding a replacement word by applying . As a result of part-of-speech tagging of the tokens to be replaced, 'slim' can be defined as 'JJ', i.e., an adjective, and 'screen', 'monitor', 'keyboard', and 'desk' can be defined as 'NN', i.e., a noun. Words identified as nouns can be replaced with synonyms, and parts of speech other than nouns can be replaced with synonyms. Replacement words are extracted by searching in WordNet, and if there are multiple synonyms or hypernyms, only one can be extracted at random. As a result, 'slim' was extracted as a synonym of 'slender', and 'screen', 'monitor', 'keyboard', and 'desk' were extracted as synonyms of 'display', 'display', 'device', and 'table', respectively. You can see that has been extracted.

도 10은 선택된 토큰들을 추출된 단어들로 대체하여 각각의 텍스트들을 새로운 텍스트로 증강한 결과를 나타낼 수 있으며, 'Origin'으로 표기된 텍스트는 원본 텍스트, 'Augmentation'으로 표기된 텍스트는 본 발명에 따른 방법을 통해 증강한 텍스트에 해당할 수 있다. 이와 같은 방식을 통해 이미지의 의미에 대한 왜곡 없이 증강한 텍스트를 이후 프로세스인 훈련에 활용할 수 있다.Figure 10 shows the result of augmenting each text with a new text by replacing selected tokens with extracted words, where the text marked as 'Origin' is the original text and the text marked as 'Augmentation' is the method according to the present invention. It may correspond to text augmented through . Through this method, the augmented text can be used in the subsequent training process without distorting the meaning of the image.

이하, 도 11 내지 13을 통해 본 발명에 따른 방법을 실제 데이터에 적용하여 측정한 성능 분석 결과를 설명한다.Hereinafter, the performance analysis results measured by applying the method according to the present invention to actual data will be described through FIGS. 11 to 13.

도 11은 성능 분석을 위한 전체 실험 프로세스를 설명하는 도면을 나타낼 수 있다. 도 11의 (A)는 본 발명에 따른 방법을 통해 증강된 텍스트 데이터를 사용하여 Text to Image 합성을 진행하는 과정이며, (B)는 유의어로 대체하는 어휘 대체 기반의 증강 방법을 통해 증강된 텍스트 데이터를 사용하여 Text to Image 합성을 진행하는 과정에 해당할 수 있다. 마지막으로, (C)는 증강을 하지 않은 원본 텍스트 데이터를 사용하여 Text to Image 합성을 진행하는 과정에 해당할 수 있다. 최종적으로 Text to Image 합성에 사용되는 이미지당 텍스트의 개수는 각각 (A) 10개, (B) 10개, (C) 5개이며, 학습은 AttnGAN 모델을 사용하여 진행될 수 있다. 또한, 텍스트로부터 생성된 이미지의 품질 평가에는 인셉션 스코어(Inception Score)가 사용될 수 있다. 인셉션 스코어는 생성 이미지의 품질을 평가하는데 주로 사용되는 평가 척도로 생성된 이미지의 품질과 다양성을 기준으로 측정되며, 인셉션 스코어가 높을수록 좋은 성능을 나타내는 것으로 해석될 수 있다. 세 가지 모델을 적용하여 생성한 결과 이미지에 대한 인셉션 스코어가 도 12 및 13에 각각 도시되어 있다.Figure 11 may represent a diagram explaining the entire experimental process for performance analysis. (A) in Figure 11 is a process of text to image synthesis using text data augmented through the method according to the present invention, and (B) is text augmented through a vocabulary replacement-based augmentation method replacing synonyms. This may correspond to the process of performing Text to Image synthesis using data. Lastly, (C) may correspond to the process of Text to Image synthesis using original text data without augmentation. Ultimately, the number of texts per image used for Text to Image synthesis is (A) 10, (B) 10, and (C) 5, respectively, and learning can be performed using the AttnGAN model. Additionally, the Inception Score can be used to evaluate the quality of images generated from text. The Inception score is an evaluation scale mainly used to evaluate the quality of generated images and is measured based on the quality and diversity of the generated images. A higher Inception score can be interpreted as indicating better performance. The Inception scores for the resulting images generated by applying the three models are shown in Figures 12 and 13, respectively.

도 12는 실험에서 비교한 세 가지 모델을 통해 생성한 이미지에 대한 인셉션 스코어를 10 에폭(Epoch) 단위로 나타낸 비교 그래프에 해당할 수 있다. 중반 에폭까지는 기존 텍스트 증강 방법을 사용한 모델 (B)와 증강을 하지 않은 모델 (C)의 인셉션 스코어가 본 발명에 따른 방법의 인셉션 스코어보다 다소 높게 나타나지만, 학습이 충분히 진행된 130 에폭에서는 본 발명에 따른 모델의 인셉션 스코어가 가장 높게 나타남을 확인할 수 있다. 가장 높은 인셉션 스코어를 나타내는 에폭은 각 모델마다 서로 다를 수 있으며, 인셉션 스코어의 최댓값을 비교한 결과도 도 13과 같이 본 발명에 따른 방법의 성능이 가장 높을 수 있다.Figure 12 may correspond to a comparison graph showing the Inception score in units of 10 epochs for images generated through the three models compared in the experiment. Until the mid-epoch, the Inception scores of the model (B) using the existing text augmentation method and the model (C) without augmentation appear to be somewhat higher than the Inception scores of the method according to the present invention, but at 130 epochs when learning has sufficiently progressed, the It can be seen that the Inception score of the model according to is the highest. The epoch showing the highest Inception score may be different for each model, and the results of comparing the maximum value of the Inception score show that the method according to the present invention may have the highest performance, as shown in FIG.

실험 결과 소량의 원본 데이터를 사용한 모델에 비해 텍스트 증강을 적용한 모델의 성능이 높게 나타날 수 있으며, 텍스트 증강의 경우 본 발명에 따른 단어의 의미 계층 기반 텍스트 증강 방법이 기존의 유의어 대체 기반의 텍스트 증강 방법에 비해 우수한 성능을 나타낼 수 있다.As a result of the experiment, the performance of the model applying text augmentation may be higher than that of the model using a small amount of original data, and in the case of text augmentation, the text augmentation method based on the semantic hierarchy of words according to the present invention is better than the existing text augmentation method based on synonym substitution. It can show superior performance compared to .

상기에서는 본 발명의 바람직한 실시예를 참조하여 설명하였지만, 해당 기술 분야의 숙련된 당업자는 하기의 특허 청구의 범위에 기재된 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 및 변경시킬 수 있음을 이해할 수 있을 것이다.Although the present invention has been described above with reference to preferred embodiments, those skilled in the art may make various modifications and changes to the present invention without departing from the spirit and scope of the present invention as set forth in the claims below. You will understand that you can do it.

100: 데이터 증강 시스템
110: 사용자 단말 130: 데이터 증강 장치
150: 데이터베이스
210: 토큰 분리부 230: 품사 분류부
250: 텍스트 증강부 270: 제어부100: Data augmentation system
110: User terminal 130: Data augmentation device
150: database
210: token separation unit 230: part-of-speech classification unit
250: text augmentation unit 270: control unit

Claims

a token separator that receives input text and separates it into a plurality of tokens;
a part-of-speech classification unit that selects at least one candidate token from among the plurality of tokens and defines and classifies a part-of-speech for each candidate token; and
Generating augmented text related to the input text by changing the at least one candidate token into a replacement token that forms a hierarchical relationship according to the corresponding part of speech, and forming the hierarchical relationship with the candidate token classified into a specific part of speech A text augmentation unit that replaces the candidate token with one of the plurality of hypernyms when hypernyms exist,
Wherein the text augmentation unit applies a limited number of repetitions of a replaceable hierarchical relationship when the hierarchical relationship is repeated between the plurality of hypernyms. Text data augmentation device through hierarchy-based word replacement.

The method of claim 1, wherein the token separator
A text data augmentation device through layer-based word replacement, characterized in that generating the plurality of tokens by applying a tokenizer to the input text.

The method of claim 1, wherein the part of speech classification unit
Removing stopwords from a predefined stopword dictionary from among the plurality of tokens and determining n tokens (where n is a natural number) randomly selected from among the tokens from which the stopwords were removed as the at least one candidate token. Characterized by a text data augmentation device through layer-based word replacement.

The method of claim 3, wherein the part of speech classification unit
A text data augmentation device through hierarchy-based word substitution, characterized in that classifying the at least one candidate token into a specific part of speech or other parts of speech other than the specific part of speech based on the corresponding part of speech.

The method of claim 1, wherein the text enhancer
A text data augmentation device through hierarchy-based word substitution, characterized in that it defines an independent replacement rule for each part of speech and changes the at least one candidate token into a replacement token according to the replacement rule according to the corresponding part of speech.

The method of claim 1, wherein the text enhancer
A text data augmentation device through hierarchy-based word replacement, characterized in that a candidate token is replaced with one randomly selected from a plurality of synonyms according to the corresponding part of speech.

The method of claim 4, wherein the text enhancer
Text data through hierarchy-based word replacement, characterized in that selectively performing the operation of replacing candidate tokens classified into the specific part of speech with a hypernym according to the hierarchical relationship and replacing candidate tokens classified into the other parts of speech with synonyms. Augmentation device.

delete

In a data augmentation method performed in a data augmentation device,
Receiving input text and separating it into a plurality of tokens, through a token separator;
Selecting at least one candidate token from among the plurality of tokens and classifying them by defining a part of speech for each candidate token through a part-of-speech classification unit; and
Through a text augmentation unit, the at least one candidate token is changed into a replacement token that forms a hierarchical relationship according to the corresponding part of speech to generate augmented text related to the input text, and the hierarchical relationship is established with the candidate token classified into a specific part of speech. If there are a plurality of hypernyms forming , replacing the candidate token with one of the plurality of hypernyms,
The step of generating the augmented text includes applying a limited number of repetitions of replaceable hierarchical relationships when the hierarchical relationship is repeated between the plurality of hypernyms. Text data augmentation method.

The method of claim 9, wherein the step of defining and classifying the parts of speech is
A method of augmenting text data through hierarchy-based word replacement, comprising the step of classifying the at least one candidate token into a specific part of speech or other parts of speech other than the specific part of speech based on the corresponding part of speech.

The method of claim 9, wherein generating the augmented text
A text data augmentation method through hierarchy-based word substitution, comprising the step of defining an independent replacement rule for each part of speech and changing the at least one candidate token into a replacement token according to the replacement rule according to the corresponding part of speech.

The method of claim 10, wherein generating the augmented text
Hierarchy-based word replacement, comprising the step of selectively replacing candidate tokens classified into the specific part of speech with a hypernym according to the hierarchical relationship and replacing candidate tokens classified into the other parts of speech with synonyms. Text data augmentation method through.