KR102416542B1

KR102416542B1 - Apparatus and method for detecting spam based on artificial intelligence

Info

Publication number: KR102416542B1
Application number: KR1020190092374A
Authority: KR
Inventors: 백성복; 김소진; 안태진; 진기범
Original assignee: 주식회사 케이티
Priority date: 2019-07-30
Filing date: 2019-07-30
Publication date: 2022-07-01
Also published as: KR20210014379A

Abstract

스팸 탐지 장치가 스팸을 탐지하는 방법으로서, 복수의 단말들로부터 메시지 정보들을 수집하는 단계, 상기 메시지 정보들을 가공하여 2차원 학습 이미지들을 생성하는 단계, 각 학습 이미지에 해당 학습 이미지에 대응하는 메시지 정보의 스팸 판정 결과가 대응된 학습 데이터를 생성하고, 상기 학습 데이터를 이용하여 스팸 지수 산출 모델을 지도 학습하는 단계, 임의의 단말로부터 새로운 메시지 정보를 수집하면, 상기 새로운 메시지 정보를 가공하여 2차원 입력 이미지를 생성하는 단계, 상기 입력 이미지와 상기 스팸 지수 산출 모델을 이용하여 상기 새로운 메시지에 대한 스팸 지수를 산출하는 단계, 그리고 산출된 스팸 지수를 상기 임의의 단말에 전송하는 단계를 포함하는 스팸 탐지 방법이다.A method for a spam detection apparatus to detect spam, the method comprising: collecting message information from a plurality of terminals; processing the message information to generate two-dimensional learning images; message information corresponding to the learning image in each learning image generating learning data corresponding to the spam determination result of , supervising and learning a spam index calculation model using the learning data. When new message information is collected from an arbitrary terminal, the new message information is processed and two-dimensional input A spam detection method comprising the steps of generating an image, calculating a spam index for the new message using the input image and the spam index calculation model, and transmitting the calculated spam index to the arbitrary terminal to be.

Description

AI-based spam detection device and method

본 발명은 인공지능을 기반으로 하는 스팸 탐지 기술에 관한 것이다.The present invention relates to a spam detection technology based on artificial intelligence.

한국인터넷진흥원은 현행법인 정보통신망이용촉진및정보보호등에관한법률을 기반으로, 휴대폰이나 유선 전화 등으로 수신자가 원치 않았음에도 불구하고 일방적으로 전송되는 영리 목적의 광고성 정보를 스팸으로 규정하여 금지하고 있으나, 광고성 스팸의 양은 매년 증가하고 있다.Based on the Act on Promotion of Information and Communications Network Utilization and Information Protection, etc. of the current corporation, the Korea Internet & Security Agency prohibits commercial advertisement information transmitted unilaterally through mobile phones or landlines, etc., even though the recipient did not want it, as spam. However, the amount of advertising spam is increasing every year.

스팸, 즉 휴대폰을 대상으로 문자나 전화를 통해 상품에 대한 안내를 보내는 방식은 비용 대비 효과가 좋기 때문에 광고 수단으로 많이 활용된다. 하지만 이런 스팸 메시지는 수신자의 동의 없이 불특정 다수에게 대량으로 보내지는 경향을 보이고 있어서 사회적 이슈가 되고 있다.Spam, that is, the method of sending product information to mobile phones through text messages or phone calls, is widely used as an advertising method because it is cost-effective. However, these spam messages tend to be sent in large quantities to unspecified people without the consent of the recipient, so it is becoming a social issue.

스팸을 방지하기 위해 여러 가지 방법과 시스템이 구현되어 운용되고 있다. 대부분은 룰(Rule)을 기반으로 하며, 스팸의 전송 특징 및 패턴을 탐지할 수 있는 룰 집합(Rule Set)을 생성하고, 주기적인 관리를 통해 스팸을 탐지하는 방식을 채택하고 있다.In order to prevent spam, various methods and systems have been implemented and operated. Most of them are rule-based, and a method of generating a rule set that can detect spam transmission characteristics and patterns, and detecting spam through periodic management is adopted.

룰은 스팸 탐지 분야의 전문가들이 운용 노하우를 바탕으로 설정하며, 룰이 정확하게 설정되기만 하면, 룰 조건에 매칭되는 스팸을 신속하고 정확하게 탐지해 낼 수 있다는 장점이 있다. 또한 룰은 각각의 메시지에 대해 그것이 스팸인지 아닌지를 결정론적(Deterministic)으로 판별해 준다는 특징이 있다.The rules are set by experts in the field of spam detection based on their operational know-how, and as long as the rules are set correctly, spam matching the rule conditions can be detected quickly and accurately. In addition, the rule has the characteristic of deterministically determining whether or not it is spam for each message.

그러나 한번 설정된 룰은 해당 룰 조건에 매칭되는 메시지만 걸러낼 수 있고, 메시지가 약간만 변형되어도 스팸으로 탐지하지 못한다. 또한 스팸 탐지 룰이 작동하고 있다는 사실을 해커들이 인지하는 경우, 그들은 스팸 공격 방법을 즉시 변형하여 사용하기 때문에, 전문가가 룰 집합을 지속적으로 관리해 줄 필요가 있다는 문제점이 있다.However, once a rule is set, only messages that match the rule condition can be filtered out, and even a slight modification of the message cannot be detected as spam. In addition, when hackers recognize that the spam detection rules are working, they immediately transform and use the spam attack method, so there is a problem that an expert needs to continuously manage the rule set.

해결하고자 하는 과제는 각종 스팸 문자와 전화에 대해, 인공지능 기반의 알고리즘으로 스팸 지수를 산출하고, 이를 이용하여 스팸을 탐지하는 방법 및 시스템을 제공하는 것이다. The task to be solved is to provide a method and system for calculating the spam index using an AI-based algorithm for various spam text messages and phone calls, and using this to detect spam.

또한, 해결하고자 하는 과제는 해커들이 스팸을 변형하는 경우에도 유연하게 적용되는 인공지능 기반의 알고리즘을 이용하여, 특정한 형태가 없는 스팸이나 유사 스팸을 탐지하는 방법 및 시스템을 제공하는 것이다.In addition, the task to be solved is to provide a method and system for detecting spam or similar spam without a specific form by using an artificial intelligence-based algorithm that is flexibly applied even when hackers transform spam.

한 실시예에 따른 스팸 탐지 장치가 스팸을 탐지하는 방법으로서, 복수의 단말들로부터 메시지 정보들을 수집하는 단계, 상기 메시지 정보들을 가공하여 2차원 학습 이미지들을 생성하는 단계, 각 학습 이미지에 해당 학습 이미지에 대응하는 메시지 정보의 스팸 판정 결과가 대응된 학습 데이터를 생성하고, 상기 학습 데이터를 이용하여 스팸 지수 산출 모델을 지도 학습하는 단계, 임의의 단말로부터 새로운 메시지 정보를 수집하면, 상기 새로운 메시지 정보를 가공하여 2차원 입력 이미지를 생성하는 단계, 상기 입력 이미지와 상기 스팸 지수 산출 모델을 이용하여 상기 새로운 메시지에 대한 스팸 지수를 산출하는 단계, 그리고 산출된 스팸 지수를 상기 임의의 단말에 전송하는 단계를 포함한다.A method for a spam detection apparatus according to an embodiment to detect spam, the method comprising: collecting message information from a plurality of terminals; processing the message information to generate two-dimensional learning images; and learning images corresponding to each learning image generating learning data corresponding to the spam determination result of message information corresponding to , supervising and learning a spam index calculation model using the learning data; generating a two-dimensional input image by processing, calculating the spam index for the new message using the input image and the spam index calculation model, and transmitting the calculated spam index to the arbitrary terminal include

상기 학습 이미지들을 생성하는 단계는, 각 메시지 정보에 포함된 발신자 관련 정보들을 이진화하고, 이진화된 정보들을 2차원으로 배열하는 단계, 그리고 2차원으로 배열된 비트를 임의의 길이 단위로 분할하고, 분할된 비트 단위들을 각각 정수로 변환하는 단계, 그리고 변환된 정수를 그레이 스케일 또는 색상을 나타내는 값에 대응시켜 2차원 이미지를 생성하는 단계를 포함할 수 있다.The generating of the training images includes binarizing the sender-related information included in each message information, arranging the binarized information in two dimensions, and dividing the two-dimensionally arranged bits into arbitrary length units, dividing The method may include converting each of the bit units into integers, and generating a two-dimensional image by matching the converted integers to values representing gray scale or color.

상기 2차원으로 배열하는 단계는, 상기 발신자 관련 정보들 중 특정 항목에 해당하는 메시지를 기준 시간 동안 누적한 건수에 대한 정보를 더 포함하여 2차원으로 배열할 수 있다.The two-dimensional arranging may include two-dimensionally arranging messages corresponding to a specific item among the sender-related information, further including information on the number of accumulated messages for a reference time.

상기 누적한 건수에 대한 정보는, 특정 발신번호로부터 일정 시간 간격 내에 복수의 메시지들을 수신하는 경우, 수신한 각 메시지 사이의 시간 간격을 포함할 수 있다.When a plurality of messages are received from a specific caller number within a predetermined time interval, the information on the accumulated number may include a time interval between each received message.

상기 누적한 건수에 대한 정보는, 특정 발신번호로부터 수신한 메시지를 상기 기준 시간 동안 누적한 건수 또는 특정 주소의 발신자로부터 수신한 메시지를 상기 기준 시간 동안 누적한 건수를 포함할 수 있다.The information on the accumulated number of cases may include the accumulated number of messages received from a specific caller number during the reference time or the accumulated number of messages received from a sender of a specific address during the reference time.

상기 스팸 지수 산출 모델은 컨볼루션 신경망을 이용하고, 상기 스팸 지수는 상기 스팸 지수 산출 모델에 포함된 컨볼루션 신경망의 최종 노드에서 산출된 확률값일 수 있다.The spam index calculation model uses a convolutional neural network, and the spam index may be a probability value calculated from a final node of the convolutional neural network included in the spam index calculation model.

상기 스팸 지수를 산출하는 단계 이후에, 상기 새로운 메시지 정보에서 발신 번호를 확인하고, 상기 발신 번호가 기 저장된 블랙 리스트 또는 기 저장된 화이트 리스트에 포함된 번호인지 판단하는 단계, 그리고 판단 결과에 따라 상기 산출된 스팸 지수를 보정하는 단계를 더 포함할 수 있다.After the step of calculating the spam index, confirming the calling number in the new message information, determining whether the calling number is a number included in a pre-stored black list or a pre-stored white list, and the calculation according to the determination result It may further include the step of correcting the spam index.

다른 실시예에 따른 스팸 탐지 장치로서 복수의 단말들로부터 수집한 메시지 정보들을 가공하여 2차원 학습 이미지들을 생성하는 전처리부, 그리고 각 학습 이미지에 해당 학습 이미지에 대응하는 메시지 정보의 스팸 판정 결과가 대응된 학습 데이터를 생성하고, 상기 학습 데이터를 이용하여 스팸 지수 산출 모델을 지도 학습하는 모델 학습부를 포함하고, 상기 전처리부는 임의의 단말로부터 새로운 메시지 정보를 수집하면, 상기 새로운 메시지 정보를 가공하여 2차원 입력 이미지를 생성하고, 상기 입력 이미지와 상기 스팸 지수 산출 모델을 이용하여 상기 새로운 메시지에 대한 스팸 지수를 산출하는 스팸 지수 산출부를 더 포함한다.As a spam detection apparatus according to another embodiment, a preprocessor that processes message information collected from a plurality of terminals to generate two-dimensional learning images, and a spam determination result of message information corresponding to the learning image corresponds to each learning image and a model learning unit for supervising and learning the spam index calculation model by using the training data and generating the training data, wherein the pre-processing unit collects new message information from any terminal, processing the new message information to form a two-dimensional It generates an input image, and further includes a spam index calculator for calculating a spam index for the new message by using the input image and the spam index calculation model.

상기 스팸 탐지 장치는, 상기 새로운 메시지 정보에서 발신 번호를 확인하고, 상기 발신 번호가 기 저장된 블랙 리스트 또는 기 저장된 화이트 리스트에 포함된 번호인지 판단하고, 판단 결과에 따라 상기 산출된 스팸 지수를 보정하는 후처리부를 더 포함할 수 있다.The spam detection device checks the caller number in the new message information, determines whether the caller number is a number included in a pre-stored black list or a pre-stored white list, and corrects the calculated spam index according to the determination result It may further include a post-processing unit.

상기 모델 학습부는 상기 후처리부의 판단 결과를 반영하여 상기 스팸 지수 산출 모델을 수정할 수 있다.The model learning unit may modify the spam index calculation model by reflecting the determination result of the post-processing unit.

또 다른 실시예에 따른 스팸 탐지 장치가 단말로부터 수집한 메시지 정보를 가공하는 방법으로서, 메시지 정보 중 발신자 관련 정보를 포함하는 발신 정보 테이블을 생성하는 단계, 기준 시간 동안, 상기 발신 정보 테이블의 항목 중 임의의 항목에 해당하는 메시지의 누적 건수를 포함하는 누적 정보 테이블을 생성하는 단계, 임의의 단말로부터 새로운 메시지 정보를 수집하는 단계, 상기 새로운 메시지 정보에 포함된 발신자 관련 정보를 상기 발신 정보 테이블의 새로운 행에 기록하는 단계, 상기 새로운 메시지 정보가 수신됨에 따라 변화된 상기 누적 건수를 상기 누적 정보 테이블의 새로운 행에 기록하는 단계, 그리고 상기 발신 정보 테이블에 추가된 행과 상기 누적 정보 테이블에 추가된 행의 내용을 이진화하여 2차원으로 배열하고, 상기 2차원 배열을 이미지로 변환하는 단계를 포함한다.A method for a spam detection device to process message information collected from a terminal according to another embodiment, the method comprising: generating an outgoing information table including sender-related information among message information; Generating a cumulative information table including the cumulative number of messages corresponding to an arbitrary item, collecting new message information from an arbitrary terminal, and adding sender-related information included in the new message information to the new outgoing information table Recording in a row, recording the accumulated number of cases changed as the new message information is received in a new row of the accumulated information table, and a row added to the outgoing information table and a row added to the accumulated information table and binarizing the content and arranging it in two dimensions, and converting the two-dimensional array into an image.

상기 누적 정보 테이블은, 특정 발신번호로부터 수신한 메시지를 상기 기준 시간 동안 누적한 건수, 특정 주소의 발신자로부터 수신한 메시지를 상기 기준 시간 동안 누적한 건수, 또는 특정 발신번호로부터 수신한 복수의 메시지들의 수신 시간 간격을 포함할 수 있다.The cumulative information table includes the number of messages received from a specific caller number accumulated during the reference time, the number of messages received from the sender of a specific address accumulated during the reference time, or a plurality of messages received from a specific caller number. It may include a reception time interval.

상기 이미지로 변환하는 단계는, 상기 2차원 배열을 구성하는 비트를 임의의 길이 단위로 분할하고, 분할된 비트를 정수로 변환하는 단계, 그리고 변환된 정수를 그레이 스케일 또는 색상을 나타내는 값에 대응시키는 단계를 포함할 수 있다.The converting to the image includes dividing the bits constituting the two-dimensional array into arbitrary length units, converting the divided bits into integers, and matching the converted integers to values representing gray scale or color may include steps.

본 발명에 따르면, 결정론적인 판단이 아니라 스팸 확률을 계산하므로, 일정한 형식 없이 변형되는 스팸을 탐지할 수 있어, 수신자에게 더 정밀한 스팸 탐지 서비스를 제공할 수 있다.According to the present invention, since spam probability is calculated rather than deterministic judgment, spam that is deformed without a specific format can be detected, and a more precise spam detection service can be provided to the recipient.

또한 본 발명에 따르면, 고정된 룰을 사용하지 않고 인공지능에 기반한 모델을 이용하므로, 전문가의 지속적인 룰 관리가 없어도 스팸을 탐지할 수 있어 관리의 효율성을 높일 수 있다. In addition, according to the present invention, since a model based on artificial intelligence is used instead of using a fixed rule, spam can be detected without an expert's continuous rule management, and thus management efficiency can be increased.

도 1은 한 실시예에 따른 스팸 탐지 시스템의 구성도이다.
도 2는 한 실시예에 따른 스팸 탐지 장치의 구성도이다.
도 3은 한 실시예에 따른 스팸 탐지 장치가 동작하는 방법의 흐름도이다.
도 4는 한 실시예에 따른 전처리부가 메시지 정보를 바탕으로 이미지를 생성하는 방법을 나타낸 흐름도이다.
도 5는 한 실시예에 따른 메시지 정보들이 관리되는 테이블의 예시도이다.
도 6은 다른 실시예에 따른 메시지 정보들이 관리되는 테이블의 예시도이다.
도 7은 한 실시예에 따른 전처리부가 메시지 정보를 배열하는 방법을 나타낸 설명도이다.
도 8은 한 실시예에 따른 전처리부가 2차원 이미지를 생성하는 방법을 나타낸 설명도이다.1 is a block diagram of a spam detection system according to an embodiment.
2 is a block diagram of an apparatus for detecting spam according to an embodiment.
3 is a flowchart of a method of operating a spam detection apparatus according to an embodiment.
4 is a flowchart illustrating a method of a preprocessor generating an image based on message information according to an exemplary embodiment.
5 is an exemplary diagram of a table in which message information is managed according to an embodiment.
6 is an exemplary diagram of a table in which message information is managed according to another embodiment.
7 is an explanatory diagram illustrating a method of arranging message information by a preprocessor according to an embodiment.
8 is an explanatory diagram illustrating a method of a preprocessor generating a two-dimensional image according to an exemplary embodiment.

아래에서는 첨부한 도면을 참고로 하여 본 발명의 실시예에 대하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 상세히 설명한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다.Hereinafter, with reference to the accompanying drawings, the embodiments of the present invention will be described in detail so that those of ordinary skill in the art to which the present invention pertains can easily implement them. However, the present invention may be embodied in several different forms and is not limited to the embodiments described herein. And in order to clearly explain the present invention in the drawings, parts irrelevant to the description are omitted, and similar reference numerals are attached to similar parts throughout the specification.

명세서 전체에서, 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다. 또한, 명세서에 기재된 "…부", "…기", "모듈" 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어나 소프트웨어 또는 하드웨어 및 소프트웨어의 결합으로 구현될 수 있다.Throughout the specification, when a part "includes" a certain element, it means that other elements may be further included, rather than excluding other elements, unless otherwise stated. In addition, terms such as “…unit”, “…group”, and “module” described in the specification mean a unit that processes at least one function or operation, which may be implemented as hardware or software or a combination of hardware and software. have.

본 명세서에서 단말(Terminal)은 사용자 기기로서, 디바이스(Device), UE(User Equipment), ME(Mobile Equipment), MS(Mobile Station), 이동 단말(Mobile Terminal, MT), 가입자국(Subscriber Station, SS), 휴대 가입자국(Portable Subscriber Station, PSS), 사용자 장치(User Equipment, UE), 접근 단말(Access Terminal, AT) 등의 용어로 언급될 수도 있고, 이동 단말, 가입자국, 휴대 가입자국, 사용자 장치 등의 전부 또는 일부의 기능을 포함할 수도 있다. In the present specification, a terminal is a user equipment, and includes a device, a user equipment (UE), a mobile equipment (ME), a mobile station (MS), a mobile terminal (MT), a subscriber station, SS), Portable Subscriber Station (PSS), User Equipment (UE), Access Terminal (AT), etc. may be referred to in terms such as mobile terminal, subscriber station, portable subscriber station, It may include all or part of the functionality of a user device or the like.

또한, 단말은 SIP(Session Initiation Protocol) 시그널링을 처리할 수 있는 이동통신 단말기로, 휴대폰 뿐만 아니라, 개인 휴대 단말기 등 IP 기반의 통신을 수행할 수 있는 모든 통신기기일 수 있다. 예를 들어, 단말은 셀룰러폰, PCS(Personal Communication Service)폰, PDA(Personal Digital Assistant)폰, GSM(Global System for Mobile Communications)폰, WCDMA(Wideband Code Division Multiple Access)폰, CDMA(Code Division Multiple Access)-2000폰, DMB(Digital Multimedia Broadcasting)폰, LTE(Long Term Evolution)폰 등일 수 있다.In addition, the terminal is a mobile communication terminal capable of processing SIP (Session Initiation Protocol) signaling, and may be any communication device capable of performing IP-based communication, such as a personal portable terminal as well as a mobile phone. For example, the terminal includes a cellular phone, a Personal Communication Service (PCS) phone, a Personal Digital Assistant (PDA) phone, a Global System for Mobile Communications (GSM) phone, a Wideband Code Division Multiple Access (WCDMA) phone, and a Code Division Multiple (CDMA) phone. Access)-2000 phone, DMB (Digital Multimedia Broadcasting) phone, LTE (Long Term Evolution) phone, etc. may be.

본 명세서에서 스팸(Spam)은 정보통신망을 통해 이용자가 원하지 않음에도 불구하고 일방적으로 전송되는 광고성 정보를 의미한다. 메시지, 전화 등 다양한 형태로 전달될 수 있으며, 본 명세서에서는 편의상 메시지를 대상으로 설명한다. 스팸을 송신하는 행위를 스패밍(Spamming)이라고 하고, 스팸을 송신하는 주체를 스패머(Spammer)라고 한다.In this specification, spam refers to advertisement information that is unilaterally transmitted through an information and communications network despite the user's unwillingness to do so. It may be delivered in various forms, such as a message or a phone call, and in the present specification, a message will be described for convenience. The act of sending spam is called spamming, and the subject sending spam is called a spammer.

도　1은 한 실시예에 따른 스팸 탐지 시스템의 구성도이다. Fig. 1 is a block diagram of a spam detection system according to an embodiment.

도 1을 참고하면, 스팸 탐지 시스템(1000)은 복수의 단말들(100), 스팸 탐지 장치(200)를 포함한다. 단말(100)은 스팸 탐지 장치와 통신할 수 있는 장치이다. 스팸 탐지 장치는 본 발명에서 설명한 동작을 수행하도록 구현된다.Referring to FIG. 1 , a spam detection system 1000 includes a plurality of terminals 100 and a spam detection apparatus 200 . The terminal 100 is a device capable of communicating with a spam detection device. The spam detection apparatus is implemented to perform the operations described in the present invention.

단말(100)은 스팸을 포함한 메시지를 수신하고, 수신한 메시지를 스팸 탐지 장치(200)로 전송하고, 스팸 탐지 장치(200)로부터 해당 메시지가 스팸일 확률을 전달받는다. The terminal 100 receives a message including spam, transmits the received message to the spam detection device 200 , and receives a probability that the message is spam from the spam detection device 200 .

단말(100)은 스팸 모니터링부(110)를 포함하며, 스팸 모니터링부(110)는 단말(100)에 수신된 메시지를 인터셉트하여 메시지 정보를 추출하고, 이를 스팸 탐지 장치(200)로 전송한다. The terminal 100 includes a spam monitoring unit 110 , the spam monitoring unit 110 intercepts a message received by the terminal 100 , extracts message information, and transmits it to the spam detection device 200 .

스팸 모니터링부(110)가 추출하는 메시지 정보는 발신 번호, 착신 번호, 통화량, 통화 시간, 발신자 업종 코드, 발신자 주소 코드 등 전화 사용과 관련된 통화 내역 기록(Call Detail Recording, CDR)의 정보 중 적어도 하나를 포함할 수 있다. 이후 단말(100)은 스팸 탐지 장치(200)에 의해 산출된 스팸 지수를 수신하여, 사용자에게 전달한다. 그리고, 스팸 지수를 관리자(미도시)에 전달하여, 스팸 지수의 정확성을 추가로 검증받고, 검증 결과를 바탕으로 학습 모델의 성능을 조절할 수 있다.The message information extracted by the spam monitoring unit 110 is at least one of the information of the call history record (Call Detail Recording, CDR) related to the use of the phone, such as calling number, called number, call volume, call time, caller industry code, caller address code, etc. may include Thereafter, the terminal 100 receives the spam index calculated by the spam detection device 200 and delivers it to the user. In addition, by delivering the spam index to the manager (not shown), the accuracy of the spam index can be further verified, and the performance of the learning model can be adjusted based on the verification result.

스팸 탐지 장치(200)는 사용자 단말(100)의 스팸 모니터링부(110)로부터 전달받은 메시지 정보를 수집한다. 수집한 정보를 바탕으로 딥러닝 모델 학습에 필요한 형태로 가공하여 학습 이미지를 생성하는 전처리를 진행한다. 이후 인공지능을 이용하여 학습 모델을 생성한다. 이후 스팸 모니터링부(110)로부터 스팸 탐지 장치(200)에 메시지가 전송되는 경우, 전처리부(220)는 입력 이미지를 생성하고, 학습 모델은 입력 이미지에 대한 스팸 확률을 계산한다. 후처리부(260)에서 스팸 확률을 표준값으로 변환하는 등의 후처리를 거쳐 스팸 지수를 산출하여 단말(100)의 스팸 모니터링부(110)에 전송한다. The spam detection apparatus 200 collects message information received from the spam monitoring unit 110 of the user terminal 100 . Based on the collected information, pre-processing is performed to generate a learning image by processing it into the form required for deep learning model learning. After that, a learning model is created using artificial intelligence. Thereafter, when a message is transmitted from the spam monitoring unit 110 to the spam detection device 200 , the preprocessor 220 generates an input image, and the learning model calculates a spam probability for the input image. The post-processing unit 260 calculates the spam index through post-processing such as converting the spam probability into a standard value, and transmits it to the spam monitoring unit 110 of the terminal 100 .

도 2는 한 실시예에 따른 스팸 탐지 장치의 구성도이다.2 is a block diagram of an apparatus for detecting spam according to an embodiment.

도 2를 참고하면, 스팸 탐지 장치(200)는 데이터 수집부(210), 전처리부(220), 모델 학습부(230), 저장부(240), 스팸 확률 계산부(250), 후처리부(260)를 포함한다.Referring to FIG. 2 , the spam detection apparatus 200 includes a data collection unit 210 , a preprocessor 220 , a model learning unit 230 , a storage unit 240 , a spam probability calculator 250 , and a postprocessor ( 260).

데이터 수집부(210)는 모델 학습을 위해 복수의 사용자 단말(100)들 각각에 설치된 스팸 모니터링부(110)로부터 전송되는 메시지 정보들을 수집하고, 수집된 메시지 정보들을 전처리부(220)에 전달한다.The data collection unit 210 collects message information transmitted from the spam monitoring unit 110 installed in each of the plurality of user terminals 100 for model learning, and delivers the collected message information to the preprocessor 220 . .

전처리부(220)는 데이터 수집부(210)로부터 전달받은 메시지 정보들을 모델 학습과 스팸 지수 산출에 필요한 형태로 가공하며, 본 발명에서는 그레이 스케일의 이미지 형태로 가공하여 모델 학습부(230)와 스팸 확률 계산부(250)에 전달한다. The pre-processing unit 220 processes the message information received from the data collection unit 210 into a form necessary for model learning and spam index calculation, and in the present invention, it processes it into a grayscale image form to form a model learning unit 230 and spam It is transmitted to the probability calculator 250 .

모델 학습부(230)는 전처리부(220)가 전달한 이미지들과 관리자의 스팸 여부 판정 결과 또는 단말 사용자의 스팸 신고 데이터를 이용하여 학습 데이터를 생성하고, 생성된 학습 데이터를 이용하여 스팸 확률을 산출하는 딥러닝 모델을 학습한다. 본 발명에서 사용되는 딥러닝 모델은 컨볼루션 신경망(Convolutional Neural Network, CNN)일 수 있으며, 반드시 이 알고리즘에 한정되는 것은 아니다. The model learning unit 230 generates training data using the images delivered by the preprocessor 220 and the administrator's spam determination result or the terminal user's spam report data, and calculates the spam probability using the generated training data. learn a deep learning model. The deep learning model used in the present invention may be a convolutional neural network (CNN), and is not necessarily limited to this algorithm.

저장부(240)는 모델 학습을 위해 메시지 정보들, 즉 데이터를 축적한다. 저장부(240)는 시간에 따라 입력되는 메시지 정보와 기준 시간마다 누적되어 변하는 정보들을 표로 관리한다. 이 표를 각각 발신 정보 테이블과 누적 정보 테이블이라고 하며 자세한 내용은 도 5를 통해 설명한다. The storage unit 240 accumulates message information, that is, data for model learning. The storage unit 240 manages message information input according to time and information that is accumulated and changed for each reference time in a table. These tables are referred to as a calling information table and a cumulative information table, respectively, and details will be described with reference to FIG. 5 .

또한 저장부(240)는 모델 학습부(230)에서 생성된 학습 모델을 저장하고, 스팸 확률 계산부(250)에서 계산된 스팸 확률 데이터를 저장한다. 또한, 학습 모델의 성능을 조절하거나 모델을 수정하기 위해 관리자의 최종 판정 결과가 추가로 저장될 수 있다. In addition, the storage unit 240 stores the learning model generated by the model learning unit 230 and stores the spam probability data calculated by the spam probability calculation unit 250 . In addition, in order to adjust the performance of the learning model or to modify the model, the final decision result of the administrator may be additionally stored.

스팸 확률 계산부(250)는 전처리부(220)에서 전달된 이미지에 대해 실시간으로 스팸 확률을 계산한다. 스팸 확률 계산부(250)는 저장부(240)에 있는 학습 모델을 이용하여 메시지가 스팸일 확률을 1차적으로 계산한다.The spam probability calculation unit 250 calculates the spam probability in real time for the image transmitted from the preprocessor 220 . The spam probability calculation unit 250 primarily calculates the probability that the message is spam by using the learning model in the storage unit 240 .

후처리부(260)는 스팸 확률 계산부(250)에서 산출된 스팸 확률에 대해, 해당 메시지의 발신번호가 블랙 리스트 또는 화이트 리스트에 속한 번호인지 판단하는 추가 판정을 진행하거나, 스팸 확률 계산부(250)에서 계산된 스팸 확률을 표준 점수(Z-score) 등으로 표준화할 수 있다. 예를 들어, 특정 메시지가 스팸일 확률이 높은 것으로 계산되었더라도, 해당 메시지의 발신 번호가 화이트 리스트에 속한 경우, 해당 메시지는 스팸이 아닌 것으로 후처리부(260)에서 최종 판단한다.The post-processing unit 260 performs additional determination to determine whether the originating number of the corresponding message is a number belonging to the black list or the white list with respect to the spam probability calculated by the spam probability calculation unit 250, or the spam probability calculation unit 250 ) can be standardized with a standard score (Z-score), etc., calculated from the spam probability. For example, even if a specific message is calculated as having a high probability of being spam, if the originating number of the message belongs to the white list, the post-processing unit 260 finally determines that the message is not spam.

이러한 보정을 거쳐 최종적으로 산출된 스팸 지수는 사용자 단말(100)의 스팸 모니터링부(110)에 전송되어, 단말(100) 화면에 표시될 수 있다. 또한, 후처리부(260)에서 내린 최종 판정과 보정 결과는 저장부(240)에 다시 저장되어 모델 학습 과정에서 활용될 수 있다.The spam index finally calculated through this correction may be transmitted to the spam monitoring unit 110 of the user terminal 100 and displayed on the screen of the terminal 100 . In addition, the final determination and correction results made by the post-processing unit 260 may be stored again in the storage unit 240 to be utilized in the model learning process.

도 3은 한 실시예에 따른 스팸 탐지 방법의 흐름도이다. 3 is a flowchart of a spam detection method according to an embodiment.

도 3을 참고하면, 단말(100)은 수신한 메시지 정보를 스팸 탐지 장치(200)에 전달한다(S101). 사용자 단말(100)의 스팸 모니터링부(110)는 단말(100)이 수신하는 메시지를 인터셉트하여 발신자 정보 등 메시지 관련 정보를 추출하여 스팸 탐지 장치(200)의 데이터 수집부(210)에 전송한다. Referring to FIG. 3 , the terminal 100 transmits the received message information to the spam detection apparatus 200 ( S101 ). The spam monitoring unit 110 of the user terminal 100 intercepts the message received by the terminal 100 , extracts message related information such as sender information, and transmits it to the data collection unit 210 of the spam detection device 200 .

스팸 탐지 장치(200)의 데이터 수집부(210)는 복수의 단말들(100)의 스팸 모니터링부(110)에서 전송한 메시지 정보들을 수집한다(S102). 수집된 정보는 학습 모델 생성을 위한 데이터로 사용된다. The data collection unit 210 of the spam detection apparatus 200 collects message information transmitted from the spam monitoring unit 110 of the plurality of terminals 100 (S102). The collected information is used as data to create a learning model.

스팸 탐지 장치(200)는 수집한 메시지 정보들을 전처리하여 학습 이미지를 생성한다(S103). 메시지를 2차원의 그레이 스케일 이미지 형태로 가공한 것을 이하 이미지라고 통칭하며, 학습 모델을 생성하기 위해 사용되는 이미지들을 학습 이미지라고 통칭한다. 생성된 학습 이미지들은 저장부(240)에 저장된다. 이하 전처리부(220)가 메시지 정보들을 이미지로 가공하는 자세한 방법은 도 4 내지 도 7을 통해 설명한다. The spam detection apparatus 200 pre-processes the collected message information to generate a training image (S103). Hereinafter, a message processed in the form of a two-dimensional grayscale image is collectively referred to as an image, and images used to create a learning model are collectively referred to as a learning image. The generated training images are stored in the storage unit 240 . Hereinafter, a detailed method of the preprocessor 220 processing the message information into an image will be described with reference to FIGS. 4 to 7 .

스팸 탐지 장치(200)는 가공된 전처리 학습 이미지들과 관리자의 스팸 여부 판단 결과를 이용하여 학습 데이터를 생성하고, 생성된 학습 데이터를 이용하여 학습 모델을 생성한다(S104). 이 때 사용되는 딥러닝 모델은 어느 한 알고리즘에 특정되는 것은 아니나, 본 명세서에서는 컨볼루션 신경망 모델을 이용한 방법을 가정한다. The spam detection apparatus 200 generates training data using the processed pre-processing training images and the administrator's spam determination result, and generates a training model using the generated training data (S104). The deep learning model used at this time is not specific to any one algorithm, but in the present specification, a method using a convolutional neural network model is assumed.

한편, 학습된 모델은 후처리부(260)의 판단에 따라 성능이 조절될 수도 있으며 후처리부(260)의 보정 결과가 반영된 최종 모델은 저장부(240)에 보관된다.Meanwhile, the performance of the learned model may be adjusted according to the judgment of the post-processing unit 260 , and the final model in which the correction result of the post-processing unit 260 is reflected is stored in the storage unit 240 .

이후 스패머(300)가 만든 스팸 메시지를 사용자 단말(100)에 전송한다(S105). 본 명세서에서 사용자 단말(100)로 접수되는 스팸의 유형은 문자 메시지에 한정되지 않으며, 전화, 메일 등과 더불어 스팸 작업이 가능한 모든 형태가 가능할 수 있다.Thereafter, the spam message created by the spammer 300 is transmitted to the user terminal 100 (S105). In the present specification, the type of spam received by the user terminal 100 is not limited to text messages, and any form capable of spam work may be possible along with phone calls and mail.

단말(100)에 설치된 스팸 모니터링부(110)는 수신한 메시지를 인터셉트하여 분석하고, 발신자 정보 등 메시지 관련 정보를 추출하여 스팸 탐지 장치(200)에 전송한다(S106).The spam monitoring unit 110 installed in the terminal 100 intercepts and analyzes the received message, extracts message-related information such as sender information, and transmits it to the spam detection device 200 (S106).

스팸 탐지 장치(200)의 전처리부(220)는 단말(100)로부터 전송된 메시지 정보를 전처리하여 2차원의 그레이 스케일 이미지 형태인 입력 이미지로 가공한다(S107). The preprocessor 220 of the spam detection apparatus 200 preprocesses the message information transmitted from the terminal 100 and processes it into an input image in the form of a two-dimensional grayscale image (S107).

스팸 탐지 장치(200)의 스팸 확률 계산부(250)는 가공된 입력 이미지를 학습 모델에 입력하여 1차적으로 스팸 확률을 계산한다(S108).The spam probability calculation unit 250 of the spam detection device 200 primarily calculates the spam probability by inputting the processed input image to the learning model (S108).

본 명세서에서 사용하는 컨볼루션 신경망 모델은, 컨벌루션 기능과 신경망을 결합시킨 딥러닝 알고리즘 중 하나이다. 이하 컨볼루션 신경망 모델에 대해 간단히 기술한다. 컨볼루션 신경망 모델은 크게 특징 학습(Feature Learning) 단계와 분류(Classification) 단계로 나뉜다. The convolutional neural network model used in this specification is one of deep learning algorithms that combines a convolution function and a neural network. Hereinafter, the convolutional neural network model will be briefly described. The convolutional neural network model is largely divided into a feature learning stage and a classification stage.

특징 학습 단계에서는, 입력 이미지에 복수개의 컨볼루션 커널 또는 필터를 사용하여 특징 맵(Feature Map)을 생성하는 컨볼루션 과정, 특징 맵의 크기 또는 공간적 해상도를 줄이기 위해 서브 샘플링(Subsampling) 또는 풀링(Pooling) 과정을 반복하여 입력된 이미지의 여러 특징을 추출할 수 있다. 컨볼루션 과정과 서브 샘플링 과정은 입력 이미지의 크기와 특성에 따라 여러 번 반복될 수 있다.In the feature learning step, a convolution process for generating a feature map using a plurality of convolution kernels or filters in an input image, subsampling or pooling to reduce the size or spatial resolution of the feature map ) process to extract various features of the input image. The convolution process and subsampling process may be repeated several times depending on the size and characteristics of the input image.

이후 분류 단계에서는, 추출된 특징을 이용하여 완전 연결 계층(Fully Connected Layer)을 구성하여 입력 이미지를 분류할 수 있고, 마지막 출력층에서는 로지스틱 회귀를 이용하여 2개의 클래스로 분류하거나, 소프트 맥스 함수(Softmax Function)를 이용하여 3개 이상의 클래스로 분류할 수 있다. In the subsequent classification step, the input image can be classified by constructing a fully connected layer using the extracted features, and in the last output layer, it can be classified into two classes using logistic regression, or a soft max function (Softmax). Function) can be used to classify into three or more classes.

예를 들어, 스팸 확률 계산부(250)가 산출하는 결과는 스팸일 확률과 스팸이 아닐 확률이므로, 2개의 클래스로 분류하는 문제이다. 따라서 분류 단계의 가장 마지막 계층은 2개의 노드로 구성되며, 완전 연결 계층의 모든 노드들은 이 2개의 노드에 각각 연결되어 스팸 확률을 계산한다. 최종적으로 계산되는 값은 스팸일 확률과 스팸이 아닐 확률이다.For example, the result calculated by the spam probability calculation unit 250 is a spam probability and a non-spam probability, so it is a problem of classifying it into two classes. Therefore, the last layer of the classification step consists of two nodes, and all nodes of the fully connected layer are connected to these two nodes respectively to calculate the spam probability. The final calculated values are the probability of being spam and the probability of not being spam.

스팸 탐지 장치(200)의 후처리부(260)는 추가 정보를 바탕으로 스팸 확률을 보정하여 최종 스팸 지수를 산출한다(S109). 후처리부(260)는 스팸 확률 계산부(250)에서 계산된 스팸 확률에 대해, 블랙 리스트에 포함된 발신번호에 의한 스팸인 경우 스팸 확률을 높여 스팸 지수를 생성할 수 있고, 화이트 리스트에 포함된 발신번호에 의한 스팸인 경우 스팸 확률을 낮추어 스팸 지수를 생성할 수 있다. 이를 학습 모델에 반영하여 알고리즘을 수정할 수 있다. 또한, 사용자의 편의를 위해 스팸 확률을 표준화된 점수로 환산하여 제공할 수도 있다.The post-processing unit 260 of the spam detection device 200 calculates a final spam index by correcting the spam probability based on the additional information (S109). The post-processing unit 260 may generate a spam index by increasing the spam probability in the case of spam by a caller number included in the black list with respect to the spam probability calculated by the spam probability calculation unit 250, and may generate a spam index included in the white list. In the case of spam by the sender number, the spam index can be created by lowering the spam probability. The algorithm can be modified by reflecting this in the learning model. In addition, for the convenience of the user, the probability of spam may be converted into a standardized score and provided.

스팸 탐지 장치(200)는 산출한 스팸 지수를 단말(100)의 스팸 모니터링부(110)에 전송한다(S110). 이때, 스팸 탐지 장치(200)는 관리자에게 스팸 지수를 전송할 수 있고, 관리자는 스팸 지수와 스팸 메시지 사이의 관계에 따라 모델의 성능을 조절할 수 있다.The spam detection device 200 transmits the calculated spam index to the spam monitoring unit 110 of the terminal 100 (S110). In this case, the spam detection apparatus 200 may transmit the spam index to the administrator, and the administrator may adjust the performance of the model according to the relationship between the spam index and the spam message.

이후 단말(100)은 단말(100) 화면에 스팸 지수를 표시하여 사용자에게 스팸 지수를 알린다(S111). 예를 들어 단말(100)은 메시지와 동시에 스팸 지수를 화면에 표시하거나, 스팸 메시지를 전송한 발신 번호에 스팸 지수를 표시할 수 있다.Thereafter, the terminal 100 notifies the user of the spam index by displaying the spam index on the screen of the terminal 100 (S111). For example, the terminal 100 may display the spam index on the screen at the same time as the message, or may display the spam index on the number from which the spam message is transmitted.

이하에서는 전처리부(220)가 메시지 정보를 바탕으로 학습 이미지와 입력 이미지를 생성하는 방법과 생성된 이미지를 예를 들어 설명한다.Hereinafter, a method in which the preprocessor 220 generates a training image and an input image based on the message information and the generated image will be described as an example.

도 4는 한 실시예에 따른 전처리부가 메시지 정보를 바탕으로 이미지를 생성하는 방법을 나타낸 흐름도이고, 도 5는 한 실시예에 따른 메시지 정보들이 관리되는 테이블의 예시도이고, 도 6은 다른 실시예에 따른 메시지 정보들이 관리되는 테이블의 예시도이고, 도 7은 한 실시예에 따른 전처리부가 메시지 정보를 배열하는 방법을 나타낸 설명도이고, 도 8은 한 실시예에 따른 전처리부가 2차원 이미지를 생성하는 방법을 나타낸 설명도이다.4 is a flowchart illustrating a method for a preprocessor to generate an image based on message information according to an embodiment, FIG. 5 is an exemplary diagram of a table in which message information is managed according to an embodiment, and FIG. 6 is another embodiment It is an exemplary diagram of a table in which message information is managed according to It is an explanatory diagram showing how to do it.

도 4를 참고하면, 전처리부(220)는 메시지 정보를 가공하여 발신자 관련 정보를 포함하는 발신 정보 테이블을 생성한다(S210). 발신 정보 테이블은 사용자 단말(100)에 설치된 스팸 모니터링부(110)에서 스팸 탐지 장치(200)로 전달되는 메시지 정보들 중 발신자와 관련된 정보를 표로 배열한 것이다.Referring to FIG. 4 , the preprocessor 220 processes the message information to generate an outgoing information table including sender-related information (S210). The sender information table arranges information related to a sender among message information delivered from the spam monitoring unit 110 installed in the user terminal 100 to the spam detection device 200 in a table.

도 5를 참고하면, 발신 정보 테이블에 포함되는 필드는 메시지를 발신한 번호, 메시지의 전송 시간, 발신자의 정보, 발신자의 업종, 발신자의 주소 등을 포함할 수 있다. 한편, 단말(100)이 수신한 정보가 메시지의 형태가 아닌 경우, 발신 타입을 입력하는 필드를 포함할 수 있다.Referring to FIG. 5 , fields included in the originating information table may include a message originating number, message transmission time, sender information, sender's industry type, sender's address, and the like. On the other hand, when the information received by the terminal 100 is not in the form of a message, a field for inputting a transmission type may be included.

전처리부(220)는 기준 시간 동안 발신 정보 중 임의의 항목에 해당되는 메시지의 누적 건수를 포함하는 누적 정보 테이블을 생성한다(S220). 누적 정보 테이블은 각 단말(100)에서 시간 순으로 발생한 사건을 누적하여 누적값을 표로 배열한 것으로, 입력 이미지에 더 다양한 정보를 포함하여 모델의 정확도를 높이기 위해 사용될 수 있다.The preprocessor 220 generates an accumulated information table including the accumulated number of messages corresponding to any item of outgoing information during the reference time (S220). The cumulative information table is an arrangement of cumulative values in a table by accumulating events occurring in chronological order in each terminal 100 , and may be used to increase the accuracy of the model by including more various information in the input image.

도 6을 참고하면, 누적 정보 테이블은 발신 정보 테이블에 기록된 정보 또는 새롭게 수집하는 메시지 정보를 바탕으로 생성할 수 있다. 예를 들어, 기준 시간(1분, 3분, 또는 60분 등)동안 특정 발신번호로부터 얼마나 많은 메시지가 수신되었는지를 필드로 생성하여, 해당 셀에 0 또는 1로 표시하거나, O 또는 X, Yes 또는 No로 표시할 수 있으며, 표시되는 기호는 미리 지정된 문자, 숫자 또는 기호일 수 있다.Referring to FIG. 6 , the accumulated information table may be generated based on information recorded in the outgoing information table or newly collected message information. For example, create a field indicating how many messages were received from a specific caller number during a reference time (1 minute, 3 minutes, or 60 minutes, etc.), and display 0 or 1 in the corresponding cell, or O or X, Yes Alternatively, it may be indicated as No, and the displayed symbol may be a predefined letter, number, or symbol.

예를 들어 도 6의 (a)는 단말(100)이 특정 발신번호로부터 일정 시간 동안 수신한 메시지의 누적 건수를 필드화한 것이고, 도 6의 (b)는 단말(100)이 특정 주소의 발신자로부터 일정 시간 동안 수신한 메시지의 누적 건수를 필드화한 것이다. 기준 시간을 1분, 3분, 60분으로 구분하였으나 이는 사용자 또는 관리자에 따라 변경될 수 있고, 열을 구분하는 기준 건수 역시 변경될 수 있다. For example, in FIG. 6(a), the terminal 100 fielded the accumulated number of messages received from a specific caller number for a certain period of time, and FIG. 6(b) shows that the terminal 100 is the sender of a specific address. The accumulated number of messages received over a certain period of time from Although the reference time is divided into 1 minute, 3 minutes, and 60 minutes, this may be changed according to a user or an administrator, and the number of standards for dividing a column may also be changed.

또한 도 6의 (c)는 동일한 발신번호로부터 이전에 수집한 메시지들의 시간 간격을 필드화한 포맷이다. 특정 발신번호로부터 복수의 메시지를 수신하면 메시지를 수신한 시간 차를 학습 데이터로 이용하는 것이다. 이를 통해 누적 건수를 기준으로 스팸 여부를 탐지하는 룰 기반 또는 딥러닝 기반의 스팸 탐지 방법에 비해 더 빠른 시간 내에 스팸 여부를 판단할 수 있다. 따라서, 스팸을 빠르게 차단하여 스팸으로 인한 피해를 줄일 수 있게 된다.Also, (c) of FIG. 6 is a format in which time intervals of messages previously collected from the same caller number are fielded. When a plurality of messages are received from a specific caller number, the time difference at which the messages are received is used as learning data. Through this, it is possible to determine whether or not spam is spam in a shorter time compared to a rule-based or deep learning-based spam detection method that detects spam based on the accumulated number of cases. Therefore, it is possible to reduce the damage caused by spam by quickly blocking spam.

도 6의 (c)는 3개의 메시지에 대해 시간 간격을 1초, 5초, 10초, 60초로 설정하였으나, 시간 간격을 계산하는 메시지의 개수 또는 메시지 간 시간 간격은 이와 다르게 설정되거나, 동적으로 설정될 수 있다. In (c) of FIG. 6, the time interval is set to 1 second, 5 seconds, 10 seconds, and 60 seconds for three messages, but the number of messages for calculating the time interval or the time interval between messages is set differently or dynamically can be set.

한편, 본 발명에서 수집하는 스팸은 메시지에 한정되는 것이 아니므로, 스팸 전화의 경우 도 6의 (c)는 동일 발신번호로부터 수신한 전화들의 시간 간격을 의미하는 것으로 변경될 수 있다.Meanwhile, since spam collected in the present invention is not limited to messages, in the case of spam calls, FIG. 6(c) may be changed to mean a time interval between calls received from the same caller number.

전처리부(220)는 데이터 수집부(210)로부터 새롭게 메시지 정보를 받으면, 발신자 관련 정보를 추출하여 발신 정보 테이블에 기록한다(S230). 전처리부(220)는 단말(100)로부터 수신하는 각각의 메시지 정보를 시간 순으로 발신 정보 테이블에 입력하여 관리한다. 따라서 가장 최근에 수신한 메시지 정보는 발신 정보 테이블의 가장 아래쪽 행에 입력될 것이다.When the preprocessor 220 receives new message information from the data collection unit 210, it extracts sender-related information and records it in the outgoing information table (S230). The pre-processing unit 220 manages each message information received from the terminal 100 by inputting it into the outgoing information table in chronological order. Therefore, the most recently received message information will be entered in the lowest row of the outgoing information table.

전처리부(220)는 데이터 수집부(210)로부터 새롭게 메시지 정보를 받으면, 변화된 누적 건수를 누적 정보 테이블에 기록한다(S240). 예를 들어, 새로운 메시지 정보를 수집함에 따라, 특정 발신번호로부터 3분간 총 70건의 메시지 수신이 발생한 경우, 도 6의 (a)에 도시된 누적 정보 테이블의 "3분 누적 건수" 중 "60~89" 건으로 표시된 셀에 1 또는 임의의 표시를 입력할 수 있다. When the preprocessor 220 receives new message information from the data collection unit 210, the changed accumulated number is recorded in the accumulated information table (S240). For example, when a total of 70 messages are received for 3 minutes from a specific caller number as new message information is collected, “60~ You can enter 1 or any mark in the cell marked with the 89" key.

또 다른 예로서, 동일한 발신번호로부터 3개의 메시지가 수신되고, 현재 메시지를 기준으로 직전에 수신한 메시지와의 시간 간격이 0.7초, 전전에 수신한 메시지와의 시간 간격이 4초인 경우, 도 6의 (c)에 도시된 누적 정보 테이블의 "n-1번째 메시지와의 간격" 중 "~1초 미만"으로 표시된 셀과 "n-2번째 메시지와의 간격" 중 "1~5초 미만"으로 표시된 셀에 1 또는 임의의 표시를 입력할 수 있다.As another example, when three messages are received from the same caller number, and the time interval with the message received immediately before the current message is 0.7 seconds, and the time interval with the message received before, 4 seconds, FIG. 6 In the cumulative information table shown in (c) of (c), the cell marked “less than ~1 second” among “interval with the n-1th message” and “less than 1-5 seconds” among “the interval between the n-2th message” You can enter 1 or any mark in the cell marked with .

전처리부(220)는 발신 정보 테이블과 누적 정보 테이블에 추가된 내용을 이진화하여 2차원으로 배열한다(S250). The preprocessor 220 binarizes the content added to the outgoing information table and the accumulated information table and arranges them in two dimensions (S250).

발신 정보 테이블에 포함된 정보는 숫자 또는 문자 형태이고, 누적 정보 테이블에 포함된 정보는 특정 셀에 해당 여부만을 표시하는 것이 목적이므로, 발신 정보 테이블의 각 정보를 이진화 한 내용이 누적 정보 테이블의 각 정보를 이진화 한 내용보다 비트 길이가 길 것이다. 따라서 비트 길이가 긴 발신 정보 테이블의 항목들을 우선 배열하고, 비트 길이가 짧은 누적 정보 테이블의 항목들을 배열하여 사각형의 형태를 생성할 수 있다.Since the information included in the outgoing information table is in the form of numbers or letters, and the purpose of the information included in the cumulative information table is to indicate whether or not it corresponds to a specific cell, the binarized content of each information in the calling information table is The bit length will be longer than the binarized information. Accordingly, the items of the outgoing information table having a long bit length are first arranged, and the items of the accumulated information table having a short bit length are arranged to generate a rectangular shape.

한편, 도 7에서는 1개의 발신 정보 테이블과 1개의 누적 정보 테이블을 이용하는 경우를 가정하였으나, 이용되는 누적 정보 테이블은 반드시 1개일 필요는 없다. 도 5에 도시된 발신 정보 테이블과 도 6에 도시된 누적 정보 테이블들을 모두 이용할 수도 있다. Meanwhile, although it is assumed in FIG. 7 that one originating information table and one cumulative information table are used, the number of used cumulative information tables is not necessarily one. Both the calling information table shown in FIG. 5 and the cumulative information table shown in FIG. 6 may be used.

전처리부(220)는 2차원 배열을 구성하는 비트를 미리 설정된 길이 단위로 분할하고, 각 비트 단위를 정수로 변환한다(S260). 사각형으로 배열된 각 정보들은 0 또는 1의 이진수로 표현되어 있으며, 배열된 비트를 특정 길이 단위로 분할한다. 이후 분할된 비트를 10진수의 정수로 변환한다.The preprocessor 220 divides the bits constituting the two-dimensional array into preset length units, and converts each bit unit into an integer (S260). Each piece of information arranged in a rectangle is expressed as a binary number of 0 or 1, and the arranged bits are divided into specific length units. After that, the divided bits are converted into decimal integers.

전처리부(220)는 정수로 변환된 숫자를 색상 정보를 나타내는 값에 대응시켜 2차원 이미지를 생성한다(S250). 색상 정보를 나타내는 값은 명도를 나타내는 그레이 스케일이거나 RGB의 한 값일 수 있다.The preprocessor 220 generates a two-dimensional image by matching the number converted into an integer to a value representing color information (S250). A value representing color information may be a gray scale representing brightness or a value of RGB.

예를 들어, 단계 S250에서 생성된 2차원 배열은 0 또는 1로 구성되어 있다. 이때 배열된 비트를 크기가 1인 단위로 분할하고, 분할된 비트를 정수로 변환하면 0인 비트는 10진수로 변환하여도 0이고, 1인 비트는 10진수로 변환하면 1이다. 따라서 0을 흑색, 1을 백색으로 대응시켜 2단계의 명암을 갖는 흑백 이미지를 생성할 수 있다. For example, the two-dimensional array generated in step S250 is composed of 0 or 1. At this time, if the arranged bits are divided into units with a size of 1 and the divided bits are converted into integers, the 0 bit is 0 even when converted to a decimal number, and the 1 bit is 1 when converted to a decimal number. Therefore, it is possible to generate a black-and-white image having two levels of contrast by matching 0 to black and 1 to white.

또 다른 예로서 도 8을 참고하면, 도 7을 통해 생성된 사각형을 8비트 단위로 분할한다고 가정한다. 이 경우 분할된 각 비트 단위는 '00000000'부터 '11111111'까지의 경우에 해당할 수 있다. 분할된 각 비트 단위를 10진수로 변환하면, '0000000'인 비트 단위는 0으로 변환되고, '11111111'인 비트 단위는 255로 변환되고, '01010101'인 비트 단위는 85로 변환될 것이다. 즉 하나의 비트 단위는 0부터 255내의 숫자에 해당할 수 있으며, 0인 부분을 흑색, 255인 부분을 백색에 대응시키면

가지, 즉 256 단계의 명암을 갖는 그레이 스케일 이미지를 생성할 수 있다.As another example, referring to FIG. 8 , it is assumed that the rectangle generated through FIG. 7 is divided into 8-bit units. In this case, each divided bit unit may correspond to cases from '00000000' to '11111111'. If each divided bit unit is converted to a decimal number, a bit unit of '0000000' will be converted to 0, a bit unit of '11111111' will be converted into 255, and a bit unit of '01010101' will be converted into 85. That is, one bit unit can correspond to a number from 0 to 255, and if the 0 part corresponds to black and the 255 part to white

It is possible to generate a gray-scale image having 256 levels of light and dark.

또한 본 발명에 따르면, 고정된 룰을 사용하지 않고 인공지능에 기반한 모델을 이용하므로, 전문가의 지속적인 룰 관리가 없어도 스팸을 탐지할 수 있어 관리의 효율성을 높일 수 있다.In addition, according to the present invention, since a model based on artificial intelligence is used instead of using a fixed rule, spam can be detected without an expert's continuous rule management, and thus management efficiency can be increased.

이상에서 본 발명의 실시예에 대하여 상세하게 설명하였지만 본 발명의 권리범위는 이에 한정되는 것은 아니고 다음의 청구범위에서 정의하고 있는 본 발명의 기본 개념을 이용한 당업자의 여러 변형 및 개량 형태 또한 본 발명의 권리범위에 속하는 것이다.Although the embodiments of the present invention have been described in detail above, the scope of the present invention is not limited thereto. is within the scope of the right.

Claims

A method for a spam detection device to detect spam, comprising:
collecting message information from a plurality of terminals;
generating two-dimensional learning images by processing the message information;
Generating training data corresponding to spam judgment results in each training image;
Supervised learning of a spam index calculation model using the learning data;
When new message information is collected from an arbitrary terminal, generating a two-dimensional input image by processing the new message information;
calculating a spam index for the new message using the input image and the spam index calculation model;
Checking the calling number in the new message information, and determining whether the calling number is a number included in a black list or a white list;
Correcting the spam index calculated according to the determination result, and
Comprising the step of transmitting the corrected spam index to the arbitrary terminal,
The step of correcting the spam index
When the caller number is included in the black list, it is corrected to a higher spam index than the spam index calculated using the spam index calculation model, and when the caller number is included in the white list, using the spam index calculation model A spam detection method that calibrates the spam index to a lower spam index than the calculated spam index.

In claim 1,
The step of generating the learning images,
binarizing the sender-related information included in each message information, and arranging the binarized information in two dimensions; and
Splitting the two-dimensionally arranged bits into arbitrary length units, converting the divided bit units into integers, and
generating a two-dimensional image by matching the converted integer to a value representing a gray scale or color;
Including, spam detection method.

In claim 2,
The step of arranging in two dimensions,
A spam detection method for arranging messages corresponding to a specific item among the sender-related information in two dimensions, further including information on the number of accumulated messages for a reference time.

In claim 3,
Information on the accumulated number of cases,
When a plurality of messages are received within a predetermined time interval from a specific caller number, a spam detection method comprising a time interval between each received message.

In claim 3,
Information on the accumulated number of cases,
A spam detection method comprising the number of accumulated messages received from a specific caller number during the reference time or the number of messages received from a sender of a specific address accumulated during the reference time.

In claim 1,
The spam index calculation model uses a convolutional neural network,
The spam index is a probability value calculated in the final node of the convolutional neural network included in the spam index calculation model, the spam detection method.

delete

A pre-processing unit that generates two-dimensional images by processing the message information collected from a plurality of terminals;
Receive training images for learning the spam index calculation model from the preprocessor, generate training data corresponding to the spam determination result in each training image, and learn a model to supervise and learn the spam index calculation model using the training data wealth,
A spam index calculator that receives the two-dimensional input image generated for the new message information from the preprocessor, and calculates the spam index for the new message by using the input image and the spam index calculation model; and
When the caller number is included in the black list in the new message information, the spam index calculation model is corrected to a higher spam index than the spam index calculated using the spam index calculation model, and when the caller number is included in the white list, the spam index calculation model After correcting the spam index to a lower spam index than the calculated spam index using
Including, spam detection device.

delete

In claim 8,
The model learning unit,
A spam detection device that modifies the spam index calculation model by reflecting the determination result of the post-processing unit.

As a method of processing the message information collected from the terminal by the spam detection device,
generating an accumulated information table of the specific caller number by using message information sent from the specific caller number;
Collecting new message information from any terminal;
Recording the sender-related information included in the new message information in a calling information table, and updating the accumulated information table corresponding to the calling number of the new message information;
arranging the cells of the origination information table of the new message information and the cumulative information table updated by the new message information in two dimensions, and converting the two-dimensional array into an image;
Including, message information processing method.

In claim 11,
The cumulative information table is
Including the number of accumulated messages received from a specific caller number for a reference time, the number of messages received from a sender of a specific address accumulated during the reference time, or a reception time interval of a plurality of messages received from a specific caller number, How message information is processed.

In claim 11,
The step of converting to the image is,
dividing the bits constituting the two-dimensional array into arbitrary length units, and converting the divided bits into integers; and
Corresponding the converted integer to a value representing a gray scale or color
A message information processing method comprising a.