RU2698916C1

RU2698916C1 - Method and system of searching for relevant news

Info

Publication number: RU2698916C1
Application number: RU2019107328A
Authority: RU
Inventors: Федор Борисович Федоров; Александра Евгеньевна Липачева; Владимир Алексеевич Кузнецов; Роман Владиславович Черкасов
Original assignee: Публичное Акционерное Общество "Сбербанк России" (Пао Сбербанк)
Priority date: 2019-03-14
Filing date: 2019-03-14
Publication date: 2019-09-02
Also published as: WO2020185110A1; EA201990538A1; EA038241B1

Abstract

FIELD: information technology.SUBSTANCE: invention relates to the field of information technology. Technical result is achieved by receiving a news set from a news aggregator server on a management server; executing on the control server an analysis of the obtained news set, which includes lemmatization of the texts of each news from said news set; processing lemmas of news texts using a machine learning model which contains a set of company data and a list of events, wherein a preset set of lemmas is set for each event in the machine learning model; determining news, containing lemmas, identifying given events and formation of communication of detected events with at least one company; and generating a list of relevant news based on the analysis of the news set.EFFECT: technical result consists in providing a related set of information formation from news sources with grouping by companies, which are news object and given types of events.16 cl, 5 dwg

Description

ОБЛАСТЬ ТЕХНИКИFIELD OF TECHNOLOGY

[0001] Настоящее техническое решение в общем относится к области информационных технологий, а в частности к поисковым механизмам, предназначенным для выявления релевантной информации из разнородных источников данных.[0001] This technical solution generally relates to the field of information technology, and in particular to search engines designed to identify relevant information from heterogeneous data sources.

УРОВЕНЬ ТЕХНИКИBACKGROUND

[0002] В настоящее время сбор данных (англ. «Data Mining») является важной составляющей для различных сфер бизнеса, в особенности в сферах аналитики и прогнозирования. Зачастую, источником данных об информации по интересующим темам являются общедоступные ресурсы в сети Интернет, например, новостные ресурсы (вебсайты, каналы в мессенджерах и т.п.).[0002] At present, data collection (English “Data Mining”) is an important component for various business areas, especially in the areas of analytics and forecasting. Often, the source of data on information on topics of interest are publicly available resources on the Internet, for example, news resources (websites, channels in instant messengers, etc.).

[0003] При анализе данных основной проблемой является агрегирование массива новостных источников, в частности привязка действительных событий к компаниям для целей последующего поиска. Как правило, на сегодняшний день нет эффективных средств для фильтрации собираемого новостного контента для создания агрегированных массивов информации с привязкой по объектам новостей, например, компаниям.[0003] When analyzing data, the main problem is the aggregation of an array of news sources, in particular the binding of real events to companies for the purpose of subsequent searches. As a rule, today there is no effective means for filtering the collected news content to create aggregated arrays of information with reference to news objects, for example, companies.

[0004] Из существующего уровня техники известны различные алгоритмы для сбора данных, например, решение, описанное в заявке WO 1999005614 (автор: Louis Gay et al., опубликовано: 04.02.1999), которое позволяет агрегировать данные из множества источников и отслеживать ретроспективную актуальность собираемых данных.[0004] Various algorithms for collecting data are known from the prior art, for example, the solution described in WO 1999005614 (author: Louis Gay et al., Published: 02/04/1999), which allows you to aggregate data from multiple sources and track retrospective relevance data collected.

[0005] Из патента RU 2382401 (патентообладатель: МАЙКРОСОФТ КОРПОРЕЙШН, опубликовано: 20.02.2010) известен подход для анализа и сравнения совокупностей документов, в соответствии с чем документы могут быть предположительно организованы в группы по своему содержимому или источнику и проанализированы на предмет межгрупповых и внутригрупповых различий и общностей. Например, сопоставление двух групп документов, посвященных одной теме, но полученных из двух различных источников, к примеру, информационного обзора происшествия в различных частях мира, может показать интересные различия мнений и общих истолкований ситуаций. За счет перемещения содержимого из статичных совокупностей в наборы статей, генерируемых во времени, может быть рассмотрено его развитие. Например, поток новостных статей по общему описанию может быть рассмотрен во времени с целью выделения действительно информативных свежих новостей и фильтрования множества статей, которые в значительной степени передают «практически то же самое».[0005] From patent RU 2382401 (patent holder: MICROSOFT CORPORATION, published: 02/20/2010), an approach is known for analyzing and comparing sets of documents, according to which documents can be supposedly organized into groups according to their content or source and analyzed for intergroup and intra-group differences and commonalities. For example, a comparison of two groups of documents devoted to one topic, but obtained from two different sources, for example, an information review of an incident in different parts of the world, can show interesting differences in opinions and general interpretations of situations. By moving content from static collections to sets of articles generated over time, its development can be considered. For example, a general flow of news articles can be reviewed over time with the goal of highlighting truly informative breaking news and filtering out a multitude of articles that largely convey “almost the same thing”.

[0006] Общим недостатком существующих подходов является отсутствие способа выявления релевантных новостей относительно привязки к объекту новости, например, компании и соответствующему событию, связанного с ней, что не позволяет эффективно осуществить сбор релевантной информации из множества источников данных.[0006] A common drawback of existing approaches is the lack of a method for identifying relevant news regarding the binding of a news item, for example, a company and the corresponding event associated with it, which does not allow efficient collection of relevant information from multiple data sources.

РАСКРЫТИЕ ИЗОБРЕТЕНИЕSUMMARY OF THE INVENTION

[0007] Решаемой технической проблемой или технической задачей с помощью заявленного подхода является обеспечение процесса поиска и формирования набора новостей с привязкой к заданному набору наименования компаний, как объектов новостей, и событий, о которых появляется информация в открытых источниках данных.[0007] The technical problem being solved or the technical problem using the claimed approach is to provide a search process and the formation of a set of news with reference to a given set of names of companies as news objects and events about which information appears in open data sources.

[0008] Техническим результатом, достигающимся при решении вышеуказанной технической задачи, является обеспечение формирования связанного набора информации из новостных источников с группировкой по компаниям, являющимся объектом новостей и заданными типами событий.[0008] The technical result achieved in solving the above technical problem is to ensure the formation of an associated set of information from news sources with a grouping of companies that are the subject of news and the specified types of events.

[0009] Дополнительным техническим результатом является повышение точности выявления информации о компаниях для заданного типа событий в общедоступных источниках информации.[0009] An additional technical result is to increase the accuracy of identifying company information for a given type of event in publicly available information sources.

[0010] Указанный технический результат достигается благодаря осуществлению компьютерно-реализуемого способа поиска релевантных новостей, в котором:[0010] The specified technical result is achieved through the implementation of a computer-implemented method for searching for relevant news, in which:

получают на управляющем сервере набор новостей от по меньшей мере одного сервера новостного агрегатора;receive on the management server a set of news from at least one server of the news aggregator;

осуществляют на управляющем сервере анализ полученного набора новостей, который включает в себяcarry out on the control server the analysis of the resulting news set, which includes

лемматизацию текстов каждой новости из упомянутого набора новостей;Lemmatization of the texts of each news item from the said set of news;

обработку полученных лемм текстов новостей с помощью модели машинного обучения, которая содержит установленный набор данных компаний и список событий, причем для каждого события в модели машинного обучения установлен заданный набор лемм;processing the received lemmas of news texts using the machine learning model, which contains an established set of company data and a list of events, and for each event in the machine learning model, a specified set of lemmas is installed;

определение новостей, содержащих леммы, идентифицирующие заданные события и формирование связи выявленных событий с по меньшей мере одной компанией;definition of news containing lemmas identifying given events and forming a connection between the identified events with at least one company;

формируют список релевантных новостей на основании выполненного анализа.form a list of relevant news based on the analysis.

[0011] В одном из частных примеров осуществления способа при получении набора новостей осуществляется фильтрация дублирующих новостей.[0011] In one particular embodiment of the method, upon receipt of a set of news, duplicate news is filtered.

[0012] В другом частном примере осуществления способа фильтрация осуществляется с помощью вычисления меры Жаккарда между сигнатурами новостей.[0012] In another particular embodiment of the method, filtering is performed by computing a Jacquard measure between news signatures.

[0013] В другом частном примере осуществления способа события присвоенные новости о компании сохраняют в базе данных.[0013] In another particular embodiment of the event method, the assigned company news is stored in a database.

[0014] В другом частном примере осуществления способа новостной агрегатор обновляет список новостей с помощью информационных каналов.[0014] In another particular embodiment of the method, the news aggregator updates the news list using information channels.

[0015] В другом частном примере осуществления способа информационные каналы представляют собой веб-сайты в сети Интернет и/или мессенджер-каналы.[0015] In another particular embodiment of the method, the information channels are websites on the Internet and / or messenger channels.

[0016] В другом частном примере осуществления способа в ходе анализа новостей выполняется определение принадлежности события основной или дочерней компании.[0016] In another particular embodiment of the method, in the course of analyzing the news, a determination is made of the ownership of the event of the parent or subsidiary.

[0017] В другом частном примере осуществления способа принадлежность компании, упоминаемой в новости, определяется с помощью алгоритма решающих деревьев.[0017] In another particular embodiment of the method, the ownership of the company mentioned in the news is determined using a decision tree algorithm.

[0018] В другом частном примере осуществления способа в ходе лемматизации текстов новостей осуществляется их очистка от знаков пунктуации, стоп-слов и именованных сущностей.[0018] In another particular example of the method during lemmatization of news texts, they are cleared of punctuation marks, stop words and named entities.

[0019] В другом частном примере осуществления способа в ходе лемматизации для каждой леммы текста новости рассчитывается статистическая мера.[0019] In another particular example of the method during the lemmatization, a statistical measure is calculated for each lemma of the news text.

[0020] В другом частном примере осуществления способа алгоритм машинного обучения представляет собой логическую регрессию, классифицирующий принадлежность новости событию на основании анализа статистической меры лемм.[0020] In another particular embodiment of the method, the machine learning algorithm is a logical regression that classifies whether a news item belongs to an event based on an analysis of a statistical measure of lemmas.

[0021] В другом частном примере осуществления способа для каждого текста новости выполняется определение частотных словосочетаний длиной от 2 до 10 лемм.[0021] In another particular embodiment of the method, frequency phrases are determined for each news text from 2 to 10 lemmas in length.

[0022] В другом частном примере осуществления способа алгоритм машинного обучения представляет собой градиентный бустинг, обученный для классификации события на основе количества предложений, содержащих леммы, идентифицирующие событие из поискового запроса.[0022] In another particular embodiment of the method, the machine learning algorithm is a gradient boost trained to classify an event based on the number of sentences containing lemmas identifying the event from the search query.

[0023] В другом частном примере осуществления способа после присвоения события из новости компании выполняется выделение лемм и/или предложений, содержащих леммы, идентифицирующее упомянутое событие.[0023] In another particular embodiment of the method, after assigning an event from company news, lemmas and / or sentences containing lemmas identifying the mentioned event are extracted.

[0024] В другом предпочтительном варианте осуществления заявленного решения представлена система поиска релевантных новостей, содержащая по меньшей мере один процессор и по меньшей мере одну память, которая содержит машиночитаемые инструкции, которые при их исполнении по меньшей мере одним процессором выполняют вышеуказанный способ.[0024] In another preferred embodiment of the claimed solution, a relevant news search system is provided, comprising at least one processor and at least one memory, which contains computer-readable instructions that, when executed by at least one processor, perform the above method.

КРАТКОЕ ОПИСАНИЕ ЧЕРТЕЖЕЙBRIEF DESCRIPTION OF THE DRAWINGS

[0025] Признаки и преимущества настоящего технического решения станут очевидными из приводимого ниже подробного описания и прилагаемых чертежей, на которых:[0025] The features and advantages of this technical solution will become apparent from the following detailed description and the accompanying drawings, in which:

[0026] Фиг. 1 иллюстрирует взаимодействие элементов, входящих в заявленное решение.[0026] FIG. 1 illustrates the interaction of the elements included in the claimed solution.

[0027] Фиг. 2 иллюстрирует общий процесс выполнения способа.[0027] FIG. 2 illustrates a general process flow.

[0028] Фиг. 3 иллюстрирует процесс обработки текстовых данных.[0028] FIG. 3 illustrates a text data processing process.

[0029] Фиг. 4 представлен пример графического интерфейса пользователя при взаимодействии с сервисом по подбору релевантных новостей.[0029] FIG. Figure 4 shows an example of a graphical user interface when interacting with a service for selecting relevant news.

[0030] Фиг. 5 иллюстрирует общий вид вычислительного устройства.[0030] FIG. 5 illustrates a general view of a computing device.

ОСУЩЕСТВЛЕНИЕ ИЗОБРЕТЕНИЯDETAILED DESCRIPTION OF THE INVENTION

[0031] На Фиг. 1 представлена общая вычислительная архитектура (100) представленного решения. Основной функционал по сбору и обработке информации выполняется на управляющем сервере (110), который посредством канала передачи данных получает информацию сервера (120) новостного агрегатора, который связан посредством сети Интернет (150) со множеством новостных ресурсов (130). Сервер (110) обеспечивает взаимодействие с пользователями (10) для отображения данных по собранной новостной информации, а также дополнительный функционал, который будет раскрыт далее в материалах заявки.[0031] In FIG. 1 shows the general computing architecture (100) of the presented solution. The main functionality for collecting and processing information is performed on the management server (110), which, through the data channel, receives information from the server (120) of the news aggregator, which is connected via the Internet (150) to many news resources (130). Server (110) provides interaction with users (10) to display data on the collected news information, as well as additional functionality that will be disclosed later in the application materials.

[0032] В качестве канала передачи данных между управляющем сервером (110) и сервером новостного агрегатора (120) может выступать Интернет или Интранет. При этом сервер новостного агрегатора (120) может представлять собой несколько устройств, входящих в состав различного сетевого окружения, например, совокупность серверов, маршрутизаторов, кластеров и т.п.Канал передачи данных может быть организован с помощью различного вида известных протоколов передачи данных, как проводных, так и беспроводных, например, TCP/IP, 802.11, Ethernet, FTP и др., обеспечивая формирование различного сетевого взаимодействия, в частности LAN, WAN, PAN, WLAN и т.п.[0032] The Internet or the Intranet can act as a data transmission channel between the management server (110) and the server of the news aggregator (120). Moreover, the news aggregator server (120) can be several devices that are part of a different network environment, for example, a set of servers, routers, clusters, etc. A data transmission channel can be organized using various types of known data transfer protocols, such as wired or wireless, for example, TCP / IP, 802.11, Ethernet, FTP, etc., providing the formation of various network interactions, in particular LAN, WAN, PAN, WLAN, etc.

[0033] Управляющий сервер (ПО) выполняет основную обработку информации, получаемой от сервера новостного агрегатора (120), хранит и формирует данные для отображения пользователям (10). Отображение информации может формироваться с помощью специализированного графического интерфейса пользователя. Пользователи (10) могут взаимодействовать с управляющим сервером (ПО) с помощью веб-портала или иного типа программного приложения, обеспечивающего доступ к агрегированной новостной информации. Доступ может предоставляться, например, посредством API. Взаимодействие пользователей (10) может осуществляться с помощью различных электронных устройств, в качестве которых могут выступать, например, компьютер, ноутбук, смартфон, планшет, игровая приставка, умное носимое электронное устройство, тонкий клиент, а также устройства дополненной, смешанной или виртуальной реальности и др.[0033] The control server (software) performs the main processing of information received from the news aggregator server (120), stores and generates data for display to users (10). The display of information can be formed using a specialized graphical user interface. Users (10) can interact with a management server (software) using a web portal or another type of software application that provides access to aggregated news information. Access can be granted, for example, through the API. User interaction (10) can be carried out using various electronic devices, which can be, for example, a computer, laptop, smartphone, tablet, game console, smart wearable electronic device, thin client, as well as augmented, mixed or virtual reality devices, and other

[0034] Сервер новостного агрегатора (120) связан посредством сети Интернет (150) с различными информационными ресурсами (130) или информационными каналами, предоставляющими новостную информацию. Такими ресурсами (130) могут выступать, например, веб-сайты, каналы мессенджеров (Telegram™, WatsApp™, Viber™ и др.), социальные сети (Facebook™, Вконтакте™ и т.п.). Сохранение полученной информации на сервере (110) может осуществляться в формате JSON в хранилище данных, например, базе данных. При этом может учитываться источник получения новостной информации и дата ее размещения на соответствующем ресурсе (130).[0034] The news aggregator server (120) is connected via the Internet (150) to various information resources (130) or information channels providing news information. Such resources (130) can be, for example, websites, messenger channels (Telegram ™, WatsApp ™, Viber ™, etc.), social networks (Facebook ™, Vkontakte ™, etc.). Saving the received information on the server (110) can be carried out in JSON format in a data warehouse, for example, a database. In this case, the source of news information and the date of its publication on the corresponding resource (130) can be taken into account.

[0035] На Фиг. 2 представлен общий процесс выполнения заявленного способа поиска релевантной новостной информации (200). Информация из новостных источников, собранная и хранимая на сервере новостного агрегатора (120), передается (201) на управляющий сервер (ПО). Информация от сервера новостного агрегатора (120) может передаваться в режиме онлайн или офлайн. В онлайн режиме данные из сети Интернет (150) передаются по факту их появления на веб-ресурсе, к которому имеется подключение у сервера новостного агрегатора (120). В режиме офлайн новости сохраняются на сервере новостного агрегатора (120), например, в базе данных, и в установленное время (например, каждый час, раз в день и т.п.) или по запросу от управляющего сервера (110) передаются на него.[0035] In FIG. 2 presents the overall process of performing the claimed method of searching for relevant news information (200). Information from news sources, collected and stored on the server of the news aggregator (120), is transmitted (201) to the management server (software). Information from the news aggregator server (120) can be transmitted online or offline. In online mode, data from the Internet (150) is transmitted upon their appearance on a web resource to which a news aggregator server is connected (120). In offline mode, news is stored on the news aggregator server (120), for example, in the database, and at the set time (for example, every hour, once a day, etc.) or upon request from the control server (110) are transmitted to it .

[0036] Данные от сервера новостного агрегатора (120) могут передаваться в различных форматах, например, xml, html, txt и т.п. Формат данных для передачи также может изменяться в зависимости от режима передачи информации на управляющий сервер (ПО). Помимо самого текста новости данные содержат информацию о компаниях, упомянутых в тексте.[0036] Data from the news aggregator server (120) may be transmitted in various formats, for example, xml, html, txt, and the like. The format of the data for transmission can also change depending on the mode of information transfer to the management server (software). In addition to the text of the news itself, the data contains information about the companies mentioned in the text.

[0037] На управляющем сервере (110) находится сформированный список, содержащий наименования компаний (2021) и событий (2022), на предмет которых осуществляется анализ входящей новостной информации от сервера новостного агрегатора (120). Указанные данные хранятся в базе данных управляющего сервера (110). В качестве событий могут выступать, например, арест/заморозка счетов компании, банкротство компании, наличие исков к компании, обвал/рост акций и т.п. Список событий (2022) и компаний (2021) может обновляться или изменяться в течение времени.[0037] On the management server (110) is a generated list containing the names of companies (2021) and events (2022), for which an incoming news information is analyzed from the news aggregator server (120). The indicated data is stored in the database of the management server (110). Events may include, for example, seizure / freezing of company accounts, bankruptcy of the company, claims against the company, collapse / growth of shares, etc. The list of events (2022) and companies (2021) may be updated or changed over time.

[0038] Поиск релевантной информации по данным полученным от сервера новостного агрегатора (120) осуществляется с помощью обработки (202) полученного массива данных с помощью модели машинного обучения, которая обучена осуществлять поиск по наименованиям компаний (2021) и соответствующих событий (2022) в массиве текстовой информации и выдавать суждение о релевантности соответствующей информации. Обработка данных на сервере (110), выполняется по факту получения нового массива данных от сервера новостного агрегатора (120), либо по заранее установленному сценарию. В качестве сценария может настраиваться автоматический скрипт, который в установленное время осуществляет активацию модели машинного обучения для обработки данных (202).[0038] The search for relevant information from the data received from the news aggregator server (120) is carried out by processing (202) the obtained data array using the machine learning model, which is trained to search by company names (2021) and corresponding events (2022) in the array textual information and give a judgment on the relevance of the relevant information. Data processing on the server (110) is performed upon receipt of a new data array from the news aggregator server (120), or according to a predetermined scenario. An automatic script can be set up as a script, which activates a machine learning model for data processing at a set time (202).

[0039] При выполнении этапа обработки (202) выполняется обращение к хранилищу информации управляющего сервера (110), которое содержит полученные от сервера новостного агрегатора (120) данные из новостных источников (130). При доступе к сохраненной на управляющем сервере (ПО) информации осуществляется ее обработка (202) для выявления релевантных данных и привязки данных (203) из новостей к соответствующим типам событий в ходе обработки информации с помощью модели машинного обучения.[0039] When the processing step (202) is performed, the information is stored in the control server information store (110), which contains data from news sources (130) received from the news aggregator server (120). When accessing information stored on a management server (software), it is processed (202) to identify relevant data and to bind data (203) from the news to the corresponding types of events during information processing using the machine learning model.

[0040] На Фиг. 3 представлен процесс (300) осуществления обработки новостных данных, полученных от сервера новостного агрегатора (120), которая осуществляется в процессе выполнения этапов (202) - (203). На первом шаге (301) новостные текстовые данные, полученные от сервера новостного агрегатора (120), проходят лемматизацию, в ходе которой выполняется разделение на леммы корпуса текста каждой новости. Из полученных данных извлекается текст новости и метаданные из файлов.[0040] FIG. 3 shows a process (300) for processing news data received from a news aggregator server (120), which is carried out in the process of performing steps (202) to (203). In the first step (301), the news text data received from the news aggregator server (120) undergoes lemmatization, during which the text of each news is divided into lemmas of the corpus. The news text and metadata from the files are extracted from the received data.

[0041] В ходе выполнения процесса лемматизации текстов (301) тело новости разделяется на слова по всем пунктуационным разделителям, после чего приводится к нормальной форме, например, с помощью библиотеки pymorphy2. Затем осуществляется преобразование текста, в частности выполняется очистка текста от знаков пунктуации, стоп-слов (предлоги, союзы, местоимения) и именных сущностей. Именной сущностью в данном случае считается любое слово, начинающееся с большой буквы и не являющееся при этом первым словом в предложении. Также, может выполняться процесс N-грамминга (https://ru.wikipedia.orp/wiki/N-грамма), при котором в тексте выделяются наиболее частотные словосочетания длины от 2 до 10 лемм. Список наиболее частотных словосочетаний получен путем автоматического анализа большого корпуса текста и содержит более 9 млн. объектов.[0041] During the process of lemmatizing texts (301), the news body is divided into words by all punctuation delimiters, after which it is reduced to normal form, for example, using the pymorphy2 library. Then the text is converted, in particular, the text is cleared of punctuation marks, stop words (prepositions, conjunctions, pronouns) and nominal entities. In this case, a nominal entity is any word that starts with a capital letter and is not the first word in the sentence. Also, the N-gramming process (https: //ru.wikipedia.orp/wiki/N-gram) can be performed, in which the most frequent phrases of length from 2 to 10 lemmas are highlighted in the text. The list of the most frequent phrases obtained by automatic analysis of a large corpus of text and contains more than 9 million objects.

[0042] Также, входящие новости проходят процедуру дедупликации, в ходе которой отфильтровываются повторяющиеся новости. В ходе выполнения процедуры дедупликации для каждой новости считается сигнатура MinHash (см. https://en.wikipedia.org/wiki/MinHash). после чего для каждой пары новостей вычисляется схожесть сигнатур по мере Жаккара (иногда - коэффициент Жаккара). Если схожесть пары новостей превышает заданный порог, например, 0.7, то более короткая новость из пары корпусов текстов считается дублирующей и не подвергается дальнейшей обработке.[0042] Also, incoming news undergoes a deduplication procedure during which recurring news is filtered out. During the deduplication procedure for each news, the MinHash signature is considered (see https://en.wikipedia.org/wiki/MinHash). after which, for each news pair, the similarity of signatures is calculated according to the Jacquard measure (sometimes the Jacquard coefficient). If the similarity of a news pair exceeds a predetermined threshold, for example, 0.7, then the shorter news of a pair of text corps is considered duplicate and is not further processed.

[0043] На следующем шаге (302) после лемматизации текстов новостей выполняется обработка нормализованного текста. В тексте новости осуществляется поиск наименования компаний, не имеющих омонимов (например, «Сбербанк™»). Находятся все словосочетания с большой буквы и в кавычках, после чего проводится поиск лемм найденных словосочетаний в списке компаний, хранимого в базе данных сервера (110).[0043] In the next step (302), after the lemmatization of news texts, normalized text processing is performed. The text of the news searches for the names of companies that do not have homonyms (for example, Sberbank ™). All phrases with a capital letter and in quotation marks are found, after which lemmas of the found phrases are searched in the list of companies stored in the server database (110).

[0044] Найденные наименования компаний классифицируются по признаку «основная» или «дополнительная» компания (т.е. которая является косвенно упоминаемой в тексте новости). Компания считается «основной», если она является предметом новости, и «дополнительной», если наименование компании просто упоминается в теле новости. Классификация осуществляется с помощью модели машинного обучения, в частности алгоритма принятия решений, например, с помощью решающих деревьев. Список признаков решающего дерева выглядит следующим образом:[0044] The found company names are classified on the basis of “main” or “additional” company (that is, which is indirectly referred to in the text of the news). A company is considered “main” if it is the subject of the news, and “additional” if the company name is simply mentioned in the body of the news. Classification is carried out using a machine learning model, in particular a decision-making algorithm, for example, using decision trees. The list of decision tree attributes is as follows:

1) Номер предложения первого упоминания компании (0, если это заголовок);1) The proposal number of the first mention of the company (0, if it is a title);

2) Номер предложения первого упоминания компании, нормированный на число предложений;2) The offer number of the first mention of the company, normalized to the number of offers;

3) Длина текста в символах; Значение порога классификации - 0.5.3) The length of the text in characters; The classification threshold value is 0.5.

[0045] Для определения релевантности того или иного события для компаний, указываемых в теле новостей, осуществляется обработка полученных лемм из тела новости на шаге (303) с помощью моделей машинного обучения.[0045] To determine the relevance of an event for the companies indicated in the news body, the lemmas obtained from the news body are processed in step (303) using machine learning models.

[0046] В качестве одного примера модели машинного обучения может применяться логическая регрессия с помощью расчета статистической меры TF-IDF для лемм текста (см. https://ru.wikipedia.org/wiki/TF-IDF). Для каждой леммы в тексте считается статистическая мера, после чего на полученных признаках делается суждение заранее обученной логистической регрессии. Помимо обработки с помощью модели машинного обучения составляется список заданных лемм, например, список может содержать 30-40 лемм, имеющих наибольший вес в логистической регрессии. Список строится для каждого события после процесса обучения логистической регрессии. На выходе модели определяется вес каждой леммы, по которым осуществляется отбор лемм для списка на основании значений их весов.[0046] As one example of a machine learning model, logical regression can be applied by calculating the statistical measure TF-IDF for text lemmas (see https://ru.wikipedia.org/wiki/TF-IDF). For each lemma in the text, a statistical measure is considered, after which a judgment of a pre-trained logistic regression is made on the obtained characteristics. In addition to processing using the machine learning model, a list of specified lemmas is compiled, for example, the list may contain 30-40 lemmas that have the greatest weight in logistic regression. A list is built for each event after the learning process of logistic regression. At the output of the model, the weight of each lemma is determined, by which lemmas are selected for the list based on the values of their weights.

[0047] Если полученное значение вероятности суждения модели (303) выше заранее установленного порога и в тексте новости встречается хотя бы одна лемма из упомянутого списка, то по меньшей мере одно событие присваивается новости (304) для выявленного в тексте наименования компании.[0047] If the obtained value of the probability of judgment of the model (303) is higher than a predetermined threshold and at least one lemma from the list appears in the news text, then at least one event is assigned to the news (304) for the company name identified in the text.

[0048] Дополнительно для каждого события может задаваться набор лемм, например, 10-15 лемм, наиболее соответствующих событию, которые выделяются из ранее определенного списка лемм, и если событие было присвоено новости на этапе (304), то все найденные в тексте леммы из упомянутого набора выделяются в тексте.[0048] Additionally, for each event, a set of lemmas can be specified, for example, 10-15 lemmas that are most relevant to the event, which are selected from the previously defined list of lemmas, and if the event was assigned news in step (304), then all the lemmas found in the text said set are highlighted in the text.

[0049] Вторым примером применения модели машинного обучения является классифицирующий алгоритм в виде градиентного бустинга, например, LightGBM (https://lightgbm.readthedocs.io). Для каждого текста новости считается количество предложений, содержащих пары характерных для события лемм. Пары характерных лемм подбираются для каждого события в ходе обучения классификатора. Характерные леммы (и их количество) подбираются автоматически в ходе обучения.[0049] A second example of the application of the machine learning model is a classifying algorithm in the form of gradient boosting, for example, LightGBM (https://lightgbm.readthedocs.io). For each text of the news, the number of sentences containing pairs of lemmas characteristic of the event is considered. Pairs of characteristic lemmas are selected for each event during the training of the classifier. Typical lemmas (and their number) are selected automatically during training.

[0050] На полученных таким образом признаках (парах лемм) делается суждение с помощью упомянутой модели машинного обучения (303). Если полученное значение вероятности выше заранее подобранного порога, то событие присваивается новости (304) для одной или нескольких компаний, указанных в новости. Дополнительно в каждом тексте могут выделяться предложения, содержащие пары характерных для события лемм.[0050] Based on the attributes thus obtained (pairs of lemmas), a judgment is made using the mentioned machine learning model (303). If the obtained probability value is higher than a pre-selected threshold, then the event is assigned to the news (304) for one or more companies indicated in the news. Additionally, sentences containing pairs of lemmas characteristic of the event can be highlighted in each text.

[0051] Если в ходе обработки новостных данных не осуществляется выявление релевантных событий для указанных наименований компаний, то такая информация не учитывается (305).[0051] If the processing of news data does not identify relevant events for the specified company names, then such information is not taken into account (305).

[0052] На Фиг. 4 представлен пример графического интерфейса пользователя (400) для взаимодействия с сервисом по подбору релевантной новостной информации. Интерфейс (400) предоставляет функционал по отображению и управлению содержанием предоставляемых данных. Формирование поискового запроса выполняется с помощью панели ввода информации о наименовании компании (401). В основном поле (404) для отображения текущей или найденной информации представлен перечень компаний, для которых осуществляется обработка выявления релевантной информации из базы данных сервера (110).[0052] FIG. Figure 4 shows an example of a graphical user interface (400) for interacting with a service for selecting relevant news information. The interface (400) provides functionality for displaying and managing the content of the data provided. Formation of a search query is performed using the panel for entering information about the company name (401). The main field (404) for displaying current or found information is a list of companies for which the processing of identifying relevant information from the server database (110) is carried out.

[0053] Компании в поле (404) могут отображаться в различном иерархическом порядке, например, в алфавитном, по количеству новостей и т.п. Информация может отфильтровываться по временному диапазону, который устанавливается в поле ввода дат (402).[0053] The companies in the field (404) can be displayed in different hierarchical order, for example, in alphabetical order, by the number of news, etc. Information can be filtered by the time range that is set in the date entry field (402).

[0054] Также, интерфейс (400) содержит панель управления для настройки параметров поисковых запросов (403). С помощью панели управления (403) можно осуществляться настройку выявления тех или иных типов событий, осуществлять привязку компаний, конфигурировать параметры сервиса и т.п. В поле (405) отображается список выявленных новостных источников в соответствии с заданными событиями для компаний.[0054] Also, the interface (400) comprises a control panel for configuring search query parameters (403). Using the control panel (403), you can configure the detection of certain types of events, bind companies, configure service parameters, etc. The field (405) displays a list of identified news sources in accordance with the specified events for companies.

[0055] Пользователи (10) также могут устанавливать функцию оповещения для выбранных наименований компаний. Оповещения о поступлении новых новостей могут передаваться посредством сообщений электронной почты, PUSH уведомлений, SMS уведомлений и т.п. При настройке функции оповещения пользователь (10) может настраивать требуемые параметры, например, наименование компании, тип событий, связанных с компаниями.[0055] Users (10) can also set an alert function for selected company names. Notifications of new news can be sent via e-mail, PUSH notifications, SMS notifications, etc. When configuring the notification function, the user (10) can configure the required parameters, for example, the name of the company, the type of events associated with companies.

[0056] Сформированная информация по обработанным новостям также может отображаться с применением фильтра, настроенному относительно роли пользователя (10), взаимодействующего с интерфейсом (400). С учетом параметром учетной записи пользователя (10) ему могут отображаться только те новости, которые содержат связанный с его ролью тип событий.[0056] The generated information on the processed news can also be displayed using a filter configured according to the role of the user (10) interacting with the interface (400). Taking into account the user account parameter (10), only news that contains the type of events associated with his role can be displayed to him.

[0057] На Фиг. 5 представлен пример общего вида устройства (500), которое обеспечивает реализацию представленного решения. На базе устройства (500) может реализовываться различный спектр вычислительных устройств, например, управляющий сервер (110), сервер новостного агрегатора (120), устройства пользователей (10) и т.д.[0057] FIG. 5 shows an example of a general view of the device (500), which provides an implementation of the presented solution. On the basis of the device (500), a different range of computing devices can be implemented, for example, a control server (110), a news aggregator server (120), user devices (10), etc.

[0058] В общем виде устройство (500) содержит объединенные общей шиной информационного обмена один или несколько процессоров (501), средства памяти, такие как ОЗУ (502) и ПЗУ (503), интерфейсы ввода/вывода (504), устройства ввода/вывода (505), и устройство для сетевого взаимодействия (506).[0058] In general, the device (500) comprises one or more processors (501) connected by a common bus of information exchange, memory means such as RAM (502) and ROM (503), input / output interfaces (504), input devices / output (505), and a device for network interaction (506).

[0059] Процессор (501) (или несколько процессоров, многоядерный процессор и т.п.) может выбираться из ассортимента устройств, широко применяемых в настоящее время, например, таких производителей, как: Intel™, AMD™, Apple™, Samsung Exynos™, MediaTEK™, Qualcomm Snapdragon™ и т.п. Под процессором или одним из используемых процессоров в устройстве (500) также необходимо учитывать графический процессор, например, GPU NVIDIA или Graphcore, тип которых также является пригодным для полного или частичного выполнения способа (200), а также может применяться для обучения и применения моделей машинного обучения в различных информационных системах.[0059] The processor (501) (or multiple processors, a multi-core processor, and the like) may be selected from a variety of devices that are currently widely used, for example, manufacturers such as: Intel ™, AMD ™, Apple ™, Samsung Exynos ™, MediaTEK ™, Qualcomm Snapdragon ™, etc. Under a processor or one of the processors used in the device (500), it is also necessary to take into account a graphic processor, for example, an NVIDIA or Graphcore GPU, the type of which is also suitable for the full or partial execution of method (200), and can also be used for training and application of machine models training in various information systems.

[0060] ОЗУ (502) представляет собой оперативную память и предназначено для хранения исполняемых процессором (501) машиночитаемых инструкций для выполнение необходимых операций по логической обработке данных. ОЗУ (502), как правило, содержит исполняемые инструкции операционной системы и соответствующих программных компонент (приложения, программные модули и т.п.). При этом, в качестве ОЗУ (502) может выступать доступный объем памяти графической карты или графического процессора.[0060] RAM (502) is a random access memory and is designed to store machine-readable instructions executed by the processor (501) to perform the necessary operations for logical data processing. RAM (502), as a rule, contains executable instructions of the operating system and corresponding software components (applications, program modules, etc.). At the same time, the available memory capacity of the graphics card or graphics processor can act as RAM (502).

[0061] ПЗУ (503) представляет собой одно или более средств для постоянного хранения данных, например, жесткий диск (HDD), твердотельный накопитель данных (SSD), флэш-память (EEPROM, NAND и т.п.), оптические носители информации (CD-R/RW, DVD-R/RW, BlueRay Disc, MD) и др.[0061] The ROM (503) is one or more means for permanently storing data, for example, a hard disk drive (HDD), a solid state drive (SSD), flash memory (EEPROM, NAND, and the like), optical storage media (CD-R / RW, DVD-R / RW, BlueRay Disc, MD), etc.

[0062] Для организации работы компонентов устройства (500) и организации работы внешних подключаемых устройств применяются различные виды интерфейсов В/В (504). Выбор соответствующих интерфейсов зависит от конкретного исполнения вычислительного устройства, которые могут представлять собой, не ограничиваясь: PCI, AGP, PS/2, IrDa, FireWire, LPT, COM, SATA, IDE, Lightning, USB (2.0, 3.0, 3.1, micro, mini, type C), TRS/Audio jack (2.5, 3.5, 6.35), HDMI, DVI, VGA, Display Port, RJ45, RS232 и т.п.[0062] Various types of I / O interfaces (504) are used to organize the operation of the components of the device (500) and organize the operation of external connected devices. The choice of appropriate interfaces depends on the particular computing device, which can be, but not limited to: PCI, AGP, PS / 2, IrDa, FireWire, LPT, COM, SATA, IDE, Lightning, USB (2.0, 3.0, 3.1, micro, mini, type C), TRS / Audio jack (2.5, 3.5, 6.35), HDMI, DVI, VGA, Display Port, RJ45, RS232, etc.

[0063] Для обеспечения взаимодействия пользователя с вычислительной системой (500) применяются различные средства (505) В/В информации, например, клавиатура, дисплей (монитор), сенсорный дисплей, тач-пад, джойстик, манипулятор мышь, световое перо, стилус, сенсорная панель, трекбол, динамики, микрофон, средства дополненной реальности, оптические сенсоры, планшет, световые индикаторы, проектор, камера, средства биометрической идентификации (сканер сетчатки глаза, сканер отпечатков пальцев, модуль распознавания голоса) и т.п.[0063] Various means (505) of I / O information, for example, a keyboard, a display (monitor), a touch screen, a touch pad, a joystick, a mouse, a light pen, a stylus, are used to provide user interaction with a computing system (500), touch panel, trackball, speakers, microphone, augmented reality, optical sensors, tablet, light indicators, projector, camera, biometric identification tools (retina scanner, fingerprint scanner, voice recognition module), etc.

[0064] Средство сетевого взаимодействия (506) обеспечивает передачу данных посредством внутренней или внешней вычислительной сети, например, Интранет, Интернет, ЛВС и т.п.В качестве одного или более средств (506) может использоваться, но не ограничиваться: Ethernet карта, GSM модем, GPRS модем, LTE модем, 5G модем, модуль спутниковой связи, NFC модуль, Bluetooth и/или BLE модуль, Wi-Fi модуль и др.[0064] The network interaction tool (506) provides data transmission via an internal or external computer network, for example, an Intranet, the Internet, a LAN, etc. As one or more means (506), but not limited to: an Ethernet card, GSM modem, GPRS modem, LTE modem, 5G modem, satellite communications module, NFC module, Bluetooth and / or BLE module, Wi-Fi module, etc.

[0065] Дополнительно могут применяться также средства спутниковой навигации в составе устройства (500), например, GPS, ГЛОНАСС, BeiDou, Galileo.[0065] Additionally, satellite navigation means as part of the device (500), for example, GPS, GLONASS, BeiDou, Galileo, can also be used.

[0066] Конкретный выбор элементов устройств (500) для реализации различных программно-аппаратных архитектурных решений может варьироваться с сохранением обеспечиваемого требуемого функционала от того или иного типа устройства.[0066] The specific selection of device elements (500) for implementing various software and hardware architectural solutions may vary while maintaining the required functionality from one or another type of device.

[0067] Представленные материалы заявки раскрывают предпочтительные примеры реализации технического решения и не должны трактоваться как ограничивающие иные, частные примеры его воплощения, не выходящие за пределы испрашиваемой правовой охраны, которые являются очевидными для специалистов соответствующей области техники.[0067] The application materials presented disclose preferred examples of implementing a technical solution and should not be construed as limiting other, private examples of its implementation, not going beyond the scope of the requested legal protection, which are obvious to specialists in the relevant field of technology.

Claims

1. A computer-implemented method for searching for relevant news, comprising stages in which

receive on the management server a set of news from at least one server of the news aggregator;

carry out on the control server the analysis of the resulting news set, which includes

Lemmatization of the texts of each news item from the said set of news;

processing the received lemmas of news texts using the machine learning model, which contains an established set of company data and a list of events, and for each event in the machine learning model, a specified set of lemmas is installed;

definition of news containing lemmas identifying given events and forming a connection between the identified events with at least one company;

form a list of relevant news on the basis of the analysis of the set of news.

2. The method according to p. 1, characterized in that upon receipt of a set of news is the filtering of duplicate news.

3. The method according to p. 2, characterized in that the filtering is carried out by calculating the Jacquard measure between the signatures of the news.

4. The method according to p. 1, characterized in that the events assigned news about the company are stored in the database.

5. The method according to p. 1, characterized in that the news aggregator updates the list of news using information channels.

6. The method according to p. 5, characterized in that the information channels are websites on the Internet and / or messenger channels.

7. The method according to p. 1, characterized in that during the analysis of the news the determination of the affiliation of the event of the main or subsidiary company.

8. The method according to claim 7, characterized in that the affiliation of the company referred to in the news is determined using the decision tree algorithm.

9. The method according to p. 1, characterized in that during the lemmatization of news texts they are cleared of punctuation marks, stop words and personal entities.

10. The method according to p. 1, characterized in that during the lemmatization for each lemma of the text of the news is calculated statistical measure.

11. The method according to p. 10, characterized in that the machine learning algorithm is a logical regression that classifies whether a news item belongs to an event based on an analysis of a statistical measure of lemmas.

12. The method according to claim 1, characterized in that for each text of the news, frequency phrases are determined from 2 to 10 lemmas in length.

13. The method according to claim 1, characterized in that the machine learning algorithm is a gradient boost trained to classify an event based on the number of sentences containing lemmas identifying the event from the search query.

14. The method according to p. 1, characterized in that after the assignment of the event from the news of the company is the selection of lemmas and / or sentences containing lemmas that identify the event.

15. A device for searching for relevant news, containing at least one processor and at least one memory containing machine-readable instructions that, when executed by at least one processor, perform the method according to any one of claims. 1-14.

16. A search engine for relevant news, containing

at least one management server;

at least one news aggregator server, configured to receive news data from at least one news source,

moreover, the control server is configured

receiving news data from at least one news aggregator server;

analysis of the data during which

Lemmatization of the texts of each news item from the mentioned set of news;

generating a list of relevant news based on the analysis of a set of news.