CN110569502A

CN110569502A - Method and device for identifying forbidden slogans, computer equipment and storage medium

Info

Publication number: CN110569502A
Application number: CN201910701299.6A
Authority: CN
Inventors: 洪帅; 叶国华; 黄坤; 吕锡海; 厉智
Original assignee: Suning Cloud Computing Co Ltd
Current assignee: Suning Cloud Computing Co Ltd
Priority date: 2019-07-31
Filing date: 2019-07-31
Publication date: 2019-12-13

Abstract

The invention discloses a method and a device for identifying forbidden slogans, computer equipment and a storage medium, and belongs to the field of artificial intelligence application. The method comprises the following steps: performing word segmentation processing on a text to be recognized to obtain a word segmentation result; identifying whether the text to be identified contains forbidden slogans or not based on the word segmentation result, and if the text to be identified does not contain the forbidden slogans, identifying whether the text to be identified contains suspected forbidden slogans or not; and if the text to be recognized contains suspected forbidden slogans, inputting the text to be recognized into a pre-trained semantic analysis model, and outputting a result of whether the text to be recognized contains the forbidden slogans. In the process of identifying whether the text to be identified contains the forbidden slogans, the forbidden slogans can be identified more accurately by two-stage text identification and the deep learning semantic analysis model, compared with the prior art which only uses single-stage text identification.

Description

method and device for identifying forbidden slogans, computer equipment and storage medium

Technical Field

the invention relates to the field of artificial intelligence application, in particular to a method and a device for identifying forbidden slogans, computer equipment and a storage medium.

Background

with the rapid development of internet and electronic commerce technologies, online shopping has become an important part of people's daily life. However, consumers have the problems of fraud and misleading the consumers caused by the fact that advertising language is exaggerated while enjoying the benefits.

the advertisement words are exaggerated and publicized, which means that the operator uses the fictitious fact of the advertisement form to hide the truth, and causes the misunderstanding of the consumer and the user about the goods or services, thereby trading with the goods or services, winning the market and obtaining the benefits. This behavior violates the honest and creditworthiness principles and violates the recognized business principles, and is a serious dishonest competitive behavior. The general advertising or other commercial propaganda mostly surrounds the information about the goods or services, and is focused on the characteristics of the goods or services, the status of the goods, price, quality, production composition, performance, usage, producer, expiration date and other conditions, and the false advertising content is as extensive as the general advertising and can relate to various conditions of the goods (or services). The above false advertisements or false promotions are socially harmful to the extent of being misleading. The limit expression is a typical false publicity strategy, such as the publicity expressions of national level, world level, highest level, and the like, and the limit expression or the exaggerated publicity in the advertisement content violates a new advertising law.

at present, the method for identifying the forbidden slogans mainly extracts the characteristic words of the slogans to be identified and matches the characteristic words with a preset illegal characteristic word bank so as to judge whether the forbidden slogans exist, however, the single-stage identification method does not utilize the information in the aspects of vocabulary semantics and the like, so that the identification accuracy is not ideal.

Therefore, how to improve the accuracy of identifying the forbidden slogans becomes a technical problem to be solved urgently by those skilled in the art.

Disclosure of Invention

in view of this, embodiments of the present invention provide a method, an apparatus, a computer device, and a storage medium for identifying a forbidden slogan, which can improve the accuracy of identifying the forbidden slogan.

the embodiment of the invention provides the following specific technical scheme:

In a first aspect, the present invention provides a method for identifying a forbidden slogan, including:

performing word segmentation processing on a text to be recognized to obtain a word segmentation result;

identifying whether the text to be identified contains forbidden slogans or not based on the word segmentation result, and if the text to be identified does not contain the forbidden slogans, identifying whether the text to be identified contains suspected forbidden slogans or not;

And if the text to be recognized contains suspected forbidden slogans, inputting the text to be recognized into a pre-trained semantic analysis model, and outputting a result of whether the text to be recognized contains the forbidden slogans.

Further, the text to be recognized is one of an input text, an image recognition result text and a voice recognition result text.

further, the word segmentation processing is performed on the text to be recognized to obtain a word segmentation result, and the word segmentation result includes:

and performing word segmentation processing on the text to be recognized according to a preset word segmentation word bank, a white list word bank, a forbidden word bank and a suspected forbidden word bank to obtain a word segmentation result.

further, the identifying whether the text to be identified contains forbidden slogans based on the word segmentation result comprises:

Performing intersection operation on the word segmentation result and the forbidden word bank, and judging whether the text to be identified contains forbidden slogans;

The identifying whether the text to be identified contains suspected forbidden slogans includes:

And performing intersection operation on the word segmentation result and the suspected forbidden word library, and judging whether the text to be identified contains suspected forbidden slogans.

further, the pre-trained semantic analysis model is obtained by training through the following process:

dividing the labeled text data set into a training set, a verification set and a test set according to a preset proportion;

performing iterative training on the initial classification model according to the training set, judging whether the iterative training is finished or not by using the verification set, and outputting the trained initial classification model after judging that the iterative training is finished;

and testing the trained initial classification model according to the test set until reaching a preset test accuracy to obtain the semantic analysis model.

In a second aspect, the present invention provides an apparatus for identifying a contraband, the apparatus comprising:

The word segmentation module is used for performing word segmentation processing on the text to be recognized to obtain a word segmentation result;

The first identification module is used for identifying whether the text to be identified contains forbidden slogans or not based on the word segmentation result, and identifying whether the text to be identified contains suspected forbidden slogans or not if the text to be identified does not contain the forbidden slogans;

and the second identification module is used for inputting the text to be identified into a pre-trained semantic analysis model and outputting a result whether the text to be identified contains the forbidden slogans or not if the text to be identified contains the suspected forbidden slogans.

Further, the word segmentation module is specifically configured to:

further, the first identification module is specifically configured to:

The first identification module is specifically further configured to:

Inquiring each word in the word segmentation result in the forbidden word library, and judging whether the text to be identified contains forbidden slogans;

The first identification module is specifically further configured to:

Performing intersection operation on the word segmentation result and the suspected contraband word library, and judging whether the text to be identified contains suspected contraband advertising;

The first identification module is specifically further configured to:

And inquiring each word in the word segmentation result in the suspected forbidden word library, and judging whether the text to be identified contains suspected forbidden slogans.

Further, the device further comprises a training module, and the training module is specifically configured to:

in a third aspect, the present invention provides a computer device comprising:

One or more processors;

Storage means for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement the steps of the method for identifying a contraband according to any of the first aspects.

In a fourth aspect, the present invention provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor, implements the steps of the method for identifying a contraband according to any one of the first aspect.

The invention provides a method, a device, computer equipment and a storage medium for identifying forbidden slogans, which are used for carrying out word segmentation processing on a text to be identified to obtain word segmentation results; identifying whether the text to be identified contains forbidden slogans or not based on the word segmentation result, and if the text to be identified does not contain the forbidden slogans, identifying whether the text to be identified contains suspected forbidden slogans or not; and if the text to be recognized contains suspected forbidden slogans, inputting the text to be recognized into a pre-trained semantic analysis model, and outputting a result of whether the text to be recognized contains the forbidden slogans. In the process of identifying whether the text to be identified contains the forbidden slogans or not, through two-stage text identification and the application of a deep learning semantic analysis model, the forbidden slogans can be identified more accurately compared with the prior art in which only single-stage text identification is used; in addition, the deep learning semantic analysis model is used for identifying whether the text contains the forbidden slogans only when the text to be identified contains the suspected forbidden slogans, so that the deep learning semantic analysis model can be applied more specifically, the computing resources of the semantic analysis model are saved, and the delay time caused by the forbidden slogans identification process is reduced.

drawings

in order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

fig. 1 is a schematic view illustrating an application scenario of a method for identifying a forbidden slogan according to an embodiment of the present invention;

fig. 2 is a flowchart illustrating a method for identifying a forbidden slogan according to an embodiment of the present invention;

fig. 3 shows a block diagram of a device for identifying a forbidden slogan according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

it is to be understood that throughout the specification and claims, unless the context clearly requires otherwise, the words "comprise", "comprising", and the like are to be construed in an inclusive sense as opposed to an exclusive or exhaustive sense; that is, what is meant is "including, but not limited to".

It will be further understood that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. In addition, in the description of the present invention, "a plurality" means two or more unless otherwise specified.

Referring to fig. 1, fig. 1 is a schematic view illustrating an application scenario of the method for identifying a forbidden slogan according to an embodiment of the present invention, where a terminal 102 communicates with a server 104 through a network. The terminal 102 may be a variety of electronic devices having a display screen and supporting information browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like. The server 104 may be implemented as a stand-alone server or as a server cluster comprised of multiple servers.

Illustratively, a merchant user uploads commodity description information to a server 104 of an e-commerce platform through a terminal 102, the server 104 identifies whether the commodity description information contains prohibited advertising statements or not after receiving the commodity description information, when the server 104 identifies that the commodity description information contains the prohibited advertising statements, a reminding message can be sent to the terminal 102 to prompt the merchant user to modify the commodity description information, and when the commodity description information of the server 104 identifies that the commodity description information does not contain the prohibited advertising statements, the commodity description information can be published on the e-commerce platform.

referring to fig. 2, fig. 2 is a flowchart illustrating a method for identifying a forbidden slogan according to an embodiment of the present invention, which is described by taking the method as an example applied to the server in fig. 1, and the method may include:

Step 201, performing word segmentation processing on the text to be recognized to obtain word segmentation results.

The text to be recognized may be an input text, for example, a merchant user inputs a title of a commodity, a subtitle of the commodity, detailed information, and the like through a terminal.

The text to be recognized may also be an image recognition result text, for example, a text message in a product image is recognized and extracted, and the product image may be an image directly uploaded by a merchant user through a terminal, or a key frame containing the product image extracted by sampling from a video uploaded by the merchant user through the terminal.

in addition, the text to be recognized may also be a speech recognition result text, for example, performing speech recognition on an audio/video uploaded by a merchant user through a terminal, and converting a speech recognition result into text information.

specifically, the word segmentation processing is performed on the text to be recognized to obtain a word segmentation result, and the process may include:

and performing word segmentation processing on the text to be recognized according to a preset word segmentation lexicon, a white list lexicon, a forbidden word lexicon and a suspected forbidden word lexicon to obtain a word segmentation result.

Wherein, the word segmentation word stock is a basic word stock, such as a ending word stock; the white list word library refers to a normal special name word library containing limit expressions, for example, professional words such as 'maximum power'; the forbidden word library refers to a word library containing extreme words, such as forbidden words like 'unique', 'far ahead' and the like; the suspected forbidden word library refers to a word library containing suspected forbidden words, the suspected forbidden words refer to words which can appear in normal sentences and also can appear in forbidden advertisements, for example, "preferably" the words belong to the suspected forbidden words, for example, "the best clothes" is a limit term, and "preferably refuel in a ventilated place" is not the limit term but belongs to constructive semantics.

In the specific implementation process, a white list lexicon, a forbidden word library and a suspected forbidden word library can be constructed in advance, and according to the preset participle lexicon, the white list lexicon, the forbidden word library and the suspected forbidden word library, a jieba participle algorithm is used for carrying out participle processing on the text to be recognized to obtain a participle result, wherein the participle result is a participle set containing one or more words.

In addition, other word segmentation algorithms in the prior art can be adopted to perform word segmentation processing on the text to be recognized, and the invention is not limited to this.

In the embodiment, the text to be recognized is subjected to word segmentation according to the preset word segmentation word bank, the white list word bank, the forbidden word bank and the suspected forbidden word bank, so that the situation that a forbidden adword or a suspected forbidden adword is split into a plurality of words can be avoided, and the situation that a normal professional noun containing extreme semantics is split as the forbidden adword can also be avoided, so that the accuracy of word segmentation of the text to be recognized is improved, and whether the text contains the forbidden adword or not can be recognized accurately in the follow-up process.

Step 202, based on the word segmentation result, identifying whether the text to be identified contains forbidden slogans, and if the text to be identified does not contain the forbidden slogans, identifying whether the text to be identified contains suspected forbidden slogans.

specifically, identifying whether the text to be identified contains a forbidden advertisement based on the word segmentation result may include:

And performing intersection operation on the word segmentation result and the forbidden word bank, detecting whether the intersection operation result of the word segmentation result and the forbidden word bank is empty, if so, determining that the text to be identified does not contain forbidden slogans, and if not, determining that the text to be identified contains the forbidden slogans.

in addition, each word in the word segmentation result can be queried in the forbidden word bank, if the query result is empty, the text to be recognized is determined not to contain forbidden slogans, and if the query result is not empty, the text to be recognized is determined to contain the forbidden slogans.

when the text to be identified contains the forbidden slogans, the server can send a reminding message to the terminal of the merchant user to prompt the merchant user to modify the commodity description information, or directly delete or hide the text containing the forbidden slogans, so that the problems of fraud and misleading consumers caused by false publicity of commodities are effectively avoided.

Specifically, identifying whether the text to be identified contains suspected forbidden slogans may include:

And performing intersection operation on the word segmentation result and the suspected forbidden word bank, detecting whether the intersection operation result of the word segmentation result and the suspected forbidden word bank is empty, if so, determining that the text to be identified does not contain suspected forbidden slogans, and if not, determining that the text to be identified contains the suspected forbidden slogans.

In addition, each word in the word segmentation result can be queried in a suspected forbidden word bank, if the query result is empty, the text to be recognized does not contain suspected forbidden slogans, and if the query result is not empty, the text to be recognized contains the suspected forbidden slogans.

when the text to be recognized does not contain suspected forbidden slogans, the server can issue the text, the original image of the text and the original audio and video of the text to the e-commerce platform.

in this embodiment, through the pre-established forbidden word library and the suspected forbidden word library, whether the text to be recognized contains the forbidden word and the suspected forbidden word can be accurately recognized.

And 203, if the text to be recognized contains suspected forbidden slogans, inputting the text to be recognized into a pre-trained semantic analysis model, and outputting a result of whether the text to be recognized contains the forbidden slogans.

specifically, whether the text to be recognized contains the forbidden slogans or not can be determined according to the result output by the semantic analysis model, when the output result is a positive label result, the text to be recognized is determined to contain the forbidden slogans, and otherwise, the text to be recognized is determined to not contain the forbidden slogans.

when the text to be identified contains the forbidden slogans, the server can send a reminding message to the terminal of the merchant user to prompt the merchant user to modify the commodity description information, or directly delete or hide the text containing the forbidden slogans, so that the problems of fraud and misleading consumers caused by false publicity of commodities are effectively avoided. When the text to be recognized does not contain suspected forbidden slogans, the server can issue the original image of the text or the original audio and video of the text to the e-commerce platform.

the pre-trained semantic analysis model is obtained by training through the following process:

And testing the trained initial classification model according to the test set until the preset test accuracy is reached to obtain a semantic analysis model.

in the specific implementation process, the labeled text data set can be manually labeled to obtain a text data set, the labeled text data set comprises two types, one type is a text containing forbidden slogans and labeled as a positive label, the other type is a text not containing forbidden slogans and labeled as a negative label, and the labeled text data set can be divided into a training set, a verification set and a test set according to a preset proportion. For example, the manually labeled text data set includes 5 ten thousand labeled texts, wherein 3.5 thousand labeled texts are used as a training set, 0.5 thousand labeled texts are used as a verification set, and 1 thousand labeled texts are used as a test set. The Google open-source bert model can be called as an initial classification model in the model training, relevant training parameters are set, the initial classification model is subjected to iterative training by using a training set, and whether the iterative training is finished or not is judged by using a verification set. Outputting the trained initial classification model after judging that the iterative training is finished; and then testing the trained initial classification model by using a test set until the preset test accuracy is reached to obtain a semantic analysis model.

It should be noted that, in the training process, if it is desired to improve the accuracy of the illegal adword recognition for a specific commodity category, the number of text data for extracting the specific commodity category may be increased during the training of the initial classification model, so as to obtain a better illegal adword recognition generalization capability for the specific commodity category.

In the embodiment, whether the text contains the forbidden slogans is identified by using the deep learning semantic analysis model, so that the problem of deep semantic ambiguity which cannot be solved by a vocabulary library in the prior art is solved, and the forbidden slogans can be identified more accurately; and only when the text to be recognized contains suspected forbidden slogans is recognized, the deep learning semantic analysis model is used for recognizing whether the text contains the forbidden slogans, so that the deep learning semantic analysis model can be applied more specifically, not only is the computing resource of the semantic analysis model saved, but also the delay time caused by the forbidden slogans recognition process is reduced.

The invention provides a method for identifying forbidden slogans, which comprises the steps of carrying out word segmentation processing on a text to be identified to obtain word segmentation results; identifying whether the text to be identified contains forbidden slogans or not based on the word segmentation result, and if the text to be identified does not contain the forbidden slogans, identifying whether the text to be identified contains suspected forbidden slogans or not; and if the text to be recognized contains suspected forbidden slogans, inputting the text to be recognized into a pre-trained semantic analysis model, and outputting a result of whether the text to be recognized contains the forbidden slogans. In the process of identifying whether the text to be identified contains the forbidden slogans or not, through two-stage text identification and the application of a deep learning semantic analysis model, the forbidden slogans can be identified more accurately compared with the prior art in which only single-stage text identification is used; in addition, the deep learning semantic analysis model is used for identifying whether the text contains the forbidden slogans only when the text to be identified contains the suspected forbidden slogans, so that the deep learning semantic analysis model can be applied more specifically, the computing resources of the semantic analysis model are saved, and the delay time caused by the forbidden slogans identification process is reduced.

as an implementation of the method for identifying a prohibited slogan in the foregoing embodiment, an embodiment of the present invention further provides an apparatus for identifying a prohibited slogan, where as shown in fig. 3, the apparatus includes:

the word segmentation module 31 is configured to perform word segmentation processing on the text to be recognized to obtain a word segmentation result;

the first identification module 32 is configured to identify whether the text to be identified contains forbidden slogans based on the word segmentation result, and if the text to be identified does not contain the forbidden slogans, identify whether the text to be identified contains suspected forbidden slogans;

and the second identification module 33 is configured to, if the text to be identified contains suspected forbidden slogans, input the text to be identified into a pre-trained semantic analysis model, and output a result of whether the text to be identified contains the forbidden slogans.

further, the word segmentation module 31 is specifically configured to:

Further, the first identification module 32 is specifically configured to:

performing intersection operation on the word segmentation result and the forbidden word library, and judging whether the text to be identified contains forbidden slogans;

The first identification module 32 is further specifically configured to:

Inquiring each word in the word segmentation result in a forbidden word library, and judging whether the text to be identified contains forbidden slogans;

the first identification module 32 is further specifically configured to:

performing intersection operation on the word segmentation result and the suspected forbidden word library, and judging whether the text to be identified contains suspected forbidden slogans;

the first identification module 32 is further specifically configured to:

And inquiring each word in the word segmentation result in a suspected forbidden word library, and judging whether the text to be identified contains suspected forbidden slogans.

Further, the apparatus further comprises a training module 34, where the training module 34 is specifically configured to:

the device for identifying the forbidden slogans provided by the embodiment of the invention and the method for identifying the forbidden slogans provided by the embodiment of the invention belong to the same invention concept, can execute the method for identifying the forbidden slogans provided by any embodiment of the invention, and have the corresponding functional modules and the beneficial effects of executing the method for identifying the forbidden slogans. For technical details that are not described in detail in the embodiments of the present invention, reference may be made to the method for identifying forbidden slogans provided in the embodiments of the present invention, and details are not described here again.

in addition, another embodiment of the present invention further provides a computer device, including:

One or more processors;

A memory;

A program stored in the memory, which when executed by the one or more processors, causes the processors to perform the steps of the method for identifying contraband as described in the embodiments above.

furthermore, another embodiment of the present invention also provides a computer-readable storage medium, which stores a program that, when executed by a processor, causes the processor to perform the steps of the method for identifying a prohibited slogan as described in the above embodiment.

as will be appreciated by one of skill in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A method for identifying forbidden slogans, the method comprising:

2. The method according to claim 1, wherein the text to be recognized is one of input text, image recognition result text, and voice recognition result text.

3. The method according to claim 1, wherein the performing word segmentation processing on the text to be recognized to obtain a word segmentation result comprises:

4. The method of claim 3, wherein the identifying whether the text to be identified contains forbidden slogans based on the word segmentation result comprises:

Performing intersection operation on the word segmentation result and the forbidden word bank, and judging whether the text to be identified contains forbidden slogans; or

Performing intersection operation on the word segmentation result and the suspected contraband word library, and judging whether the text to be identified contains suspected contraband advertising; or

5. the method according to any one of claims 1 to 4, wherein the pre-trained semantic analysis model is obtained by training through the following process:

6. An apparatus for identifying contraband, the apparatus comprising:

7. the apparatus of claim 6, wherein the text to be recognized is one of input text, image recognition result text, and voice recognition result text.

8. The apparatus of claim 6, wherein the word segmentation module is specifically configured to:

9. a computer device, comprising:

one or more processors;

Storage means for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement the steps of the method for identifying contraband as claimed in any of claims 1 to 5.

10. a computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, carries out the steps of the method of identifying contraband according to any one of claims 1 to 5.