Nothing Special   »   [go: up one dir, main page]

CN110264318A - Data processing method and device, electronic equipment and storage medium - Google Patents

Data processing method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN110264318A
CN110264318A CN201910563737.7A CN201910563737A CN110264318A CN 110264318 A CN110264318 A CN 110264318A CN 201910563737 A CN201910563737 A CN 201910563737A CN 110264318 A CN110264318 A CN 110264318A
Authority
CN
China
Prior art keywords
product
sample
keyword
category
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910563737.7A
Other languages
Chinese (zh)
Inventor
赵呈路
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Lazas Network Technology Shanghai Co Ltd
Original Assignee
Lazas Network Technology Shanghai Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Lazas Network Technology Shanghai Co Ltd filed Critical Lazas Network Technology Shanghai Co Ltd
Priority to CN201910563737.7A priority Critical patent/CN110264318A/en
Publication of CN110264318A publication Critical patent/CN110264318A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/06Buying, selling or leasing transactions
    • G06Q30/0601Electronic shopping [e-shopping]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/12Hotels or restaurants

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Tourism & Hospitality (AREA)
  • Finance (AREA)
  • Strategic Management (AREA)
  • Accounting & Taxation (AREA)
  • Health & Medical Sciences (AREA)
  • Economics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Business, Economics & Management (AREA)
  • Marketing (AREA)
  • General Engineering & Computer Science (AREA)
  • Primary Health Care (AREA)
  • Human Resources & Organizations (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Development Economics (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the disclosure discloses a data processing method and device, electronic equipment and a storage medium. The method comprises the following steps: acquiring sample data; wherein the sample data comprises a textual description of a sample product and a category to which the sample product belongs; extracting key words in the text description; determining the importance degree of the keyword; training a product identification model by using the characteristic data of the sample product and the category to which the sample product belongs; wherein the feature data includes the degree of importance of the keyword corresponding to the sample product. The product identification model trained in the mode can learn the influence degree of the keywords in the text description on the product identification under the product category from the text description of the product, the accuracy of the product category identification can be improved, and even different products with similar text descriptions can be identified by the product identification model.

Description

Data processing method and device, electronic equipment and storage medium
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a data processing method and apparatus, an electronic device, and a storage medium.
Background
With the development of internet technology, more and more products appear on an online operation platform. In order to express common points and different points of various products well, an online operation platform usually generates portrait data for the products, and facilitates classification and identification of the products in various scenes such as retrieval scenes. However, because of the wide variety of products, the same product may have different text descriptions such as product names, etc., and different products may also have the same or similar text descriptions, the product representation data is usually given by manually screening keywords, and the abstract summarization capability of people is different, so the error is also large. Therefore, the process of labeling the product portrait data is time-consuming and labor-consuming, and the accuracy is not high.
Disclosure of Invention
The embodiment of the disclosure provides a data processing method and device, electronic equipment and a storage medium.
In a first aspect, an embodiment of the present disclosure provides a data processing method.
Specifically, the data processing method includes:
acquiring sample data; wherein the sample data comprises a textual description of a sample product and a category to which the sample product belongs;
extracting key words in the text description;
determining the importance degree of the keyword;
training a product identification model by using the characteristic data of the sample product and the category to which the sample product belongs; wherein the feature data includes the degree of importance of the keyword corresponding to the sample product.
With reference to the first aspect, in a first implementation manner of the first aspect, the obtaining sample data includes:
acquiring text descriptions of a plurality of sample products in a preset category;
and performing de-duplication processing on the text descriptions of the sample products.
With reference to the first aspect and/or the first implementation manner of the first aspect, in a second implementation manner of the first aspect, the present disclosure performs deduplication processing on text descriptions of a plurality of sample products, including:
and uniformly mapping a plurality of different text descriptions corresponding to the same sample product into the same text description.
With reference to the first aspect, the first implementation manner of the first aspect, and/or the second implementation manner of the first aspect, in a third implementation manner of the first aspect, the extracting keywords in the text description includes:
segmenting the text description;
and determining the participles with the correlation higher than a preset threshold value with the category to which the sample product belongs as the keywords.
With reference to the first aspect, the first implementation manner of the first aspect, the second implementation manner of the first aspect, and/or the third implementation manner of the first aspect, in a fourth implementation manner of the first aspect, the determining, as the keyword, a segmented word whose correlation with a category to which the sample product belongs is higher than a preset threshold includes:
and determining the relevance of the word segmentation and the category to which the sample product belongs by using a chi-square independent test method.
With reference to the first aspect, the first implementation manner of the first aspect, the second implementation manner of the first aspect, the third implementation manner of the first aspect, and/or the fourth implementation manner of the first aspect, in a fifth implementation manner of the first aspect, the determining the importance degree of the keyword includes:
and determining the TD-IDF value of the keyword as the importance degree of the keyword.
With reference to the first aspect, the first implementation manner of the first aspect, the second implementation manner of the first aspect, the third implementation manner of the first aspect, the fourth implementation manner of the first aspect, and/or the fifth implementation manner of the first aspect, in a sixth implementation manner of the first aspect, the determining the TD-IDF value of the keyword as the importance degree of the keyword includes:
determining a TD-IDF value of the keyword under a category to which the sample product belongs;
and when the keyword corresponds to a plurality of TD-IDF values under different categories, selecting the smallest TD-IDF value as the importance degree of the keyword.
With reference to the first aspect, the first implementation manner of the first aspect, the second implementation manner of the first aspect, the third implementation manner of the first aspect, the fourth implementation manner of the first aspect, the fifth implementation manner of the first aspect, and/or the sixth implementation manner of the first aspect, in a seventh implementation manner of the first aspect, the determining the importance degree of the keyword includes:
and when the relevance of all the participles corresponding to the sample product and the category to which the sample product belongs is lower than a preset threshold value, taking a default value as the importance degree of the keyword corresponding to the sample product.
In a second aspect, a product identification method is provided in an embodiment of the present disclosure.
Specifically, the product identification method includes:
acquiring text description of a product to be identified;
extracting keywords of the text description;
determining the importance degree of the keyword;
inputting the importance degree of the keyword into a pre-trained product identification model so as to identify the product to be identified; wherein the product identification model is trained using the method of the first aspect.
With reference to the second aspect, in a first implementation manner of the second aspect, the extracting keywords in the text description includes:
segmenting the text description;
and matching the word segmentation with a keyword set to determine whether the word segmentation is a keyword.
In a third aspect, a data processing apparatus is provided in an embodiment of the present disclosure.
Specifically, the data processing apparatus includes:
a first obtaining module configured to obtain sample data; wherein the sample data comprises a textual description of a sample product and a category to which the sample product belongs;
a first extraction module configured to extract keywords in the text description;
a first determination module configured to determine a degree of importance of the keyword;
a training module configured to train a product recognition model using the feature data of the sample product and the category to which the sample product belongs; wherein the feature data includes the degree of importance of the keyword corresponding to the sample product.
In a fourth aspect, a product identification device is provided in embodiments of the present disclosure.
Specifically, the product identification device includes:
the second acquisition module is configured to acquire a text description of the product to be identified;
a second extraction module configured to extract keywords of the textual description;
a second determination module configured to determine a degree of importance of the keyword;
the recognition module is configured to input the importance degree of the keyword into a pre-trained product recognition model so as to recognize the product to be recognized; wherein the product identification model is obtained by training with the device of the third aspect.
The functions can be realized by hardware, and the functions can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the above-described functions.
In one possible design, the data processing apparatus and/or the product identification apparatus includes a memory and a processor, the memory is used for storing one or more computer instructions for supporting the data processing apparatus and/or the product identification apparatus to execute the data processing method and/or the product identification method, and the processor is configured to execute the computer instructions stored in the memory. The data processing apparatus and/or the product identifying apparatus may further comprise a communication interface for the data processing apparatus and/or the product identifying apparatus to communicate with other devices or a communication network.
In a fifth aspect, an embodiment of the present disclosure provides an electronic device, including a memory and a processor; wherein the memory is to store one or more computer instructions, wherein the one or more computer instructions are executed by the processor to implement the method steps of:
acquiring sample data; wherein the sample data comprises a textual description of a sample product and a category to which the sample product belongs;
extracting key words in the text description;
determining the importance degree of the keyword;
training a product identification model by using the characteristic data of the sample product and the category to which the sample product belongs; wherein the feature data includes the degree of importance of the keyword corresponding to the sample product.
With reference to the fifth aspect, in a first implementation manner of the fifth aspect, the obtaining sample data includes:
acquiring text descriptions of a plurality of sample products in a preset category;
and performing de-duplication processing on the text descriptions of the sample products.
With reference to the fifth aspect and/or the first implementation manner of the fifth aspect, in a second implementation manner of the fifth aspect, the present disclosure performs deduplication processing on text descriptions of a plurality of sample products, including:
and uniformly mapping a plurality of different text descriptions corresponding to the same sample product into the same text description.
With reference to the fifth aspect, the first implementation manner of the fifth aspect, and/or the second implementation manner of the fifth aspect, in a third implementation manner of the fifth aspect, the extracting keywords in the text description includes:
segmenting the text description;
and determining the participles with the correlation higher than a preset threshold value with the category to which the sample product belongs as the keywords.
With reference to the fifth aspect, the first implementation manner of the fifth aspect, the second implementation manner of the fifth aspect, and/or the third implementation manner of the fifth aspect, in a fourth implementation manner of the fifth aspect, the determining, as the keyword, a participle whose relevance to a category to which the sample product belongs is higher than a preset threshold includes:
determining, with a chi-square independent-check electronic device, a relevance of the segmented word to a category to which the sample product belongs.
With reference to the fifth aspect, the first implementation manner of the fifth aspect, the second implementation manner of the fifth aspect, the third implementation manner of the fifth aspect, and/or the fourth implementation manner of the fifth aspect, in a fifth implementation manner of the fifth aspect, the determining the importance degree of the keyword includes:
and determining the TD-IDF value of the keyword as the importance degree of the keyword.
With reference to the fifth aspect, the first implementation manner of the fifth aspect, the second implementation manner of the fifth aspect, the third implementation manner of the fifth aspect, the fourth implementation manner of the fifth aspect, and/or the fifth implementation manner of the fifth aspect, in a sixth implementation manner of the fifth aspect, the determining the TD-IDF value of the keyword as the importance degree of the keyword includes:
determining a TD-IDF value of the keyword under a category to which the sample product belongs;
and when the keyword corresponds to a plurality of TD-IDF values under different categories, selecting the smallest TD-IDF value as the importance degree of the keyword.
With reference to the fifth aspect, the first implementation manner of the fifth aspect, the second implementation manner of the fifth aspect, the third implementation manner of the fifth aspect, the fourth implementation manner of the fifth aspect, the fifth implementation manner of the fifth aspect, and/or the sixth implementation manner of the fifth aspect, in a seventh implementation manner of the fifth aspect, the determining the importance degree of the keyword includes:
and when the relevance of all the participles corresponding to the sample product and the category to which the sample product belongs is lower than a preset threshold value, taking a default value as the importance degree of the keyword corresponding to the sample product.
In a sixth aspect, an embodiment of the present disclosure provides an electronic device, including a memory and a processor; wherein the memory is to store one or more computer instructions, wherein the one or more computer instructions are executed by the processor to implement the method steps of:
acquiring text description of a product to be identified;
extracting keywords of the text description;
determining the importance degree of the keyword;
inputting the importance degree of the keyword into a pre-trained product identification model so as to identify the product to be identified; wherein the product identification model is obtained by training with the electronic device of the fifth aspect.
With reference to the sixth aspect, in a first implementation manner of the sixth aspect, the obtaining sample data includes:
extracting keywords in the text description, including:
segmenting the text description;
and matching the word segmentation with a keyword set to determine whether the word segmentation is a keyword.
In a seventh aspect, the disclosed embodiments provide a computer-readable storage medium for storing computer instructions for a data processing apparatus and/or a product identification apparatus, which includes computer instructions for performing any of the methods described above.
The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects:
according to the data processing method, the text description of the sample product and the category to which the product belongs are obtained, the keywords of the text description are extracted, the importance degree of the extracted keywords under the category to which the sample product belongs is determined, and then a product identification model is trained according to feature data including the importance degree of the keywords and the category to which the product belongs. The product identification model trained in the mode can learn the influence degree of the keywords in the text description on the product identification under the product category from the text description of the product, the accuracy of the product category identification can be improved, and even different products with similar text descriptions can be identified by the product identification model.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
Other features, objects, and advantages of the present disclosure will become more apparent from the following detailed description of non-limiting embodiments when taken in conjunction with the accompanying drawings. In the drawings:
FIG. 1 shows a flow diagram of a data processing method according to an embodiment of the present disclosure;
FIG. 2 shows a flow chart of step S101 according to the embodiment shown in FIG. 1;
FIG. 3 shows a flowchart of step S102 according to the embodiment shown in FIG. 1;
FIG. 4 is a flow diagram illustrating a portion of determining importance of keywords in accordance with the embodiment shown in FIG. 1;
FIG. 5 illustrates a flow diagram of a method of product identification according to an embodiment of the present disclosure;
FIG. 6 shows a flowchart of step S502 according to the embodiment shown in FIG. 5;
FIG. 7 shows a block diagram of a data processing apparatus according to an embodiment of the present disclosure;
FIG. 8 is a block diagram of the first obtaining module 701 according to the embodiment shown in FIG. 7;
FIG. 9 illustrates a block diagram of the first extraction module 702 according to the embodiment illustrated in FIG. 7;
FIG. 10 is a block diagram illustrating a structure of a portion for determining importance of a keyword according to an embodiment of the present disclosure;
fig. 11 illustrates a block diagram of a product recognition apparatus according to an embodiment of the present disclosure;
FIG. 12 is a block diagram illustrating a second extraction module 1102 according to the embodiment shown in FIG. 11;
fig. 13 is a schematic structural diagram of an electronic device suitable for implementing a data processing method according to an embodiment of the present disclosure.
Detailed Description
Hereinafter, exemplary embodiments of the present disclosure will be described in detail with reference to the accompanying drawings so that those skilled in the art can easily implement them. Also, for the sake of clarity, parts not relevant to the description of the exemplary embodiments are omitted in the drawings.
In the present disclosure, it is to be understood that terms such as "including" or "having," etc., are intended to indicate the presence of the disclosed features, numbers, steps, behaviors, components, parts, or combinations thereof, and are not intended to preclude the possibility that one or more other features, numbers, steps, behaviors, components, parts, or combinations thereof may be present or added.
It should be further noted that the embodiments and features of the embodiments in the present disclosure may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
Fig. 1 shows a flow diagram of a data processing method according to an embodiment of the present disclosure. As shown in fig. 1, the data processing method includes the steps of:
in step S101, sample data is acquired; wherein the sample data comprises a textual description of a sample product and a category of the sample product;
in step S102, extracting keywords in the text description;
in step S103, determining the importance degree of the keyword under the category;
in step S104, training a product identification model using the feature data of the sample product and the category; wherein the feature data includes the degree of importance of the keyword corresponding to the sample product under the category.
In this embodiment, the sample product may be a product currently related to the online platform, such as dishes on a take-away ordering platform, clothes on an e-commerce platform, living goods, household goods, and the like. The textual description of the sample product includes, but is not limited to, textual descriptions of attributes such as product name, product material, manufacturing process, efficacy, size, quantity, and the like. For example, the textual description of the dish on the takeaway ordering platform may include the name of the dish, the food material of the dish, the method of doing the dish, and so on.
In some embodiments, the textual description of the sample product may also include a textual description of an operator to whom the sample product belongs. Under the general condition, the product types operated by one operator are relatively similar, even some operators only operate products of one type, so that when the product identification model is trained, the data of the operator is also used as input data, the product identification model learns the characteristics which can influence the product types from the operator data, and the identification accuracy of the product identification model is further improved.
The data of the operator to which the sample product belongs may include, but is not limited to, the name of the operator, the main operating scope, the large scope to which the product operated by the operator belongs (such as a cuisine in the catering industry), and the like.
The type of the sample product can be determined according to the existing data of the online platform, and can also be manually marked. For example, sample data may be collected from existing products of the online platform, and the online platform generally has its own classification of the products, so that sample data required for the training may be obtained by collecting text descriptions related to the products under each category.
For each sample data obtained, one or more keywords can be extracted from the text description of the sample product, and then the importance degree of the one or more keywords is determined, wherein the importance degree is used for indicating the action size of the keyword on product identification, if the keyword can play an important role in product identification, the importance degree of the keyword is higher, and if the keyword cannot play an important role in product identification, the importance degree of the keyword is lower.
The importance degree of the keyword may be determined in advance by counting the number of times that the keyword appears in the text descriptions of all sample products in the same category, for example, if the number of times that a certain keyword appears in the text descriptions of all sample products in the same category is large, the importance degree of the keyword may be considered to be high, and if the number of times that the keyword appears is small, the importance degree of the keyword may be considered to be low.
The product identification model may employ an xgboost model, a GBDT model, a neural network model, or the like. When the product identification model is trained, the importance degree can be converted into a vector form, and a plurality of vectors corresponding to the keywords are combined to form input data of the model. In each iteration cycle process, the characteristic data in one sample data is used as the input of the product identification model, after the output result of the product identification model is obtained, the output result can be compared with the class to which the sample product in the sample data belongs, and then the model parameters of the product identification model are updated, so that the output result of the product identification model is closer to the class to which the sample product belongs. After training of a large amount of sample data, model parameters of the product identification model are continuously updated, and after training is finished, the product identification model can provide a relatively accurate output result aiming at input data.
According to the data processing method, the text description of the sample product and the category to which the product belongs are obtained, the keywords of the text description are extracted, the importance degree of the extracted keywords is determined, and then the product identification model is trained according to the feature data including the importance degree of the keywords and the category to which the product belongs. The product identification model trained in the mode can learn the influence degree of the keywords in the text description on the product identification under the product category from the text description of the product, the accuracy of the product identification can be improved, and even different products with similar text descriptions can be identified by the product identification model.
In an optional implementation manner of this embodiment, as shown in fig. 2, the step S101, namely the step of obtaining sample data, further includes the following steps:
in step S201, obtaining text descriptions of a plurality of sample products in a preset category;
in step S202, a text description of a plurality of the sample products is subjected to a deduplication process.
In this optional implementation manner, when collecting sample data, the text descriptions of a plurality of sample products may be obtained from a plurality of preset categories, which already have classification data, for the online platform. The textual description may include, but is not limited to, textual descriptions of attributes such as product name, product material, manufacturing process, efficacy, size, quantity, and the like. To avoid collecting duplicate sample products, the textual description may be deduplicated.
In an optional implementation manner of this embodiment, the step S202, that is, the step of performing deduplication processing on the text descriptions of the plurality of sample products, further includes the following steps:
and uniformly mapping a plurality of different text descriptions corresponding to the same sample product into the same text description.
In this alternative implementation, there may be a plurality of sample products under the same preset category, and these sample products belong to the same product although the text description is different. For example, in the take-away ordering platform, some merchants upload menus with tomato fried eggs, and some merchants upload menus with tomato fried eggs, which belong to the same product substantially and adopt different names, so that the two products can be mapped to the same product name in a unified manner. Of course, it is understood that other content in the text description may be mapped uniformly.
In an optional implementation manner of this embodiment, as shown in fig. 3, the step S102, namely the step of extracting the keywords in the text description, further includes the following steps:
in step S301, performing word segmentation on the text description;
in step S302, a segmented word having a correlation higher than a preset threshold with the category to which the sample product belongs is determined as the keyword.
In this optional implementation manner, when extracting the keywords in the text description, after segmenting the text description, the keywords may be determined according to the relevance of the segments to the category to which the sample product belongs, for example, one of the segments "pan" of "stir-fried eggs with tomatoes" in the takeaway meal ordering platform is not important for the category identification of the dish, that is, the relevance of the word "pan" to the identification of the dish is not high, and the word may be removed without being used as the keywords. The preset threshold may be set according to actual conditions, and is not limited herein. Keywords are extracted from word segmentation results described by the text, word segmentation with low correlation can be eliminated, and the problem that training efficiency is low due to overlarge feature data dimension of a subsequent training product identification model can be solved.
In an optional implementation manner of this embodiment, the step S302 of determining, as the keyword, a segmented word having a correlation with the category to which the sample product belongs higher than a preset threshold, further includes the following steps:
and determining the relevance of the word segmentation and the category to which the sample product belongs by using a chi-square independent test method.
In this alternative implementation, the card-side independence check can determine the association and dependency between two types of variables. Therefore, in the embodiment of the present disclosure, text descriptions of sample products in different preset categories are collected, and after the text descriptions are participled, for each preset category, the relevance between a participle result obtained from the collected text descriptions of the sample products and the preset category may be determined in a chi-square independence verification manner, and a participle with the relevance higher than a preset threshold is determined as a keyword.
After a large amount of sample data is collected, aiming at the text description of a sample product, keywords in different preset categories are extracted from the text description by chi-square independence check to form a keyword set. The card-side independence check is the prior art and is not described herein.
In an optional implementation manner of this embodiment, the step S103, namely, the step of determining the importance degree of the keyword, further includes the following steps:
and determining the TD-IDF value of the keyword as the importance degree of the keyword.
In this alternative implementation, TF-IDF (Term Frequency-Inverse Document Frequency) is a commonly used weighting technique for information retrieval and data mining, TF means Term Frequency (Term Frequency) and IDF means Inverse text Frequency index (Inverse Document Frequency). The TF-IDF value of the keyword may have the following meaning: the frequency TF of the current keyword appearing in the text descriptions of all sample products in the current preset category is high, and the current keyword rarely appears in the text descriptions of all sample products in other preset categories, so that the keyword can be considered to have good category distinguishing capability and is suitable for classification, therefore, the keyword can be considered to be important relative to the current preset category, and the TD-IDF value of the keyword can be used for measuring the importance of the keyword.
The TD-IDF values of the keywords can be obtained by counting text descriptions of all sample products in preset categories on an online platform in advance, extracting the keywords from the text descriptions and determining the TD-IDF values of the keywords in the text descriptions. As described above, for the keyword set formed by sample data, each keyword may further correspond to a corresponding TD-IDF value, so that during online identification, the keyword set and the corresponding TD-IDF value may be directly utilized to obtain the keyword and the TD-IDF value corresponding to the product to be identified.
In this embodiment, the TF value of a keyword may be obtained by dividing the number of times that the keyword appears in the text descriptions of all sample products in the preset category by the number of text descriptions (i.e., the number of all sample products in the preset category); the IDF of the keyword may be determined by the number of preset categories to which the sample product corresponding to the text description where the keyword appears belongs and the total number of the preset categories, and the calculation formula is IDF log (n/m), where n is the total number of the preset categories, and m is the number of the preset categories where the keyword appears. For example, if the keyword a appears in the text descriptions of the sample products in the preset category 1, the preset category 2, and the preset category 3, and the preset categories are 5 in total, the keyword appears in the three preset categories, and thus the IDF of the keyword is log (5/3).
The TD-IDF value of a keyword is the product of the TD value and the IDF value of the keyword.
For example, the TD-IDF values of the keywords under each preset category in the sample data collected in one takeaway ordering platform are as follows:
[ "fried egg", "braised meat", "home style", "cold and dressed with sauce", "braised eggplant", "palace chicken", "diced", "sugar and vinegar", "shredded meat", "jelly", "potato", "mugwort", "bean curd", "kidney", "tomato", "inner ridge", "spelling", "dish", "small" ] home style vegetable [0.34, 033, 0.28, 0.26, 0.22, 0.19, 0.17, 0.13, 0.12, 0.1, 0.07, 0.08, 0.05, 0.07, 0.06, 0.07, 0.05, 0.05, 0.06]
[ "beer", "rice wine", "Yanjing beer", "snowflake", "wheat", "Harbin", "tin", "Skyo", "protoplasm", "Islands", "wine egg", "courage", "bizard", "Belgium", "refreshing", "king", "white", "sweet osmanthus", "Xiao", "liqueur" ] wine [1.68, 0.42, 0.37, 0.37, 0.22, 0.19, 0.18, 0.17, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.12, 0.1, 0.1, 0.11, 0.07]
[ "roast meat", "crispy skin", "roast", "roasted", "crusty", "sausage", "roast string", "barbeque", "Orleans", "brazilian", "tendon", "Cumin", "Salix", "honeydew", "salad", "New Orleans", "Chicken", "tendon", "croissant", "baked bread", "roasted", "Turkey" ] roast meat [1.57, 0.21, 0.19, 0.18, 0.15, 0.14, 0.11, 0.12, 0.1, 0.1, 0.1, 0.08, 0.08, 0.09, 0.07, 0.08, 0.06, 0.05]
[ "braised in brown sauce", "chicken", "rice", "extreme hot", "tofu skin", "golden mushroom", "Wu' o", "abalone", "of", "chicken small", "potato block", "not", "earthenware", "dry pot", "big hot", "give tofu skin" ] yellow braised chicken rice [1.81, 1.26, 0.4, 0.14, 0.13, 0.12, 0.1, 0.09, 0.1, 0.08, 0.07, 0.07, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05]
[ "teppanyaki", "kebab", "whisker", "pungency", "big chicken chop", "select", "squid", "old dry mother", "pancake", "iron plate", "fish", "eggplant", "juice", "tofu", "sesame cake", "egg" ] teppanyaki [2.94, 0.43, 043, 0.41, 0.4, 037, 0.33, 032, 0.32, 0.29, 0.28, 0.27, 0.25, 0.25, 0.23, 0.21]
[ "meatball", "beef meatball", "pellet", "meat meatball", "cabbage", "stew", "soup", "meatball", "urination", "white gourd", "casserole", "cooking", "octopus", "croquette", "delicacy", "handmade", "four happiness", "vegetarian", "vermicelli", "meatball" ] meatballs [1.14, 0.35, 0.19, 0.19, 0.18, 0.16, 0.14, 0.12, 0.11, 0.1, 0.11, 0.1, 0.09, 0.07, 0.07, 0.06, 0.07, 0.06, 0.05]
[ "dessert", "foreign exchange", "double poetry", "macarons", "gift box", "afternoon tea", "Rui +", "Rubi", "Brown", "Nib", "butter", "curl" ] dessert [2.82, 0.66, 0.66, 0.62, 0.6, 0.6, 0.58, 0.55, 0.54, 0.53, 0.5, 0.42]
Wherein, in each section of content, the former part is a plurality of key words extracted from the dish category, such as 'fried egg', 'braised pork in brown sauce', and the middle part is the name of the dish category, such as 'home dish'; the latter half is the corresponding TD-IDF values of these keywords, such as "0.34, 0.33", etc.
For example, for a dish "bonbon pot" the keywords and feature data shown in the following table can be extracted from the relevant text description:
the name of the dish is as follows: bonito, restaurant: millions, cuisine: beijing vegetable, main operation: snack food
Uterus protector Diced chicken ....... Snack food .......
0.3 0.5 1
The first action in the table is a keyword, and the second action is characteristic data corresponding to the dish of 'Tungbao chicken with rice covered in pot', namely the TD-IDF value corresponding to each keyword.
In an optional implementation manner of this embodiment, as shown in fig. 4, the step of determining the TD-IDF value of the keyword as the importance degree of the keyword further includes the following steps:
in step S401, determining a TD-IDF value of the keyword under the category to which the sample product belongs;
in step S402, when the keyword corresponds to a plurality of TD-IDF values in different categories, the smallest TD-IDF value is selected as the importance level of the keyword.
In this alternative implementation, if the same keyword occurs in multiple predetermined categories, a TD-IDF value of the keyword can be calculated for each of the predetermined categories, and for the sake of uniformity, the minimum TD-IDF value can be selected for the importance of the keyword.
In an optional implementation manner of this embodiment, the step S103, namely, the step of determining the importance degree of the keyword, further includes the following steps:
and when the relevance of all the participles corresponding to the sample product and the category to which the sample product belongs is lower than a preset threshold value, taking a default value as the importance degree of the keyword corresponding to the sample product.
In this optional implementation manner, when extracting the keywords in the text description, the low-relevance participles with the category to which the sample product belongs are removed, and if the relevance of all the participles with the category to which the sample product belongs in the text description of one sample product is lower than a preset threshold, the importance degree of the keywords in the feature data of the sample product may be set to a default value. The default value may be set according to practical situations, and is not limited herein.
Fig. 5 illustrates a flow diagram of a product identification method according to an embodiment of the present disclosure. As shown in fig. 5, the product identification method includes the steps of:
in step S501, a text description of a product to be identified is acquired;
in step S502, extracting keywords of the text description;
in step S503, the importance degree of the keyword is determined;
in step S504, the importance degree of the keyword is input into a product recognition model trained in advance to recognize the product to be recognized; and the product identification model is obtained by utilizing the data processing method for training.
In this embodiment, the product to be identified may be a product currently related to the online platform, such as dishes on a take-out ordering platform, clothing on an e-commerce platform, living goods, household goods, and the like. The textual description of the product to be identified includes, but is not limited to, textual descriptions of attributes such as product name, product material, manufacturing process, efficacy, size, quantity, and the like. For example, the textual description of the dish on the takeaway ordering platform may include the name of the dish, the food material of the dish, the method of doing the dish, and so on.
In some embodiments, the textual description of the product to be identified may also include a textual description of an operator to whom the product to be identified belongs. Under the general condition, the product types operated by one operator are relatively similar, even some operators only operate products of one type, so that when the product identification model is trained, the data of the operator is also used as input data, the product identification model learns the characteristics which can influence the product types from the operator data, and the identification accuracy of the product identification model is further improved.
The data of the operator to which the product to be identified belongs may include, but is not limited to, the name of the operator, the main operation range, the large range to which the product operated by the operator belongs (such as a cuisine in the catering industry), and the like.
For a product to be identified, one or more keywords can be extracted from the text description of the product to be identified, and then the importance degree of the one or more keywords is determined, wherein the importance degree is used for indicating the action size of the keyword on product identification, if the keyword can play an important role in product identification, the importance degree of the keyword is higher, and if the keyword cannot play an important role in product identification, the importance degree of the keyword is lower.
The product identification model is obtained by training the data processing method, so the specific details of the product identification model can be referred to the above description of the data processing method, and are not described herein again.
In an optional implementation manner of this embodiment, as shown in fig. 6, the step S502, namely the step of extracting the keywords in the text description, further includes the following steps:
in step S601, performing word segmentation on the text description;
in step S602, the segmentation is matched with a keyword set, and it is determined whether the segmentation is a keyword.
In this optional implementation manner, as described in the above data processing method, in the training process, for all collected sample products, keywords in corresponding text descriptions are extracted, a keyword set corresponding to the sample product is formed, and the importance degrees of the keywords are also determined in subsequent steps. Therefore, after the training of the product recognition model is completed, the keyword set can be reserved, the participles in the text description of the product to be recognized are obtained and then matched with the keyword set, if the matching is successful, the participles can be determined as the keywords corresponding to the product to be recognized, and the importance degrees of the keywords can also be directly determined.
For determining the keywords and determining the importance degree, reference may be made to the above description of the data processing method, which is not described herein again.
It should be noted that, when there is no keyword matching with the keyword set in the text description of the product to be recognized, the importance degree of the keyword corresponding to the recognized product may be set as a default value.
The following are embodiments of the disclosed apparatus that may be used to perform embodiments of the disclosed methods.
Fig. 7 shows a block diagram of a data processing apparatus according to an embodiment of the present disclosure, which may be implemented as part or all of an electronic device by software, hardware, or a combination of both. As shown in fig. 7, the data processing apparatus includes:
a first obtaining module 701 configured to obtain sample data; wherein the sample data comprises a textual description of a sample product and a category to which the sample product belongs;
a first extraction module 702 configured to extract keywords in the text description;
a first determining module 703 configured to determine the importance degree of the keyword;
a training module 704 configured to train a product recognition model using the feature data of the sample product and the category to which the sample product belongs; wherein the feature data includes the degree of importance of the keyword corresponding to the sample product.
In this embodiment, the sample product may be a product currently related to the online platform, such as dishes on a take-away ordering platform, clothes on an e-commerce platform, living goods, household goods, and the like. The textual description of the sample product includes, but is not limited to, textual descriptions of attributes such as product name, product material, manufacturing process, efficacy, size, quantity, and the like. For example, the textual description of the dish on the takeaway ordering platform may include the name of the dish, the food material of the dish, the method of doing the dish, and so on.
In some embodiments, the textual description of the sample product may also include a textual description of an operator to whom the sample product belongs. Under the general condition, the product types operated by one operator are relatively similar, even some operators only operate products of one type, so that when the product identification model is trained, the data of the operator is also used as input data, the product identification model learns the characteristics which can influence the product types from the operator data, and the identification accuracy of the product identification model is further improved.
The data of the operator to which the sample product belongs may include, but is not limited to, the name of the operator, the main operating scope, the large scope to which the product operated by the operator belongs (such as a cuisine in the catering industry), and the like.
The type of the sample product can be determined according to the existing data of the online platform, and can also be manually marked. For example, sample data may be collected from existing products of the online platform, and the online platform generally has its own classification of the products, so that sample data required for the training may be obtained by collecting text descriptions related to the products under each category.
For each sample data obtained, one or more keywords can be extracted from the text description of the sample product, and then the importance degree of the one or more keywords is determined, wherein the importance degree is used for indicating the action size of the keyword on product identification, if the keyword can play an important role in product identification, the importance degree of the keyword is higher, and if the keyword cannot play an important role in product identification, the importance degree of the keyword is lower.
The importance degree of the keyword may be determined in advance by counting the number of times that the keyword appears in the text descriptions of all sample products in the same category, for example, if the number of times that a certain keyword appears in the text descriptions of all sample products in the same category is large, the importance degree of the keyword may be considered to be high, and if the number of times that the keyword appears is small, the importance degree of the keyword may be considered to be low.
The product identification model may employ an xgboost model, a GBDT model, a neural network model, or the like. When the product identification model is trained, the importance degree can be converted into a vector form, and a plurality of vectors corresponding to the keywords are combined to form input data of the model. In each iteration cycle process, the characteristic data in one sample data is used as the input of the product identification model, after the output result of the product identification model is obtained, the output result can be compared with the class to which the sample product in the sample data belongs, and then the model parameters of the product identification model are updated, so that the output result of the product identification model is closer to the class to which the sample product belongs. After training of a large amount of sample data, model parameters of the product identification model are continuously updated, and after training is finished, the product identification model can provide a relatively accurate output result aiming at input data.
In the data processing device of the embodiment of the disclosure, the text description of the sample product and the category to which the product belongs are obtained, the keywords of the text description are extracted, the importance degree of the extracted keywords is determined, and then the product identification model is trained according to the feature data including the importance degree of the keywords and the category to which the product belongs. The product identification model trained in the mode can learn the influence degree of the keywords in the text description on the product identification under the product category from the text description of the product, the accuracy of the product identification can be improved, and even different products with similar text descriptions can be identified by the product identification model.
In an optional implementation manner of this embodiment, as shown in fig. 8, the first obtaining module 701 includes:
a first obtaining sub-module 801 configured to obtain text descriptions of a plurality of sample products in a preset category;
a deduplication sub-module 802 configured to perform deduplication processing on textual descriptions of a plurality of the sample products.
In this optional implementation manner, when collecting sample data, the text descriptions of a plurality of sample products may be obtained from a plurality of preset categories, which already have classification data, for the online platform. The textual description may include, but is not limited to, textual descriptions of attributes such as product name, product material, manufacturing process, efficacy, size, quantity, and the like. To avoid collecting duplicate sample products, the textual description may be deduplicated.
In an optional implementation manner of this embodiment, the duplication elimination sub-module 802 includes:
a mapping sub-module configured to uniformly map a plurality of different text descriptions corresponding to the same sample product into the same text description.
In this alternative implementation, there may be a plurality of sample products under the same preset category, and these sample products belong to the same product although the text description is different. For example, in the take-away ordering platform, some merchants upload menus with tomato fried eggs, and some merchants upload menus with tomato fried eggs, which belong to the same product substantially and adopt different names, so that the two products can be mapped to the same product name in a unified manner. Of course, it is understood that other content in the text description may be mapped uniformly.
In an optional implementation manner of this embodiment, as shown in fig. 9, the first extracting module 702 includes:
a first word segmentation sub-module 901 configured to segment the text description;
a first determining sub-module 902 configured to determine, as the keyword, a segmented word having a correlation higher than a preset threshold with respect to a category to which the sample product belongs.
In this optional implementation manner, when extracting the keywords in the text description, after segmenting the text description, the keywords may be determined according to the relevance of the segments to the category to which the sample product belongs, for example, one of the segments "pan" of "stir-fried eggs with tomatoes" in the takeaway meal ordering platform is not important for the category identification of the dish, that is, the relevance of the word "pan" to the identification of the dish is not high, and the word may be removed without being used as the keywords. The preset threshold may be set according to actual conditions, and is not limited herein. Keywords are extracted from word segmentation results described by the text, word segmentation with low correlation can be eliminated, and the problem that training efficiency is low due to overlarge feature data dimension of a subsequent training product identification model can be solved.
In an optional implementation manner of this embodiment, the first determining sub-module 902 includes:
a second determining sub-module configured to determine a correlation of the segmented word with a category to which the sample product belongs using a chi-square independent test apparatus.
In this alternative implementation, the card-side independence check can determine the association and dependency between two types of variables. Therefore, in the embodiment of the present disclosure, text descriptions of sample products in different preset categories are collected, and after the text descriptions are participled, for each preset category, the relevance between a participle result obtained from the collected text descriptions of the sample products and the preset category may be determined in a chi-square independence verification manner, and a participle with the relevance higher than a preset threshold is determined as a keyword.
After a large amount of sample data is collected, aiming at the text description of a sample product, keywords in different preset categories are extracted from the text description by chi-square independence check to form a keyword set. The card-side independence check is the prior art and is not described herein.
In an optional implementation manner of this embodiment, the first determining module 703 includes:
a third determination submodule configured to determine the TD-IDF value of the keyword as the degree of importance of the keyword.
In this alternative implementation, TF-IDF (Term Frequency-Inverse Document Frequency) is a commonly used weighting technique for information retrieval and data mining, TF means Term Frequency (Term Frequency) and IDF means Inverse text Frequency index (Inverse Document Frequency). The TF-IDF value of the keyword may have the following meaning: the frequency TF of the current keyword appearing in the text descriptions of all sample products in the current preset category is high, and the current keyword rarely appears in the text descriptions of all sample products in other preset categories, so that the keyword can be considered to have good category distinguishing capability and is suitable for classification, therefore, the keyword can be considered to be important relative to the current preset category, and the TD-IDF value of the keyword can be used for measuring the importance of the keyword.
The TD-IDF values of the keywords can be obtained by counting text descriptions of all sample products in preset categories on an online platform in advance, extracting the keywords from the text descriptions and determining the TD-IDF values of the keywords in the text descriptions. As described above, for the keyword set formed by sample data, each keyword may further correspond to a corresponding TD-IDF value, so that during online identification, the keyword set and the corresponding TD-IDF value may be directly utilized to obtain the keyword and the TD-IDF value corresponding to the product to be identified.
In this embodiment, the TF value of a keyword may be obtained by dividing the number of times that the keyword appears in the text descriptions of all sample products in the preset category by the number of text descriptions (i.e., the number of all sample products in the preset category); the IDF of the keyword may be determined by the number of preset categories to which the sample product corresponding to the text description where the keyword appears belongs and the total number of the preset categories, and the calculation formula is IDF log (n/m), where n is the total number of the preset categories, and m is the number of the preset categories where the keyword appears. For example, if the keyword a appears in the text descriptions of the sample products in the preset category 1, the preset category 2, and the preset category 3, and the preset categories are 5 in total, the keyword appears in the three preset categories, and thus the IDF of the keyword is log (5/3).
The TD-IDF value of a keyword is the product of the TD value and the IDF value of the keyword.
For example, the TD-IDF values of the keywords under each preset category in the sample data collected in one takeaway ordering platform are as follows:
[ "fried egg", "braised in soy sauce", "home", "cold mix", "braised eggplant", "palace chicken", "diced", "sugar and vinegar", "shredded fish", "agaric", "potato", "mugwort", "tofu", "kidney", "tomato", "inner ridge", "jigsaw", "dish", "small" ] home vegetables [0.34, 0.33, 0.28, 0.26, 0.22, 0.19, 0.17, 0.13, 0.12, 0.1, 0.07, 0.08, 0.05, 0.07, 0.06, 0.07, 0.05, 0.05, 0.06]
[ "beer", "rice wine", "Yanjing beer", "snowflake", "wheat", "Harbin", "can", "Skyo", "Naja", "protoplasm", "Qingdao", "egg wine", "courage", "Belgium", "refreshing", "king", "white", "sweet osmanthus", "Xiao", "liqueur" ] wine [1.68, 0.42, 0.37, 0.37, 0.22, 0.19, 0.18, 0.17, 0.14, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.12, 0.1, 0.1, 0.11, 0.07]
[ "roast meat", "crispy skin", "roast", "roasted", "crusty", "sausage", "kebab", "Orleans", "muscle", "tendon", "cumin", "Salix", "honeydew", "salad", "New Orleans", "chicken", "tendon", "croissant", "baked bread", "roasted", "Turkey" ] roast meat [1.57, 0.21, 0.19, 0.18, 0.15, 0.14, 0.11, 0.12, 0.1, 0.1, 0.1, 0.08, 0.08, 0.09, 0.07, 0.08, 0.06, 0.08, 0.06, 0.05]
[ "braised", "chicken", "rice", "extremely spicy", "tofu skin", "golden mushroom", "Wu Ji", "abalone", "of", "chicken meat", "potato piece", "not yet", "casserole", "thousand pot", "big spicy", "giving skin of beans" ] braised chicken rice [1.81, 1.26, 0.4, 0.14, 0.13, 0.12, 0.1, 0.09, 0.1, 0.08, 0.07, 0.07, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05]
[ "iron plate roasting", "skewered meat", "whisker", "pungency", "big chicken chop", "select", "squid", "old dry mother", "pancake", "iron plate", "fish", "eggplant", "juice", "tofu", "sesame cake", "egg", "teppanyak [2.94, 0.43, 0.43, 0.41, 0.4, 0.37, 0.33, 0.32, 0.32, 0.29, 0.28, 0.27, 0.25, 0.25, 0.23, 0.21]
[ "meatball", "beef meatball", "pellet", "stew", "soup", "meatball", "urination", "white gourd", "casserole", "cooking", "octopus", "croquette", "delicacy", "handmade", "four happiness", "vegetarian", "vermicelli", "meatball" ] meatballs [1.14, 0.35, 0.19, 0.19, 0.18, 0.16, 0.14, 0.12, 0.11, 0.1, 0.110.1, 0.09, 0.07, 0.07, 0.07, 0.06, 0.07, 0.06, 0.05]
[ "dessert", "outman", "double poetry" macarons "," gift box "," afternoon tea "," Rui + - "," Rubi "," Brown "," Nib "," butter "," curl "] dessert [2.82, 0.66, 0.66, 0.62, 0.6, 0.6, 0.58, 0.55, 0.54, 0.53, 0.5, 0.42]
Wherein, in each section of content, the former part is a plurality of key words extracted from the dish category, such as 'fried egg', 'braised pork in brown sauce', and the middle part is the name of the dish category, such as 'home dish'; the latter half is the corresponding TD-IDF values of these keywords, such as "0.34, 0.33", etc.
For example, for a dish "bonbon pot" the keywords and feature data shown in the following table can be extracted from the relevant text description:
the name of the dish is as follows: bonito, restaurant: millions, cuisine: beijing vegetable, main operation: snack food
Uterus protector Diced chicken ....... Snack food .......
0.3 0.5 1
The first action in the table is a keyword, and the second action is characteristic data corresponding to the dish of 'Tungbao chicken with rice covered in pot', namely the TD-IDF value corresponding to each keyword.
In an optional implementation manner of this embodiment, as shown in fig. 10, the third determining sub-module includes:
a fourth determination submodule 1001 configured to determine a TD-IDF value of the keyword under a category to which the sample product belongs;
a selecting sub-module 1002 configured to select a smallest TD-IDF value as the importance degree of the keyword when the keyword corresponds to a plurality of TD-IDF values under different categories.
In this alternative implementation, if the same keyword occurs in multiple predetermined categories, a TD-IDF value of the keyword can be calculated for each of the predetermined categories, and for the sake of uniformity, the minimum TD-IDF value can be selected for the importance of the keyword.
In an optional implementation manner of this embodiment, the first determining module 703 includes:
and the fifth determining sub-module is configured to take a default value as the importance degree of the keyword corresponding to the sample product when the relevance of all the participles corresponding to the sample product and the category to which the sample product belongs is lower than a preset threshold value.
In this optional implementation manner, when extracting the keywords in the text description, the low-relevance participles with the category to which the sample product belongs are removed, and if the relevance of all the participles with the category to which the sample product belongs in the text description of one sample product is lower than a preset threshold, the importance degree of the keywords in the feature data of the sample product may be set to a default value. The default value may be set according to practical situations, and is not limited herein.
Fig. 11 shows a block diagram of a product identification device according to an embodiment of the present disclosure, which may be implemented as part or all of an electronic device by software, hardware, or a combination of both. As shown in fig. 11, the product recognition apparatus includes:
a second obtaining module 1101 configured to obtain a text description of the product to be identified;
a second extraction module 1102 configured to extract keywords of the text description;
a second determining module 1103 configured to determine the importance degree of the keyword;
the recognition module 1104 is configured to input the importance degree of the keyword into a pre-trained product recognition model so as to recognize the product to be recognized; wherein the product identification model is trained by the data processing device.
In this embodiment, the product to be identified may be a product currently related to the online platform, such as dishes on a take-out ordering platform, clothing on an e-commerce platform, living goods, household goods, and the like. The textual description of the product to be identified includes, but is not limited to, textual descriptions of attributes such as product name, product material, manufacturing process, efficacy, size, quantity, and the like. For example, the textual description of the dish on the takeaway ordering platform may include the name of the dish, the food material of the dish, the method of doing the dish, and so on.
In some embodiments, the textual description of the product to be identified may also include a textual description of an operator to whom the product to be identified belongs. Under the general condition, the product types operated by one operator are relatively similar, even some operators only operate products of one type, so that when the product identification model is trained, the data of the operator is also used as input data, the product identification model learns the characteristics which can influence the product types from the operator data, and the identification accuracy of the product identification model is further improved.
The data of the operator to which the product to be identified belongs may include, but is not limited to, the name of the operator, the main operation range, the large range to which the product operated by the operator belongs (such as a cuisine in the catering industry), and the like.
For a product to be identified, one or more keywords can be extracted from the text description of the product to be identified, and then the importance degree of the one or more keywords is determined, wherein the importance degree is used for indicating the action size of the keyword on product identification, if the keyword can play an important role in product identification, the importance degree of the keyword is higher, and if the keyword cannot play an important role in product identification, the importance degree of the keyword is lower.
The product identification model is obtained by training the data processing device, so the specific details of the product identification model can be referred to the above description of the data processing device, and are not described herein again.
In an optional implementation manner of this embodiment, as shown in fig. 12, the second extracting module 1102 includes:
a second word segmentation sub-module 1201 configured to segment the text description;
a matching sub-module 1202 configured to match the participle with a keyword set, and determine whether the participle is a keyword.
In this optional implementation manner, as described in the above data processing apparatus, in the training process, for all collected sample products, keywords in corresponding text descriptions are extracted, and a keyword set corresponding to the sample products is formed, and the importance degrees of these keywords are also determined in subsequent steps. Therefore, after the training of the product recognition model is completed, the keyword set can be reserved, the participles in the text description of the product to be recognized are obtained and then matched with the keyword set, if the matching is successful, the participles can be determined as the keywords corresponding to the product to be recognized, and the importance degrees of the keywords can also be directly determined.
The determination of the keywords and the determination of the importance degree can be referred to the above description of the data processing apparatus, and are not described herein again.
It should be noted that, when there is no keyword matching with the keyword set in the text description of the product to be recognized, the importance degree of the keyword corresponding to the recognized product may be set as a default value.
The disclosed embodiment also provides an electronic device, as shown in fig. 13, including a processor 1301; and memory 1302 communicatively coupled to the processor 1301; wherein the memory 1302 stores instructions executable by the processor 1301, the instructions being executable by the processor 1301 to implement:
acquiring sample data; wherein the sample data comprises a textual description of a sample product and a category to which the sample product belongs;
extracting key words in the text description;
determining the importance degree of the keyword;
training a product identification model by using the characteristic data of the sample product and the category to which the sample product belongs; wherein the feature data includes the degree of importance of the keyword corresponding to the sample product.
Wherein, obtaining sample data comprises:
acquiring text descriptions of a plurality of sample products in a preset category;
and performing de-duplication processing on the text descriptions of the sample products.
Wherein de-duplicating the textual descriptions of the plurality of sample products comprises:
and uniformly mapping a plurality of different text descriptions corresponding to the same sample product into the same text description.
Wherein, extracting the keywords in the text description comprises:
segmenting the text description;
and determining the participles with the correlation higher than a preset threshold value with the category to which the sample product belongs as the keywords.
Determining the participles with the relevance higher than a preset threshold value with the category to which the sample product belongs as the keywords, wherein the method comprises the following steps:
determining the segmentation words and the categories to which the sample products belong by using a chi-square independent test electronic device, wherein determining the importance degree of the keywords comprises the following steps:
and determining the TD-IDF value of the keyword as the importance degree of the keyword.
Determining the TD-IDF value of the keyword as the importance degree of the keyword, wherein the determining comprises the following steps:
determining a TD-IDF value of the keyword under a category to which the sample product belongs;
and when the keyword corresponds to a plurality of TD-IDF values under different categories, selecting the smallest TD-IDF value as the importance degree of the keyword.
Wherein determining the importance of the keyword comprises:
and when the relevance of all the participles corresponding to the sample product and the category to which the sample product belongs is lower than a preset threshold value, taking a default value as the importance degree of the keyword corresponding to the sample product.
The present implementations also provide an electronic device comprising a memory and a processor; wherein,
the memory is for storing one or more computer instructions, wherein the one or more computer instructions are executed by the processor to implement the method steps of: acquiring text description of a product to be identified;
extracting keywords of the text description;
determining the importance degree of the keyword;
inputting the importance degree of the keyword into a pre-trained product identification model so as to identify the product to be identified; wherein the product identification model is trained by using the electronic device shown in fig. 13.
Wherein, extracting the keywords in the text description comprises:
segmenting the text description;
and matching the word segmentation with a keyword set to determine whether the word segmentation is a keyword.
Specifically, the processor 1301 and the memory 1302 may be connected by a bus or in other manners, and fig. 13 illustrates an example of connection by a bus. Memory 1302, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules. The processor 1301 executes various functional applications of the apparatus and data processing by running nonvolatile software programs, instructions, and modules stored in the memory 1302, that is, implements the above-described method in the embodiments of the present disclosure.
The memory 1302 may include a program storage area and a data storage area, wherein the program storage area may store an operating system, application programs required for functions; the storage data area may store historical data of shipping network traffic, and the like. Further, the memory 1302 may include high speed random access memory and may also include non-volatile memory, such as a magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the electronic device optionally includes a communication component 1303, and the memory 1302 optionally includes memory remotely located from the processor 1301, which may be connected to an external device through the communication component 1303. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
One or more modules are stored in the memory 1302, which when executed by the one or more processors 1301, perform the methods described above in the embodiments of the present disclosure.
The product can execute the method provided by the embodiment of the disclosure, has corresponding functional modules and beneficial effects of the execution method, and reference can be made to the method provided by the embodiment of the disclosure for technical details which are not described in detail in the embodiment.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowcharts or block diagrams may represent a module, a program segment, or a portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units or modules described in the embodiments of the present disclosure may be implemented by software or hardware. The units or modules described may also be provided in a processor, and the names of the units or modules do not in some cases constitute a limitation of the units or modules themselves.
As another aspect, the present disclosure also provides a computer-readable storage medium, which may be the computer-readable storage medium included in the apparatus in the above-described embodiment; or it may be a separate computer readable storage medium not incorporated into the device. The computer readable storage medium stores one or more programs for use by one or more processors in performing the methods described in the present disclosure.
The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the present disclosure is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is possible without departing from the inventive concept. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Claims (10)

1. A data processing method, comprising:
acquiring sample data; wherein the sample data comprises a textual description of a sample product and a category to which the sample product belongs;
extracting key words in the text description;
determining the importance degree of the keyword;
training a product identification model by using the characteristic data of the sample product and the category to which the sample product belongs; wherein the feature data includes the degree of importance of the keyword corresponding to the sample product.
2. The method of claim 1, wherein obtaining sample data comprises:
acquiring text descriptions of a plurality of sample products in a preset category;
and performing de-duplication processing on the text descriptions of the sample products.
3. The method of claim 1, wherein de-duplicating the textual descriptions of the plurality of sample products comprises:
and uniformly mapping a plurality of different text descriptions corresponding to the same sample product into the same text description.
4. The method according to any one of claims 1-3, wherein extracting keywords from the textual description comprises:
segmenting the text description;
and determining the participles with the correlation higher than a preset threshold value with the category to which the sample product belongs as the keywords.
5. A method of product identification, comprising:
acquiring text description of a product to be identified;
extracting keywords of the text description;
determining the importance degree of the keyword;
inputting the importance degree of the keyword into a pre-trained product identification model so as to identify the product to be identified; wherein the product recognition model is trained using the method of any one of claims 1-4.
6. A data processing apparatus, comprising:
a first obtaining module configured to obtain sample data; wherein the sample data comprises a textual description of a sample product and a category to which the sample product belongs;
a first extraction module configured to extract keywords in the text description;
a first determination module configured to determine a degree of importance of the keyword;
a training module configured to train a product recognition model using the feature data of the sample product and the category to which the sample product belongs; wherein the feature data includes the degree of importance of the keyword corresponding to the sample product.
7. A product identification device, comprising:
the second acquisition module is configured to acquire a text description of the product to be identified;
a second extraction module configured to extract keywords of the textual description;
a second determination module configured to determine a degree of importance of the keyword;
the recognition module is configured to input the importance degree of the keyword into a pre-trained product recognition model so as to recognize the product to be recognized; wherein the product recognition model is trained using the apparatus of claim 6.
8. An electronic device comprising a memory and a processor; wherein,
the memory is for storing one or more computer instructions, wherein the one or more computer instructions are executed by the processor to implement the method steps of:
acquiring sample data; wherein the sample data comprises a textual description of a sample product and a category to which the sample product belongs;
extracting key words in the text description;
determining the importance degree of the keyword;
training a product identification model by using the characteristic data of the sample product and the category to which the sample product belongs; wherein the feature data includes the degree of importance of the keyword corresponding to the sample product.
9. An electronic device comprising a memory and a processor; wherein,
the memory is for storing one or more computer instructions, wherein the one or more computer instructions are executed by the processor to implement the method steps of:
acquiring text description of a product to be identified;
extracting keywords of the text description;
determining the importance degree of the keyword;
inputting the importance degree of the keyword into a pre-trained product identification model so as to identify the product to be identified; wherein the product recognition model is trained using the electronic device of claim 8.
10. A computer-readable storage medium having computer instructions stored thereon, wherein the computer instructions, when executed by a processor, implement the method of any one of claims 1-5.
CN201910563737.7A 2019-06-26 2019-06-26 Data processing method and device, electronic equipment and storage medium Pending CN110264318A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910563737.7A CN110264318A (en) 2019-06-26 2019-06-26 Data processing method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910563737.7A CN110264318A (en) 2019-06-26 2019-06-26 Data processing method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN110264318A true CN110264318A (en) 2019-09-20

Family

ID=67921955

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910563737.7A Pending CN110264318A (en) 2019-06-26 2019-06-26 Data processing method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN110264318A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110837867A (en) * 2019-11-08 2020-02-25 深圳市深视创新科技有限公司 Method for automatically distinguishing similar and heterogeneous products based on deep learning
CN110941719A (en) * 2019-12-02 2020-03-31 中国银行股份有限公司 Data classification method, test method, device and storage medium
CN111190635A (en) * 2020-01-03 2020-05-22 拉扎斯网络科技(上海)有限公司 Method, device and equipment for determining characteristic data of application program and storage medium
CN111429184A (en) * 2020-03-27 2020-07-17 北京睿科伦智能科技有限公司 User portrait extraction method based on text information
CN111522945A (en) * 2020-04-10 2020-08-11 南通大学 Poetry style analysis method based on chi-square test
CN113657113A (en) * 2021-08-24 2021-11-16 北京字跳网络技术有限公司 Text processing method and device and electronic equipment
CN114416992A (en) * 2022-01-18 2022-04-29 新华智云科技有限公司 Entity text relevance calculation method and system based on machine learning

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104951430A (en) * 2014-03-27 2015-09-30 携程计算机技术(上海)有限公司 Product feature tag extraction method and device
US20160239865A1 (en) * 2013-10-28 2016-08-18 Tencent Technology (Shenzhen) Company Limited Method and device for advertisement classification
CN106095996A (en) * 2016-06-22 2016-11-09 量子云未来(北京)信息科技有限公司 Method for text classification
CN106156372A (en) * 2016-08-31 2016-11-23 北京北信源软件股份有限公司 The sorting technique of a kind of internet site and device
CN106294355A (en) * 2015-05-14 2017-01-04 阿里巴巴集团控股有限公司 A kind of determination method and apparatus of business object attribute
CN107609160A (en) * 2017-09-26 2018-01-19 联想(北京)有限公司 A kind of file classification method and device
CN108595418A (en) * 2018-04-03 2018-09-28 上海透云物联网科技有限公司 A kind of commodity classification method and system
US20190005121A1 (en) * 2017-06-29 2019-01-03 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for pushing information
CN109388712A (en) * 2018-09-21 2019-02-26 平安科技(深圳)有限公司 A kind of trade classification method and terminal device based on machine learning
CN109522544A (en) * 2018-09-27 2019-03-26 厦门快商通信息技术有限公司 Sentence vector calculation, file classification method and system based on Chi-square Test
CN109614475A (en) * 2018-12-07 2019-04-12 广东工业大学 A kind of product feature based on deep learning determines method

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160239865A1 (en) * 2013-10-28 2016-08-18 Tencent Technology (Shenzhen) Company Limited Method and device for advertisement classification
CN104951430A (en) * 2014-03-27 2015-09-30 携程计算机技术(上海)有限公司 Product feature tag extraction method and device
CN106294355A (en) * 2015-05-14 2017-01-04 阿里巴巴集团控股有限公司 A kind of determination method and apparatus of business object attribute
CN106095996A (en) * 2016-06-22 2016-11-09 量子云未来(北京)信息科技有限公司 Method for text classification
CN106156372A (en) * 2016-08-31 2016-11-23 北京北信源软件股份有限公司 The sorting technique of a kind of internet site and device
US20190005121A1 (en) * 2017-06-29 2019-01-03 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for pushing information
CN107609160A (en) * 2017-09-26 2018-01-19 联想(北京)有限公司 A kind of file classification method and device
CN108595418A (en) * 2018-04-03 2018-09-28 上海透云物联网科技有限公司 A kind of commodity classification method and system
CN109388712A (en) * 2018-09-21 2019-02-26 平安科技(深圳)有限公司 A kind of trade classification method and terminal device based on machine learning
CN109522544A (en) * 2018-09-27 2019-03-26 厦门快商通信息技术有限公司 Sentence vector calculation, file classification method and system based on Chi-square Test
CN109614475A (en) * 2018-12-07 2019-04-12 广东工业大学 A kind of product feature based on deep learning determines method

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110837867A (en) * 2019-11-08 2020-02-25 深圳市深视创新科技有限公司 Method for automatically distinguishing similar and heterogeneous products based on deep learning
CN110941719A (en) * 2019-12-02 2020-03-31 中国银行股份有限公司 Data classification method, test method, device and storage medium
CN110941719B (en) * 2019-12-02 2023-12-19 中国银行股份有限公司 Data classification method, testing method, device and storage medium
CN111190635A (en) * 2020-01-03 2020-05-22 拉扎斯网络科技(上海)有限公司 Method, device and equipment for determining characteristic data of application program and storage medium
CN111190635B (en) * 2020-01-03 2021-10-29 拉扎斯网络科技(上海)有限公司 Method, device and equipment for determining characteristic data of application program and storage medium
CN111429184A (en) * 2020-03-27 2020-07-17 北京睿科伦智能科技有限公司 User portrait extraction method based on text information
CN111522945A (en) * 2020-04-10 2020-08-11 南通大学 Poetry style analysis method based on chi-square test
CN113657113A (en) * 2021-08-24 2021-11-16 北京字跳网络技术有限公司 Text processing method and device and electronic equipment
CN114416992A (en) * 2022-01-18 2022-04-29 新华智云科技有限公司 Entity text relevance calculation method and system based on machine learning

Similar Documents

Publication Publication Date Title
CN110264318A (en) Data processing method and device, electronic equipment and storage medium
Chen et al. Cross-modal recipe retrieval with rich food attributes
US20220005376A1 (en) Systems and methods to mimic target food items using artificial intelligence
US11823042B2 (en) System for measuring food weight
CN106503442A (en) Menu recommendation method and device
CN107067293A (en) Merchant category method, device and electronic equipment
Morol et al. Food recipe recommendation based on ingredients detection using deep learning
CN110851571B (en) Data processing method and device, electronic equipment and computer readable storage medium
Sudo et al. Estimating nutritional value from food images based on semantic segmentation
CN108596789B (en) Dish standardization method
Kitamura et al. Image processing based approach to food balance analysis for personal food logging
CN110322323A (en) Entity display method, entity display device, storage medium and electronic equipment
Park et al. Adapting a standardised international 24 h dietary recall methodology (GloboDiet software) for research and dietary surveillance in Korea
CN110968748A (en) Electronic menu processing method, device and system
Amano et al. Food category representatives: Extracting categories from meal names in food recordings and recipe data
CN109472025B (en) Dish name extraction method and device
EP3848870A1 (en) Nutritional value calculation of a dish
CN118193523A (en) Intelligent cooking method and intelligent cooking device based on cooking AI large model and RAG system
Tachibana et al. Extraction of naming concepts based on modifiers in recipe titles
Yanai et al. Large-scale twitter food photo mining and its applications
CN114218415A (en) Cooking recipe display method and device
Prajena et al. Indonesian Traditional Food Image Recognition using Convolutional Neural Network
Lim et al. Explainable artificial intelligence in oriental food recognition using convolutional neural network
CN117541359B (en) Dining recommendation method and system based on preference analysis
CN115797924A (en) Training method and image retrieval method of model for food image classification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20190920

RJ01 Rejection of invention patent application after publication