US11003638B2 - System and method for building an evolving ontology from user-generated content - Google Patents
System and method for building an evolving ontology from user-generated content Download PDFInfo
- Publication number
- US11003638B2 US11003638B2 US16/174,140 US201816174140A US11003638B2 US 11003638 B2 US11003638 B2 US 11003638B2 US 201816174140 A US201816174140 A US 201816174140A US 11003638 B2 US11003638 B2 US 11003638B2
- Authority
- US
- United States
- Prior art keywords
- concept
- themes
- ontology
- concepts
- data entries
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active, expires
Links
- 238000000034 method Methods 0.000 title claims abstract description 93
- 238000013479 data entry Methods 0.000 claims abstract description 204
- 238000012549 training Methods 0.000 claims description 33
- 238000003860 storage Methods 0.000 claims description 17
- 230000007935 neutral effect Effects 0.000 claims description 13
- 238000013145 classification model Methods 0.000 claims description 9
- 238000003780 insertion Methods 0.000 claims description 6
- 230000037431 insertion Effects 0.000 claims description 6
- 238000012360 testing method Methods 0.000 claims description 6
- 238000007477 logistic regression Methods 0.000 claims description 4
- 238000012795 verification Methods 0.000 description 31
- 239000013598 vector Substances 0.000 description 18
- 238000004140 cleaning Methods 0.000 description 17
- 230000006870 function Effects 0.000 description 17
- 230000015654 memory Effects 0.000 description 16
- 238000012986 modification Methods 0.000 description 14
- 230000004048 modification Effects 0.000 description 14
- 238000004458 analytical method Methods 0.000 description 12
- 238000007726 management method Methods 0.000 description 11
- 230000008569 process Effects 0.000 description 11
- 238000001514 detection method Methods 0.000 description 9
- 238000004891 communication Methods 0.000 description 5
- 238000010801 machine learning Methods 0.000 description 4
- 238000010200 validation analysis Methods 0.000 description 4
- 238000004590 computer program Methods 0.000 description 3
- 238000009826 distribution Methods 0.000 description 3
- 238000012552 review Methods 0.000 description 3
- 230000003442 weekly effect Effects 0.000 description 3
- 230000008859 change Effects 0.000 description 2
- 239000000470 constituent Substances 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 238000013500 data storage Methods 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000002372 labelling Methods 0.000 description 2
- 238000005065 mining Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000002093 peripheral effect Effects 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 230000004044 response Effects 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000001404 mediated effect Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 238000012731 temporal analysis Methods 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
- 238000012800 visualization Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/211—Schema design and management
- G06F16/213—Schema design and management with details for schema evolution support
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/23—Updating
- G06F16/2379—Updates performed during online database operations; commit processing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/285—Clustering or classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/02—Knowledge representation; Symbolic representation
- G06N5/022—Knowledge engineering; Knowledge acquisition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
- G06N20/20—Ensemble learning
Definitions
- the present disclosure relates generally to building an evolving ontology from complex and dynamic data, and more particularly to systems and methods for building an evolving ontology from user generated content on an e-commerce website.
- Computer-mediated communication is becoming the most convenient and important way of sharing and exchanging information nowadays in the society. People can directly submit their feedbacks to a particular merchant or manufacturer, and conduct online research before making many of their traditional consumer purchase decisions by reading other user's reviews. However, it's hard to utilize the large volume and diverse user-generated content on the web efficiently by simply checking a single review score or a number of positive or negative reviews.
- the present disclosure relates to a method for constructing an evolving ontology database.
- the method includes:
- semantic similarity scores between any two of the data entries based on feature sources and feature similarities of the data entries; clustering, by the computing device, the data entries into a plurality of current themes based on the semantic similarity scores;
- the semantic score between any two of the data entries are calculated by:
- s i weight of the features sources
- f j is one of the feature similarities between the two of the data entries
- w j is a weight of f j
- j, k and n are positive integers.
- the data entries are user generated feedbacks
- the step of calculating semantic similarity scores includes: predicting sentiment similarity values by a sentiment analyzer, the sentiment similarity values representing similarity between the two data entries in regard to positive feedback, negative feedback, neutral feedback, very negative feedback, and internet abuse; predicting text similarity values by a similarity calculator, the text similarity values representing similarity between semantic meaning of text extracted from the two data entries; and predicting syntactic similarity values by a neutral language parser, the syntactic similarity values representing syntactic complexity of the text of the two data entries.
- the step of clustering the data entries further includes: calculating a semantic similarity score for the two data entries using the sentiment similarity values, the text similarity values, and the syntactic similarity values.
- the step of selecting the new concepts from the current themes includes: retrieving the current themes and the previous themes; identifying near duplicate themes from the current themes and the previous themes; removing the near duplicated themes from the current themes to obtain non-duplicate themes; comparing the non-duplicate themes to concepts in the ontology database to obtain novel concepts candidates, wherein the novel concepts candidates are the non-duplicate themes that have low similarity to any of the concepts in the ontology database; and verifying the novel concepts candidates according to an instruction from a manager of the ontology database, to obtain the new concepts.
- the step of updating the evolving ontology database includes: detecting a most relevant parent concepts by comparing the at least one verified concept with the concepts in the ontology; computing similarity between the at least one verified concept and sibling concepts to obtain a most similar sibling concepts, wherein the sibling concepts are child concepts of the most relevant parent concept; proposing ontology adjustments based on the most relevant parent concept and the most similar sibling concept; and using an optimal adjustment from the proposed ontology adjustments to update the ontology.
- the proposed adjustment includes an insertion adjustment, and in the insertion adjustment, the new concept is defined as a child node of the most relevant parent concept.
- the proposed adjustment includes a lift adjustment, and in the lift adjustment, the new concept is defined as a sibling node of the most relevant parent concept.
- the proposed adjustment includes a shift adjustment, and in the shift adjustment, the new concept is defined as a child node of the most similar sibling concept.
- the proposed adjustment includes a merge adjustment, and in the merge adjustment, the new theme is combined with the most similar sibling concept to form a combined concept, the combined concept is defined as a child node of the most relevant parent concept, and the new theme and the most similar sibling concept are defined as child nodes of the combined concept.
- each concept in the ontology data base is defined by a classification model
- the classification model comprises a logistic regression model and a gradient boosting classifier.
- the method further includes: tuning the classification model according to the updated ontology.
- the method further includes: cleaning and tokenizing the data entries before the step of calculating semantic similarity scores.
- the present disclosure relates to a system for constructing an evolving ontology database.
- the system includes a computing device.
- the computing device has a processor and a storage device storing computer executable code.
- the computer executable code when executed at the processor, is configured to perform the method described above.
- the present disclosure relates to a non-transitory computer readable medium storing computer executable code.
- the computer executable code when executed at a processor of a computing device, is configured to perform the method as described above.
- FIG. 1 schematically depicts an evolving ontology system according to certain embodiments of the present disclosure.
- FIG. 2A schematically depict an emerging theme detector according to certain embodiments of the present disclosure.
- FIG. 2B schematically depict a new concept verifier according to certain embodiments of the present disclosure.
- FIG. 2C schematically depict an ontology adjusting module according to certain embodiments of the present disclosure.
- FIG. 2D schematically depict an ontology updating module according to certain embodiments of the present disclosure.
- FIG. 3A schematically depicts a current ontology (partial) according to certain embodiments of the present disclosure.
- FIG. 3B schematically depicts a lift operation of adjusting an ontology according to certain embodiments of the present disclosure.
- FIG. 3C schematically depicts a shift operation of adjusting an ontology according to certain embodiments of the present disclosure.
- FIG. 3D schematically depicts a merge operation of adjusting an ontology according to certain embodiments of the present disclosure.
- FIG. 4 schematically depicts a flow chart to build and update an evolving ontology from user-generated content according to certain embodiments of the present disclosure.
- FIG. 5 schematically depicts a method for detecting emerging themes according to certain embodiments of the present disclosure.
- FIG. 6 schematically depicts a method for verifying new themes to obtain new concepts according to certain embodiments of the present disclosure.
- FIG. 7 schematically depicts a method for proposing ontology adjustments based on verified new concepts and updating ontology using an optimal adjustment according to certain embodiments of the present disclosure.
- “around”, “about”, “substantially” or “approximately” shall generally mean within 20 percent, preferably within 10 percent, and more preferably within 5 percent of a given value or range. Numerical quantities given herein are approximate, meaning that the term “around”, “about”, “substantially” or “approximately” can be inferred if not expressly stated.
- the phrase at least one of A, B, and C should be construed to mean a logical (A or B or C), using a non-exclusive logical OR. It should be understood that one or more steps within a method may be executed in different order (or concurrently) without altering the principles of the present disclosure. As used herein, the term “and/or” includes any and all combinations of one or more of the associated listed items.
- module may refer to, be part of, or include an Application Specific Integrated Circuit (ASIC); an electronic circuit; a combinational logic circuit; a field programmable gate array (FPGA); a processor (shared, dedicated, or group) that executes code; other suitable hardware components that provide the described functionality; or a combination of some or all of the above, such as in a system-on-chip.
- ASIC Application Specific Integrated Circuit
- FPGA field programmable gate array
- processor shared, dedicated, or group
- the term module may include memory (shared, dedicated, or group) that stores code executed by the processor.
- code may include software, firmware, and/or microcode, and may refer to programs, routines, functions, classes, and/or objects.
- shared means that some or all code from multiple modules may be executed using a single (shared) processor. In addition, some or all code from multiple modules may be stored by a single (shared) memory.
- group means that some or all code from a single module may be executed using a group of processors. In addition, some or all code from a single module may be stored using a group of memories.
- interface generally refers to a communication tool or means at a point of interaction between components for performing data communication between the components.
- an interface may be applicable at the level of both hardware and software, and may be uni-directional or bi-directional interface.
- Examples of physical hardware interface may include electrical connectors, buses, ports, cables, terminals, and other I/O devices or components.
- the components in communication with the interface may be, for example, multiple components or peripheral devices of a computer system.
- computer components may include physical hardware components, which are shown as solid line blocks, and virtual software components, which are shown as dashed line blocks.
- virtual software components which are shown as dashed line blocks.
- these computer components may be implemented in, but not limited to, the forms of software, firmware or hardware components, or a combination thereof.
- the apparatuses, systems and methods described herein may be implemented by one or more computer programs executed by one or more processors.
- the computer programs include processor-executable instructions that are stored on a non-transitory tangible computer readable medium.
- the computer programs may also include stored data.
- Non-limiting examples of the non-transitory tangible computer readable medium are nonvolatile memory, magnetic storage, and optical storage.
- the present disclosure provide an ontology structure for such dataset, so as to improve the efficiency of a lot of downstream semantic analysis work.
- the challenges to construct ontology on such data may stem from two characteristics of user-generated content.
- a hierarchy structure is curated manually, and hierarchical machine learning classifiers are trained for semantic analysis.
- this method heavily depends on human efforts to understand content and to label training data, and this method cannot track the changes of the data automatically.
- data stream is partitioned into temporal segments, semantic analysis is applied on each segment, and then the emerging themes are identified within the segment.
- entities from dataset are extracted, and they are linked to a well-built universal knowledge graph. Further semantic analysis and inference can be conducted based on the knowledge graph.
- the limit of this method is that the universal knowledge graph is stable and thus cannot catch the pace of quickly changing semantic structures of user generated data. Also, it is costly since the universal knowledge graph needs to be maintained by a large group of experts. Moreover, this method is not able to discover concepts absent in the existing knowledge graphs.
- the present disclosure provides a semantic analysis framework to detect emerging themes from large-scale, evolving data streams, and a set of methods are further provided to verify new concepts and optimize relevant ontology structures.
- the present disclosure provides a system using natural language processing, active learning, semi-supervised learning technology together with principled human-computer interactions.
- this system is composed of two parts: 1) a real time semantic analysis pipeline which automatically mines and detects emerging themes and new concepts from user-generated data; and 2) management interfaces to demonstrate the analysis results and facilitate system administrators to search, verify and adjust the ontology structures.
- the semantic analysis pipeline contains three modules:
- a semantic analyzer keeps clustering items belonging to same topics from data stream.
- the temporal analysis module is in charge of predicting if the found emerging themes are about the known topics or new concept.
- An ontology optimization module is designed to maintain and adjust the semantic relations between the concepts, and start the training process of the machine learning models according to analysis results and verification.
- the management interfaces provide these following utilities:
- Management interfaces to verify the validation of detected concepts, to edit the semantic relations inside ontology structures, and to control the training procedure of machine models and to supervise the model prediction results.
- FIG. 1 schematically depicts an evolving ontology system according to certain embodiments of the present disclosure.
- the system 100 includes a computing device 110 .
- the computing device 110 may be a server computer, a cluster, a cloud computer, a general-purpose computer, a mobile device, a tablet, or a specialized computer, which constructs an ontology based on historical data or/and current data, and updates the ontology based on new inputs of the data, so as to make the ontology model evolve with the updated data automatically with minimal supervision.
- the computing device 110 may communicate with other computing devices or services, so as to obtain user generated data from those computing devices to update the ontology, and provide the ontology to those computing devices.
- the communication is performed via a network, which may be a wired or wireless network, and may be of various forms, such as a public network and a private network.
- the computing device 110 may include, without being limited to, a processor 112 , a memory 114 , and a storage device 116 .
- the computing device 110 may include other hardware components and software components (not shown) to perform its corresponding tasks. Examples of these hardware and software components may include, but not limited to, other required memory, interfaces, buses, Input/Output (I/O) modules or devices, network interfaces, and peripheral devices.
- I/O Input/Output
- the processor 112 may be a central processing unit (CPU) which is configured to control operation of the computing device 110 .
- the processor 112 can execute an operating system (OS) or other applications of the computing device 110 .
- the computing device 110 may have more than one CPU as the processor, such as two CPUs, four CPUs, eight CPUs, or any suitable number of CPUs.
- the memory 114 can be a volatile memory, such as the random-access memory (RAM), for storing the data and information during the operation of the computing device 110 .
- the memory 114 may be a volatile memory array.
- the computing device 110 may run on more than one memory 114 .
- the storage device 116 is a non-volatile data storage media for storing the OS (not shown) and other applications of the computing device 110 .
- Examples of the storage device 116 may include non-volatile memory such as flash memory, memory cards, USB drives, hard drives, floppy disks, optical drives, solid-state drive (SSD) or any other types of data storage devices.
- the storage device 116 may be a local storage, a remote storage, or a cloud storage.
- the computing device 110 may have multiple storage devices 116 , which may be identical storage devices or different types of storage devices, and the applications of the computing device 110 may be stored in one or more of the storage devices 116 of the computing device 110 .
- the computing device 110 is a cloud computer, and the processor 112 , the memory 114 and the storage device 116 are shared resources provided over the Internet on-demand.
- the storage device 116 includes an ontology application 118 , and at least one of user generated data 190 , training data 192 , new theme database 194 , and ontology 196 .
- the ontology application 118 is configured to construct an ontology and update the ontology using the data.
- the ontology application 118 includes, among other things, an emerging theme detector 120 , a new concept verifier 140 , an ontology adjusting module 160 , an ontology updating module 170 , a tuning module 180 , and a management interface 185 .
- the ontology application 118 may include other applications or modules necessary for the operation of the ontology application 118 .
- the modules are each implemented by computer executable codes or instructions, or data table or databases, which collectively forms one application.
- each of the modules may further include sub-modules. Alternatively, some of the modules may be combined as one stack. In other embodiments, certain modules may be implemented as a circuit instead of executable code.
- some or all of the modules of the ontology application 118 may be located at a remote computing device or distributed in a cloud.
- the emerging theme detector 120 is configured to, upon receiving or retrieving data entries from the user generated data 190 , score semantic distance between each pair of data entries and cluster entries based on topics, so as to generate themes of the user generated data 190 .
- the emerging theme detector 120 may retrieve data entries in a specified time range, such as last week, last month, or last quarter (season), or may be a certain number of most recent data entries, such as the last 1,000 data entries, the last 10,000 data entries, or the last 100,000 data entries. In one example, emerging theme detector 120 retrieves data entries for the last week, which is termed week 0. Referring to FIG.
- the emerging theme detector 120 includes a data cleaning and tokenizer 122 , a sentiment analyzer 124 , a similarity calculator 126 , a natural language parser (NLP) 128 , a semantic scorer 130 , and a cluster classifier 132 .
- NLP natural language parser
- the data entries from the user generated data 190 may include noises.
- the data cleaning and tokenizer 122 is configured to retrieve data entries from the user generated data 190 , clean and tokenize those data entries, and send those tokenized data entries to the sentiment analyzer 124 , the similarity calculator 126 , and the NLP 128 .
- the cleaning process refers to removing certain symbols or words that is irrelevant for the downstream work.
- sciki-learn may be used to perform the cleaning.
- the data cleaning and tokenizer 122 may use the class listed in http://scikit-learn.org/stable/modules/generated/sklearnm.feature_extraction.text.TfidfVectorizer.html.
- one of the attributes of the above class is “stop_words,” and the data cleaning and tokenizer 122 provides a list of stop words, and in operation of the stop_words, remove the listed stop words from the data entries.
- user 1 submitted a feedback, “the color of this under armor T-shirt is cool,” and the words “the,” “of” and “this” are included in the list of stop words, then the output could be: identifier “user 1, feedback No. 123,” clean text “color, under, armor, T-shirt, cool.”
- the data cleaning and tokenizer 122 is further configured to, after cleaning of the data entries, tokenize the cleaned data entries based on a dictionary. For example, if the mapping between the token string and their id is ⁇ armor:0; color:1, cool:2, T-shirt:3, under:4 ⁇ , then the clean text “color, under, armor, T-shirt, cool,” after tokenization provides an output that is a list with token ids [1, 4, 0, 3, 2].
- each tokenized data entry is represented by a user identifier, a feedback identifier and a token of clean text.
- the data cleaning and tokenization module 122 is further configured, after cleaning and tokenization, send the tokenized data entries to the sentiment analyzer 124 , the similarity calculator 126 , and the NLP 128 .
- the sentiment analyzer 124 is configured to, upon receiving the tokenized data entries, predict sentiment polarity of the cleaned text of each data entry.
- the sentiment is represented by a vector, and each dimension of the vector defining a sentiment.
- five different sentiments are defined: positive, neutral, negative, very negative, and internet abuse.
- the correlation between the tokenized data entry and the sentiments is represented by a number from 0 to 1. 1 indicates high correlation and 0 indicates no correlation.
- the representation value of the sentiments are normalized, such that the sum of the representation values of all the sentiments is 1.
- the result of analyzing a tokenized data entry by the sentiment analyzer 124 is [0.7, 0.2, 0.1, 0.0, 0.0], i.e., positive 0.7, neutral 0.2, negative 0.1, very negative 0.0, and internet abuse 0.0. Accordingly, the data entry is very likely a positive feedback, possibly neutral, and very low possibility of negative.
- the sentiment analyzer 124 uses certain techniques described by Pang, Bo et al. (Pang, Bo and Lee, Lillian, Opinion mining and sentiment analysis, Foundations and Trends in Information Retrieval, 2008, Vol. 2: No. 1-2, pp 1-135, which is incorporated herein by reference in its entirety.
- the sentiment analyzer 124 is a convolutional neural network classifier. In certain embodiments, the sentiment analyzer 124 is trained in advance using a set of training data included in the training data 192 , where each data entry in the set of training data includes tokenized value and is labeled with their corresponding sentiment attributes. In certain embodiments, the label of the training data entry may be 1 for one of the sentiments and 0 for the other sentiments. However, after training, the sentiment analyzer 124 may assign a number between 0 to 1 for one or more of the sentiments for each data entry, so as to more accurately represent the data entries in regard to the five different sentiments. The sentiment analyzer 124 is further configured to, after obtaining the sentiment vectors of the data entries, send the sentiment vectors to the semantic scorer 130 .
- the similarity calculator 126 is configured to, upon receiving the tokenized data entries, determine similarity between each pair of cleaned text based on sentence embedding.
- clean texts from any two of the data entries form a pair.
- Each clean text of a data entry is represented by a vector.
- the word representation in vector space uses the method described by Mikolov, Thomas et al. (Mikolove, Tomas et al, efficient estimation of word representation in vector space, 2013, arxiv:1301.3781v3), which is incorporated herein by reference in its entirety. Through word embedding, the words in a cleaned text are mapped to vectors of real numbers.
- the vectors of one data entry text in the pair and the vector of the other data entry text in the pair are compared to determine a similarity or distance between them.
- the similarity calculator 126 uses the method described in Kusner, et al. (Kusner et al. From word embedding to document distances, Proceedings of Machine Learning Research, 2015, V37, pp. 957-966), which is incorporated herein by reference in its entirety.
- the similarity score between each pair of data entries is normalized to 0-1, wherein 0 indicates no similarity and 1 indicates substantially the same.
- the similarity between the clean texts from two data entries is 0.7, which indicates a high similarity between the two data entries or a close distance of the two data entries in the vector space.
- the similarity calculator 126 is further configured to, after obtaining the similarity score between any two of the data entries, send the similarity scores to the semantic scorer 130 .
- the NLP 128 is configured to, upon receiving the tokenized data entries, determine the syntactic structure of the text by analyzing its constituent words based on an underlying grammar.
- the syntactic features are part-of-speech tags.
- a pretrained model for example, the Stanford parser (https://nlp.stanford.edu/software/lex-parser.shtml) is used, which is incorporated herein by reference in its entirety.
- the NLP 128 is further configured to process the initial parser output to provide certain statistic result. For example, after syntactic parsing, the NPL parser 128 may further count the number of nouns and the number of verbs in the output.
- the NLP 128 is further configured to evaluate the syntactic or grammar complexity of the data entry, and represent the complexity as a real number.
- the NLP 128 is configured to calculate the complexity using the number of unique words, the number of verb, the number of noun, the number of verb phrase, and the number of noun phrase. For example, assuming the maximum number of the unique words of text in all datasets (such as all the training datasets) is C 0 (e.g.
- the maximum number of verb phrases of text in all datasets is V 0 (e.g. 10)
- the maximum number of noun phrases of text in all dataset is No (e.g. 20).
- the complexity of text t can be calculated with the formula: ((c 1 +1) ⁇ (v 1 +1) ⁇ (n 1 +1))/((C 0 +1) ⁇ (V 0 +1) ⁇ (N 0 +1)).
- the value of the complexity that is, a real number, is used as the result of the NLP 128 .
- the NLP 128 is further configured to, after obtaining the result, send the result to the semantic scorer 130 .
- the semantic scorer 130 is configured to, upon receiving the different aspects of semantic information from the sentiment analyzer 124 , the similarity calculator 126 , the NPL parser 128 , that is, the sentiment vectors of the texts of the data entries, the similarity scores between each pair of texts of the data entries, and the parsing result of the texts of the data entries, calculate a semantic similarity score between each pair (any two) of the texts.
- the semantic similarity score between each pair of texts is calculated by the formula:
- n corresponds to the major types of features or the feature sources.
- s i for the sentiment features, the text similarity features, and syntactic features are 0.10, 0.85 and 0.05 respectively.
- f j is a feature function that measures the similarity between two data entries.
- Each of the feature sources, sentiment, text similarity, syntactic may include one or more feature functions, and the total number of feature functions is k.
- f sentiment can be the cross entropy of two entries' sentiment distribution
- f fact can be cosine similarity of two entries' noun phrases Tf-idf vectors
- the method is expandable where a new feature function can be incorporated in the above formula easily by adding a new f j and a corresponding new w j .
- the w j is a weight of f j .
- the weights w j can be set at arbitrary values and will be optimized automatically during the training.
- the semantic scorer 130 is configured to send the semantic similarity scores to the cluster classifier 132 .
- the parameters s i , f j , and w j are learned using training data entries retrieved from the training data 192 , where semantic scores for each pair of training data entries are recorded to represent the relationship between the two training data entries.
- the feature functions are unsupervised models and are trained using training data.
- the training data are labeled with corresponding features before training the models.
- some of the models can be trained without any labeling of data.
- feature functions such as sentiment prediction model requires human annotations for training.
- feature functions such as text similarity model does not require labeled data for training.
- the cluster classifier 132 is configured to, upon receiving the semantic similarity scores between each pair of the texts of the two data entries, classify those data entries into clusters.
- the semantic scores of any pair of the data entries in this given set is greater than a pre-determined threshold t.
- the threshold can be chosen according to the system requirements. In certain embodiments, if the system needs high recall on the novel theme detection, it can use a small number (such as 2) as the threshold. Then most of the possible novel themes will be detected. In contrast, if the system needs high precision, it can use a relatively large number.
- each cluster is defined as a new theme (concept candidate), and the clusters are stored in the new theme database 194 .
- the new theme database 194 stores the new themes by batches or time intervals, such as a week, bi-weeks, a month, or a quarter.
- the system may process a batch of data entries each week, and the new themes are stored weekly. Therefore, we have new themes of the current week, new themes of the week previous to the current week, new themes of the week before the previous week, and so on . . . .
- the cluster classifier 132 in addition to store the new themes, may also send a message to the new concept verifier 140 , informing the new concept verifier 140 that a new batch of themes are available in the new theme database 194 , so that the new concept verifier 140 can verify immediately whether any of the newly detected themes are qualified as new concepts.
- the new concept verifier 140 is configured to, retrieve new themes from the new theme database 194 , and verify if any of the new themes are new concept.
- the new themes as recognized topics detected from the recent data stream, that is, the clusters detected by the cluster classifier 132 , while define the new concept as verified new themes.
- the new themes are candidates for new concepts, and the new concepts are verified new themes.
- the verified new themes then can be used to update the ontology.
- the new concept verifier 140 includes a new theme retrieving module 142 , a near duplicate identification module 144 , a concept comparing module 146 , a concept proposing module 148 , and a concept verification module 150 .
- the new theme retrieving module 142 is configured to retrieving new themes from the new theme database 194 .
- the new theme retrieving module 142 may retrieve those new themes in a pre-determined time interval such as weekly or monthly, or in response to a message from the cluster classifier 132 that new themes are stored in the new theme database 194 , or an instruction from a system manager managing the system 100 .
- the theme database 194 stores the new themes by week, and the new theme retrieving module 142 retrieves new themes of the most recent four weeks, and send the retrieved new themes to the near duplicate identification module 144 .
- the new themes from the most recent four weeks include new themes form the current week and new themes from the previous three weeks, and are named week 0, week—1, week—2, week—3 respectively.
- the near duplicate identification module 144 upon receiving the new themes from the new theme retrieving module 142 , remove duplicated themes from the retrieved themes, so as to obtain most representative new themes.
- the near duplicate identification module 144 compares each data entry in the first theme with every data entry in the second theme to calculate semantic similarity scores; uses the semantic similarity scores to determine whether that data entry in the first theme belongs to the second theme; then computes the percentage of data entries in the first theme that belong to the second theme; and determines whether the first theme is a duplication of the second theme based on the percentage.
- the near duplicate identification module 144 may calculate the semantic similarity scores as described in related to the semantic scorer 130 , or call the semantic scorer 130 to calculate the semantic scores.
- the near duplicate identification module 144 may use average semantic similarity score between each data entry in the first theme and the data entries in the second theme to determine whether that data entry in the first theme belongs to the second theme.
- the threshold of the average semantic score may be set in a rang of 0.6-1.0, or preferably above 0.7, or more preferably above 0.8 or 0.9.
- the near duplicate identification module 144 may determine the first theme is a duplication of the second theme when the percentage of the data entries in the first theme that belong to the second theme is greater than a pre-determined threshold.
- the threshold is set at about 0.6, preferably at about 0.7, and more preferably at about 0.8 or 0.9.
- the near duplicate identification module 144 is configured to compare the themes in the current week with the themes in the previous weeks to determine duplicates by the method described above.
- the current week 0 includes a number of T 0 themes
- week—1 includes a number of T 1 themes
- week—2 includes a number of T 2 themes
- week—3 includes a number of T 3 themes.
- Each of the themes in the T 0 themes is compared to the themes in the T 1 , T 2 , and T 3 themes, and the duplicated themes in the T 0 themes is defined as T 0 -duplicate themes.
- the T 0 , T 1 , T 2 , and T 3 themes are combined together, and those themes are compared with each other to determine and remove the duplicated themes; the nonduplicated themes from the T 0 themes or from all the T 0 , T 1 , T 2 , and T 3 themes are used for further processing.
- the T 1 , T 2 , and T 3 themes are combined together, and the T 0 themes are compared to the combined themes, and the duplicated theme between the T 0 themes and the combined themes are removed from the T 0 themes.
- the new themes may be added as new concept directly to the ontology to initialize the ontology.
- the initial ontology may also be defined manually.
- the near duplicate identification module 144 is further configured to send those representative new themes to the concept comparing module 146 .
- the concept comparing module 146 is configured to, upon receiving the nonduplicated themes, calculate the possibilities of whether the representative new themes belong to existing concepts or not.
- the concept comparing module 146 uses classification models of the existing concepts in the ontology to determined when a new theme belongs to a concept.
- a binary text classifier is provided and trained for each concept in the ontology.
- each concept in the ontology 196 has its text classifier model.
- the machine learning model of these classifiers can be binary classifier such as logistic regression, gradient boosting classifier, and convolutional neural network, etc.
- a concept when a concept is created and added to the ontology, a collection of text documents are collected and may be verified for example by the system administrator.
- the documents are semantically similar and are used as positive samples of the model of the concept. Some other documents, which may be randomly selected from other existing categories or concepts, are used as negative samples.
- the new concept's corresponding text classifier will then be trained on the combination of positive and negative samples.
- the concept comparing module 146 performs the prediction whether a representative new theme (i.e., a nonduplicate theme) belongs to a concept in the ontology
- the concept comparing module 146 inputs each text content of the data entries in one representative new theme to the binary text classifier, and obtains a Boolean value for that text content.
- the Boolean value indicates if the given feedback (data entry) belongs to the concept.
- the percentage of the text contents that belong to the concept indicates the possibility that the representative new theme belongs to the concept. For example, if a representative new theme T contains 100 data entries, and the binary text classifier predicts that 90% of the data entries belong to a concept C, then the probability of the representative new theme T belonging to the concept C is 90%.
- the concept comparing module 146 is further configured to send the possibilities to the concept proposing module 148 .
- the concept proposing module 148 is configured to, upon receiving the possibilities for each representative new theme that belongs to one of the available concept, determine whether the representative new themes is a new concept candidate. For example, if a representative new theme T contains 100 data entries, and the binary text classifier predicts that 90% of the data entries belong to a concept C, then the probability of the representative new theme T belonging to the concept C is 90%.
- the possibility of the representative new theme T belonging to each of the concepts is determined, and the highest possibility of the representative new theme T belong to one of the concepts is regarded as the possibility that the representative new theme T belong to a concept in the ontology. If the highest possibility for one concept is greater than a pre-determined number, such as about 90%, the new theme T is determined to belong to an exist concept.
- the probability may be respectively 91%, 85% 81% 80% and 70%, and then the new theme T is determined to be belonging to C 1 because the highest percentage 91% is greater than a threshold of 90%.
- the probability may be respectively 89%, 83% 69% 69% and 65%, and then the new theme T does not belong to existing concepts because the highest possibility 89% is lower than the pre-determined threshold 90%.
- the threshold may be varied based on the characteristics of the data entries and the purpose of the project.
- the concept verification module 150 is configured to, upon receiving the new concept candidates, verify the new concept candidate to obtain verified concepts. In certain embodiments, the concept verification module 150 verifies the new concept candidates automatically based on certain criteria. In certain embodiments, the concept verification module 150 provides an interface to show the new concept candidates to the system manager, and verifies the new concept candidates according to the instruction from the system manager via the interface. After verification, the concept verification module 150 discards the new concept candidates that fail the verification, and sends the verified new concepts to the ontology adjusting module 160 . The verified new concepts are also simply termed verified concepts.
- the ontology adjusting module 160 is configured to, upon receiving the verified concepts, propose adjustments of the ontology.
- the ontology adjusting module 160 includes an ontology and new concept retrieving module 162 , a parent concepts detection module 164 , a sibling concepts similarity module 166 , and an adjustment proposing module 168 .
- the ontology and new concept retrieving module 162 is configured to retrieve the ontology from the ontology 196 and retrieve or receive the verified concepts form the concept verification module 150 of the new concept verifier 140 , and send the retrieved or received ontology and verified concepts to the parent concept detection module 164 .
- the parent concept detection module 164 is configured to, upon receiving the ontology and the verified concepts, detect a parent concept from the ontology for each of the verified concepts. In certain embodiments, the determination is similar to the function of the concept comparing module 146 and the concept proposing module 148 . Specifically, for each verified concept, the parent concept detection module 164 inputs each text content in the verified concept to the classifier of one of the concepts of the ontology, and obtains a value of that inputted text content. Once values of all the text content from the verified concept against the one concept of the ontology are available, the possibility of whether the concept of the ontology is the parent concept of the verified concept is obtained.
- the parent concepts detection module 164 is further configured to, after obtaining the correspondence between the verified concept and its parent concept in the ontology, send the verified concept and its parent concept to the sibling concepts similarity module 166 . In certain embodiments, the parent concepts detection module 164 is further configured to analyze each of the new concepts to obtain their respective parent concepts.
- the sibling concepts similarity module 166 is further configured to send the parent concept and the most similar one of the sibling concepts to the adjustment proposing module 168 .
- the adjustment proposing module 168 is configured to, upon receiving the parent concept and the most similar sibling concept of the verified concept, propose adjustments on the ontology based on the information.
- the adjustment proposing module 168 is configured to propose the adjustments of the ontology by performing insert, lift, shift and merge.
- FIG. 3A schematically shows a current ontology (partial) according to certain embodiments of the present disclosure. As shown in FIG. 3A , the nodes are concepts of the ontology.
- the nodes A11, A12 and A13 have a common parent node A1, the nodes A111, A112 and A113 have a common parent node A11, the nodes A121 and A122 have a common parent node A12, and the nodes A131, A132, A133 and A134 have a common parent node A13.
- the new verified concept it is calculated that the new theme has the highest possibility of belonging to the node A1, that is, A11 is the parent node of the new theme.
- the new theme is compared with the sibling nodes A111, A112, A113, the new theme is most similar to the sibling node A112.
- the new theme is added as a child concept of A11, and sibling concept of A111, A112, and A113. In other words, the concept A111, A112, A113 and the new concept are child concepts of the node A11.
- the adjustment proposing module 168 is configured to propose the adjustment by performing lift. Specifically, the node of the new theme is pointed to the node A1. In other words, after adjustment, the node A11 and the node new theme have a common parent node A1, and the node A11 and the node new concept are sibling nodes. As shown in FIG. 3C , the adjustment proposing module 168 is configured to propose the adjustment by performing shift. Specifically, the node of the new concept is pointed to the node A112. In other words, after adjustment, the node new concept is a child node of the node A112, and the node A112 is the parent node of the node new concept. As shown in FIG.
- the adjustment proposing module 168 is configured to propose the adjustment by performing merge. Specifically, the node of the new concept and the node of A112 are combined to form a new node A112/new concept, and the new node A112/new concept has the parent node A11. Now the nodes A111, A112/new concept, and A113 are children nodes of the node A1. Further, two children nodes are defined for the new node A112/new concept, and the two children nodes are respectively A112 and new theme. That is, the node A112/new concept is the parent node of the node A112 and the node new concept. After proposing the three types of adjustment, the adjustment proposing module 168 is further configured to send the proposed adjustments to the ontology updating module 170 .
- the ontology updating 170 is configured to, upon receiving the proposed adjustments from the adjustment proposing module 168 of the ontology adjusting module 160 , verify the proposed adjustments, and choose the optimal proposal to update the ontology.
- the ontology updating module 170 includes a modification verification module 172 and an updating module 174 .
- the modification verification module 172 is configured to, upon receiving the proposed adjustments from the adjustment proposing module 168 , verify which of the proposed adjustments is the optimal adjustment, and send the optimal adjustment to the updating module 174 .
- the modification verification module 172 is configured to verify the adjustments by looking for the optimal hierarchy adjustment.
- an optimal hierarchy H opt is a hierarchy that:
- H opt arg ⁇ ⁇ max H ⁇ log ⁇ ⁇ p ⁇ ( D ⁇ H ) .
- the modification verification module 172 estimates the likelihood with classification performance of a hierarchical model.
- the modification verification module 172 uses macro-averaged recall of the whole classification system to estimate the conditional likelihood.
- the macro-averaged recall of the system is the average of recall of all concept classifiers' recall on the test set.
- a hierarchy H comprises of M concepts. For each concept, there are a training set A i and a test set E i .
- the modification verification module 180 trains the binary concept classifier on A i , evaluates it on E j and get its recall r i .
- the modification verification module 172 is further configured to send the optima hierarchy to the updating module 174 .
- the updating module 174 is configured to, upon receiving the optimal proposal of the adjustment, update the ontology stored in the ontology 196 using the optimal proposal.
- the tuning module 180 is configured to, when the ontology is updated by the updating module 174 , using the updated ontology and the corresponding dataset to tune the classifiers of the concepts of the ontology.
- the tuning may be performed after each of the updating of the ontology, or be performed at a pre-determined time interval such as a month, or upon instruction by the system manager.
- the management interface 185 is configured to, when in operation, provide an interactive interface for presenting results and parameters to the system manager, and receiving instruction and revised parameters from the system manager.
- the manager interface 185 includes verification and parameters mentioned above, which may include, among other things, threshold parameters for the cluster classifier 132 , semantic score threshold for the near duplicate identification module 144 , threshold value for predicting concept proposing module 148 , new concept verification, proposed adjustments verification, etc.
- the user generated data 190 includes the historical user generated data, such as the user feedbacks on an e-commerce platform.
- the user generated data 290 may be arranged by a predetermined time interval, such as by week or by month.
- the training data 192 includes data for training the classifiers in the system 100 .
- Each set of data in the training data 192 may correspond to a specific classifier or other types of models, and are labeled with corresponding features. For example, a set of data entries having text are labeled with sentiment, and the set of data are used to train the sentiment analyzer 124 .
- the new theme database 194 stores new themes detected by the emerging theme detector 120 .
- the new themes are stored by batch. Each batch of the new themes may correspond to the new themes detected from, for example, data entries from a week, a month, or a quarter, etc.
- the ontology 196 stores the ontology of the system, which can be updated automatically or with minimal supervision by the system manager.
- the ontology 196 includes, among other things, the concepts, the relationship between the concepts, and the classifiers corresponding to each concepts.
- system manager may initialize the ontology 196 manually, and the initialized ontology 196 is updated and expanded after receiving more data and after performing the function of the ontology application 118 .
- the ontology application 118 may use a first batch of data entries, detect emerging themes using the emerging theme detector 120 , and uses the classified emerging themes as the initial ontology 196 .
- FIG. 4 schematically depicts a flow chart to build and update an evolving ontology from user-generated content according to certain embodiments of the present disclosure.
- the building and updating the evolving ontology is implemented by the server computing device 110 shown in FIG. 1 .
- the user generated data 190 is provided to or retrieved by the emerging theme detector 120 .
- the user generated data 190 may include a large amount of historical data, and the emerging theme detector 120 may only process a batch of data at a time, such as the user feedbacks in an e-commerce website in the past week.
- the emerging detector 120 then processes the batch of the user generated data that include many data entries, to obtain relationships between any two of the data entries.
- the relationship may be represented by a semantic similarity score, where the higher the score, the more similar the two data entries.
- the emerging theme detector 120 clusters the data entries into different groups.
- the data entries in the same group have high semantic similarity score between each other.
- the groups are regarded as new emerging themes.
- the emerging theme detector 120 may also use a threshold to filter the groups, and only the groups that have a number of data entries greater than the threshold number, such as 50 or 60, are regarded as the new emerging themes. In certain embodiments, the emerging theme detector 120 further compares the detected new themes with the new themes detected in the older time, such as in the three weeks previous to the passing week, and keep only the new themes that are not shown in those previous three weeks. The emerging theme detector 120 then sends the detected new themes to the new concept verifier 140 .
- the new concept verifier 140 upon receiving the new themes, compares the new themes with the nodes in the ontology, where each node in the ontology represent a concept.
- the new concept verifier 140 calculates the novelty score of each new themes by comparing the similarity between each of the new themes and each of the concepts.
- the novelty score may be computed using a set of classification models.
- the new concept verifier 140 defines the new themes having the high novelty scores as verified new concepts or simply verified concepts.
- the new concept verifier 140 then sends the verified concepts to the ontology adjusting module 160 .
- the ontology adjusting module 160 upon receiving each of the verified concept, calculates the similarity between the verified new concept and the nodes in the ontology, and define the node having the highest similarity as the parent node of the verified concept.
- the parent node may have multiple children nodes.
- the ontology adjusting module 160 compares the similarity between the verified concept with all the children nodes of the parent node (also termed sibling nodes of the verified concept), and determines the sibling node that has the highest similarity score with the verified concept, among those sibling nodes. That sibling node is termed determined sibling node. With the parent node and the determined sibling node at hand, the ontology adjusting module 160 then proposes several different adjustments.
- the ontology adjusting module 160 inserts the verified concept as a child node of the parent node. In certain embodiments, by performing lift, the ontology adjusting module 160 inserts the verified new concept as a sibling node of the parent node. In certain embodiments, by performing shift, the ontology adjusting module 160 inserts the verified new concept as a child node of the determined sibling node. In certain embodiments, by performing merge, the ontology adjusting module 160 merges the verified concept and the determined sibling node as a merged node. The merged node is a child node of the parent node, and the merged node is a parent node of the verified concept and the determined sibling node. The ontology adjusting module 160 then sends those proposed adjustment to the ontology updating module 170 .
- the ontology updating module 170 upon receiving the proposed adjustment, evaluate which of the adjustments is optimal, and uses the optimal adjustment proposal to update the ontology.
- the tuning module 180 may further tune the whole system, and retain the related model according to the ontology changes.
- the models with high creditability are retained or defined with high weights, and the models with low creditability are discarded or defined with low weights.
- the system further includes a management interface 185 .
- the management interface 185 provides an interface, such as a graphic user interface (GUI) to the system manager, so that the manager can interact during the process with the application.
- GUI graphic user interface
- the system manager can use the management interface 185 to visualize and demonstrate keywords, novelty threshold, occurring frequencies, and summaries of the new concepts, adjust novelty score threshold, verify new concepts, etc.
- the system may also include an initialization step to construct the ontology 196 from scratch.
- the initial ontology 196 is manually prepared by the system administrator.
- the ontology 196 is automatically constructed by: detecting emerging themes using certain number of user generated data, classifying those emerging themes, and construct the initial ontology 196 using the detected emerging themes as concepts of the ontology.
- the initialization of the ontology 196 is performed by supervising and revising the result of the above automatic method by the system manager.
- FIG. 5 schematically depicts a method for detecting emerging themes according to certain embodiments of the present disclosure.
- the method is implemented by the computing device 110 shown in FIG. 1 .
- the method shown in FIG. 5 corresponds to the function of the emerging theme detector 120 . It should be particularly noted that, unless otherwise stated in the present disclosure, the steps of the method may be arranged in a different sequential order, and are thus not limited to the sequential order as shown in FIG. 5 .
- the data cleaning and tokenizer 122 retrieves or receives a batch of data entries from the user generated data 190 .
- the batch of data entries may be, for example, user feedbacks on an e-commerce website in the last week.
- the number of data entries may vary, such as 10,000 data entries.
- the data cleaning and tokenizer 122 After retrieving the data entries, at procedure 504 , the data cleaning and tokenizer 122 cleans the data entries, and tokenizes the cleaned data entries into numbers.
- the data entries are generally text.
- the data cleaning and tokenizer may remove the image from the data entries or convert the image into texts.
- the data cleaning and tokenizer 122 then separates the text into words, and cleans the words by removing certain irrelevant symbols or words.
- the data cleaning and tokenizer 122 tokenizes each data entry into numeral representation, and sends the tokenized text of the data entries to the sentiment analyzer 124 , the similarity calculator 126 , and the NLP 128 .
- the sentiment analyzer 124 upon receiving the tokenized text of the data entries, predicts sentiment polarity for each of the tokenized text.
- the sentiment analyzer 124 defines five sentiments, and uses a pretrained model to give five corresponding values for each data entry.
- the five sentiments includes positive, neutral, negative, very negative, and internet abuse.
- the pretrained model is a classification model such as gradient regression classifier, and the training data is retrieved from the training data 192 .
- the training data may be a set of data entries with sentiment labels, that is, positive, neutral, negative, very negative, and internet abuse features of the data entries. When the target data entries are different, the sentiment labeling may also be changed accordingly.
- the sentiment analyzer 124 may be [0.7, 0.2, 0.1, 0.0, 0.0], i.e., positive 0.7, neutral 0.2, negative 0.1, very negative 0.0, and internet abuse 0.0. Accordingly, the data entry is very likely a positive feedback, possibly neutral, and very low possibility of negative.
- the sentiment analyzer 124 sends the result to the semantic scorer 130 .
- the similarity calculator 126 upon receiving the tokenized data entries, computes the text similarity between any two of the tokenized data entries based on sentence embedding. Specifically, the similarity calculator 126 represents the words in each text (i.e., each cleaned and tokenized data entry) by an n-dimensional vector space, where semantically similar or semantically related words come closer depending on the training model. After representation of the texts by vectors, the similarity calculator 126 calculates the similarity between any two of the texts. In certain embodiments, for calculating the similarity, the similarity calculate 126 not only considers the meaning of the words in the text, but also the relationship of the words in the texts, especially the sequence of the words in the text.
- the similarity score is represented by a number between 0 and 1, where 0 indicates that two data entries are distant in the vector space and have no similarity at all, and 1 indicates that the two data entries are close or overlapped in the vector space and are substantially the same.
- the two texts are regarded as very similar if the similarity score is greater than about 0.6-0.8, and regarded as less similar if the similarity score is lower than about 0.6.
- the comparison between two tokenized texts results in multiple scores, each score corresponds to one word or multiple words having similar features. For example, words in the text that related to color is chosen for comparison, so that the result of the comparison includes a similarity score that corresponds to color.
- the NLP 128 upon receiving the cleaned and tokenized data entries (text), determines the syntactic structure of the text by analyzing its constituent words based on an underlying grammar. In certain embodiments, the NLP 128 uses part-of-speech tagging. In certain embodiments, the NLP 128 evaluate the syntactic or grammar complexity of the data entry, and represents the complexity as a real number. After obtaining a number for each cleaned and tokenized data entry, the NLP 128 sends the numbers to the semantic scorer 130 .
- the procedure 506 , 508 and 510 are performed in parallel or independently.
- the semantic scorer 130 upon receiving the sentiment polarity of each of the data entries from the sentiment analyzer 124 , the similarity scores between any two of the data entries from the similarity calculator 126 , and the NLP score of each of the data entries, calculates the semantic similarity score for each pair of data entries, i.e., for any two of the data entries.
- the semantic scorer 130 calculates the semantic similarity score based on the above three types of features using the formula:
- n corresponds to the major types of features or the feature sources: the sentiment features, the text similarity feature, and the syntactic features; s i is the weight of the features sources; f j is a feature function that measures the similarity between two data entries, and each of the feature sources, sentiment, text similarity, syntactic, may include one or more feature functions; k is the total number of feature functions; w j is a weight of f j .
- the parameters in the formula may be obtained using a training data sets with a training model, or the parameters are pre-determined values entered by the system manager.
- the semantic similarity scores are positive numbers. After obtaining the semantic similarity score between each pair of data entries (clean and tokenized texts) using the above formula, the semantic scorer 130 sends the semantic similarity scores to the cluster classifier 132 .
- the cluster classifier 132 upon receiving the semantic similarity scores between each pair (any two) of the data entries, classifies the data entries based on the semantic similarity scores. Specifically, the cluster classifier 132 groups the data entries into clusters, the data entries in the same cluster have high semantic similarity scores.
- a threshold is defined for a the clusters, which means that any two data entries in the same cluster has the semantic similarity score greater than the threshold score.
- the value of the threshold score may be determined based on the subject matter of the data entries, the required recall, and the required precision. In certain embodiments, a small threshold value is given when high recall is needed. In certain embodiments, a large threshold value is given when high precision is needed.
- the cluster classifier 132 stores the clusters into the new theme database 194 .
- each cluster includes one or more data entries, and the cluster classifier 132 may only stores the clusters that having a large number of data entries.
- the threshold number of data entries in the clusters may be set at about 5-500. In certain embodiments, the threshold number is set in a range of 25-120. In certain embodiments, the threshold number is set in the range of about 50-60. In one example, the average cluster size within a week is about 50, and the threshold number is set at 60, and the stored clusters are very possible real themes or topics. The stored cluster or also named emerging themes.
- the emerging theme detector 120 obtains certain number of new themes, each new theme includes some data entries.
- the procedures may be performed repeatedly by batch in a predetermined time interval, such as weekly or monthly.
- the user generated entries are collected and stored by week, and the emerging theme detector 120 processes the data entries in a week when the data entries are available.
- the new theme database 194 includes different sets of new themes, each set corresponding to data entries from a specific week or a specific month.
- FIG. 6 schematically depicts a method for verifying new themes to obtain new concepts according to certain embodiments of the present disclosure.
- the new concepts are verified new themes.
- the method is implemented by the computing device 110 shown in FIG. 1 .
- the method shown in FIG. 6 corresponds to the function of the new concept verifier 140 .
- the steps of the method may be arranged in a different sequential order, and are thus not limited to the sequential order as shown in FIG. 6 .
- the procedures shown in FIG. 6 are performed sequentially after the procedures shown in FIG. 5 .
- the new theme retrieving module 142 retrieves new themes from the new theme database 190 .
- the retrieved new themes include a current batch of new themes for analysis and a few previous batches of new themes that have already been analyzed before.
- the new theme retrieving module 142 retrieves new themes from the most recent week (hereinafter refers to week 0) and new themes from the three weeks previous to the most recent week (hereinafter refers to week—1, week—2, week—3).
- the batch of week 0, week—1, week—2, and week—3 respectively include, for example, 120, 130, 11, and 140 new themes.
- Each batch of new themes are obtained through the procedures shown in FIG. 5 by analyzing that week of data entries.
- the new theme retrieving module 142 After retrieving the new themes, the new theme retrieving module 142 sends the new themes to the near duplicate identification module 144 .
- the near duplicate identification module 144 identifies duplicated themes in the week 0 themes. Specifically, for comparing whether one theme in week 0 is a duplicate of one theme in any of the themes in week—1, week—2 or week—3 (termed target theme hereinafter) the near duplicate identification module 144 : first calculates semantic similarity scores between each data entry in the week 0 theme to the data entries in the target theme, and based on the semantic similarity scores, determines whether the week 0 data entry belongs to the target theme; then repeats the process and determines the possibility for each of the week 0 data entries belonging to the target theme; and after that, computes the percentage of the week 0 data entries that belong to the target theme.
- the near duplicate identification module 144 determines that the week 0 theme is a duplicate of the target theme. If not, the near duplicate identification module 144 continues to compare the week 0 theme with all the other week—1, week—2 and week—3 themes. If the week 0 theme is not duplicate theme of any of the week—1, week—2 and week—3 themes, the near duplicate identification module 144 determines that the week 0 theme is a nonduplicate theme. The near duplicate identification module 144 repeats the above process for each of the week 0 themes, obtains the nonduplicate themes from the week 0 themes, and sends nonduplicate themes to the concept comparing module 146 . In one example, among the 120 week 0 new themes, 90 of them have one or more duplicate themes in the week—1, week—2 or week—3 themes, and 30 of them are nonduplicate themes.
- the concept comparing module 146 computes whether the nonduplicate themes belong to existing concepts. Specifically, for each concept in the ontology, a binary text classifier is constructed and trained. i.e., each concept in the ontology database has its text classifier model.
- the classifier model is a logistic regression or gradient boosting classifier.
- the nonduplicate theme For each theme of the nonduplicate themes (such as the 30 nonduplicate themes), the nonduplicate theme includes a number of data entries. Each data entry in the nonduplicate theme is used as an input of the classifier of one concept (termed as target concept hereinafter), so as to obtain a Boolean value, indicating if the data entry belongs to the target concept.
- a percentage of the data entries in the nonduplicate theme that belongs to the target concept can be computed. For example, if a nonduplicate theme T contains 100 data entries, and 90 of the data entries belong to a given target concept C, then the probability of the nonduplicate theme T belonging to the target concept C is 90%.
- a highest probability is recorded corresponding to one of the concepts.
- each of the 30 nonduplicate themes are given a probability score against one of the concepts (the highest score when comparing with all the concepts).
- the concept comparing module 146 then sends those 30 probability scores, each corresponding to one of the concepts, to the concept proposing module 148 .
- the concept proposing module 148 ranks the 30 nonduplicate themes based on their probability scores, and proposing the new themes that have a low probability score as proposed concepts.
- the low probability score is defined as less than about 0.4. In certain embodiments, the low probability is defined as less than 0.25.
- the number of new themes may be eight, and the concept proposing module 148 then sends the proposed concepts, such as the eight proposed concepts from the 30 nonduplicate new themes, to the concept verification module 150 .
- the concept verification module 150 upon receiving the proposed concepts, presents the proposed concepts, such as the eight proposed concept, to the system administrator, and the system administrator verify the proposed concepts, for example may select five of the eight proposed concept as real concept candidates.
- the concept verification module 150 may further label the 120 week 0 new themes with “duplicated data entry,” “unverified concept,” or “verified concept” in the new theme data base 194 , and sends the five verified concept to the ontology adjusting module 160 .
- the concept verification may not be necessary, and the concept proposing module 148 sends the proposed concepts (such as the eight proposed concepts) directly to the ontology adjusting module 160 .
- the verification may also be performed automatically using certain criteria, such as the feature of the theme word.
- FIG. 7 schematically depicts a method for proposing ontology adjustments based on the verified concepts and updating the ontology using an optimal adjustment according to certain embodiments of the present disclosure.
- the method is implemented by the computing device 110 shown in FIG. 1 .
- the method shown in FIG. 7 corresponds to the function of the ontology adjustment module 160 and the ontology updating module 170 .
- the steps of the method may be arranged in a different sequential order, and are thus not limited to the sequential order as shown in FIG. 7 .
- the procedures shown in FIG. 7 are performed sequentially after the procedures shown in FIG. 6 .
- the ontology and new concept retrieving module 162 retrieves the ontology 196 and retrieves (or receive) the verified concepts from the concept verification module 150 , and sends the retrieved data to the parent concept detection module 164 .
- the following procedures are described in related to one verified concept, and each of the new verified concept should be processed similarly.
- the parent concept module 164 detects a parent concept from the ontology for each of the verified concept.
- each of the existing concepts from the ontology has a classifier
- the verified concept includes a plurality of data entries.
- the parent concept module 164 inputs each of the text content of the verified concept to the classifier of the existing concept, so as to obtain a value. The value indicates whether the text of the new concept belongs to the ontology concept.
- the percentage of the data entries belonging to the existing concept is calculated and regarded as the possibility of whether the verified concept belongs to the existing concept.
- the parent concept module 164 compares the data entries of the verified concept to each of the existing concepts (nodes) in the ontology, and obtains the possibilities of whether the verified concept belonging to any of the existing concepts. The parent concept module 164 then selects the existing concept that corresponding to the highest possibility as the parent concept of the verified concept. The parent concept module 164 then sends the ontology, the selection of the parent concept, and the verified concept (or their specific identification) to the sibling concept similarity module 166 . In certain embodiments, the parent concept module 164 may not only provide the most relevant parent concept, but a list of relevant parent concepts with the corresponding possibility values for the verified concept. The results may be presented and selected through the management interface 185 .
- the sibling concept similarity module 166 determines all child concepts of the parent concept, which is also termed sibling concepts of the verified concept; calculates the possibilities of the data entries in the verified concept belonging to one of the sibling concept using the classifier of the sibling concept; calculates the percentage of data entries belonging to the sibling concept; repeating the process to calculate percentages of the data entries against each of the sibling concepts; and selects the one sibling concept with the highest percentage. Then the sibling concept similarity module 166 sends the parent concept and the most closed related sibling concept (having the highest percentage) to the adjustment proposing module 168 .
- the sibling concept similarity module 166 may not only provide the closely related sibling concept, but a list of related sibling concepts with the corresponding possibility values for the verified concept. In certain embodiments, the sibling concept similarity module 166 may include more than one list of sibling concepts, each list corresponding to one relevant parent concepts, and the system manager views and selects the parent concept and the sibling concept for the verified concept through the through the management interface 185 .
- the adjustment proposing module 168 upon receiving the most relevant parent concept and the most closely related sibling concept, proposes several ways of adjusting hierarchy structure of the ontology.
- the adjustment proposing module 168 may insert the new concept candidate as a child node of the parent concept.
- the adjustment proposing module 168 may proposes the elementary operations of lift, shift and merge as shown in FIGS. 3B-3D .
- the hierarch structure adjusting module 168 then sends the proposed adjustments to the modification verification module 172 .
- the modification verification module 172 upon receiving the proposed adjustment, verifies the adjustment. Specifically, for a dataset D, each proposed adjustment has a corresponding hierarchy.
- the optimal hierarchy from the plurality of hierarchies can be determined by:
- H opt arg ⁇ ⁇ max H ⁇ log ⁇ ⁇ p ⁇ ( D ⁇ H ) .
- the optimal hierarchy is then defined as the verified hierarchy.
- the manager interface 185 may provide means for the system manager to change the parameters, such as recalls, so as to change the result of the optimal hierarchy, and optimize the results.
- the modification verification module 172 present the verification result through the manager interface 185 , which may include the list of the proposed adjustment and the numerical values indicating whether the proposed adjustments are optimal.
- the system manager may validate the verified adjustments by selecting one of the proposed adjustments through the manager interface 185 , and if the validation selection is yes, the validation is sent by the manager interface 185 to the updating module 174 . If the system manager determines that the adjustment(s) is not valid, he may provide an instruction to the parent concept module 164 via the manager interface 185 , such that the parent concept 164 detects a parent concept for another verified concept. In certain embodiments, the system manager may provide an instruction to the adjustment proposing module 168 via the manager interface 185 , such that the adjustment proposing module 168 proposes different adjustment for the hierarchy using different parameters. In certain embodiments, the validation step is not necessary, and the verified adjustment is sent directly to the update module 174 .
- the updating module 174 updates the ontology using the validate adjustment.
- the method further includes a tuning mechanism, where the tuning module 180 analyzes the updated ontology, and retrain the related models according to the updated ontology.
- certain embodiments of the present disclosure provides a semantic analysis pipeline to automatically mining and detecting emerging themes and new concepts from user-generated data. Further, a management interface is provided to present the detected themes along with statistic information, generated summarization, sentiment distribution, and receive instructions from the system manager to adjust parameters of the system.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Animal Behavior & Ethology (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Machine Translation (AREA)
Abstract
Description
wherein si is weight of the features sources, fj is one of the feature similarities between the two of the data entries, wj is a weight of fj, and j, k and n are positive integers.
The macro-averaged recall is
By comparing the recall for each of the hierarchies, the optimal hierarchy can be determined. The
- 1. Tomas Mikolov, Ilya Sutskever et al, Distributed representations of words and phrases and their compositionality, 2013, arXiv: 1310.4546 [cs.CL]
- 2. Quoc Le, Tomas Mikolov, Distributed representations of sentences and documents, Proceedings of the 31st International Conference on Machine Learning, Beijing, China, 2014, JMLR 32(2): 1188-1196.
- 3. Yoon Kim, Convolutional Neural Networks for Sentence Classification, arXiv:1408.5882 [cs.CL].
- 4. Lei Tang, Jianping Zhang, Huan Liu, Automatically adjusting content taxonomies for hierarchical classification, Proceedings, 2006.
- 5. Bo Pang and Lillian Lee, Opinion mining and sentiment analysis, Foundations and Trends in Information Retrieval, 2008 V2(1-2): 1-135.
- 6. David M Blei, Probabilistic topic models, Communications of the ACM, 2012, V55(4): 77-84.
- 7. Kunal Punera, Suju Rajan, Joydeep Ghosh, Automatically learning document taxonomies for hierarchical classification, Special interest tracks and posters of the 14th international conference on World Wide Web, 2005, pp. 1010-1011.
- 8. Mikolov, Tomas; et al. Efficient estimation of word representations in vector space, 2013, CoRR, abs/1301.3781.
- 9. Sanjeev Arora, Yingyu Liang, Tengyu Ma, A simple but tough-to-beat baseline for sentence embeddings, ICLR 2017.
- 10. Yiming Yang, Thomas Ault, Thomas Pierce and Charles W Lattimer, Improving text categorization methods for event tracking, SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, 2000, pp. 65-72.
- 11. Jian Zhang, Zoubin Ghahramani and Yiming Yang, A probabilistic model for online document clustering with application to novelty detection, NIPS'04 Proceedings of the 17th International Conference on Neural Information Processing Systems, 2004, pp 1617-1624.
- 12. Matt J. Kusner, Yu Sun, Nicholas I. Kolkin, Kilian Q. Weinberger, From word embeddings to document distances, Proceedings of the 32nd International Conference on Machine Learning, Lille, France, JMLR: W&CP, 2015, v37: 857-966.
- 13. Dingquan Wang, Weinan Zhang, Gui-Rong Xue, and Yong Yu, Deep classifier for large scale hierarchical text classification, Proceedings, 2009.
- 14. David M. Blei, John D. Lafferty, Dynamic topic models, Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, Pa., 2006, pp 113-120.
- 15. Wilas Chamlertwat, Pattarasinee Bhattarakosol, Tippakorn Rungkasiri, Discovering consumer insight from twitter via sentiment analysis, Journal of Universal Computer Science, 2012, V18(8): 973-992.
- 16. https://github.com/dmlc/xgboost
- 17. Nagaraju Bandaru, Eric D. Moyer, Shrisha Radhakrishna, Method and system for analyzing user-generated content, U.S. Published Patent Application No. 2008/0133488 A1, 2008.
- 18. Rui Cai, Qiang Hao, Changhu Wang, Rong Xiao, Lei Zhang, Mining topic-related aspects from user generated content, U.S. Pat. No. 8,458,115 B2, 2013.
- 19. Rajeev Dadia, Vidya Sagar, Anisingaraju, Prashanth Talanki, Systems and methods for analyzing consumer sentiment with social perspective insight, U.S. Published Patent Application No. 2016/0196564A1, 2016.
Claims (18)
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US16/174,140 US11003638B2 (en) | 2018-10-29 | 2018-10-29 | System and method for building an evolving ontology from user-generated content |
CN201911031161.6A CN111104518A (en) | 2018-10-29 | 2019-10-28 | System and method for building an evolving ontology from user-generated content |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US16/174,140 US11003638B2 (en) | 2018-10-29 | 2018-10-29 | System and method for building an evolving ontology from user-generated content |
Publications (2)
Publication Number | Publication Date |
---|---|
US20200134058A1 US20200134058A1 (en) | 2020-04-30 |
US11003638B2 true US11003638B2 (en) | 2021-05-11 |
Family
ID=70327191
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/174,140 Active 2039-05-29 US11003638B2 (en) | 2018-10-29 | 2018-10-29 | System and method for building an evolving ontology from user-generated content |
Country Status (2)
Country | Link |
---|---|
US (1) | US11003638B2 (en) |
CN (1) | CN111104518A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11238235B2 (en) * | 2019-09-18 | 2022-02-01 | International Business Machines Corporation | Automated novel concept extraction in natural language processing |
Families Citing this family (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11422996B1 (en) * | 2018-04-26 | 2022-08-23 | Snap Inc. | Joint embedding content neural networks |
US11170169B2 (en) * | 2019-03-29 | 2021-11-09 | Innoplexus Ag | System and method for language-independent contextual embedding |
US12106240B2 (en) * | 2019-04-14 | 2024-10-01 | Jamil JADALLAH | Systems and methods for analyzing user projects |
US11030257B2 (en) | 2019-05-20 | 2021-06-08 | Adobe Inc. | Automatically generating theme-based folders by clustering media items in a semantic space |
US11227018B2 (en) * | 2019-06-27 | 2022-01-18 | International Business Machines Corporation | Auto generating reasoning query on a knowledge graph |
US11573995B2 (en) * | 2019-09-10 | 2023-02-07 | International Business Machines Corporation | Analyzing the tone of textual data |
US11487943B2 (en) * | 2020-06-17 | 2022-11-01 | Tableau Software, LLC | Automatic synonyms using word embedding and word similarity models |
US11501070B2 (en) * | 2020-07-01 | 2022-11-15 | International Business Machines Corporation | Taxonomy generation to insert out of vocabulary terms and hypernym-hyponym pair induction |
US11392769B2 (en) * | 2020-07-15 | 2022-07-19 | Fmr Llc | Systems and methods for expert driven document identification |
CN111930976B (en) * | 2020-07-16 | 2024-05-28 | 平安科技(深圳)有限公司 | Presentation generation method, device, equipment and storage medium |
CN111950264B (en) * | 2020-08-05 | 2024-04-26 | 广东工业大学 | Text data enhancement method and knowledge element extraction method |
US11550832B2 (en) | 2020-10-02 | 2023-01-10 | Birchhoover Llc | Systems and methods for micro-credential accreditation |
US11551001B2 (en) * | 2020-11-10 | 2023-01-10 | Discord Inc. | Detecting online contextual evolution of linguistic terms |
US12124495B2 (en) * | 2021-02-05 | 2024-10-22 | Mercari, Inc. | Generating hierarchical ontologies |
US20220366269A1 (en) * | 2021-05-11 | 2022-11-17 | International Business Machines Corporation | Interactive feature engineering in automatic machine learning with domain knowledge |
US11922122B2 (en) * | 2021-12-30 | 2024-03-05 | Calabrio, Inc. | Systems and methods for detecting emerging events |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110004588A1 (en) * | 2009-05-11 | 2011-01-06 | iMedix Inc. | Method for enhancing the performance of a medical search engine based on semantic analysis and user feedback |
US7930302B2 (en) * | 2006-11-22 | 2011-04-19 | Intuit Inc. | Method and system for analyzing user-generated content |
US20110264649A1 (en) * | 2008-04-28 | 2011-10-27 | Ruey-Lung Hsiao | Adaptive Knowledge Platform |
US20130138696A1 (en) * | 2011-11-30 | 2013-05-30 | The Institute for System Programming of the Russian Academy of Sciences | Method to build a document semantic model |
US8458115B2 (en) | 2010-06-08 | 2013-06-04 | Microsoft Corporation | Mining topic-related aspects from user generated content |
US20140195518A1 (en) * | 2013-01-04 | 2014-07-10 | Opera Solutions, Llc | System and Method for Data Mining Using Domain-Level Context |
US20140324750A1 (en) * | 2013-04-24 | 2014-10-30 | Alcatel-Lucent | Ontological concept expansion |
US10078843B2 (en) | 2015-01-05 | 2018-09-18 | Saama Technologies, Inc. | Systems and methods for analyzing consumer sentiment with social perspective insight |
US10303688B1 (en) * | 2018-06-13 | 2019-05-28 | Stardog Union | System and method for reducing data retrieval delays via prediction-based generation of data subgraphs |
US10878191B2 (en) * | 2016-05-10 | 2020-12-29 | Nuance Communications, Inc. | Iterative ontology discovery |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101930462A (en) * | 2010-08-20 | 2010-12-29 | 华中科技大学 | Comprehensive body similarity detection method |
US8620964B2 (en) * | 2011-11-21 | 2013-12-31 | Motorola Mobility Llc | Ontology construction |
CN107895012B (en) * | 2017-11-10 | 2021-10-08 | 上海电机学院 | Ontology construction method based on Topic Model |
-
2018
- 2018-10-29 US US16/174,140 patent/US11003638B2/en active Active
-
2019
- 2019-10-28 CN CN201911031161.6A patent/CN111104518A/en active Pending
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7930302B2 (en) * | 2006-11-22 | 2011-04-19 | Intuit Inc. | Method and system for analyzing user-generated content |
US20110264649A1 (en) * | 2008-04-28 | 2011-10-27 | Ruey-Lung Hsiao | Adaptive Knowledge Platform |
US20110004588A1 (en) * | 2009-05-11 | 2011-01-06 | iMedix Inc. | Method for enhancing the performance of a medical search engine based on semantic analysis and user feedback |
US8458115B2 (en) | 2010-06-08 | 2013-06-04 | Microsoft Corporation | Mining topic-related aspects from user generated content |
US20130138696A1 (en) * | 2011-11-30 | 2013-05-30 | The Institute for System Programming of the Russian Academy of Sciences | Method to build a document semantic model |
US20140195518A1 (en) * | 2013-01-04 | 2014-07-10 | Opera Solutions, Llc | System and Method for Data Mining Using Domain-Level Context |
US20140324750A1 (en) * | 2013-04-24 | 2014-10-30 | Alcatel-Lucent | Ontological concept expansion |
US10078843B2 (en) | 2015-01-05 | 2018-09-18 | Saama Technologies, Inc. | Systems and methods for analyzing consumer sentiment with social perspective insight |
US10878191B2 (en) * | 2016-05-10 | 2020-12-29 | Nuance Communications, Inc. | Iterative ontology discovery |
US10303688B1 (en) * | 2018-06-13 | 2019-05-28 | Stardog Union | System and method for reducing data retrieval delays via prediction-based generation of data subgraphs |
Non-Patent Citations (15)
Title |
---|
Bo Pang and Lillian Lee, Opinion mining and sentiment analysis, Foundations and Trends in Information Retrieval, 2008 V2(1-2): 1-135. |
David M Blei, Probabilistic topic models, Communications of the ACM, 2012, V55(4): 77-84. |
David M. Blei, John D. Lafferty, Dynamic topic models, Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, 2006, pp. 113-120. |
Dingquan Wang, Weinan Zhang, Gui-Rong Xue, and Yong Yu, Deep classifier for large scale hierarchical text classification, Proceedings, 2009. |
Jian Zhang, Zoubin Ghahramani and Yiming Yang, A probabilistic model for online document clustering with application to novelty detection, NIPS'04 Proceedings of the 17th International Conference on Neural Information Processing Systems, 2004, pp. 1617-1624. |
Kunal Punera, Suju Rajan, Joydeep Ghosh, Automatically learning document taxonomies for hierarchical classification, Special interest tracks and posters of the 14th international conference on World Wide Web, 2005, pp. 1010-1011. |
Lei Tang , Jianping Zhang, Huan Liu, Automatically adjusting content taxonomies for hierarchical classification, Proceedings, 2006. |
Matt J. Kusner , Yu Sun , Nicholas I. Kolkin , Kilian Q. Weinberger, From word embeddings to document distances, Proceedings of the 32nd International Conference on Machine Learning, Lille, France, JMLR: W&CP, 2015, v37: 857-966. |
Mikolov, Tomas; et al. Efficient estimation of word representations in vector space, 2013, CoRR, abs/1301.3781. |
Quoc Le, Tomas Mikolov, Distributed representations of sentences and documents, Proceedings of the 31st International Conference on Machine Learning, Beijing, China, 2014, JMLR 32(2): 1188-1196. |
Sanjeev Arora, Yingyu Liang, Tengyu Ma, A simple but tough-to-beat baseline for sentence embeddings, ICLR 2017. |
Tomas Mikolov, Ilya Sutskever et al, Distributed representations of words and phrases and their compositionality, 2013, arXiv:1310.4546 [cs.CL]. |
Wilas Chamlertwat, Pattarasinee Bhattarakosol, Tippakorn Rungkasiri, Discovering consumer insight from twitter via sentiment analysis, Journal of Universal Computer Science, 2012, V18(8): 973-992. |
Yiming Yang, Thomas Ault, Thomas Pierce and Charles W Lattimer, Improving text categorization methods for event tracking, SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, 2000, pp. 65-72. |
Yoon Kim, Convolutional Neural Networks for Sentence Classification, arXiv:1408.5882 [cs.CL]. |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11238235B2 (en) * | 2019-09-18 | 2022-02-01 | International Business Machines Corporation | Automated novel concept extraction in natural language processing |
Also Published As
Publication number | Publication date |
---|---|
CN111104518A (en) | 2020-05-05 |
US20200134058A1 (en) | 2020-04-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11003638B2 (en) | System and method for building an evolving ontology from user-generated content | |
Bagheri et al. | Care more about customers: Unsupervised domain-independent aspect detection for sentiment analysis of customer reviews | |
Wang et al. | Topic sentiment analysis in twitter: a graph-based hashtag sentiment classification approach | |
Neethu et al. | Sentiment analysis in twitter using machine learning techniques | |
US10853697B2 (en) | System and method for monitoring online retail platform using artificial intelligence and fixing malfunction | |
Aries et al. | Automatic text summarization: What has been done and what has to be done | |
Xiang et al. | Bridging domains using world wide knowledge for transfer learning | |
Kamal | Subjectivity classification using machine learning techniques for mining feature-opinion pairs from web opinion sources | |
Falke et al. | Concept-map-based multi-document summarization using concept coreference resolution and global importance optimization | |
Cheema et al. | Check_square at checkthat! 2020: Claim detection in social media via fusion of transformer and syntactic features | |
Kamal et al. | Mining feature-opinion pairs and their reliability scores from web opinion sources | |
Rodrigues et al. | Real‐Time Twitter Trend Analysis Using Big Data Analytics and Machine Learning Techniques | |
Bollegala et al. | ClassiNet--Predicting missing features for short-text classification | |
Thukral et al. | DiffQue: Estimating relative difficulty of questions in community question answering services | |
Ahmad et al. | Opinion mining using frequent pattern growth method from unstructured text | |
Chou et al. | Boosted web named entity recognition via tri-training | |
Srivastava et al. | Ensemble methods for sentiment analysis of on-line micro-texts | |
Makrynioti et al. | PaloPro: a platform for knowledge extraction from big social data and the news | |
Jafari Sadr et al. | Popular tag recommendation by neural network in social media | |
Joshi et al. | An Inventive Movie Suggestion System Using Machine Learning Techniques | |
Chaudhary et al. | Fake News Detection During 2016 US Elections Using Bootstrapped Metadata-Based Naïve Bayesian Classifier | |
Batista | Large-scale semantic relationship extraction for information discovery | |
Robinson | Disaster tweet classification using parts-of-speech tags: a domain adaptation approach | |
Shi et al. | Story disambiguation: Tracking evolving news stories across news and social streams | |
Tuarob et al. | Twittdict: Extracting social oriented keyphrase semantics from twitter |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: BEIJING JINGDONG SHANGKE INFORMATION TECHNOLOGY CO., LTD., CHINA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:LIU, SHIZHU;HUANG, KAINLIN;CHEN, LI;AND OTHERS;SIGNING DATES FROM 20181026 TO 20181028;REEL/FRAME:047345/0138 Owner name: JD.COM AMERICAN TECHNOLOGIES CORPORATION, CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:LIU, SHIZHU;HUANG, KAINLIN;CHEN, LI;AND OTHERS;SIGNING DATES FROM 20181026 TO 20181028;REEL/FRAME:047345/0138 |
|
FEPP | Fee payment procedure |
Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: FINAL REJECTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE AFTER FINAL ACTION FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT RECEIVED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT VERIFIED Free format text: AWAITING TC RESP, ISSUE FEE PAYMENT VERIFIED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT VERIFIED |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1551); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY Year of fee payment: 4 |