Nothing Special   »   [go: up one dir, main page]

US11397731B2 - Method and system for interactive keyword optimization for opaque search engines - Google Patents

Method and system for interactive keyword optimization for opaque search engines Download PDF

Info

Publication number
US11397731B2
US11397731B2 US16/840,538 US202016840538A US11397731B2 US 11397731 B2 US11397731 B2 US 11397731B2 US 202016840538 A US202016840538 A US 202016840538A US 11397731 B2 US11397731 B2 US 11397731B2
Authority
US
United States
Prior art keywords
posts
mre
retrieved
document
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active, expires
Application number
US16/840,538
Other versions
US20200327120A1 (en
Inventor
Rami Puzis
Aviad ELYASHAR
Maor REUBEN
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BG Negev Technologies and Applications Ltd
Original Assignee
BG Negev Technologies and Applications Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BG Negev Technologies and Applications Ltd filed Critical BG Negev Technologies and Applications Ltd
Priority to US16/840,538 priority Critical patent/US11397731B2/en
Assigned to B. G. NEGEV TECHNOLOGIES AND APPLICATIONS LTD., AT BEN-GURION UNIVERSITY reassignment B. G. NEGEV TECHNOLOGIES AND APPLICATIONS LTD., AT BEN-GURION UNIVERSITY ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: PUZIS, RAMI, ELYASHAR, AVIAD, REUBEN, MAOR
Publication of US20200327120A1 publication Critical patent/US20200327120A1/en
Priority to US17/854,917 priority patent/US11809423B2/en
Application granted granted Critical
Publication of US11397731B2 publication Critical patent/US11397731B2/en
Active legal-status Critical Current
Adjusted expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9532Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines

Definitions

  • the present invention relates to the field of data search engines. More particularly, the present invention relates to a method and system for interactive keyword optimization for opaque search engines.
  • Short keyword queries are one of the main milestones of any user or bot seeking information through the ubiquitous search engines available on the Web [Chirita et al., “Personalized query expansion for the web”, Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 7-14. ACM, 2007].
  • Automated Keyword optimization mostly relies on the analysis of data repositories for a small set of keywords that identify the discussed topic and relevant documents.
  • search engines are available today on the Web are opaque, providing little to no information about their operation methods and the searched repository.
  • opaque search engines provide a very limited level of interactivity and hide all activities that the search engine performs, including the repository itself [Koenemann et al., “A case for interaction: A study of interactive information retrieval behavior and effectiveness”, Proceeding of the ACM SIGCHI Conference on Human Factors in Computing Systems, pages 205-212, Citeseer, 1996].
  • OSM optical character recognition
  • Query performance prediction is used mainly for information retrieval domain [Zhou et al., “Ranking robustness: a novel framework to predict query performance”, Proceedings of the 15th ACM international conference on Information and knowledge management, pages 567-574. ACM, 2006] by estimating the relevance of retrieved documents to a query when no previous knowledge about the documents exists [Kurland et al., “Back to the roots: A probabilistic framework for query performance prediction”, Proceedings of the 21st ACM international conference on Information and knowledge management, pages 823-832. ACM, 2012].
  • the QPP task can be divided into two sub domains: pre-retrieval and post-retrieval prediction.
  • pre-retrieval and post-retrieval prediction In the first domain, researchers attempt to predict query performance based on data that does not contain the retrieved documents [He et al., “Inferring query performance using pre-retrieval predictors”, International symposium on string processing and information retrieval, pages 43-54. Springer, 2004].
  • the task is to predict query performance using the retrieved documents from the query [Kurland et al., 2012].
  • There are few well-known measures used for evaluating the performance of post-retrieval prediction methods such as Clarity [Cronen-Townsend et al., “Predicting query performance”, Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, pages 299-306. ACM, 2002.], and WIG [Zhou et al., “Query performance prediction in web search environments”, Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 543-550. ACM, 2007].
  • An automated interactive optimization method of short keyword queries for improving information retrieval from opaque (black box) search engines comprising the steps of:
  • Calculating the mean relevance error may be performed by estimating the minimal distance between vector representations of words in a retrieved post and the words in the given input document, including the following steps:
  • the MRE may be adapted to measure only one aspect of query performance, for representing the relevance of the results.
  • Each claim may include one or more of the following descriptive attributes:
  • the interactive greedy search process may include the following steps:
  • Every other score can be used as an optional relevance measure, instead of MRE.
  • a system for automated interactive keyword optimization for opaque search engines comprising:
  • FIG. 1 shows a pie graph with Website Distribution
  • FIG. 2 shows a graph of the number of retrieved tweets by keyword extraction methods
  • FIG. 3 shows a graph of the number of retrieved tweets per number of keywords in a given query
  • FIG. 4 shows a graph of a claim, relevant tweet, and irrelevant tweet embedded in 2D space
  • FIG. 5 shows a graph of ROC of the proposed measure on 1078 labeled posts.
  • FIG. 6 shows a graph of average tweets per claim versus mean relevance error of TF-IDF keywords generator, POS tagging keywords generator, and our proposed Bottom-Up search.
  • the left dots of TF-IDF and POS tagging are keywords with ten words and the right dots are keywords with one word.
  • the present invention provides an automated interactive optimization method of short keyword queries in order to improve information retrieval from opaque (“black box”) search engines.
  • the task for which the present invention is directed may be for example, the retrieval of relevant posts from an online social media (OSM) given a news article or a document being discussed online (referred to as a “claim”).
  • OSM online social media
  • the proposed algorithm iteratively selects keywords while querying the search engine and comparing a small set of retrieved posts to the given news article through a mean relevance error based on word embedding.
  • the proposed algorithm is being demonstrated while building a Fake News data set from claims (collected from fact-checking websites) and their tweets.
  • the mean relevance error found to be accurate for differentiating between relevant and irrelevant posts (0.9 Area Under the Curve (AUC)).
  • TF-IDF Inverse Document Frequency
  • POS tagging the process of marking up a word in a text (corpus) as corresponding to a particular part of speech.
  • the proposed solution is two-fold: (1) The relevance of posts to the claim is estimated by comparing the vector representations of words contained in both documents; (2) A greedy algorithm is used to build the set of keywords for the above task by iteratively querying the OSM for the first page or relevant posts and choosing the best keyword to add to the set.
  • the present invention proposes a novel interactive method for optimizing keyword extraction given a document while querying a search engine. This is done by evaluating the similarity between a given claim (document) to a collection of posts (documents) associated with the given claim.
  • the proposed method includes two complementing steps:
  • the first step is finding the mean relevance error, a short-document comparison method for determining the relevance of query results to a given document based on estimating the minimal distance between words comprising both the retrieved posts and the input document.
  • the second step is a novel interactive greedy search for finding the most appropriate keywords in order to retrieve the maximal number of relevant posts using an opaque search engine. Since there is no knowledge about the inner mechanisms of the search engine and the data stored there, a series of limited interactions were performed with the search engine in order to optimize the set of keywords comprising the query. In every step of the greedy search, the next best keyword to add to the query has been chosen. The quality of the incumbent queries is computed on a few of the top results using the proposed mean relevance error (MRE).
  • MRE mean relevance error
  • the present invention proposes a method for estimating the relevance of posts retrieved from a search engine to a given input document.
  • the method estimates the minimal distance between vector representations of words in a retrieved post and the words in the given input document.
  • the mean relevance error is defined as a function, which receives as an input a document d and a collection of posts P retrieved from the search engine and outputs a number. The lower the MRE is, the more relevant are the retrieved posts P to the underlined document d.
  • Vector representations of words can be derived using any word embedding model, such as GloVe [Pennington et al., “Glove: Global vectors for word representation”, Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532-1543, 2014], Word2vec [Mikolov et al., “Efficient estimation of word representations in vector space”, arXiv preprint arXiv:1301.3781, 2013], fastText [Bojanowski et al., “Enriching word vectors with subword information”, arXiv preprint arXiv:1607.04606, 2016], etc.
  • the distance between a word w i and a document d is the minimal distance between the word w i and all the words in
  • dist ⁇ ( w i , d ) min w j ⁇ W d ⁇ ⁇ dist ⁇ ( w i , w j ) ⁇
  • dist ⁇ ( p , d ) 1 k ⁇ ⁇ w i ⁇ W p ⁇ dist ⁇ ( w i , W d )
  • the mean relevance error (MRE) of the collection P to the document d was defined as the average distance of all posts in P from d:
  • MRE ⁇ ( P , d ) 1 ⁇ P ⁇ ⁇ ⁇ p ⁇ P ⁇ dist ⁇ ( p , d )
  • the MRE defined above is designed to measure only one aspect of query performance, namely the relevance of the results. Other important aspects, such as the number of results are intentionally not captured by MRE.
  • the quality of MRE is affected by the quality of the underlying word embedding model. For general purpose query evaluation, it is recommended to use word embedding models trained globally on large data sets.
  • the present invention proposes a novel automatic method for finding the most appropriate keywords in order to retrieve the maximal number of relevant documents using an opaque search engine.
  • the proposed method is based on an interactive greedy search for the best word that should be added to the input query in order to maximize the corresponding posts retrieved by the search engine.
  • the given document's text is split into a set of words and stop words are removed.
  • the process starts from queries with a single word. Each query is sent to the opaque search engine and posts are being received as a response. Each keyword receives an aggregated mean relevance error (MRE), which reflects the relevance of the retrieved collection of posts to the given document. At the end of the iteration, the keyword that its MRE improves the retrieved results is add. The process is finished in case the error is not increased, or in case the query includes all the document's key-words. The algorithm returns the query that yields the best MRE (as shown in Algorithm 1 below).
  • MRE mean relevance error
  • the present invention suggests and evaluates the proposed MRE. However, every other score can be suggested as an optional relevance measure.
  • FIG. 1 shows a pie graph with Website Distribution. These claims were collected manually from July until December 2018. The claims were published from June 1997 to December 2018. Each claim includes descriptive attributes, such as title, description, verdict date (the date in which a fact checker published the claim), a link to the analysis report of a fact checker, and verdict (the true label).
  • Twitter search engine was used in order to collect tweets that are relevant to these claims. Twitter is one of the biggest and popular online social networks worldwide with more than 321 million monthly active users worldwide as of the fourth quarter of 2018 [twi,]. In total, 1,186,334 tweets published by 772,940 users were retrieved. An average of 2,981 posts per claim. All the tweets were published from April 2007 until February 2019. These tweets were crawled by four different methods: the proposed Bottom-Up greedy search (280,261 tweets), key-words defined manually (75,263 tweets), TF-IDF (423,868 tweets), and part of speech (POS) tagging (489,598 tweets).
  • FIG. 2 shows a graph of the number of retrieved tweets by keyword extraction methods. For the keywords defined manually, TF-IDF and POS tagging methods, tweets were collected by querying a different number of unique words (from one to ten).
  • FIG. 3 shows a graph of the number of retrieved tweets per number of keywords in a given query.
  • Tweet A includes the next text: “Rihanna Might Have Just Cost Snapchat $600 Million With a Single Instagram Story”.
  • Tweet B includes the text: “Legends And Pop Stars As Social Media Lady Gaga Is Twitter Madonna Is Vine Rihanna Is Instagram Katy Perry Is Tumblr Cher Is Facebook Miley Cyrus Is Snapchat”.
  • Tweet A was labeled as relevant, whereas Tweet B as irrelevant. Stop words were removed and the MRE was calculated for both tweets. Tweet A got an error of 0.948, as opposed to Tweet B, which reached 1.177.
  • words compose Tweet B “gaga”, “miley”, “starts” are far from the claim's words, in contrast to the words compose the relevant Tweet A, such as “story” that is placed next to “message”, or “millions” that are close to “hundreds”.
  • FIG. 4 shows a graph of a claim, relevant tweet, and irrelevant tweet embedded in 2D space.
  • AUC Operating Characteristic Curve
  • FIG. 5 shows a graph of ROC of the proposed measure on 1078 labeled posts. Therefore, one can conclude that the proposed MRE is found very useful for differentiating between relevant and irrelevant posts associated with a given claim.
  • the user is required to read the given claim in order to understand the subject of the claim.
  • the user should assign keywords, which can express the meaning of the given claim.
  • the method proposed by present invention starts by removing stop words. Similar to [Zhang, 2008], the present invention extracts 3-5 keywords from the title and description of a given claim.
  • annotators use synonyms in order to expand the context of the retrieved posts written differently but convey the same message [Voorhees, 1994].
  • the user is also required to use synonyms in order to retrieve a high number of posts relevant to the given claim.
  • POS part of speech
  • the next step was picking the K first words from the candidates as input keywords.
  • TF-IDF and POS tagging keywords were generated with a fixed size of one to ten words.
  • the keywords defined manually were created using the news article's title and description.
  • the keywords were used to query Twitter for collecting the top 600 posts and the MRE was computed on the received posts for each claim and keyword expansion method. It can be seen that there is a trade-off between the number of posts retrieved per claim and their relevance. Longer queries are less beneficial than shorter queries due to the low number of retrieved posts [Voorhees, 1994].
  • the proposed Bottom Up search outperforms the automatic baseline methods (TF-IDF and POS tagging) and performed similarly to non-automatic keywords defined manually.
  • FIG. 6 shows a graph of average tweets per claim versus mean relevance error of TF-IDF keywords generator, POS tagging keywords generator, and the proposed Bottom-Up search.
  • the left dots of TF-IDF and POS tagging are keywords with ten words and the right dots are keywords with one word. It received more relevant posts, comparing to the average posts received by TF-IDF and POS tagging.
  • the present invention For minimizing the potential risks that may arise from activities like collecting information from OSM, the present invention follows recommendations presented by [Elovici et al., 2014], which deal with ethical challenges regarding OSM and Internet communities. Given news article, the present invention proposes a method which suggests the optimal keywords for retrieving the maximal number of relevant documents. To evaluate the proposed method, the Twitter search engine has been used in order to retrieve tweets associated with the given news article.
  • the present invention proposed a novel automatic interactive method in order to improve information retrieval from opaque search engines.
  • This method is focused on the task of retrieval of relevant posts from Twitter OSM platform given a news article.
  • the mean relevance error has been proposed, which estimates the relevance of posts to a given news article based on the mean distance between vector representations of the article words and the post words. This estimation based on word embedding was found to be accurate for distinguishing between relevant and irrelevant posts. It can be very helpful for collecting relevant posts associated with a given claim automatically.
  • the proposed Bottom-Up greedy algorithm attempts to construct a set of keywords by adding a keyword that improves the relevance of the retrieved posts in each iteration.
  • the collected Fake News data set (claims and tweets) has been presented for evaluation, as well as guidelines for manual labeling of tweets. The guidelines for manual keyword assignment were also presented.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

An automated interactive optimization method of short keyword queries for improving information retrieval from opaque (black box) search engines, according to which data including labeled claims from several fact-checking websites, is collected for creating dataset which is used for evaluation. The relevance of posts/query results retrieved from a search engine to a given input document, is estimated by calculating the mean relevance error (MRE), based on estimating the minimal distance between words comprising both the retrieved posts and the input document. A subset of claims is labeled for evaluation, by choosing a number of claims that gained the maximal and the minimal mean relevance error (MRE). The most appropriate keywords is found in order to retrieve the maximal number of relevant posts using an opaque search engine, by performing an interactive greedy search for the best word that should be added to the input query, for maximizing the corresponding posts retrieved by the search engine.

Description

FIELD OF THE INVENTION
The present invention relates to the field of data search engines. More particularly, the present invention relates to a method and system for interactive keyword optimization for opaque search engines.
BACKGROUND OF THE INVENTION
Short keyword queries are one of the main milestones of any user or bot seeking information through the ubiquitous search engines available on the Web [Chirita et al., “Personalized query expansion for the web”, Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 7-14. ACM, 2007]. Automated Keyword optimization mostly relies on the analysis of data repositories for a small set of keywords that identify the discussed topic and relevant documents. However, most search engines are available today on the Web are opaque, providing little to no information about their operation methods and the searched repository.
Searching (retrieving posts) within Online Social Media (OSM) can help with box office revenues prediction [Liu et al., “Predicting movie box-office revenues by exploiting large-scale social media content. Multimedia Tools and Applications”, 75(3):1509-1528, 2016], product reviews [Jansen et al., 2009], and many other problems, where the intelligence of the crowd can be utilized. However, in many cases, the ambiguity of short keyword queries causes poor performance [Cronen-Townsend et al., “Predicting query performance”, Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, pages 299-306. ACM, 2002].
The problem of ambiguity is more emphasized when working with opaque (“black box”) search engines. In contrast to transparent search engines, where the repository and the algorithms are visible to the query writer, opaque search engines provide a very limited level of interactivity and hide all activities that the search engine performs, including the repository itself [Koenemann et al., “A case for interaction: A study of interactive information retrieval behavior and effectiveness”, Proceeding of the ACM SIGCHI Conference on Human Factors in Computing Systems, pages 205-212, Citeseer, 1996]. Nowadays, most conventional search engines, including OSM, are opaque.
In recent years, one of the OSM search use cases is related to fake news. There is a huge growth of fake news, disinformation, and propaganda within the OSM, leading to the erosion of public trust in media outlets and OSM [Zhou et al., “Fake news: Fundamental theories, detection strategies and challenges”, Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, pages 836-837, ACM, 2019]. Some methods for locating posts related to falsehood or truth disseminated through OSM include tracking specific sources, the behavior of which is extreme to either end [Tacchini et al., “Some like it hoax: Automated fake news detection in social networks”, arXiv preprint arXiv:1704.07506, 2017]. Several researchers suggest investigating and determining the trustworthiness of a Claim made in public media by looking into online discussions extracted from the OSM platforms.
Such investigations require collecting posts associated with (presumably fake) claims that appear in news articles. In all these methods, the set of keywords for querying the OSM are defined manually for each and every Claim. Determining search keywords manually, significantly limits the number of Claims that can be processed using the techniques described above.
Query Performance Prediction
Query performance prediction (QPP) is used mainly for information retrieval domain [Zhou et al., “Ranking robustness: a novel framework to predict query performance”, Proceedings of the 15th ACM international conference on Information and knowledge management, pages 567-574. ACM, 2006] by estimating the relevance of retrieved documents to a query when no previous knowledge about the documents exists [Kurland et al., “Back to the roots: A probabilistic framework for query performance prediction”, Proceedings of the 21st ACM international conference on Information and knowledge management, pages 823-832. ACM, 2012].
The QPP task can be divided into two sub domains: pre-retrieval and post-retrieval prediction. In the first domain, researchers attempt to predict query performance based on data that does not contain the retrieved documents [He et al., “Inferring query performance using pre-retrieval predictors”, International symposium on string processing and information retrieval, pages 43-54. Springer, 2004].
In the post-retrieval prediction domain, the task is to predict query performance using the retrieved documents from the query [Kurland et al., 2012]. There are few well-known measures used for evaluating the performance of post-retrieval prediction methods, such as Clarity [Cronen-Townsend et al., “Predicting query performance”, Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, pages 299-306. ACM, 2002.], and WIG [Zhou et al., “Query performance prediction in web search environments”, Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 543-550. ACM, 2007].
Document Similarity
Studies in this domain evaluate the semantic similarity between two given documents. Several methods used word vector representation for this problem [Kusner et al., “From word embeddings to document distances”, International Conference on Machine Learning, pages 957-966, 2015] calculated the minimal distance between each word in one text to all the words in the other. Kenter et al., [“Short text similarity with word embeddings”, Proceedings of the 24th ACM international on conference on information and knowledge management, pages 1411-1420. ACM, 2015] also used these vectors for calculating the distances between words of documents. Based on these distances it can be determined whether the documents are similar or not.
Keyword Expansion
In the last decades, the ambiguity of short keyword queries aroused the need for improved solutions for Web retrieval task [Chirita et al., 2007]. One of the common methods for keyword expansion takes given keywords and adds more related words to the keywords for better representation. [Wang et al., 2009; Voorhees, 1994] added synonyms from Word-Net (https://wordnet.princeton.edu/https://wordnet.princeton.edu/) for improving keyword representation over the text. [Banerjee et al., 2007] showed that Wikipedia can be a source for keywords expansion. [Liu et al., 2014] presented a novel part of speech (POS) patterns that can be used for choosing candidate keywords. Similarly, [Wang et al., 2009] used TF-IDF measure for keywords expansion, only by choosing the K best terms based on the TF-IDF score. The present invention method chooses the POS tagging and TF-IDF keywords expansion methods as the baseline for the proposed method. [Kuzi et al., 2016; Roy et al., 2016] proposed a method for choosing a term for query expansion using word embedding representation of terms. Their idea is to choose terms that yield the highest probability for being related to the current query.
It is therefore an object of the present invention to provide a method and system for interactive keyword optimization for opaque search engines, for improving information retrieval from opaque search engines.
Other objects and advantages of the invention will become apparent as the description proceeds.
SUMMARY OF THE INVENTION
An automated interactive optimization method of short keyword queries for improving information retrieval from opaque (black box) search engines, comprising the steps of:
    • a) collecting data including labeled claims from several fact-checking websites, for creating dataset which is used for evaluation;
    • b) estimating the relevance of posts/query results retrieved from a search engine to a given input document, by calculating the mean relevance error (MRE), based on estimating the minimal distance between words comprising both the retrieved posts and the input document;
    • c) labeling a subset of claims for evaluation, by choosing a number of claims that gained the maximal and the minimal mean relevance error (MRE); and
    • d) finding the most appropriate keywords in order to retrieve the maximal number of relevant posts using an opaque search engine, by performing an interactive greedy search for the best word that should be added to the input query, for maximizing the corresponding posts retrieved by the search engine.
Calculating the mean relevance error (MRE) may be performed by estimating the minimal distance between vector representations of words in a retrieved post and the words in the given input document, including the following steps:
    • a) removing stop-words from the input document and the retrieved posts;
    • b) defining the mean relevance error (MRE) as a function, which receives as an input a document d and a collection of posts P retrieved from the search engine and outputs a number, where the lower the MRE, the more relevant are the retrieved posts P to the underlined document d;
    • c) calculating the Euclidean distance between vector representations of two words as a measure of similarity between them, wherein vector representations of words are derived using a word embedding model;
    • d) defining the distance between a word wi and a document d as the minimal distance between a word wi and all the words in the set of words in the input document d, defined as Wd;
    • e) averaging the distances of all words wi∈Wp, which defines as the set pf words in p∈P, to the document d, for calculating the distance of a post p from document d; and
    • f) defining the mean relevance error (MRE) of the collection P to the document d as the average distance of all posts in P from document d and calculating said MRE.
The MRE may be adapted to measure only one aspect of query performance, for representing the relevance of the results.
Each claim may include one or more of the following descriptive attributes:
    • title;
    • description;
    • verdict date;
    • a link to the analysis report of a fact checker and verdict, being the true label.
The labeling process may include the following steps:
    • a) using annotators that are required to read the claim's title and description and the retrieved posts associated with said title;
    • b) labeling each post by each annotator with one of the optional labels: Relevant in case of the given post is associated to the given claim, Irrelevant in case of the given post is not associated to the given claim, and Unknown in case the annotator is not sure whether the tweet is related or not; and
    • c) using only the posts that the majority among the annotators agreed on.
The interactive greedy search process may include the following steps:
    • a) splitting the given document's text into a set of words and removing stop words;
    • b) at the first iteration, starting from queries with a single word, sending each query to the opaque search engine and receiving posts as a response;
    • c) receiving for each keyword an aggregated mean relevance error (MRE), which reflects the relevance of the retrieved collection of posts to the given document;
    • d) adding the keyword that its MRE improves the retrieved results, wherein the process is finished in case the error is not increased, or in case the query includes all the document's key-words; and
    • e) returning and implementing the algorithm on the query that yields the best MRE.
Every other score can be used as an optional relevance measure, instead of MRE.
A system for automated interactive keyword optimization for opaque search engines, comprising:
    • a) A database for storing data for evaluation, including labeled claims collected from several fact-checking websites;
    • b) At least one processor adapted to:
      • b.1) estimate the relevance of posts/query results retrieved from a search engine to a given input document, by calculating the mean relevance error (MRE), based on estimating the minimal distance between words comprising both the retrieved posts and the input document;
      • b.2) label a subset of claims for evaluation, by choosing a number of claims that gained the maximal and the minimal mean relevance error (MRE); and
      • B.3) find the most appropriate keywords in order to retrieve the maximal number of relevant posts using an opaque search engine, by performing an interactive greedy search for the best word that should be added to the input query, for maximizing the corresponding posts retrieved by the search engine.
BRIEF DESCRIPTION OF THE DRAWINGS
The above and other characteristics and advantages of the invention will be better understood through the following illustrative and non-limitative detailed description of preferred embodiments thereof, with reference to the appended drawings, wherein:
FIG. 1 shows a pie graph with Website Distribution;
FIG. 2 shows a graph of the number of retrieved tweets by keyword extraction methods;
FIG. 3 shows a graph of the number of retrieved tweets per number of keywords in a given query;
FIG. 4 shows a graph of a claim, relevant tweet, and irrelevant tweet embedded in 2D space;
FIG. 5 shows a graph of ROC of the proposed measure on 1078 labeled posts; and
FIG. 6 shows a graph of average tweets per claim versus mean relevance error of TF-IDF keywords generator, POS tagging keywords generator, and our proposed Bottom-Up search. The left dots of TF-IDF and POS tagging are keywords with ten words and the right dots are keywords with one word.
DETAILED DESCRIPTION OF THE INVENTION
The present invention provides an automated interactive optimization method of short keyword queries in order to improve information retrieval from opaque (“black box”) search engines. The task for which the present invention is directed may be for example, the retrieval of relevant posts from an online social media (OSM) given a news article or a document being discussed online (referred to as a “claim”).
The proposed algorithm iteratively selects keywords while querying the search engine and comparing a small set of retrieved posts to the given news article through a mean relevance error based on word embedding. The proposed algorithm is being demonstrated while building a Fake News data set from claims (collected from fact-checking websites) and their tweets. The mean relevance error found to be accurate for differentiating between relevant and irrelevant posts (0.9 Area Under the Curve (AUC)). The optimized queries produce similar results to manually extracted keywords outperforming Term Frequency—Inverse Document Frequency (TF-IDF is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus) based methods and POS tagging (the process of marking up a word in a text (corpus) as corresponding to a particular part of speech).
The proposed solution is two-fold: (1) The relevance of posts to the claim is estimated by comparing the vector representations of words contained in both documents; (2) A greedy algorithm is used to build the set of keywords for the above task by iteratively querying the OSM for the first page or relevant posts and choosing the best keyword to add to the set.
The proposed method has demonstrated on Twitter presenting a Fake News dataset of 398 claims collected from fact-checking websites, as well as word embedding of 1,186,334 posts relevant to those claims. 1,078 of the posts were manually classified as relevant or irrelevant to a given claim.
The present invention proposes a novel interactive method for optimizing keyword extraction given a document while querying a search engine. This is done by evaluating the similarity between a given claim (document) to a collection of posts (documents) associated with the given claim. The proposed method includes two complementing steps:
The first step is finding the mean relevance error, a short-document comparison method for determining the relevance of query results to a given document based on estimating the minimal distance between words comprising both the retrieved posts and the input document.
The second step is a novel interactive greedy search for finding the most appropriate keywords in order to retrieve the maximal number of relevant posts using an opaque search engine. Since there is no knowledge about the inner mechanisms of the search engine and the data stored there, a series of limited interactions were performed with the search engine in order to optimize the set of keywords comprising the query. In every step of the greedy search, the next best keyword to add to the query has been chosen. The quality of the incumbent queries is computed on a few of the top results using the proposed mean relevance error (MRE).
Mean Relevance Error (MRE)
The present invention proposes a method for estimating the relevance of posts retrieved from a search engine to a given input document. The method estimates the minimal distance between vector representations of words in a retrieved post and the words in the given input document.
At the first stage, stop-words are being removed from the document and the retrieved posts. The mean relevance error (MRE) is defined as a function, which receives as an input a document d and a collection of posts P retrieved from the search engine and outputs a number. The lower the MRE is, the more relevant are the retrieved posts P to the underlined document d.
Wp={w1,w2, . . . , wk} is the set of words in p∈P and Wd={w1,w2, . . . wl} denotes the set of words in the input document d. Since the important aspect is the retrieval of microblog posts which are relevant to some online discussion (such as a news article), it is assumed that l>>k. The Euclidean distance between vector representations of two words is used as a measure of similarity between them, denoted by dist(wi,wj). Vector representations of words can be derived using any word embedding model, such as GloVe [Pennington et al., “Glove: Global vectors for word representation”, Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532-1543, 2014], Word2vec [Mikolov et al., “Efficient estimation of word representations in vector space”, arXiv preprint arXiv:1301.3781, 2013], fastText [Bojanowski et al., “Enriching word vectors with subword information”, arXiv preprint arXiv:1607.04606, 2016], etc. The distance between a word wi and a document d is the minimal distance between the word wi and all the words in
Wd.
dist ( w i , d ) = min w j W d { dist ( w i , w j ) }
At the next stage, in order to calculate the distance of a post p from a document d, the distances of all words Wi∈WP to the document d are averaged:
dist ( p , d ) = 1 k w i W p dist ( w i , W d )
At the next stage, given a collection of posts P the mean relevance error (MRE) of the collection P to the document d was defined as the average distance of all posts in P from d:
MRE ( P , d ) = 1 P p P dist ( p , d )
The MRE defined above is designed to measure only one aspect of query performance, namely the relevance of the results. Other important aspects, such as the number of results are intentionally not captured by MRE. The quality of MRE is affected by the quality of the underlying word embedding model. For general purpose query evaluation, it is recommended to use word embedding models trained globally on large data sets.
Keyword Optimization
The present invention proposes a novel automatic method for finding the most appropriate keywords in order to retrieve the maximal number of relevant documents using an opaque search engine. The proposed method is based on an interactive greedy search for the best word that should be added to the input query in order to maximize the corresponding posts retrieved by the search engine.
Bottom-Up Search
First, the given document's text is split into a set of words and stop words are removed. In the first iteration, the process starts from queries with a single word. Each query is sent to the opaque search engine and posts are being received as a response. Each keyword receives an aggregated mean relevance error (MRE), which reflects the relevance of the retrieved collection of posts to the given document. At the end of the iteration, the keyword that its MRE improves the retrieved results is add. The process is finished in case the error is not increased, or in case the query includes all the document's key-words. The algorithm returns the query that yields the best MRE (as shown in Algorithm 1 below).
Algorithm 1
Bottom-Up Search
 1: procedure BOTTOM-UP(DOCUMENT, MINPOSTS)
 2: WalkedQueryList ← [ ]
 3: baseQuery ← ””
 4: keywords ← set(splitToKeywords(document))
 5: se = SearchEngine
 6: while keywords.size > 0 do
 7: query ← baseQuery
 8: bestword ← ””
 9: bestRelevance ← ∞
10: for all keyword ϵ keywords do
11: query.add(keyword)
12: posts ← se.getPosts(keyword)
13: MRE ← calculateRelevance(document, posts)
14: if posts.size( ) > minPosts then
15: WalkedQueryList.add(query, MRE)
16: if MRE < bestRelevance then
17: bestRelevance ← MRE
18: bestWord ← keyword
19: baseQuery.add(bestWord)
20: keywords.remove(bestWord)
21: bestQuery ← MinMRE(WalkedQueryList)
return bestQuery
The present invention suggests and evaluates the proposed MRE. However, every other score can be suggested as an optional relevance measure.
The Dataset Used for the Evaluation Process
398 labeled claims were collected from several fact-checking websites, mostly from Snopes (https://www.snopes.com/). FIG. 1 shows a pie graph with Website Distribution. These claims were collected manually from July until December 2018. The claims were published from June 1997 to December 2018. Each claim includes descriptive attributes, such as title, description, verdict date (the date in which a fact checker published the claim), a link to the analysis report of a fact checker, and verdict (the true label).
Twitter search engine was used in order to collect tweets that are relevant to these claims. Twitter is one of the biggest and popular online social networks worldwide with more than 321 million monthly active users worldwide as of the fourth quarter of 2018 [twi,]. In total, 1,186,334 tweets published by 772,940 users were retrieved. An average of 2,981 posts per claim. All the tweets were published from April 2007 until February 2019. These tweets were crawled by four different methods: the proposed Bottom-Up greedy search (280,261 tweets), key-words defined manually (75,263 tweets), TF-IDF (423,868 tweets), and part of speech (POS) tagging (489,598 tweets).
FIG. 2 shows a graph of the number of retrieved tweets by keyword extraction methods. For the keywords defined manually, TF-IDF and POS tagging methods, tweets were collected by querying a different number of unique words (from one to ten).
FIG. 3 shows a graph of the number of retrieved tweets per number of keywords in a given query.
Manual Labeling of Tweet Relevance
After retrieving tweets according to a few keyword suggestions, a subset of claims was labeled for evaluation. Twenty claims that gained the maximal and the minimal mean relevance error were chosen. In total, for the twenty claims, 1,173 related tweets were collected. For the labeling process, three annotators (students) were used, which were required to read the claim's title and description and the retrieved tweets associated with it. Each annotator labeled each tweet with one of the optional labels: Relevant in case of the given tweet is associated to the given claim, Irrelevant in the opposite case, and Unknown in case the annotator is not sure whether the tweet is related or not. Among the 1,173 retrieved tweets, only the tweets that the majority among the annotators agreed on were used (1,078 tweets). Table 1 below shows an example of a claim, relevant and irrelevant tweets.
TABLE 1
Example for labeling tweets associated to a given online discussion
Claim “The rapper DMX (Earl Simmons) died in February
2018.” (Fake)
Relevant “Juan is just think DMX died so good time!”
Tweet
Irrelevant “I liked a @ YouTube . . . video DMX - I Just Died in
Tweet your arms Tonight. [Remix]”

Mean Relevance Error
In order to evaluate the proposed method, the following experimental setup was defined: For word embedding, the word vectors representations were delivered by a pre-trained word embedding model of fastText [Mikolov et al., “Advances in pre-training distributed word representations”, Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018), 2018]. The model was trained on Common Crawl (http://commoncrawl.org/) and Wikipedia (https://www.wikipedia.org/) using fastText library (https://fasttext.cc/). For the distance measure, the simple Euclidean distance was used.
The proposed mean relevance error was evaluated on the Fake News data set, which includes the claims and the labeled tweets (except from the unknown tweets). The minimal distance for each tweet from the given claim was calculated. A full demonstration of the proposed method is presented in the next example. For the given claim: “Rihanna's Instagram message to followers to throw away the Snapchat app caused the company's share value to fall by hundreds of millions of dollars in one day.”. Tweet A includes the next text: “Rihanna Might Have Just Cost Snapchat $600 Million With a Single Instagram Story”. Tweet B includes the text: “Legends And Pop Stars As Social Media Lady Gaga Is Twitter Madonna Is Vine Rihanna Is Instagram Katy Perry Is Tumblr Cher Is Facebook Miley Cyrus Is Snapchat”. Tweet A was labeled as relevant, whereas Tweet B as irrelevant. Stop words were removed and the MRE was calculated for both tweets. Tweet A got an error of 0.948, as opposed to Tweet B, which reached 1.177. One can notice that according to the proposed distance-based method, the words in Tweet A are closer to the words in the claim, than are the words in Tweet B. For example, words compose Tweet B: “gaga”, “miley”, “starts” are far from the claim's words, in contrast to the words compose the relevant Tweet A, such as “story” that is placed next to “message”, or “millions” that are close to “hundreds”. FIG. 4 shows a graph of a claim, relevant tweet, and irrelevant tweet embedded in 2D space.
It can be seen that the semantics of words according to the word embedding is preserved (e.g., the words that represent OSM platforms, such as “twitter”, “Facebook”, “Instagram”, and “snapchat” are very close to each other). Generally, the lower the MRE, the higher the probability for the retrieved tweet to be relevant to the given claim.
An area has been generated under the receiver Operating Characteristic Curve (AUC) based on the relevant and irrelevant tweets. It can be noticed that an AUC of 0.9 was reached, which is based on 1,078 labeled tweets related to twenty claims.
FIG. 5 shows a graph of ROC of the proposed measure on 1078 labeled posts. Therefore, one can conclude that the proposed MRE is found very useful for differentiating between relevant and irrelevant posts associated with a given claim.
Keyword Optimization
In order to evaluate the proposed method, the results have been compared with three baseline methods for keyword selection: Keywords defined manually, TF-IDF, and POS tagging.
Keywords Defined Manually
One of the methods used for collecting online discussions related to a given claim was a manual selection of high-quality keywords.
At the first step, the user is required to read the given claim in order to understand the subject of the claim. At the next step, the user should assign keywords, which can express the meaning of the given claim. The method proposed by present invention starts by removing stop words. Similar to [Zhang, 2008], the present invention extracts 3-5 keywords from the title and description of a given claim. At the next step, in many cases, annotators use synonyms in order to expand the context of the retrieved posts written differently but convey the same message [Voorhees, 1994]. The user is also required to use synonyms in order to retrieve a high number of posts relevant to the given claim. For example, for the claim: “Did Donald Trump Scare a Group of Schoolchildren?”, there are several synonyms that can use: Donald Trump—President of U.S., scare—frighten, schoolchildren—youngsters, etc. Four, after determine assigned keywords, the user should run them manually as a query for the search engine. The user should watch the corresponding posts and read a few of them in order to understand whether they are relevant to the given claim. The number of retrieved posts is important. In case there are a few tens of posts, it can be a good intuition to use more synonyms as keywords, as shown at Algorithm 2 below.
Algorithm 2
Manual Keyword Assignment
1: Read claim's title and description
2: If it is necessary, read the full report
3: Assign keywords that express the meaning of the claim.
4: Provide 3-5 alternative sets of keywords
5: Use synonyms
6: Query the OSM using the different sets of keywords
7: Read a few of the retrieved posts.
8:  Check relevance.
9: Record the number of retrieved posts.

TF-IDF Keyword Generator
The text that exists in the claims has been used as the targeted corpus. In addition, stop words were removed and computed the TF-IDF score for each word. For each claim, the K words with the highest TF-IDF score were picked, where K is the number of required words.
POS Tagging Keyword Generator
A part of speech (POS) tagging has been used for generating keywords for each given claim. According to this method, the text has been narrowed down to the following candidates: nouns, adjectives, adverbs, and numbers, based on the heuristics suggested by [Liu et al., 2014]. The words were prioritized by their POS tagging as follows:
number≤adverb≤adjective≤noun
The next step was picking the K first words from the candidates as input keywords. TF-IDF and POS tagging keywords were generated with a fixed size of one to ten words. The keywords defined manually were created using the news article's title and description. Then, the keywords were used to query Twitter for collecting the top 600 posts and the MRE was computed on the received posts for each claim and keyword expansion method. It can be seen that there is a trade-off between the number of posts retrieved per claim and their relevance. Longer queries are less beneficial than shorter queries due to the low number of retrieved posts [Voorhees, 1994]. However, the proposed Bottom Up search outperforms the automatic baseline methods (TF-IDF and POS tagging) and performed similarly to non-automatic keywords defined manually.
FIG. 6 shows a graph of average tweets per claim versus mean relevance error of TF-IDF keywords generator, POS tagging keywords generator, and the proposed Bottom-Up search. The left dots of TF-IDF and POS tagging are keywords with ten words and the right dots are keywords with one word. It received more relevant posts, comparing to the average posts received by TF-IDF and POS tagging.
For minimizing the potential risks that may arise from activities like collecting information from OSM, the present invention follows recommendations presented by [Elovici et al., 2014], which deal with ethical challenges regarding OSM and Internet communities. Given news article, the present invention proposes a method which suggests the optimal keywords for retrieving the maximal number of relevant documents. To evaluate the proposed method, the Twitter search engine has been used in order to retrieve tweets associated with the given news article.
The present invention proposed a novel automatic interactive method in order to improve information retrieval from opaque search engines. This method is focused on the task of retrieval of relevant posts from Twitter OSM platform given a news article. For this purpose, the mean relevance error has been proposed, which estimates the relevance of posts to a given news article based on the mean distance between vector representations of the article words and the post words. This estimation based on word embedding was found to be accurate for distinguishing between relevant and irrelevant posts. It can be very helpful for collecting relevant posts associated with a given claim automatically. For example, the proposed Bottom-Up greedy algorithm attempts to construct a set of keywords by adding a keyword that improves the relevance of the retrieved posts in each iteration.
This algorithm was found to perform better than baseline methods, such as TF-IDF and POS tagging. The performance of the automatic Bottom-Up method was very similar to the keywords defined manually.
The collected Fake News data set (claims and tweets) has been presented for evaluation, as well as guidelines for manual labeling of tweets. The guidelines for manual keyword assignment were also presented.

Claims (8)

The invention claimed is:
1. An automated interactive optimization method of short keyword queries for improving information retrieval from opaque (black box) search engines, comprising:
a) collecting data including labeled claims from several fact-checking websites, for creating dataset which is used for evaluation;
b) estimating the relevance of posts/query results retrieved from a search engine to a given input document, by calculating the mean relevance error (MRE), based on estimating the minimal distance between words comprising both the retrieved posts and the input document;
c) labeling a subset of claims for evaluation, by choosing a number of claims that gained the maximal and the minimal mean relevance error (MRE); and
d) finding the most appropriate queries in order to retrieve the maximal number of relevant posts using an opaque search engine, by performing an interactive greedy search for the best word that should be added to the input query, for maximizing the corresponding posts retrieved by the search engine,
wherein calculating the mean relevance error (MRE) is performed by estimating the minimal distance between vector representations of the retrieved posts and the input document, according to the following steps:
i) removing stop-words from the input document and the retrieved posts;
ii) defining the mean relevance error (MRE) as a function, which receives as an input a document d and a collection of posts P retrieved from the search engine and outputs a number, where the lower the MRE, the more relevant are the retrieved posts P to the underlined document d;
iii) calculating the Euclidean distance between vector representations of two words as a measure of similarity between them, wherein vector representations of words are derived using a word embedding model;
iv) defining the distance between a word wi and a document d as the minimal distance between a word wi and all the words in the set of words in the input document d, defined as Wd;
v) averaging the distances of all words wi ∈ Wp, which defines as the set pf words in p∈P, to the document d, for calculating the distance of a post p from document d;
vi) defining the mean relevance error (MRE) of the collection P to the document d as the average distance of all posts in P from document d and calculating said MRE.
2. The method according to claim 1, wherein the MRE is used as a measure of relevance.
3. The method according to claim 1, wherein each claim includes one or more of the following descriptive attributes:
title;
description;
verdict date;
a link to the analysis report of a fact checker and verdict, being the true label.
4. The method according to claim 1, wherein the labeling process includes the following steps:
a) using annotators that are required to read the claim's title and description and the retrieved posts associated with said title;
b) labeling each post by each annotator as one of relevant, irrelevant or unknown labels: Relevant in case of the given post is associated to the given claim, Irrelevant in case of the given post is not associated to the given claim, and Unknown in case the annotator is not sure whether the tweet is related or not; and
c) using only the posts that the majority among the annotators agreed on.
5. The method according to claim 1, wherein a score, being different from MRE, is used as a relevance measure, instead of MRE.
6. A system for automated interactive keyword optimization for opaque search engines, comprising:
a) a database for storing data for evaluation, including labeled claims collected from several fact-checking websites;
b) at least one processor, adapted to:
b.1) estimate the relevance of posts/query results retrieved from a search engine to a given input document, by calculating the mean relevance error (MRE), based on estimating the minimal distance between words comprising both the retrieved posts and the input document;
b.2) label a subset of claims for evaluation, by choosing a number of claims that gained the maximal and the minimal mean relevance error (MRE); and
b.3) find the most appropriate queries in order to retrieve the maximal number of relevant posts using an opaque search engine, by performing an interactive greedy search for the best word that should be added to the input query, for maximizing the corresponding posts retrieved by the search engine,
wherein calculating the mean relevance error (MRE) is performed by said at least one processor, by estimating the minimal distance between vector representations of the retrieved posts and the input document, according to the following steps:
removing stop-words from the input document and the retrieved posts;
defining the mean relevance error (MRE) as a function, which receives as an input a document d and a collection of posts P retrieved from the search engine and outputs a number, where the lower the MRE, the more relevant are the retrieved posts P to the underlined document d;
calculating the Euclidean distance between vector representations of two words as a measure of similarity between them, wherein vector representations of words are derived using a word embedding model;
defining the distance between a word wi and a document d as the minimal distance between a word wi and all the words in the set of words in the input document d, defined as Wd,
averaging the distances of all words wi ∈ Wp, which defines as the set pf words in p∈P, to the document d, for calculating the distance of a post p from document d;
defining the mean relevance error (MRE) of the collection P to the document d as the average distance of all posts in P from document d and calculating said MRE.
7. An automated interactive optimization method of short queries for improving information retrieval from opaque (black box) search engines, comprising:
a) collecting data including labeled claims from several fact-checking websites, for creating dataset which is used for evaluation;
b) estimating the relevance of posts/query results retrieved from a search engine to a given input document, by calculating the mean relevance error (MRE), based on estimating the minimal distance between words comprising both the retrieved posts and the input document;
c) labeling a subset of claims for evaluation, by choosing a number of claims that gained the maximal and the minimal mean relevance error (MRE); and
d) finding the most appropriate queries in order to retrieve the maximal number of relevant posts using an opaque search engine, by performing an interactive greedy search for the best word that should be added to the input query, for maximizing the corresponding posts retrieved by the search engine,
wherein the interactive greedy search process includes the following steps, performed by said at least one processor:
splitting the given document's text into a set of words and removing stop words;
during an interactive greedy search process, starting from queries with a single word, sending each query to the opaque search engine and receiving posts as a response;
calculating the mean relevance error (MRE) of the retrieved posts, which reflects the relevance of the retrieved collection of posts to the given document;
adding a keyword that improves the retrieved results MRE, wherein the process is finished in case the MRE is not decreased, or in case the query includes all the document's key-words; and
returning and implementing the algorithm on the query that yields the best MRE.
8. A system for automated interactive keyword optimization for opaque search engines, comprising:
a) a database for storing data for evaluation, including labeled claims collected from several fact-checking websites;
b) at least one processor, adapted to:
b.1) estimate the relevance of posts/query results retrieved from a search engine to a given input document, by calculating the mean relevance error (MRE), based on estimating the minimal distance between words comprising both the retrieved posts and the input document;
b.2) label a subset of claims for evaluation, by choosing a number of claims that gained the maximal and the minimal mean relevance error (MRE); and
b.3) find the most appropriate queries in order to retrieve the maximal number of relevant posts using an opaque search engine, by performing an interactive greedy search for the best word that should be added to the input query, for maximizing the corresponding posts retrieved by the search engine,
wherein the interactive greedy search process, performed by said at least one processor, includes the following steps:
splitting the given document's text into a set of words and removing stop words;
during an interactive greedy search process, starting from queries with a single word, sending each query to the opaque search engine and receiving posts as a response;
calculate the mean relevance error (MRE) of the retrieved posts, which reflects the relevance of the retrieved collection of posts to the given document;
adding a word that improves the retrieved results MRE, wherein the process is finished in case the MRE is not decreased, or in case the query includes all the document's key-words; and
returning and implementing the algorithm on the query that yields the best MRE.
US16/840,538 2019-04-07 2020-04-06 Method and system for interactive keyword optimization for opaque search engines Active 2040-06-10 US11397731B2 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US16/840,538 US11397731B2 (en) 2019-04-07 2020-04-06 Method and system for interactive keyword optimization for opaque search engines
US17/854,917 US11809423B2 (en) 2019-04-07 2022-06-30 Method and system for interactive keyword optimization for opaque search engines

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201962830474P 2019-04-07 2019-04-07
US16/840,538 US11397731B2 (en) 2019-04-07 2020-04-06 Method and system for interactive keyword optimization for opaque search engines

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/854,917 Continuation US11809423B2 (en) 2019-04-07 2022-06-30 Method and system for interactive keyword optimization for opaque search engines

Publications (2)

Publication Number Publication Date
US20200327120A1 US20200327120A1 (en) 2020-10-15
US11397731B2 true US11397731B2 (en) 2022-07-26

Family

ID=72749297

Family Applications (2)

Application Number Title Priority Date Filing Date
US16/840,538 Active 2040-06-10 US11397731B2 (en) 2019-04-07 2020-04-06 Method and system for interactive keyword optimization for opaque search engines
US17/854,917 Active US11809423B2 (en) 2019-04-07 2022-06-30 Method and system for interactive keyword optimization for opaque search engines

Family Applications After (1)

Application Number Title Priority Date Filing Date
US17/854,917 Active US11809423B2 (en) 2019-04-07 2022-06-30 Method and system for interactive keyword optimization for opaque search engines

Country Status (1)

Country Link
US (2) US11397731B2 (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112765348B (en) * 2021-01-08 2023-04-07 重庆创通联智物联网有限公司 Short text classification model training method and device
CN112989197A (en) * 2021-03-30 2021-06-18 北京工业大学 Responder recommendation method for community question-answering platform
CN114662474B (en) * 2022-04-13 2024-06-11 马上消费金融股份有限公司 Keyword determination method and device, electronic equipment and storage medium
TWI849585B (en) * 2022-11-17 2024-07-21 大鐸資訊股份有限公司 Pure text analysis and calculation - text-based text search system and method

Citations (45)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6553372B1 (en) * 1998-07-13 2003-04-22 Microsoft Corporation Natural language information retrieval system
US20030182274A1 (en) * 2000-07-27 2003-09-25 Young-June Oh Navigable search engine
US20040243356A1 (en) * 2001-05-31 2004-12-02 Duffy Dominic Gavan Data processing apparatus and method
US20050027666A1 (en) * 2003-07-15 2005-02-03 Vente, Inc Interactive online research system and method
US20060064411A1 (en) * 2004-09-22 2006-03-23 William Gross Search engine using user intent
US20060212413A1 (en) * 1999-04-28 2006-09-21 Pal Rujan Classification method and apparatus
US20070038601A1 (en) * 2005-08-10 2007-02-15 Guha Ramanathan V Aggregating context data for programmable search engines
US20070038600A1 (en) * 2005-08-10 2007-02-15 Guha Ramanathan V Detecting spam related and biased contexts for programmable search engines
US20070092917A1 (en) * 1998-05-01 2007-04-26 Isabelle Guyon Biomarkers for screening, predicting, and monitoring prostate disease
US20080077570A1 (en) * 2004-10-25 2008-03-27 Infovell, Inc. Full Text Query and Search Systems and Method of Use
US7401057B2 (en) * 2002-12-10 2008-07-15 Asset Trust, Inc. Entity centric computer system
US20080177994A1 (en) * 2003-01-12 2008-07-24 Yaron Mayer System and method for improving the efficiency, comfort, and/or reliability in Operating Systems, such as for example Windows
US20090077033A1 (en) * 2007-04-03 2009-03-19 Mcgary Faith System and method for customized search engine and search result optimization
US20090106202A1 (en) * 2007-10-05 2009-04-23 Aharon Mizrahi System And Method For Enabling Search Of Content
US7668812B1 (en) * 2006-05-09 2010-02-23 Google Inc. Filtering search results using annotations
US20100088428A1 (en) * 2008-10-03 2010-04-08 Seomoz, Inc. Index rank optimization system and method
US20100332431A1 (en) * 2007-11-09 2010-12-30 Motorola, Inc. Method and apparatus for modifying a user preference profile
US20110093452A1 (en) * 2009-10-20 2011-04-21 Yahoo! Inc. Automatic comparative analysis
US8516048B2 (en) * 2008-01-24 2013-08-20 International Business Machines Corporation Method for facilitating a real-time virtual interaction
US20130238356A1 (en) * 2010-11-05 2013-09-12 Georgetown University System and method for detecting, collecting, analyzing, and communicating emerging event- related information
US20140075004A1 (en) * 2012-08-29 2014-03-13 Dennis A. Van Dusen System And Method For Fuzzy Concept Mapping, Voting Ontology Crowd Sourcing, And Technology Prediction
US20140079297A1 (en) * 2012-09-17 2014-03-20 Saied Tadayon Application of Z-Webs and Z-factors to Analytics, Search Engine, Learning, Recognition, Natural Language, and Other Utilities
US20140201126A1 (en) * 2012-09-15 2014-07-17 Lotfi A. Zadeh Methods and Systems for Applications for Z-numbers
US20140282586A1 (en) * 2013-03-15 2014-09-18 Advanced Elemental Technologies Purposeful computing
US20140280952A1 (en) * 2013-03-15 2014-09-18 Advanced Elemental Technologies Purposeful computing
US8860717B1 (en) * 2011-03-29 2014-10-14 Google Inc. Web browser for viewing a three-dimensional object responsive to a search query
US8903811B2 (en) * 2008-04-01 2014-12-02 Certona Corporation System and method for personalized search
US20150045713A1 (en) * 2013-08-07 2015-02-12 B. Braun Avitum Ag Device and method for predicting intradialytic parameters
US9002678B1 (en) * 2014-01-10 2015-04-07 King Fahd University Of Petroleum And Minerals Unified approach to detection and isolation of parametric faults using a kalman filter residual-based approach
US20160078057A1 (en) * 2013-09-04 2016-03-17 Shazura, Inc. Content based image retrieval
US20160170814A1 (en) * 2008-02-25 2016-06-16 Georgetown University System and method for detecting, collecting, analyzing, and communicating event-related information
US20160180434A1 (en) * 2014-12-18 2016-06-23 Expedia, Inc. Persona for opaque travel item selection
US20160357731A1 (en) * 2014-01-28 2016-12-08 Somol Zorzin Gmbh Method for Automatically Detecting Meaning and Measuring the Univocality of Text
US9582618B1 (en) * 2016-01-19 2017-02-28 King Fahd University Of Petroleum And Minerals Apparatus and method for monitoring electro- and/or mechanical systems
US20170193112A1 (en) * 2015-12-31 2017-07-06 Quixey, Inc. Transformation And Presentation Of On-Demand Native Application Crawling Results
US20180053207A1 (en) * 2016-08-16 2018-02-22 Adobe Systems Incorporated Providing personalized alerts and anomaly summarization
US20180061459A1 (en) * 2016-08-30 2018-03-01 Yahoo Holdings, Inc. Computerized system and method for automatically generating high-quality digital content thumbnails from digital video
US20180165554A1 (en) * 2016-12-09 2018-06-14 The Research Foundation For The State University Of New York Semisupervised autoencoder for sentiment analysis
US20180204111A1 (en) * 2013-02-28 2018-07-19 Z Advanced Computing, Inc. System and Method for Extremely Efficient Image and Pattern Recognition and Artificial Intelligence Platform
US20180349347A1 (en) * 2017-05-30 2018-12-06 Facebook, Inc. Measuring Phrase Association on Online Social Networks
US20190156253A1 (en) * 2017-11-22 2019-05-23 United Parcel Service Of America, Inc. Automatically generating volume forecasts for different hierarchical levels via machine learning models
US20190279102A1 (en) * 2018-03-06 2019-09-12 Tazi AI Systems, Inc. Continuously learning, stable and robust online machine learning system
US20190311301A1 (en) * 2018-04-10 2019-10-10 Ebay Inc. Dynamically generated machine learning models and visualization thereof
US20200160966A1 (en) * 2018-11-21 2020-05-21 Enlitic, Inc. Triage routing system
US20200252651A1 (en) * 2019-02-06 2020-08-06 Jared Cohn Accelerated video exportation to multiple destinations

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9792101B2 (en) * 2015-11-10 2017-10-17 Wesley John Boudville Capacity and automated de-install of linket mobile apps with deep links

Patent Citations (46)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070092917A1 (en) * 1998-05-01 2007-04-26 Isabelle Guyon Biomarkers for screening, predicting, and monitoring prostate disease
US6553372B1 (en) * 1998-07-13 2003-04-22 Microsoft Corporation Natural language information retrieval system
US20060212413A1 (en) * 1999-04-28 2006-09-21 Pal Rujan Classification method and apparatus
US20030182274A1 (en) * 2000-07-27 2003-09-25 Young-June Oh Navigable search engine
US20040243356A1 (en) * 2001-05-31 2004-12-02 Duffy Dominic Gavan Data processing apparatus and method
US7401057B2 (en) * 2002-12-10 2008-07-15 Asset Trust, Inc. Entity centric computer system
US20080177994A1 (en) * 2003-01-12 2008-07-24 Yaron Mayer System and method for improving the efficiency, comfort, and/or reliability in Operating Systems, such as for example Windows
US20050027666A1 (en) * 2003-07-15 2005-02-03 Vente, Inc Interactive online research system and method
US20060064411A1 (en) * 2004-09-22 2006-03-23 William Gross Search engine using user intent
US20080077570A1 (en) * 2004-10-25 2008-03-27 Infovell, Inc. Full Text Query and Search Systems and Method of Use
US20070038600A1 (en) * 2005-08-10 2007-02-15 Guha Ramanathan V Detecting spam related and biased contexts for programmable search engines
US20070038601A1 (en) * 2005-08-10 2007-02-15 Guha Ramanathan V Aggregating context data for programmable search engines
US7668812B1 (en) * 2006-05-09 2010-02-23 Google Inc. Filtering search results using annotations
US20090077033A1 (en) * 2007-04-03 2009-03-19 Mcgary Faith System and method for customized search engine and search result optimization
US20090106202A1 (en) * 2007-10-05 2009-04-23 Aharon Mizrahi System And Method For Enabling Search Of Content
US20100332431A1 (en) * 2007-11-09 2010-12-30 Motorola, Inc. Method and apparatus for modifying a user preference profile
US8516048B2 (en) * 2008-01-24 2013-08-20 International Business Machines Corporation Method for facilitating a real-time virtual interaction
US20160170814A1 (en) * 2008-02-25 2016-06-16 Georgetown University System and method for detecting, collecting, analyzing, and communicating event-related information
US8903811B2 (en) * 2008-04-01 2014-12-02 Certona Corporation System and method for personalized search
US20100088428A1 (en) * 2008-10-03 2010-04-08 Seomoz, Inc. Index rank optimization system and method
US20110093452A1 (en) * 2009-10-20 2011-04-21 Yahoo! Inc. Automatic comparative analysis
US20130238356A1 (en) * 2010-11-05 2013-09-12 Georgetown University System and method for detecting, collecting, analyzing, and communicating emerging event- related information
US8860717B1 (en) * 2011-03-29 2014-10-14 Google Inc. Web browser for viewing a three-dimensional object responsive to a search query
US20140075004A1 (en) * 2012-08-29 2014-03-13 Dennis A. Van Dusen System And Method For Fuzzy Concept Mapping, Voting Ontology Crowd Sourcing, And Technology Prediction
US20140201126A1 (en) * 2012-09-15 2014-07-17 Lotfi A. Zadeh Methods and Systems for Applications for Z-numbers
US20140079297A1 (en) * 2012-09-17 2014-03-20 Saied Tadayon Application of Z-Webs and Z-factors to Analytics, Search Engine, Learning, Recognition, Natural Language, and Other Utilities
US20180204111A1 (en) * 2013-02-28 2018-07-19 Z Advanced Computing, Inc. System and Method for Extremely Efficient Image and Pattern Recognition and Artificial Intelligence Platform
US20140280952A1 (en) * 2013-03-15 2014-09-18 Advanced Elemental Technologies Purposeful computing
US20140282586A1 (en) * 2013-03-15 2014-09-18 Advanced Elemental Technologies Purposeful computing
US20160266939A1 (en) * 2013-03-15 2016-09-15 Advanced Elemental Technologies, Inc. Purposeful computing
US20150045713A1 (en) * 2013-08-07 2015-02-12 B. Braun Avitum Ag Device and method for predicting intradialytic parameters
US20160078057A1 (en) * 2013-09-04 2016-03-17 Shazura, Inc. Content based image retrieval
US9002678B1 (en) * 2014-01-10 2015-04-07 King Fahd University Of Petroleum And Minerals Unified approach to detection and isolation of parametric faults using a kalman filter residual-based approach
US20160357731A1 (en) * 2014-01-28 2016-12-08 Somol Zorzin Gmbh Method for Automatically Detecting Meaning and Measuring the Univocality of Text
US20160180434A1 (en) * 2014-12-18 2016-06-23 Expedia, Inc. Persona for opaque travel item selection
US20170193112A1 (en) * 2015-12-31 2017-07-06 Quixey, Inc. Transformation And Presentation Of On-Demand Native Application Crawling Results
US9582618B1 (en) * 2016-01-19 2017-02-28 King Fahd University Of Petroleum And Minerals Apparatus and method for monitoring electro- and/or mechanical systems
US20180053207A1 (en) * 2016-08-16 2018-02-22 Adobe Systems Incorporated Providing personalized alerts and anomaly summarization
US20180061459A1 (en) * 2016-08-30 2018-03-01 Yahoo Holdings, Inc. Computerized system and method for automatically generating high-quality digital content thumbnails from digital video
US20180165554A1 (en) * 2016-12-09 2018-06-14 The Research Foundation For The State University Of New York Semisupervised autoencoder for sentiment analysis
US20180349347A1 (en) * 2017-05-30 2018-12-06 Facebook, Inc. Measuring Phrase Association on Online Social Networks
US20190156253A1 (en) * 2017-11-22 2019-05-23 United Parcel Service Of America, Inc. Automatically generating volume forecasts for different hierarchical levels via machine learning models
US20190279102A1 (en) * 2018-03-06 2019-09-12 Tazi AI Systems, Inc. Continuously learning, stable and robust online machine learning system
US20190311301A1 (en) * 2018-04-10 2019-10-10 Ebay Inc. Dynamically generated machine learning models and visualization thereof
US20200160966A1 (en) * 2018-11-21 2020-05-21 Enlitic, Inc. Triage routing system
US20200252651A1 (en) * 2019-02-06 2020-08-06 Jared Cohn Accelerated video exportation to multiple destinations

Non-Patent Citations (38)

* Cited by examiner, † Cited by third party
Title
Albishre, K. (Jan. 2017,). "Effective pseudo-relevance for microblog retrieval". In Proceedings of the Australasian Computer Science Week Multiconference (pp. 1-6). (6 pages).
Al-Khateeb (Mar. 2017,). "Query reformulation using WordNet and genetic algorithm". In 2017 Annual Conference on New Trends in Information & Communications Technology Applications (NTICT) (pp. 91-96). IEEE. (6 pages).
Bernard J Jansen et al., "Micro-blogging as online word of mouth branding". CHI 2009 extended abstracts on human factors in computing systems, pp. 3859-3864 ACM, 2009 (6 pages).
C. Zhang,"Automatic keyword extraction from documents using conditional random fields". Journal of Computational Information Systems 4 (3) (2008) 1169-1180. (11 pages).
Chirita et al., "Personalized query expansion for the web", Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 7-14. ACM, 2007 (8 pages).
Christina Boididou et al. "Verifying multimedia use at mediaeval 2016". In MediaEval, 2016 (4 pages).
Chy, A. et al., (2019). Query Expansion for Microblog Retrieval Focusing on an Ensemble of Features. Journal of Information Processing, 27, 61-76. (16 pages).
Cronen-Townsend et al., "Predicting query performance", Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 299-306. ACM, 2002 (8 pages).
Dwaipayan Roy et al., "Using word embeddings for automatic query expansion". arXiv preprint arXiv:1606.07608,2016. (5 pages).
Ellen M Voorhees. "Query expansion using lexical-semantic relations". In Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 61-69. Springer-Verlag New York, Inc., 1994. (10 pages).
He et al., "Inferring query performance using pre-retrieval predictors", International symposium on string processing and information retrieval, pp. 43-54. Springer, 2004 (12 pages).
Jeffrey Pennington et al., "Glove: Global vectors for word representation". In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532-1543, 2014 (12 pages).
Jun Wang et al., "Improving short text clustering performance with keyword expansion". In the Sixth International Symposium on Neural Networks (ISNN 2009), pp. 291-298. Springer, 2009 (877 pages).
Kai Shu et al., "Understanding user profiles on social media for fake news detection". In 2018 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), pp. 430-435. IEEE, 2018. (6 pages).
Kenter et al., "Short text similarity with word embeddings", Proceedings of the 24th ACM international on conference an information and knowledge management, pp. 1411-1420. ACM, 2015 (10 pages).
Koenemann et al., "A case for interaction: A study of interactive information retrieval behavior and effectiveness", Proceeding of the ACM SIGCHI Conference on Human Factors in Computing Systems, pp. 205-212, Citeseer, 1996 (8 pages).
Kurland et al., "Back to the roots: A probabilistic framework for query performance prediction", Proceedings of the 21st ACM international conference on Information and knowledge management, pp. 823-832 ACM, 2012 (10 pages).
Kusner et al., "From word embeddings to document distances", International Conference on Machine Learning, pp. 957-966, 2015 (10 pages).
Li, C. (Jul. 2014,). "Req-rec: High recall retrieval with query pooling and interactive classification". In Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval (pp. 163-172). (10 pages).
Liu et al., "Predicting movie box-office revenues by exploiting large-scale social media content. Multimedia Tools and Applications", 75(3):1509-1528, 2016 (20 pages).
Makki, R. et al., (2018) "ATR-Vis: Visual and interactive information retrieval for parliamentary discussions in twitter". ACM Transactions on Knowledge Discovery from Data (TKDD), 12(1), 1-33. (33 pages).
Mikolov et al., "Advances in pre-training distributed word representations", Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018), 2018 (4 pages).
Nogueira, R., & Cho, K. (2017). "Task-oriented query reformulation with reinforcement learning". arXiv preprint arXiv:1704.04572. (10 pages).
Pang, W., & Du, J. (2019). "Query Expansion and Query Fuzzy with Large-Scale Click-through Data for Microblog Retrieval". International Journal of Machine Learning and Computing, 9(3). (9 pages).
Pengqi Liu et al., "Automatic keywords generation for contextual advertising". In Proceedings of the 23rd International Conference on World Wide Web, pp. 345-346. ACM, 2014. (2 pages).
Piotr Bojanowski et al., "Enriching word vectors with subword information". arXiv preprintarXiv:1607.04606, 2016. (13 pages).
Saar Kuzi et al., "Query expansion using word embeddings". In Proceedings of the 25th ACM international on conference on information and knowledge management, pp. 1929-1932. ACM, 2016. (4 pages).
Somnath Banerjee et al., "Clustering short texts using Wikipedia". In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 787-788. ACM, 2007. (2 pages).
Svitlana Volkova et al., "Separating facts from fiction: Linguistic models to classify suspicious and trusted news posts on twitter". In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (vol. 2: Short Papers), vol. 2, pp. 647-653, 2017 (7 pages).
Tacchini et al., "Some like it hoax: Automated fake news detection in social networks", arXiv preprint arXiv:1704.07506, 2017 (12 pages).
Tomas Mikolov et al., "Efficient estimation of word representations in vector space". arXiv preprint arXiv:1301.3781, 2013. (12 pages).
Wu, P. et al., (Apr. 2006,). Query selection techniques for efficient crawling of structured web sources. In 22nd International Conference on Data Engineering (ICDE'06) (pp. 47-47). IEEE. (10 pages).
Xu, B. et al., (2018) "Improving pseudo-relevance feedback with neural network-based word representations". IEEE Access, 6, 62152-62165. (14 pages).
Y. Elovici et al., "Ethical considerations when employing fake identities in online social networks for research". Science and engineering ethics 20 (4) (2014) 1027-1043 (17 pages).
Zamani, H. et al., (Oct. 2016,). "Pseudo-relevance feedback based on matrix factorization". In Proceedings of the 25th ACM international on conference on information and knowledge management (pp. 1483-1492). (10 pages).
Zhou et al., "Fake news: Fundamental theories, detection strategies and challenges", Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, pp. 836-837, ACM, 2019 (2 pages).
Zhou et al., "Query performance prediction in web search environments", Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 543-550. ACM, 2007 (8 pages).
Zhou et al., "Ranking robustness: a novel framework to predict query performance", Proceedings of the 15th ACM international conference on Information and knowledge management, pp. 567-574. ACM, 2006 (8 pages).

Also Published As

Publication number Publication date
US20200327120A1 (en) 2020-10-15
US20220358122A1 (en) 2022-11-10
US11809423B2 (en) 2023-11-07

Similar Documents

Publication Publication Date Title
US11397731B2 (en) Method and system for interactive keyword optimization for opaque search engines
Meij et al. Learning semantic query suggestions
US20130159277A1 (en) Target based indexing of micro-blog content
Kaptein et al. Exploiting the category structure of Wikipedia for entity ranking
US20100185689A1 (en) Enhancing Keyword Advertising Using Wikipedia Semantics
US20130060769A1 (en) System and method for identifying social media interactions
Xu et al. Web content mining
US20120246100A1 (en) Methods and systems for extracting keyphrases from natural text for search engine indexing
EP2307951A1 (en) Method and apparatus for relating datasets by using semantic vectors and keyword analyses
Kim et al. Diversifying query suggestions based on query documents
Yao et al. Mobile phone name extraction from internet forums: a semi-supervised approach
Paltoglou et al. Opinion retrieval: Searching for opinions in social media
Barla et al. From ambiguous words to key-concept extraction
Kalloubi et al. Harnessing semantic features for large-scale content-based hashtag recommendations on microblogging platforms
Sharifpour et al. Large-scale analysis of query logs to profile users for dataset search
Lops et al. A semantic content-based recommender system integrating folksonomies for personalized access
Hsu et al. Efficient and effective prediction of social tags to enhance web search
Irmak et al. Contextual ranking of keywords using click data
Bafna et al. Semantic key phrase-based model for document management
Pang et al. Query expansion and query fuzzy with large-scale click-through data for microblog retrieval
Liebeskind et al. Text categorization from category name in an industry-motivated scenario
Zhang et al. An adaptive method for organization name disambiguation with feature reinforcing
Singh et al. Multi-feature segmentation and cluster based approach for product feature categorization
Miao et al. Automatic identifying entity type in linked data
Appiktala et al. Identifying salient entities of news articles using binary salient classifier

Legal Events

Date Code Title Description
FEPP Fee payment procedure

Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY

FEPP Fee payment procedure

Free format text: ENTITY STATUS SET TO SMALL (ORIGINAL EVENT CODE: SMAL); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY

AS Assignment

Owner name: B. G. NEGEV TECHNOLOGIES AND APPLICATIONS LTD., AT BEN-GURION UNIVERSITY, ISRAEL

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:PUZIS, RAMI;ELYASHAR, AVIAD;REUBEN, MAOR;SIGNING DATES FROM 20200423 TO 20200425;REEL/FRAME:052626/0861

STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER

STPP Information on status: patent application and granting procedure in general

Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS

STPP Information on status: patent application and granting procedure in general

Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT RECEIVED

STCF Information on status: patent grant

Free format text: PATENTED CASE