IR

Information Retrieval
Database System Concepts, 6th Ed.

©Silberschatz, Korth and Sudarshan
See www.db-book.com for conditions on re-use
Information Retrieval
 Relevance Ranking Using Terms

 Relevance Using Hyperlinks
 Synonyms., Homonyms, and Ontologies
 Web Search Engines
 Information Retrieval and Structured Data
 Directories
Database System Concepts - 6th Edition 21.2 ©Silberschatz, Korth and Sudarshan
Information Retrieval Systems
 Information retrieval (IR) systems use a simpler data
model than database systems
 Information organized as a collection of documents
 Documents are unstructured, no schema
 Information retrieval locates relevant documents, on the
basis of user input such as keywords or example
documents
 e.g., find documents containing the words “database
systems”
 Can be used even on textual descriptions provided with
non-textual data such as images
 Web search engines are the most familiar example of IR
systems
Information Retrieval Systems
(Cont.)
 Differences from database systems
 IR systems don’t deal with transactional updates
(including concurrency control and recovery)
 Database systems deal with structured data, with
schemas that define the data organization
 IR systems deal with some querying issues not
generally addressed by database systems
 Approximate searching by keywords
 Ranking of retrieved answers by estimated degree
of relevance
Keyword Search
 In full text retrieval, all the words in each document are
considered to be keywords.
 We use the word term to refer to the words in a document
 Information-retrieval systems typically allow query expressions
formed using keywords and the logical connectives and, or, and
not
 Ands are implicit, even if not explicitly specified
 Ranking of documents on the basis of estimated relevance to a
query is critical
 Relevance ranking is based on factors such as
 Term frequency
– Frequency of occurrence of query keyword in document

 Inverse document frequency
– How many documents the query keyword occurs in

» Fewer  give more importance to keyword
 Hyperlinks to documents
– More links to a document  document is more important
Relevance Ranking Using Terms
 TF-IDF (Term frequency/Inverse Document frequency)
ranking:
 Let n(d) = number of terms in the document d
 n(d, t) = number of occurrences of term t in the
document d.
 Relevance of a document dn(to
d, at)term t
TF (d, t) = log 1+
n(d)
 The log factor is to avoid excessive weight to

frequent terms
r ( d , Q) =  TF (d, t)
 Relevance of document to query Q
tQ n(t)
Relevance Ranking Using Terms
(Cont.)
 Most systems add to the above model
 Words that occur in title, author list, section
headings, etc. are given greater importance
 Words whose first occurrence is late in the document
are given lower importance
 Very common words such as “a”, “an”, “the”, “it” etc.
are eliminated
 Called stop words
 Proximity: if keywords in query occur close together
in the document, the document has higher
importance than if they occur far apart
 Documents are returned in decreasing order of
relevance score
 Usually only top few documents are returned, not all
Similarity Based Retrieval
 Similarity based retrieval - retrieve documents similar to
a given document
 Similarity may be defined on the basis of common
words
 E.g., find k terms in A with highest TF (d, t ) / n (t )
and use these terms to find relevance of other
documents.
 Relevance feedback: Similarity can be used to refine
answer set to keyword query
 User selects a few relevant documents from those
retrieved by keyword query, and system finds other
documents similar to these
Relevance Using Hyperlinks
 Number of documents relevant to a query can be enormous
if only term frequencies are taken into account
 Using term frequencies makes “spamming” easy
 E.g., a travel agency can add many occurrences of
the words “travel” to its page to make its rank very
high
 Most of the time people are looking for pages from popular
sites
 Idea: use popularity of Web site (e.g., how many people
visit it) to rank site pages that match given keywords
 Problem: hard to find actual popularity of site
 Solution: next slide
Relevance Using Hyperlinks (Cont.)
 Solution: use number of hyperlinks to a site as a measure of the
popularity or prestige of the site
 Count only one hyperlink from each site (why? - see previous
slide)
 Popularity measure is for site, not for individual page
 But, most hyperlinks are to root of site
 Also, concept of “site” difficult to define since a URL
prefix like cs.yale.edu contains many unrelated pages of
varying popularity
 Refinements
 When computing prestige based on links to a site, give more
weight to links from sites that themselves have higher
prestige
 Definition is circular
 Set up and solve system of simultaneous linear equations
 Above idea is basis of the Google PageRank ranking
mechanism
Relevance Using Hyperlinks (Cont.)
 Connections to social networking theories that ranked
prestige of people
 E.g., the president of the U.S.A has a high prestige
since many people know him
 Someone known by multiple prestigious people has high
prestige
Synonyms and Homonyms
 Synonyms
 E.g., document: “motorcycle repair”, query: “motorcycle
maintenance”
 Need to realize that “maintenance” and “repair” are
synonyms
 System can extend query as “motorcycle and (repair or
maintenance)”
 Homonyms
 E.g., “object” has different meanings as noun/verb
 Can disambiguate meanings (to some extent) from the context
 Extending queries automatically using synonyms can be
problematic
 Need to understand intended meaning in order to infer
synonyms
 Or verify synonyms with user
 Synonyms may have other meanings as well
Concept-Based Querying
 Approach
 For each word, determine the concept it represents
from context
 Use one or more ontologies:
 Hierarchical structure showing relationship
between concepts
 E.g., the ISA relationship that we saw in the E-R
model
 This approach can be used to standardize terminology in
a specific field
 Ontologies can link multiple languages
Web Search Engines
 Web crawlers are programs that locate and gather
information on the Web
 Recursively follow hyperlinks present in known
documents, to find other documents
 Starting from a seed set of documents
 Fetched documents
 Handed over to an indexing system
 Can be discarded after indexing, or store as a
cached copy
 Crawling the entire Web would take a very large amount
of time
 Search engines typically cover only a part of the
Web, not all of it
 Take months to perform a single crawl
Web Crawling (Cont.)
 Crawling is done by multiple processes on multiple
machines, running in parallel
 Set of links to be crawled stored in a database
 New links found in crawled pages added to this set,
to be crawled later
Information Retrieval and
Structured Data
 Information retrieval systems originally treated
documents as a collection of words
 Information extraction systems infer structure from
documents, e.g.:
 Extraction of house attributes (size, address, number
of bedrooms, etc.) from a text advertisement
 Extraction of topic and people named from a new
article
 Relations or XML structures used to store extracted
data
 System seeks connections among data to answer
queries
 Question answering systems
Directories
 Storing related documents together in a library

facilitates browsing
 Users can see not only requested document but also
related ones.
 Browsing is facilitated by classification system that
organizes logically related documents together.
 Organization is hierarchical: classification hierarchy
A Classification Hierarchy For A Library
System
Classification DAG
 Documents can reside in multiple places in a hierarchy in

an information retrieval system, since physical location
is not important.
 Classification hierarchy is thus Directed Acyclic Graph
(DAG)
A Classification DAG For A Library
Information Retrieval System
Web Directories
 A Web directory is just a classification directory on Web
pages
 E.g., Yahoo! Directory, Open Directory project
 Issues:
 What should the directory hierarchy be?
 Given a document, which nodes of the directory
are categories relevant to the document
 Often done manually
 Classification of documents into a hierarchy may
be done based on term similarity

IR

Uploaded by

Copyright:

Available Formats

IR

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

IR

Uploaded by

Copyright:

Available Formats

Information Retrieval

Database System Concepts, 6th Ed.

 Relevance Ranking Using Terms

– Frequency of occurrence of query keyword in document

– How many documents the query keyword occurs in

– More links to a document  document is more important

 The log factor is to avoid excessive weight to

 Storing related documents together in a library

 Documents can reside in multiple places in a hierarchy in

You might also like