Information Retrieval: Dr. Bassel ALKHATIB

Information Retrieval
Dr. Bassel ALKHATIB

What is this course about?
• Processing
• Indexing
• Retrieving
• … textual data
• Fits in four lines, but much more complex and

interesting than that
Slide 2
Typical IR Task
• Given:
– A corpus of textual natural-language documents.
– A user query in the form of a textual string.
• Find:
– A ranked set of documents that are relevant to the query.
Slide 3
IR System
Document
corpus
Query IR
String System
1. Doc1
Ranked 2. Doc2
Documents 3. Doc3
.
. Slide 4
and Web Search
(IR)
• The indexing and retrieval of textual documents.

• Searching for pages on the World Wide Web is the most
recent “killer app.”
• Concerned firstly with retrieving relevant documents to
a query.
• Concerned secondly with retrieving from large sets of
documents efficiently.
Slide 6
Need for IR
• With the advance of WWW - more than 8 Billion

documents indexed on Yahoo, Google
• Various needs for information:

– Search for documents that fall in a given topic
– Search for a specific information
– Search an answer to a question
– Search for information in a different language
–…
– Search for images
– Search for music
– Search for a (candidate) friend
Slide 7
Definitions Information Retrieval
• IR is a branch of applied computer science focusing on
– the acquisition,
– organization,
– storage,
– retrieval, and distribution of information.
• IR involves helping users find information that

matches their information needs.
• IR has become a center of the focus in the web era. Its

theories, techniques, and applications have reached
many fields where processing large amount of
information is essential.
Slide 8
Examples of IR systems
• Conventional (library catalog)

Search by keyword, title, author, etc.
• Text-based (Google, Lexis-Nexis, FAST).
Search by keywords. Limited search using queries in natural language.
• Multimedia (Google, WebSeek, SaFe)
Search by visual appearance (shapes, colors,… ).
• Question answering systems (AskJeeves, Answerbus)
Search in (restricted) natural language
• Other:
cross language information retrieval, music retrieval
Slide 9
IR systems on the Web
• Search for Web pages http://www.google.com

• Search for images http://www.picsearch.com
• Search for image content
http://images.google.com
• Search for answers to questions
http://www.ask.com
• Music retrieval
http://mixturtle.com
Slide 10
Slide 11
Slide 12
Slide 13
Slide 14
Slide 15
Relevance
• Relevance is a subjective judgment and may include:

– Being on the proper subject.
– Being timely (recent information).
– Being authoritative (from a trusted source).
– Satisfying the goals of the user and his/her intended use of the
information (information need).
Slide 16
Keyword Search
• Simplest notion of relevance is that the query string

appears verbatim in the document.
• Slightly less strict notion is that the words in the query
appear frequently in the document, in any order (bag of
words).
Slide 17
Problems with Keywords
• May not retrieve relevant documents that include

synonymous terms.
– “restaurant” vs. “café”
– “PRC” vs. “China”
• May retrieve irrelevant documents that include

ambiguous terms.
– “Trojan Horse” (History vs. Computer)
– “Apple” (company vs. fruit)
– “bit” (unit of data vs. act of eating)
Slide 18
Beyond Keywords
• We will cover the basics of keyword-based IR, but…

• We will focus on extensions and recent developments
that go beyond keywords.
• We will cover the basics of building an efficient IR
system, but…
• We will focus on basic capabilities and algorithms rather
than systems issues that allow scaling to industrial size
databases.
Slide 19
Intelligent IR
• Taking into account the meaning of the words used.

• Taking into account the order of words in the query.
• Adapting to the user based on direct or indirect feedback.
• Taking into account the authority of the source.
Slide 20
Unstructured (text) vs. structured
(database) data in 1996
160
140
120
100
80 Unstructured
Structured
60
40
20
0
Data volume Market Cap
Slide 21
Unstructured (text) vs. structured
(database) data in 2006
160
140
120
100
80 Unstructured
Structured
60
40
20
0
Data volume Market Cap
Slide 22
Structured vs unstructured data
• Structured data : information in “tables”
Employee Manager Salary

Smith Jones 50000
Chang Smith 60000
Ivy Smith 50000
Typically allows numerical range and exact match
(for text) queries, e.g.,
Salary < 60000 AND Manager = Smith.
Slide 23
Unstructured data
• Typically refers to free text

• Allows
– Keyword-based queries including operators
– More sophisticated “concept” queries, e.g.,
• find all web pages dealing with drug abuse
Slide 24
Vs Data Retrieval
• Documents
– Logical unit of text
• articles, books,
• links, web pages
– Other components that come with the text
• figures, charts, graphics
• multimedia
Slide 25
Vs Data Retrieval
• Textual Data
– Repository of human intellectuals
• Rich and diverse resources for all answers.
• Meaningful and understandable (to users).
– Free of pre-formatted structures
• continuous
• separated into documents
– Easy to process by the computer
Slide 26
Vs Data Retrieval
• Textual Data
– Massive
• Any IR system needs the capability of large scale data
processing.
• Use of indexes and various representations are required.
– Inconsistent
• It’s a human language
- Same information expressed in different way
- Different information expressed in similar ways.
– Incomplete (It’s an open system)
Slide 27
Vs Data Retrieval
• Retrieval
– Text retrieval
– Document retrieval
– Information retrieval
• We can’t retrieve information!

– We can only retrieve documents that contains text which carries
information.
– Information can be anywhere
• in the text, in the links, in the process of text.
Slide 28
Vs Data Retrieval
• Conceptually, information retrieval is used to cover all related
problems in finding needed information
• Historically, information retrieval is about document retrieval,
emphasizing document as the basic unit
• Technically, information retrieval refers to (text) string
manipulation, indexing, matching, querying, etc.
Slide 29
Vs Data Retrieval
• Information Retrieval Systems

– The goal of IR systems is to help users find information that satisfies
their information needs.
– The process of IR systems is to match two abstractions:
• data abstracted in the system
• queries abstracted from user’s information needs
– Information retrieval is much more difficult than data retrieval
Slide 30
Comparison of data retrieval and
information retrieval
Data retrieval Information retrieval

Content Data Information
Data object Table Document
Matching Exact match Partial match, best match
Items wanted Matching Relevant
Query language SQL(artificial) Natural
Query specification Complete Incomplete
Model Deterministic Probabilistic
Highly structured less structure
Slide 31
Text-Based Information Retrieval
• Fundamental Techniques
– Document and Query Representation
– Term weighting schemes based on Corpus Statistics
– Retrieval Models
– Document Clustering/Classification
– Data Structures and Search Techniques
– Evaluation Measures
Slide 32
Text-Based Information Retrieval
• New Challenges
– Statistical methods & Machine Learning Techniques applied to Text
Retrieval
– Text Categorization
– Text Summarization
– Cross-Language IR
– Knowledge Representation and use of Knowledge Bases
Slide 33
Multimedia Information Retrieval
• Multimedia IR in general
–Multimedia Data Types
–Challenges of MM IR Systems
–Retrieval Process
•Queries
•Indexation of Documents
•Matching Documents and Query
Representation
• Spoken Document Retrieval (SDR)
Slide 34
Search Engines on WWW
• Web Search Engines
– Crawling Agents
– Indexes of the Web pages
– Query Interface
– Answer Interface
– Retrieval Models
– Specific Purpose Search Engines
Slide 35
IR System Architecture
User Interface
Text
User
Text Operations
Need
Logical View
User Query Database
Feedback Operations Indexing
Manager
Inverted
file
Query Searching Index
Text
Ranked Retrieved Database
Docs Ranking Docs
Slide 36
IR System Components
• Text Operations forms index words (tokens).
– Stopword removal
– Stemming
• Indexing constructs an inverted index of word to

document pointers.
• Searching retrieves documents that contain a given
query token from the inverted index.
• Ranking scores all retrieved documents according to a
relevance metric.
Slide 37
IR System Components (continued)
• User Interface manages interaction with the user:

– Query input and document output.
– Relevance feedback.
– Visualization of results.
• Query Operations transform the query to improve

retrieval:
– Query expansion using a thesaurus
– Query transformation using relevance feedback.
Slide 38
Web Search
• Application of IR to HTML documents on the World

Wide Web.
• Differences:
– Must assemble document corpus by spidering the web.
– Can exploit the structural layout information in HTML
(XML).
– Documents change uncontrollably.
– Can exploit the link structure of the web.
Slide 39
Web Search System
Web Spider Document

corpus
Query IR
String System
1. Page1
2. Page2 Ranked
3. Page3 Documents
.
. Slide 40
Other IR-Related Tasks
• Automated document categorization

• Information filtering (spam filtering)
• Automated document clustering
• Recommending information or products
• Information extraction
• Information integration
• Question answering
Slide 41
History of IR
• 1960-70’s:
– Initial exploration of text retrieval systems for “small” corpora of
scientific abstracts, and law and business documents.
– Development of the basic Boolean and vector-space models of
retrieval.
– Prof. Salton and his students at Cornell University are the
leading researchers in the area.
Slide 42
IR History Continued
• 1980’s:
– Large document database systems, many run by companies:
• Lexis-Nexis
• Dialog
• MEDLINE
Slide 43
• 1990’s:
– Searching FTPable documents on the Internet
• Archie
• WAIS
– Searching the World Wide Web
• Lycos
• Yahoo
• Altavista
Slide 44
• 1990’s continued:
– Organized Competitions
• NIST TREC
– Recommender Systems
• Ringo
• Amazon
• NetPerceptions
– Automated Text Categorization & Clustering
Slide 45
Recent IR History
• 2000’s
– Link analysis for Web Search
• Google
– Automated Information Extraction
• Whizbang
• Fetch
• Burning Glass
– Question Answering
• TREC Q/A track
Slide 46
Recent IR History
• 2000’s continued:
– Multimedia IR
• Image
• Video
• Audio and music
– Cross-Language IR
• DARPA Tides
– Document Summarization
Slide 47
Related Areas
• Database Management
• Library and Information Science
• Artificial Intelligence
• Natural Language Processing
• Machine Learning
Slide 48
Database Management
• Focused on structured data stored in relational tables

rather than free-form text.
• Focused on efficient processing of well-defined queries
in a formal language (SQL).
• Clearer semantics for both data and queries.
• Recent move towards semi-structured data (XML)
brings it closer to IR.
Slide 49
Library and Information Science
• Focused on the human user aspects of information

retrieval (human-computer interaction, user
interface, visualization).
• Concerned with effective categorization of human
knowledge.
• Concerned with citation analysis and bibliometrics
(structure of information).
• Recent work on digital libraries brings it closer to CS
& IR.
Slide 50
Artificial Intelligence
• Focused on the representation of knowledge,

reasoning, and intelligent action.
• Formalisms for representing knowledge and queries:
– First-order Predicate Logic
– Bayesian Networks
• Recent work on web ontologies and intelligent

information agents brings it closer to IR.
Slide 51
Natural Language Processing
• Focused on the syntactic, semantic, and pragmatic

analysis of natural language text and discourse.
• Ability to analyze syntax (phrase structure) and
semantics could allow retrieval based on meaning rather
than keywords.
Slide 52
Natural Language Processing:
IR Directions
• Methods for determining the sense of an ambiguous

word based on context (word sense disambiguation).
• Methods for identifying specific pieces of information in
a document (information extraction).
• Methods for answering specific NL questions from
document corpora.
Slide 53
Machine Learning
• Focused on the development of computational

systems that improve their performance with
experience.
• Automated classification of examples based on
learning concepts from labeled training examples
(supervised learning).
• Automated methods for clustering unlabeled
examples into meaningful groups (unsupervised
learning).
Slide 54
Machine Learning:
IR Directions
• Text Categorization
– Automatic hierarchical classification (Yahoo).
– Adaptive filtering/routing/recommending.
– Automated spam filtering.
• Text Clustering
– Clustering of IR query results.
– Automatic formation of hierarchies (Yahoo).
• Learning for Information Extraction

• Text Mining
Slide 55

Information Retrieval: Dr. Bassel ALKHATIB

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Information Retrieval: Dr. Bassel ALKHATIB

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Information Retrieval: Dr. Bassel ALKHATIB

Uploaded by

Copyright:

Available Formats

Information Retrieval

Dr. Bassel ALKHATIB

• Fits in four lines, but much more complex and

• The indexing and retrieval of textual documents.

• With the advance of WWW - more than 8 Billion

• Various needs for information:

• IR involves helping users find information that

• IR has become a center of the focus in the web era. Its

• Conventional (library catalog)

• Search for Web pages http://www.google.com

• Relevance is a subjective judgment and may include:

• Simplest notion of relevance is that the query string

• May not retrieve relevant documents that include

• May retrieve irrelevant documents that include

• We will cover the basics of keyword-based IR, but…

• Taking into account the meaning of the words used.

Employee Manager Salary

• Typically refers to free text

• We can’t retrieve information!

• Information Retrieval Systems

Data retrieval Information retrieval

• Indexing constructs an inverted index of word to

• User Interface manages interaction with the user:

• Query Operations transform the query to improve

• Application of IR to HTML documents on the World

Web Spider Document

• Automated document categorization

• Focused on structured data stored in relational tables

• Focused on the human user aspects of information

• Focused on the representation of knowledge,

• Recent work on web ontologies and intelligent

• Focused on the syntactic, semantic, and pragmatic

• Methods for determining the sense of an ambiguous

• Focused on the development of computational

• Learning for Information Extraction

You might also like