Information Retrieval: Dr. Bassel ALKHATIB
Information Retrieval: Dr. Bassel ALKHATIB
Information Retrieval: Dr. Bassel ALKHATIB
• Processing
• Indexing
• Retrieving
• … textual data
Slide 2
Typical IR Task
• Given:
– A corpus of textual natural-language documents.
– A user query in the form of a textual string.
• Find:
– A ranked set of documents that are relevant to the query.
Slide 3
IR System
Document
corpus
Query IR
String System
1. Doc1
Ranked 2. Doc2
Documents 3. Doc3
.
. Slide 4
Information Retrieval
and Web Search
Information Retrieval
(IR)
Slide 6
Need for IR
Slide 7
Definitions Information Retrieval
• IR is a branch of applied computer science focusing on
– the acquisition,
– organization,
– storage,
– retrieval, and distribution of information.
• Other:
cross language information retrieval, music retrieval
Slide 9
IR systems on the Web
Slide 10
Slide 11
Slide 12
Slide 13
Slide 14
Slide 15
Relevance
Slide 16
Keyword Search
Slide 17
Problems with Keywords
Slide 18
Beyond Keywords
Slide 19
Intelligent IR
Slide 20
Unstructured (text) vs. structured
(database) data in 1996
160
140
120
100
80 Unstructured
Structured
60
40
20
0
Data volume Market Cap
Slide 21
Unstructured (text) vs. structured
(database) data in 2006
160
140
120
100
80 Unstructured
Structured
60
40
20
0
Data volume Market Cap
Slide 22
Structured vs unstructured data
• Structured data : information in “tables”
Slide 23
Unstructured data
Slide 24
Information Retrieval
Vs Data Retrieval
• Documents
– Logical unit of text
• articles, books,
• links, web pages
– Other components that come with the text
• figures, charts, graphics
• multimedia
Slide 25
Information Retrieval
Vs Data Retrieval
• Textual Data
– Repository of human intellectuals
• Rich and diverse resources for all answers.
• Meaningful and understandable (to users).
– Free of pre-formatted structures
• continuous
• separated into documents
– Easy to process by the computer
Slide 26
Information Retrieval
Vs Data Retrieval
• Textual Data
– Massive
• Any IR system needs the capability of large scale data
processing.
• Use of indexes and various representations are required.
– Inconsistent
• It’s a human language
- Same information expressed in different way
- Different information expressed in similar ways.
– Incomplete (It’s an open system)
Slide 27
Information Retrieval
Vs Data Retrieval
• Retrieval
– Text retrieval
– Document retrieval
– Information retrieval
Slide 28
Information Retrieval
Vs Data Retrieval
Information Retrieval
• Conceptually, information retrieval is used to cover all related
problems in finding needed information
• Historically, information retrieval is about document retrieval,
emphasizing document as the basic unit
• Technically, information retrieval refers to (text) string
manipulation, indexing, matching, querying, etc.
Slide 29
Information Retrieval
Vs Data Retrieval
Slide 30
Comparison of data retrieval and
information retrieval
Slide 31
Text-Based Information Retrieval
• Fundamental Techniques
– Document and Query Representation
– Term weighting schemes based on Corpus Statistics
– Retrieval Models
– Document Clustering/Classification
– Data Structures and Search Techniques
– Evaluation Measures
Slide 32
Text-Based Information Retrieval
• New Challenges
– Statistical methods & Machine Learning Techniques applied to Text
Retrieval
– Text Categorization
– Text Summarization
– Cross-Language IR
– Knowledge Representation and use of Knowledge Bases
Slide 33
Multimedia Information Retrieval
• Multimedia IR in general
–Multimedia Data Types
–Challenges of MM IR Systems
–Retrieval Process
•Queries
•Indexation of Documents
•Matching Documents and Query
Representation
• Spoken Document Retrieval (SDR)
Slide 34
Search Engines on WWW
• Web Search Engines
– Crawling Agents
– Indexes of the Web pages
– Query Interface
– Answer Interface
– Retrieval Models
– Specific Purpose Search Engines
Slide 35
IR System Architecture
User Interface
Text
User
Text Operations
Need
Logical View
User Query Database
Feedback Operations Indexing
Manager
Inverted
file
Query Searching Index
Text
Ranked Retrieved Database
Docs Ranking Docs
Slide 36
IR System Components
• Text Operations forms index words (tokens).
– Stopword removal
– Stemming
Slide 37
IR System Components (continued)
Slide 38
Web Search
Slide 39
Web Search System
Query IR
String System
1. Page1
2. Page2 Ranked
3. Page3 Documents
.
. Slide 40
Other IR-Related Tasks
Slide 41
History of IR
• 1960-70’s:
– Initial exploration of text retrieval systems for “small” corpora of
scientific abstracts, and law and business documents.
– Development of the basic Boolean and vector-space models of
retrieval.
– Prof. Salton and his students at Cornell University are the
leading researchers in the area.
Slide 42
IR History Continued
• 1980’s:
– Large document database systems, many run by companies:
• Lexis-Nexis
• Dialog
• MEDLINE
Slide 43
IR History Continued
• 1990’s:
– Searching FTPable documents on the Internet
• Archie
• WAIS
– Searching the World Wide Web
• Lycos
• Yahoo
• Altavista
Slide 44
IR History Continued
• 1990’s continued:
– Organized Competitions
• NIST TREC
– Recommender Systems
• Ringo
• Amazon
• NetPerceptions
– Automated Text Categorization & Clustering
Slide 45
Recent IR History
• 2000’s
– Link analysis for Web Search
• Google
– Automated Information Extraction
• Whizbang
• Fetch
• Burning Glass
– Question Answering
• TREC Q/A track
Slide 46
Recent IR History
• 2000’s continued:
– Multimedia IR
• Image
• Video
• Audio and music
– Cross-Language IR
• DARPA Tides
– Document Summarization
Slide 47
Related Areas
• Database Management
• Library and Information Science
• Artificial Intelligence
• Natural Language Processing
• Machine Learning
Slide 48
Database Management
Slide 49
Library and Information Science
Slide 50
Artificial Intelligence
Slide 51
Natural Language Processing
Slide 52
Natural Language Processing:
IR Directions
Slide 53
Machine Learning
Slide 54
Machine Learning:
IR Directions
• Text Categorization
– Automatic hierarchical classification (Yahoo).
– Adaptive filtering/routing/recommending.
– Automated spam filtering.
• Text Clustering
– Clustering of IR query results.
– Automatic formation of hierarchies (Yahoo).
Slide 55