Nothing Special   »   [go: up one dir, main page]

0% found this document useful (0 votes)
47 views55 pages

Information Retrieval: Dr. Bassel ALKHATIB

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1/ 55

Information Retrieval

Dr. Bassel ALKHATIB


What is this course about?

• Processing
• Indexing
• Retrieving
• … textual data

• Fits in four lines, but much more complex and


interesting than that

Slide 2
Typical IR Task

• Given:
– A corpus of textual natural-language documents.
– A user query in the form of a textual string.

• Find:
– A ranked set of documents that are relevant to the query.

Slide 3
IR System

Document
corpus

Query IR
String System

1. Doc1
Ranked 2. Doc2
Documents 3. Doc3
.
. Slide 4
Information Retrieval
and Web Search
Information Retrieval
(IR)

• The indexing and retrieval of textual documents.


• Searching for pages on the World Wide Web is the most
recent “killer app.”
• Concerned firstly with retrieving relevant documents to
a query.
• Concerned secondly with retrieving from large sets of
documents efficiently.

Slide 6
Need for IR

• With the advance of WWW - more than 8 Billion


documents indexed on Yahoo, Google

• Various needs for information:


– Search for documents that fall in a given topic
– Search for a specific information
– Search an answer to a question
– Search for information in a different language
–…
– Search for images
– Search for music
– Search for a (candidate) friend

Slide 7
Definitions Information Retrieval
• IR is a branch of applied computer science focusing on
– the acquisition,
– organization,
– storage,
– retrieval, and distribution of information.

• IR involves helping users find information that


matches their information needs.

• IR has become a center of the focus in the web era. Its


theories, techniques, and applications have reached
many fields where processing large amount of
information is essential.
Slide 8
Examples of IR systems

• Conventional (library catalog)


Search by keyword, title, author, etc.
• Text-based (Google, Lexis-Nexis, FAST).
Search by keywords. Limited search using queries in natural language.
• Multimedia (Google, WebSeek, SaFe)
Search by visual appearance (shapes, colors,… ).
• Question answering systems (AskJeeves, Answerbus)
Search in (restricted) natural language

• Other:
cross language information retrieval, music retrieval

Slide 9
IR systems on the Web

• Search for Web pages http://www.google.com


• Search for images http://www.picsearch.com
• Search for image content
http://images.google.com
• Search for answers to questions
http://www.ask.com
• Music retrieval
http://mixturtle.com

Slide 10
Slide 11
Slide 12
Slide 13
Slide 14
Slide 15
Relevance

• Relevance is a subjective judgment and may include:


– Being on the proper subject.
– Being timely (recent information).
– Being authoritative (from a trusted source).
– Satisfying the goals of the user and his/her intended use of the
information (information need).

Slide 16
Keyword Search

• Simplest notion of relevance is that the query string


appears verbatim in the document.
• Slightly less strict notion is that the words in the query
appear frequently in the document, in any order (bag of
words).

Slide 17
Problems with Keywords

• May not retrieve relevant documents that include


synonymous terms.
– “restaurant” vs. “café”
– “PRC” vs. “China”

• May retrieve irrelevant documents that include


ambiguous terms.
– “Trojan Horse” (History vs. Computer)
– “Apple” (company vs. fruit)
– “bit” (unit of data vs. act of eating)

Slide 18
Beyond Keywords

• We will cover the basics of keyword-based IR, but…


• We will focus on extensions and recent developments
that go beyond keywords.
• We will cover the basics of building an efficient IR
system, but…
• We will focus on basic capabilities and algorithms rather
than systems issues that allow scaling to industrial size
databases.

Slide 19
Intelligent IR

• Taking into account the meaning of the words used.


• Taking into account the order of words in the query.
• Adapting to the user based on direct or indirect feedback.
• Taking into account the authority of the source.

Slide 20
Unstructured (text) vs. structured
(database) data in 1996

160

140

120

100

80 Unstructured
Structured
60

40

20

0
Data volume Market Cap

Slide 21
Unstructured (text) vs. structured
(database) data in 2006

160

140

120

100

80 Unstructured
Structured
60

40

20

0
Data volume Market Cap
Slide 22
Structured vs unstructured data
• Structured data : information in “tables”

Employee Manager Salary


Smith Jones 50000
Chang Smith 60000
Ivy Smith 50000
Typically allows numerical range and exact match
(for text) queries, e.g.,
Salary < 60000 AND Manager = Smith.

Slide 23
Unstructured data

• Typically refers to free text


• Allows
– Keyword-based queries including operators
– More sophisticated “concept” queries, e.g.,
• find all web pages dealing with drug abuse

Slide 24
Information Retrieval
Vs Data Retrieval

• Documents
– Logical unit of text
• articles, books,
• links, web pages
– Other components that come with the text
• figures, charts, graphics
• multimedia

Slide 25
Information Retrieval
Vs Data Retrieval

• Textual Data
– Repository of human intellectuals
• Rich and diverse resources for all answers.
• Meaningful and understandable (to users).
– Free of pre-formatted structures
• continuous
• separated into documents
– Easy to process by the computer

Slide 26
Information Retrieval
Vs Data Retrieval

• Textual Data
– Massive
• Any IR system needs the capability of large scale data
processing.
• Use of indexes and various representations are required.
– Inconsistent
• It’s a human language
- Same information expressed in different way
- Different information expressed in similar ways.
– Incomplete (It’s an open system)

Slide 27
Information Retrieval
Vs Data Retrieval

• Retrieval
– Text retrieval
– Document retrieval
– Information retrieval

• We can’t retrieve information!


– We can only retrieve documents that contains text which carries
information.
– Information can be anywhere
• in the text, in the links, in the process of text.

Slide 28
Information Retrieval
Vs Data Retrieval

Information Retrieval
• Conceptually, information retrieval is used to cover all related
problems in finding needed information
• Historically, information retrieval is about document retrieval,
emphasizing document as the basic unit
• Technically, information retrieval refers to (text) string
manipulation, indexing, matching, querying, etc.

Slide 29
Information Retrieval
Vs Data Retrieval

• Information Retrieval Systems


– The goal of IR systems is to help users find information that satisfies
their information needs.
– The process of IR systems is to match two abstractions:
• data abstracted in the system
• queries abstracted from user’s information needs
– Information retrieval is much more difficult than data retrieval

Slide 30
Comparison of data retrieval and
information retrieval

Data retrieval Information retrieval


Content Data Information
Data object Table Document
Matching Exact match Partial match, best match
Items wanted Matching Relevant
Query language SQL(artificial) Natural
Query specification Complete Incomplete
Model Deterministic Probabilistic
Highly structured less structure

Slide 31
Text-Based Information Retrieval

• Fundamental Techniques
– Document and Query Representation
– Term weighting schemes based on Corpus Statistics
– Retrieval Models
– Document Clustering/Classification
– Data Structures and Search Techniques
– Evaluation Measures

Slide 32
Text-Based Information Retrieval
• New Challenges
– Statistical methods & Machine Learning Techniques applied to Text
Retrieval
– Text Categorization
– Text Summarization
– Cross-Language IR
– Knowledge Representation and use of Knowledge Bases

Slide 33
Multimedia Information Retrieval

• Multimedia IR in general
–Multimedia Data Types
–Challenges of MM IR Systems
–Retrieval Process
•Queries
•Indexation of Documents
•Matching Documents and Query
Representation
• Spoken Document Retrieval (SDR)
Slide 34
Search Engines on WWW
• Web Search Engines
– Crawling Agents
– Indexes of the Web pages
– Query Interface
– Answer Interface
– Retrieval Models
– Specific Purpose Search Engines

Slide 35
IR System Architecture

User Interface
Text
User
Text Operations
Need
Logical View
User Query Database
Feedback Operations Indexing
Manager
Inverted
file
Query Searching Index
Text
Ranked Retrieved Database
Docs Ranking Docs
Slide 36
IR System Components
• Text Operations forms index words (tokens).
– Stopword removal
– Stemming

• Indexing constructs an inverted index of word to


document pointers.
• Searching retrieves documents that contain a given
query token from the inverted index.
• Ranking scores all retrieved documents according to a
relevance metric.

Slide 37
IR System Components (continued)

• User Interface manages interaction with the user:


– Query input and document output.
– Relevance feedback.
– Visualization of results.

• Query Operations transform the query to improve


retrieval:
– Query expansion using a thesaurus
– Query transformation using relevance feedback.

Slide 38
Web Search

• Application of IR to HTML documents on the World


Wide Web.
• Differences:
– Must assemble document corpus by spidering the web.
– Can exploit the structural layout information in HTML
(XML).
– Documents change uncontrollably.
– Can exploit the link structure of the web.

Slide 39
Web Search System

Web Spider Document


corpus

Query IR
String System

1. Page1
2. Page2 Ranked
3. Page3 Documents
.
. Slide 40
Other IR-Related Tasks

• Automated document categorization


• Information filtering (spam filtering)
• Automated document clustering
• Recommending information or products
• Information extraction
• Information integration
• Question answering

Slide 41
History of IR

• 1960-70’s:
– Initial exploration of text retrieval systems for “small” corpora of
scientific abstracts, and law and business documents.
– Development of the basic Boolean and vector-space models of
retrieval.
– Prof. Salton and his students at Cornell University are the
leading researchers in the area.

Slide 42
IR History Continued

• 1980’s:
– Large document database systems, many run by companies:
• Lexis-Nexis
• Dialog
• MEDLINE

Slide 43
IR History Continued

• 1990’s:
– Searching FTPable documents on the Internet
• Archie
• WAIS
– Searching the World Wide Web
• Lycos
• Yahoo
• Altavista

Slide 44
IR History Continued

• 1990’s continued:
– Organized Competitions
• NIST TREC
– Recommender Systems
• Ringo
• Amazon
• NetPerceptions
– Automated Text Categorization & Clustering

Slide 45
Recent IR History

• 2000’s
– Link analysis for Web Search
• Google
– Automated Information Extraction
• Whizbang
• Fetch
• Burning Glass
– Question Answering
• TREC Q/A track

Slide 46
Recent IR History

• 2000’s continued:
– Multimedia IR
• Image
• Video
• Audio and music
– Cross-Language IR
• DARPA Tides
– Document Summarization

Slide 47
Related Areas

• Database Management
• Library and Information Science
• Artificial Intelligence
• Natural Language Processing
• Machine Learning

Slide 48
Database Management

• Focused on structured data stored in relational tables


rather than free-form text.
• Focused on efficient processing of well-defined queries
in a formal language (SQL).
• Clearer semantics for both data and queries.
• Recent move towards semi-structured data (XML)
brings it closer to IR.

Slide 49
Library and Information Science

• Focused on the human user aspects of information


retrieval (human-computer interaction, user
interface, visualization).
• Concerned with effective categorization of human
knowledge.
• Concerned with citation analysis and bibliometrics
(structure of information).
• Recent work on digital libraries brings it closer to CS
& IR.

Slide 50
Artificial Intelligence

• Focused on the representation of knowledge,


reasoning, and intelligent action.
• Formalisms for representing knowledge and queries:
– First-order Predicate Logic
– Bayesian Networks

• Recent work on web ontologies and intelligent


information agents brings it closer to IR.

Slide 51
Natural Language Processing

• Focused on the syntactic, semantic, and pragmatic


analysis of natural language text and discourse.
• Ability to analyze syntax (phrase structure) and
semantics could allow retrieval based on meaning rather
than keywords.

Slide 52
Natural Language Processing:
IR Directions

• Methods for determining the sense of an ambiguous


word based on context (word sense disambiguation).
• Methods for identifying specific pieces of information in
a document (information extraction).
• Methods for answering specific NL questions from
document corpora.

Slide 53
Machine Learning

• Focused on the development of computational


systems that improve their performance with
experience.
• Automated classification of examples based on
learning concepts from labeled training examples
(supervised learning).
• Automated methods for clustering unlabeled
examples into meaningful groups (unsupervised
learning).

Slide 54
Machine Learning:
IR Directions
• Text Categorization
– Automatic hierarchical classification (Yahoo).
– Adaptive filtering/routing/recommending.
– Automated spam filtering.

• Text Clustering
– Clustering of IR query results.
– Automatic formation of hierarchies (Yahoo).

• Learning for Information Extraction


• Text Mining

Slide 55

You might also like