A Theme-Based Search Technique PDF
A Theme-Based Search Technique PDF
A Theme-Based Search Technique PDF
367
Abstract: - This work presents an intelligent search engine, called ORCA, that returns the most relevant
documents for user’s queries. This search engine analyses the queries and builds themes (models) to be used
when the engine is confronted with similar queries. The intelligent component is used for constructing a model
of the user behavior and using that model to fetch and even pre-fetch information and documents considered of
interest to the user. It uses both latent semantic analysis and web page feature selection for clustering web
pages. Latent semantic analysis is used to find the semantic relations between keywords, and between
documents.
Keywords: Information Filtering, User’s Model, Search Engines, Latent Search Analysis.
followed to get to the desired page or they have been accessing the Theme Database where the results of
mistakenly visited. Unfortunately, both of the above the search are stored, organized and managed by the
engines lack efficient techniques to deal with this last Processing Manager.
issue.
In short, intelligence claimed to be embodied in
these engines lie in their ability to model the user’s
access behavior and use that model to anticipate and
predict future access patterns. It is worth noting that Look-Up
these intelligent search tools are still in their Table
experimental stages. It will be interesting to see how
effective these tools are after some performance Query
evaluations are reported. It goes without saying that
any attempt to include some sort of intelligence, as
the one described above, would require incorporating
Query
complicated machine learning and automatic user Dispos
modeling techniques in the design of search engines. e Theme Database
Proc Manager
Theme Manager
2 Themes – User’s Behavior Models
In this approach, a search engine should come up
with the models that represent best of the user access
patterns and various areas of interest. We use explicit Doc. Database
planning mechanism to model the user behavior on
the Web. In this mechanism, the user is required to
explicitly spell out his/her interests that which
externalize and represent his access patterns. In this
way, the information about the user overall access HOTBOT 37.Com
Yahoo AltaVista
behavior on the Web is analyzed by the search
engine. This approach greatly simplifies the
problems associated with designing systems for
intelligent or automatic deduction of the user access
patterns. WWW
2.1 The Architecture Figure 1: System Architecture
Here we would like to lay down an architecture of
the User’ Model (Theme) component of the system, 2.2 The Operations
which is placed on top of the existing search engines. After retrieving or creating themes for the current
The system is written in Visual Basic and is designed user, the system displays the Current Themes and
to help the users in avoiding seeing undesired Search Criteria window on the screen and allows the
information when carrying out the search. user to add, modify, or delete his search preferences.
At the architectural level (shown in Figure 1), For example, the user can enter four search criteria,
the system consists of two main components, namely but decide to submit only the first and the last ones
the Theme Manager and the Processing Manager. for search. The search engines selected can be Yahoo
The Theme Manager serves several purposes for the first criteria and 37.Com for the last criteria.
besides being the interface which communicates with After the user submits the searches, the system
the users. As an interface, the Theme Manager processes the user’s query using the noise-disposal
allows the user to establish his/her personal profile, procedure in order to produce a query consisting of
which includes such items as (1) set of keywords keywords only. It, then, uses these keywords to look
(search criteria) which captures the topic or concept up for a theme from a table called look-up table,
of interest to the user, (2) the desired search engine which contains all the themes in the database. Using
and (3) the options for live update. The Theme a partial matching process, the system tries to match
Manager maintains individual user models in the the first keyword in the user’s query with the first
Theme Database. The Manager is also responsible column in the table. If a match is found, it checks the
for submitting the queries to the search engines and next field that contains the themes or tables related to
the first keyword and once again it tries to find
Proceedings of the 11th WSEAS International Conference on SYSTEMS, Agios Nikolaos, Crete Island, Greece, July 23-25, 2007 369
another match. This process will continue until the information in the document. At the same time, we
system finds the theme that contains all the URLs. also extract text features from web page content.
The process continues in the same manner with the
other keywords in the user’s query. 3 System Implementation
The system also deals with empty queries, the
case of having only one word and this word is final
sub theme, and the case of none of the given 3.1 Query Dispose
keywords is a final sub-theme and the case when the At this stage, the query is modified by removing all
query doesn’t exist in the database. For the last case, the noise and all the unnecessary words. The system,
the system creates a new table with its description as then, converts the query into individual keywords.
the query keywords and sends the queries to the What we mean by noise is the words that are helpless
search engines selected and the results produced are in the searching process such as find, about and
fed into the Processing Manager. The Manager without. For example, if the user query is “training
compiles the lists of URLs from the engines, merges courses of diploma in computer science” so the pure
them, deletes duplicates, and finally produces a list query is “training courses diploma computer
of URLs which satisfy the queries submitted. As science”. These keywords are stored in a vector
documents arrive, the Processing Manager classifies called terms-vector.
URLs and stores these results into the Document This sub routine is written using Visual Basic
Database. Script and ASP.
The Process Manager uses two types of features
extraction, latent semantic analysis and web page
features selection, for web page clustering using our 3.2 Look-Up Table
conceptual clustering technique (see section 5). The This table is a two-dimensional array in which the
latent semantic analysis (see section 4) extracts first column consists of all the tables (or themes)
common semantic relations between keywords and a available in the system’s database. The cells in each
document. The web page feature selection extracts row contain the tables (themes) related to the first
the text features from a given web page for inputting entry in each row. The last filled entry in each row is
to our clustering technique. Latent semantic analysis indicated by “*” to show that this is the last entry.
is used to find the semantic relations between This is due to the various numbers of sub themes
keywords, and between documents. The latent related to each theme. We have used “$” to indicate
semantic analysis method projects terms and a that this theme is a final sub theme.
document into a vector space to find latent For an experiment, we created a database
containing three main themes: education, business
Var and computer. The database is created in Microsoft
x, i : integer
Access. There are 59 tables in total for all the three
word, query : string
find : Boolean
themes.
begin Now consider the business theme along with its
x=0 only three related sub themes, which are economics,
Size = 0 //size of the array of keywords of query. finance and marketing. If we trace the finance sub
query = bring the user’s query from the HTML theme, we get the following sub themes: banking,
interface “project.asp” accounting, planning, investment services, etc. If we
while (query is not “ “) trace the theme investment service, the related sub
begin themes beneath it are: brokerages and socially
x = find the position of the first responsible investment. If we continue with
white space “ “ in query
brokerages, we get bonds and money as a final sub
word=starting from left of query
take the sub string from the first index till (x-1)
theme. This technique is used for all the tables to get
find=check if word is a noise ”false if it’s a noise” to the final sub themes. The following shows the
if find then scheme that was stated above.
begin
keywordArray (Size) = word Business theme
increment Size by 1 Lookup (9, 1) = "business"
end // if Lookup (9, 2) = "economics"
query=update query to start from Lookup (9, 3) = "finance_and_investment"
x till length of query Lookup (9, 4) = "marketing_and_advertising"
end //while .
end.
Proceedings of the 11th WSEAS International Conference on SYSTEMS, Agios Nikolaos, Crete Island, Greece, July 23-25, 2007 370
The search_row () sub routine accepts an integer 4 Latent Semantic Analysis (LSA)
representing the row number and a string. It returns a Latent semantic analysis applies a vector space
Boolean value true if the string is found in the concept [5]. All keywords and documents form a
indicated row, otherwise it returns false. two-dimension term-document matrix; singular value
decomposition is used to decompose the term-
While (lookup (row number, count) != "*") && document matrix to obtain the semantic features.
(string not found)
Suppose that X is a term-document matrix, which is
begin
if some letters in lookup (row number, count) which is an i×j matrix; where i is the number of
match some letters from string (partial matching) then keywords and j the number of documents. Each
return true element X[i,j] is the number of occurrences of
Else keyword i in document j. The SVD computes the
Increment count matrix singular value decomposition of X. It uses the
end eigenvalue and eigenvector to reduce the dimensions
of the original data, filtering irrelevant information.
Finally, the findtheme () is a search algorithm The original matrix has a high dimension. An SVD
takes each keyword from the vector of keywords one can reduce the original high term-document matrix
at a time. It searches the lookup table by passing dimensions to a low term-document matrix. The
each keyword to first_col (). The findtheme() returns Matlab command [U,S,V] = svd(X) produces a
the row number. If it is a final sub-theme, the system diagonal matrix S of the same dimension as X, with
retrieves the required fields (Description and URL). nonnegative diagonal elements in decreasing order,
Otherwise, the system should carry out a further and unitary matrices U and V so that X = U*S*V'.
check. If the keyword cannot be found in the
database, the system should expand the database
using initially this keyword as a title. If the keyword 5 Clustering Technique
exists in the database, the system will check if the Our clustering technique is computationally efficient
next keyword in the vector of keywords in the row of and has high classification accuracy [6]. The basic
the previous keyword. If the keyword is there, the idea of the method is simple yet effective. It uses a
process is repeated using the remaining keywords. control-generate-test strategy, which involves the
But if the next keyword doesn’t exist in that row, generation of a set of dynamic filters (concepts) and
then the system has to create a new theme. then testing these concepts on given data vectors
(components) for detecting clusters of these
3.5 Expanding the Database components.
The database should be expanded when the query The clustering technique is based on using the
entered by the user has not found in the local identity matrix (I), i.e. a matrix that contains only the
database. The system should, therefore, adds a new base vectors, rather than using a random matrix
table (theme) with query as an initial name of the (RM). These vectors are, therefore, considered as the
new table. The system seeds for producing the required filtering concepts.
starts with empty new theme (no records). The Using this method, the algorithm operates in O (N)
records will be added manually. The following code time; where N is the number of base vectors. It
is responsible for this issue: provides great speed and efficiency improvement
Proceedings of the 11th WSEAS International Conference on SYSTEMS, Agios Nikolaos, Crete Island, Greece, July 23-25, 2007 372
over the first method and over the methods currently Clusters generation
used for clustering. 6. At the start of iteration i, select a concept
vector (cvi) and its associated distance (bdi) at a
Concepts construction: random.
Prepare a suitable identity matrix I of base 7. Find the pattern vectors that satisfy the
vectors and the data matrix that is required to be inequality, given below, and place them in the
divided into groups. The number of the elements concept group i (CGi)
Let adi= ||pvk – cvi||; k=0, 1, 2, …, n be the
(patterns) of the data matrix should constitute distance after the alignment and bdi is the
the dimensions of the identity matrix I. Thus, if distance before the alignment. For a pattern
the data matrix has n elements, then the vector k to be placed in group i, it must satisfy
dimensions of the identity matrix I should be the following inequality:
n×n. ||bdi – adi||<Ti where Ti=adi*∝;
1. Use the data matrix to produce a category 0<∝<1.
matrix or pattern matrix PM=[pvi]T of pattern 8. Stop when you reach an empty pattern
vectors (pvi). This requires a transformation of matrix (PM). Otherwise, increase ∝ slightly.
data or smoothing. We have used log10 for the 9. Reset PM and CGi for all i, and repeat from
transformation. This step is not necessary but it step 1.
provides further performance enhancement and,
as stated before, it can only be carried out if we
know in advance the categories of each attribute 6 Conclusion and Further Work
in the data matrix. This paper has presented a prototype system to
2. If the number of columns of the pattern improve the efficiency of the Web search engines.
matrix (PM) is less than the number of columns
The main goal of the search tool described here is to
of the identity matrix I, then fill the missing data
of PM with averages to produce a new matrix, overcome the drawbacks found in the current search
namely IM. This step is required in order to have engines, which are related mainly to their being one-
the same dimensionality for both PM and I. time, unintelligent search engines. The novelty of the
3. Use the identity matrix I (or IM), and the design presented lies in its ability to support
pattern matrix (PM), and go throughout the intelligent (automatic) approaches. The retrieved
following steps to generate a concept matrix documents are grouped into clusters using our
CM=[cvi] T of concept vectors along with its clustering technique that accepts as input the web
associated distance vector DV= [dvi] T. features and the semantic features produced by LSA.
Through this approach, the users are always
For i =0 to the last pattern in PM do
presented with updated information and are never
Begin
For j=0 to the last random vector in IM do flooded with duplicate information every time a
Begin query is submitted.
a. Find the closest base vector bvj in IM to pvi in PM.
b. Calculate the distance before alignment (bdi): References:
bdi = ||bvj – pvi||, and save the Result in BD=[bdi]T [1] http://www.ncsa.uiuc.edu/SDG/Software/Mosaic/
c. Align bvj and pvi as follows: [2] http://www.netscape.com/
bvj = bvj + α ||bvj – pvi||; 0<α<1. [3] Armstrong et al. WebWatcher: A Learning
d. Replace bvi with the result of the alignment to
Apprentice for the World Wide Web.
produce a candidate concept.
End;
[4] H. Lieberman. Letizia: An Agent that Assista
End; Web Browsing , IJCAI'95, 1995.
4. Remove the base vectors that are not used [5] H. Ramadhan and K. Shihab. Improving the
from IM and save the result as a concept matrix Engineering of Search Engines for the Web,
(CM) along with the distance vector (BD). Proceedings of the Intl. Conference on Internet
5. Eliminate the redundant concepts; concepts Computing, 2002, pp 188-193, USA.
are considered redundant if the same clustering [6] Landauer, T. K., Foltz, P. W., & Laham, D.
result can be reached with using only the Introduction to Latent Semantic Analysis,
remaining concepts. The elimination should be Discourse Processes, 25, 1998, pp. 259-284.
carried out as follows:
[7] Shihab, K. Improving Clustering Performance by
a. Let adi= ||pvk – cvi||; k=0, 1, 2, …, n be the
distance after the alignment.
Using Feature Selection and Extraction
Techniques, Journal of Intelligent Systems, Vol.
b. For all i and j such that i ≠ j if ||cvi - cvji|| ≤ adi, 13, No. 3, 2004, pp 249-273, 2004.
then eliminate cvji