Mining interesting knowledge from weblogs: a survey

Published: 01 June 2005


Web Usage Mining is that area of Web Mining which deals with the extraction of interesting knowledge from logging information produced by Web servers. In this paper we present a survey of the recent developments in this area that is receiving increasing attention from the Data Mining community.


Michael G. Murphy

This paper surveys the field of Web usage mining, which is a sub-area of Web mining, which, in turn, is a sub-area of data mining. Web usage mining is the part of Web mining that deals with the extraction of knowledge from server log files. Such Web logs, or weblogs, are mostly textual logs, collected when users access Web servers, and are stored in one of several commonly used formats. (Note that weblogs in this context are not blogs, as that term has come to be known recently.) The sections of the paper include an introduction, and cover data sources, data preprocessing, knowledge discovery techniques, applications, software support, moving from techniques to applications, privacy issues, future trends, and a brief summary. There are also 112 references. There are no figures, but there is one helpful table that provides references to carefully selected papers, with representative applications, techniques, and data sources. Data sources for Web usage mining are Web servers, proxy servers, and Web clients. Preprocessing includes data cleaning, identifying and reconstructing users' sessions, retrieving information about page content and structure, and data formatting. Knowledge discovery techniques for research in Web usage mining, as opposed to the statistical analysis typical of commercial applications, focus on association rules, sequential patterns, and clustering. Since the general goal of Web usage mining is to gather useful information about Web users' navigation patterns, the results produced by mining Web logs can be used to personalize the delivery of Web content, improve user navigation by means of prefetching and caching, improve Web design, and improve customer satisfaction in e-commerce. Software support has evolved over the last several years, with e-commerce Web usage mining becoming part of integrated customer relationship management (CRM) solutions, simple Web log analyzers for general usage, and an open source tool, the Web utilization miner (WUM), for the research community. The privacy issue, in general, is still being considered by the Web usage mining community. Future trends appear to be tied to the emergence and proliferation of the semantic Web concept. The paper serves as a good survey of Web usage mining. It is recommended for anyone wanting to understand the essentials of this rapidly emerging field. Online Computing Reviews Service

