The end of the 20th century saw an explosion of investment in connecting computer systems--within organizations, between organizations, and between organizations and individuals--and a corresponding explosion in the on-line collection of data. Now that we have entered the 21st century we face the problem of extracting useful knowledge from these data, which is becoming increasingly difficult as volume and complexity push traditional analysis methods beyond their limits. Knowledge Discovery and Data Mining (KDD) techniques address this problem. The annual ACM SIGKDD International Conference on Knowledge Discovery and Data Mining brings together researchers and practitioners focusing on new developments and challenges in KDD. KDD-2001, the seventh conference in the series, was held in San Francisco, on August 26-29, 2001.We received 203 research-paper submissions from twenty-three countries. Each submitted research paper was reviewed by at least three members of the program committee. This period of independent review was followed by discussion among the reviewers, and when necessary we requested additional reviews from other experts. Twenty papers were selected to appear in the program as full papers (10%), and another thirty-two were selected to appear in the program as poster papers (16%). The Industry Track received thirty-four submissions, from which eleven (32%) were selected. In addition, the Program Committee referred three research papers to the Industry Track, of which one was selected. For the Industry Track, papers were selected because they presented useful knowledge for practitioners, or because they bridged a gap between industry and research.The program for KDD-2001 also included three keynote lectures, five invited talks by well-known practitioners (as part of the Industry Track), and three panel discussions on topics of current interest. There were six tutorials, geared both for novices and for experts, plus six specialized workshops on cutting-edge research issues. The 2001 KDDCUP competition focused on problems of bioinformatics and drug design. And, finally, the program included dozens of exhibits of products from vendors and from research projects.
Challenges for knowledge discovery in biology
Bioinformatics is the study of information flow in biology. Interest in the field has exploded in the last 10 years with the emergence of techniques for large scale experimental data collection-including genome sequencing, gene expression analysis, ...
Extracting targeted data from the web
Tom M. Mitchell is author of the textbook "Machine Learning" (McGraw Hill, 1997), President of the American Association for Artificial Intelligence and a member of the National Research Council's Computer Science and Telecommunications Board. He is Vice ...
Mass collaboration and data mining
Mass Collaboration is a new "P2P"-style approach to large-scale knowledge sharing, with applications in customer support, focused community development, and capturing knowledge distributed within large organizations. Effectively supporting this paradigm ...
Applications of generalized support vector machines to predictive modeling
The work of the Russian mathematician Vladimir Vapnik (AT&T Labs) enables us to go back to the roots of theoretical statistics, leaving behind Fisher's parameters in favor of the general approaches started in the 1930s by Glivenko-Cantelli-Kolmogorov. ...
Data mining: are we there yet?
Data mining started its move out of the statistics and machine learning ghettos and into the mainstream almost 10 years ago. With great fanfare and a large influx of venture capital, data mining was going to change the very nature of business. Yet data ...
Mining e-commerce data: the good, the bad, and the ugly
Organizations conducting Electronic Commerce (e-commerce) can greatly benefit from the insight that data mining of transactional and clickstream data provides. Such insight helps not only to improve the electronic channel (e.g., a web site), but it is ...
Recommender systems in commerce and community
Recommender systems have been revolutionizing the way shoppers and information seekers find what they want. We will study some of the tremendous successes and spectacular failures of recommenders in E-commerce to understand the causes of the success or ...
The "DGX" distribution for mining massive, skewed data
Skewed distributions appear very often in practice. Unfortunately, the traditional Zipf distribution often fails to model them well. In this paper, we propose a new probability distribution, the Discrete Gaussian Exponential (DGX), to achieve excellent ...
Data mining criteria for tree-based regression and classification
This paper is concerned with the construction of regression and classification trees that are more adapted to data mining applications than conventional trees. To this end, we propose new splitting criteria for growing trees. Conventional splitting ...
Probabilistic modeling of transaction data with applications to profiling, visualization, and prediction
Transaction data is ubiquitous in data mining applications. Examples include market basket data in retail commerce, telephone call records in telecommunications, and Web logs of individual page-requests at Web sites. Profiling consists of using ...
GESS: a scalable similarity-join algorithm for mining large data sets in high dimensional spaces
The similarity join is an important operation for mining high-dimensional feature spaces. Given two data sets, the similarity join computes all tuples (x, y) that are within a distance ε.One of the most efficient algorithms for processing similarity-...
Mining the network value of customers
One of the major applications of data mining is in helping companies determine which potential customers to market to. If the expected profit from a customer is greater than the cost of marketing to her, the marketing action for that customer is ...
Empirical bayes screening for multi-item associations
This paper considers the framework of the so-called "market basket problem", in which a database of transactions is mined for the occurrence of unusually frequent item sets. In our case, "unusually frequent" involves estimates of the frequency of each ...
Proximal support vector machine classifiers
Instead of a standard support vector machine (SVM) that classifies points by assigning them to one of two disjoint half-spaces, points are classified by assigning them to the closest of two parallel planes (in input or feature space) that are pushed ...
Data mining with sparse grids using simplicial basis functions
Recently we presented a new approach [18] to the classification problem arising in data mining. It is based on the regularization network approach but, in contrast to other methods which employ ansatz functions associated to data points, we use a grid ...
Mining time-changing data streams
Most statistical and machine-learning algorithms assume that the data is a random sample drawn from a stationary distribution. Unfortunately, most of the large databases available for mining today violate this assumption. They were gathered over months ...
Visualizing multi-dimensional clusters, trends, and outliers using star coordinates
Interactive visualizations are effective tools in mining scientific, engineering, and business data to support decision-making activities. Star Coordinates is proposed as a new multi-dimensional visualization technique, which supports various ...
Ensemble-index: a new approach to indexing large databases
The problem of similarity search (query-by-content) has attracted much research interest. It is a difficult problem because of the inherently high dimensionality of the data. The most promising solutions involve performing dimensionality reduction on ...
Robust space transformations for distance-based operations
For many KDD operations, such as nearest neighbor search, distance-based clustering, and outlier detection, there is an underlying κ-D data space in which each tuple/object is represented as a point in the space. In the presence of differing scales, ...
Molecular feature mining in HIV data
We present the application of Feature Mining techniques to the Developmental Therapeutics Program's AIDS antiviral screen database. The database consists of 43576 compounds, which were measured for their capability to protect human cells from HIV-1 ...
Discovering unexpected information from your competitors' web sites
Ever since the beginning of the Web, finding useful information from the Web has been an important problem. Existing approaches include keyword-based search, wrapper-based information extraction, Web query and user preferences. These approaches ...
Personalization from incomplete data: what you don't know can hurt
Clickstream data collected at any web site (site-centric data) is inherently incomplete, since it does not capture users' browsing behavior across sites (user-centric data). Hence, models learned from such data may be subject to limitations, the nature ...
Probabilistic query models for transaction data
We investigate the application of Bayesian networks, Markov random fields, and mixture models to the problem of query answering for transaction data sets. We formulate two versions of the querying problem: the query selectivity estimation (i.e., finding ...
Extracting collective probabilistic forecasts from web games
Game sites on the World Wide Web draw people from around the world with specialized interests, skills, and knowledge. Data from the games often reflects the players' expertise and will to win. We extract probabilistic forecasts from data obtained from ...
Tri-plots: scalable tools for multidimensional data mining
We focus on the problem of finding patterns across two large, multidimensional datasets. For example, given feature vectors of healthy and of non-healthy patients, we want to answer the following questions: Are the two clouds of points separable? What ...
Efficient discovery of error-tolerant frequent itemsets in high dimensions
We present a generalization of frequent itemsets allowing for the notion of errors in the itemset definition. We motivate the problem and present an efficient algorithm that identifies error-tolerant frequent clusters of items in transactional data (...
Learning and making decisions when costs and probabilities are both unknown
In many data mining domains, misclassification costs are different for different examples, in the same way that class membership probabilities are example-dependent. In these domains, both costs and probabilities are unknown for test examples, so both ...
Data mining case study: modeling the behavior of offenders who commit serious sexual assaults
This paper looks at the use of a Self Organizing Map (SOM), to link of records of crimes of serious sexual attacks. Once linked a profile can be derived of the offender(s) responsible.The data was drawn from the major crimes database at the National ...
A human-computer cooperative system for effective high dimensional clustering
High dimensional data has always been a challenge for clustering algorithms because of the inherent sparsity of the points. Therefore, techniques have recently been proposed to find clusters in hidden subspaces of the data. However, since the behavior ...
Cited By
-
Mgxekwa-Qumba B and Kruger M (2024). Discovering the authenticity of a Liberation Cultural Heritage experience in South Africa: a market segmentation approach, Journal of Heritage Tourism, 10.1080/1743873X.2023.2301066, 19:3, (470-495), Online publication date: 3-May-2024.
-
Wu Z, Yang W, Gao S, Chen X and Srivastava H (2023). An acceleration method for influence blocking maximization via martingale 2023 3rd International Conference on Applied Mathematics, Modelling and Intelligent Computing (CAMMIC 2023), 10.1117/12.2685914, 9781510667600, (27)
-
Avornyo E and Baker S (2018). The role of play in children’s learning: the perspective of Ghanaian early years stakeholders, Early Years, 10.1080/09575146.2018.1473344, 41:2-3, (174-189), Online publication date: 27-May-2021.
-
Shi W, Zhang A and Webb G (2018). Mining significant crisp-fuzzy spatial association rules, International Journal of Geographical Information Science, 10.1080/13658816.2018.1434525, 32:6, (1247-1270), Online publication date: 3-Jun-2018.
-
Wong P, Kao D, Hao M, Chen C, Gousie M, Grady J and Branagan M (2013). Visualizing trends and clusters in ranked time-series data IS&T/SPIE Electronic Imaging, 10.1117/12.2037038, , (90170F), Online publication date: 23-Dec-2013.
-
Scargle J, Norris J, Jackson B and Chiang J (2013). STUDIES IN ASTRONOMICAL TIME SERIES ANALYSIS. VI. BAYESIAN BLOCK REPRESENTATIONS, The Astrophysical Journal, 10.1088/0004-637X/764/2/167, 764:2, (167)