Data Mining Tools
Data Mining Tools
Data Mining Tools
The development and application of data mining algorithms requires the use of
powerful software tools. As the number of available tools continues to grow, the
choice of the most suitable tool becomes increasingly difficult. This paper attempts
to support the decision-making process by discussing the historical development
and presenting a range of existing state-of-the-art data mining and related tools.
Furthermore, we propose criteria for the tool categorization based on different
user groups, data structures, data mining tasks and methods, visualization and
interaction styles, import and export options for data and models, platforms, and
license policies. These criteria are then used to classify data mining tools into
nine different types. The typical characteristics of these types are explained and
a selection of the most important tools is categorized. This paper is organized
as follows: the first section Historical Development and State-of-the-Art highlights
the historical development of data mining software until present; the criteria to
compare data mining software are explained in the second section Criteria for
Comparing Data Mining Software. The last section Categorization of Data Mining
Software into Different Types proposes a categorization of data mining software
and introduces typical software tools for the different types. C 2011 John Wiley & Sons,
Inc. WIREs Data Mining Knowl Discov 2011 00 1–13 DOI: 10.1002/widm.24
HISTORICAL DEVELOPMENT port the complete KDD process and not just a single
AND STATE-OF-THE-ART step.
Today, a large number of standard data min-
D ata mining has a long history, with strong
roots in statistics, artificial intelligence, machine
learning, and database research.1, 2 The word ‘data
ing methods are available, (see Refs 4 and 5 for
detailed descriptions). From a historical perspective,
these methods have different roots. One early group
mining’ can be found relatively early, as in the article
of methods was adopted from classical statistics: the
of Lovell,3 published in the 1980s. Advancements in
focus was changed from the proof of known hypothe-
this field were accompanied by development of related
ses to the generation of new hypotheses. Examples
software tools, starting with mainframe programs for
include methods from Bayesian decision theory, re-
statistical analysis in the early 1950s, and leading to
gression theory, and principal component analysis.
a large variety of stand alone, client/server, and web-
Another group of methods stemmed from artificial in-
based software as today’s service solution.
telligence - like decision trees, rule-based systems, and
Following the original definition given in Ref 1,
others. The term ‘machine learning’ includes methods
data mining is a step in the knowledge discovery
such as support vector machines and artificial neu-
from databases (KDD) process that consists of ap-
ral networks. There are several different and some-
plying data analysis and discovery algorithms to
times overlapping categorizations; for example, fuzzy
produce a particular enumeration of patterns (or
logic, artificial neural networks, and evolutionary al-
models) across the data. In that same article, KDD is
gorithms, which are summarized as computational
defined as the nontrivial process of identifying valid,
intelligence.6
novel, potentially useful, and ultimately understand-
The typical life cycle of new data mining meth-
able patterns in data. Sometimes, the wider KDD def-
ods begins with theoretical papers based on in-
inition is used synonymously for data mining. This
house software prototypes, followed by public or
wider interpretation is especially popular in the con-
on-demand software distribution of successful algo-
text of software tools because most such tools sup-
rithms as research prototypes. Then, either special
commercial or open source packages containing a
∗
Correspondence to: ralf.mikut.kit.edu family of similar algorithms are developed or the al-
Karlsruhe Institute of Technology, Hermann-von-Helmholtz-Platz gorithms are integrated into existing open source or
1, 76344 Eggenstein-Leopoldshafen, GERMANY commercial packages. Many companies have tried to
DOI: 10.1002/widm.24 promote their own stand alone packages, but only
few have reached notable market shares. The life cy- in many other open-source tools such as Pentaho,
cle of some data mining tools is remarkably short. RapidMiner, and KNIME.
Typical reasons include internal marketing decisions A large group of research prototypes are based
and acquisitions of specialized companies by larger on script-oriented mathematical programs such as
ones, leading to a renaming and integration of prod- MATLAB (commercial) and R (open source). Such
uct lines. mathematical programs were not originally focused
The largest commercial success stories resulted on data mining, but contain many useful mathemati-
from the step-wise integration of data mining methods cal and visualization functions that support the im-
into established commercial statistical tools. Compa- plementation of data mining algorithms. Recently,
nies such as SPSS, founded in 1975 with precursors graphical user interfaces such as those utilized for R
from 1968, or SAS, founded in 1976, have been of- (e.g., Rattle) and Matlab (e.g., Gait-CAD, Established
fering statistical tools for mainframe computers since in 2006) can be used as integration packages (INT)
the 1970s. These tools were later adapted to personal for many single, open-source algorithms.
computers and client/server solutions for larger cus- As the number of available tools continues to
tomers. With the increasing popularity of data min- grow, the choice of one special tool becomes increas-
ing, algorithms such as artificial neural networks or ingly difficult for each potential user. This decision-
decision trees were integrated in the main products making process can be supported by criteria for the
and specialized data mining companies such as Inte- categorization of data mining tools. Different catego-
grated Solutions Ltd. (acquired in 1998 by SPSS) were rizations of tools were proposed in Refs 9–12. The last
acquired to obtain access to data mining tools such as two comprehensive criteria-based surveys date back
Clementine. During these periods, renaming of tools to 1999, covering 43 software packages in Ref 9, and
and company mergers played an important role in 2003, with 33 tools in Ref 12 (a regularly updated
history; for example, the tool Clementine (SPSS) was Excel table is available on request from the same au-
renamed as PASW Modeler, and is now available as thor with 63 tools in 2009). In addition, smaller re-
IBM SPSS Modeler after the acquisition of SPSS by views have been published, containing 12 open-source
IBM in 2009. In general, tools of this statistical branch tools,13 eight noncommercial tools,14 nine commer-
are now very popular for the user groups in business cial tools,10 and five commercial tools using bench-
application and applied research. mark datasets.15
Concurrently, many companies offering busi- In the past 10–15 years, data mining has be-
ness intelligence products have integrated data mining come a technology in its own right, is well established
solutions into their database products; one example also in business intelligence (BI), and continues to ex-
is Oracle Data Mining (established in 2002). Many hibit steadily increasing importance in technology and
of these products are also a product of the acquisition life sciences sectors. For example, data mining was a
and integration of specialized data mining companies. key factor supporting methodological breakthroughs
In 2008, the worldwide market for business in- in genetics.16 It is a promising technology for fu-
telligence (i.e., software and maintenance fees) was ture fields such as text mining and semantic search
7.8 billion USD, including 1.5 billion USD in so- engines,17 learning in autonomous systems—as with
called ‘advanced analytics’, containing data mining humanoid robots18 and cars, chemoinformatics19 and
and statistics.7 This sector has grown 12.1% be- others.
tween 2007 and 2008, with large players including Various standardization initiatives have been in-
companies such as SAS (33.2%, tool: SAS Enterprise troduced for data mining processes, data and model
Miner), SPSS (14.3%, since 2009, an IBM company; interfaces—as with Cross Industry Standard Pro-
tool: IBM SPSS Modeler), Microsoft (1.7%, tool: SQL cess for Data Mining for industrial data mining,20
Server Analysis Services), Teradata (1.5%, tool: Tera- and approaches focused on clinical and biological
data Database, former name TeraMiner), and TIBCO applications.21 A survey of such initiatives is pro-
(1.4%, tool: TIBCO Spotfire). vided in Ref 22, and a large variety of standard data
Open-source libraries have also become very mining methods are described in comprehensive stan-
popular since the 1990s. The most prominent exam- dard text books;4, 5 however, new methods, especially
ple is Waikato Environment for Knowledge Analy- for data streams,23 extremely large datasets, graph
sis (WEKA), see Ref 8. WEKA started in 1994 as mining,24, 25 text mining,17 and others have been pro-
a C++ library, with its first public release in 1996. posed in the last few years. In the near future, meth-
In 1999, it was completely rebuilt as a JAVA pack- ods for high-dimensional problems such as image
age; since that time, it has been regularly updated. In retrieval26 and video mining27 will also be optimized
addition, WEKA components have been integrated and embedded into powerful tools.
2
c 2011 John Wiley & Sons, Inc. Volume 00, January/February 2011
WIREs Data Mining and Knowledge Discovery Data mining tools
Dim., maximum dimensionality; s, number of features; N, number of examples; K, number of samples in a time series. Lower dimensions
of the dataset can occur for problems with only one feature s = 1 resp. one example ( N = 1).
CRITERIA FOR COMPARING DATA integrate its own methods and compare these
MINING SOFTWARE with existing methods. The necessary tools
should contain many concurrent algorithms.
Survey
• Education: For education at universities, data
In the following, different criteria for comparison of
mining tools should be very intuitive, with
data mining software are introduced. These criteria
a comfortable interactive user interface, and
are based on user groups, data structures, data min-
inexpensive. In addition, they should allow
ing tasks and methods, import and export options,
the integration of in-house methods during
and license models. A detailed overview about the
programming seminars.
different tools is given later in this paper and as an
Excel table in the additional material; however, some
specific information about tools is discussed if a spe-
cific tool is unique to some aspects of the proposed Data Structures
criteria. The complete list of tools is provided toward An important criterion is the dimensionality of the un-
the end of this paper. derlying raw data in the processed dataset (Table 1).
The first data mining applications were focused on
handling datasets represented as two-dimensional fea-
User Groups ture tables. In this classical format, a dataset consists
There are many different data mining tools available, of a set of N examples (e.g., clients of an insurance
which fit the needs of quite different user groups: company) with s features containing real values or
usually integer-coded classes or symbols (e.g., income,
• Business applications: This group uses data
age, number of contracts, and alike). This format is
mining as a tool for solving commercially
supported by nearly all existing tools. In some cases,
relevant business applications such as cus-
the dataset can be sparse, with only a few nonzero
tomer relationship management, fraud detec-
features such as a list of s shopping items for N differ-
tion, and so on. This field is mainly covered by
ent customers. The computational and memory effort
a variety of commercial tools providing sup-
can be reduced if a tool exploits this sparse structure.
port for databases with large datasets, and
Some structured datasets are characterized by
deep integration in the company’s workflow.
the same dimensionality. As an example, sample doc-
• Applied research: A user group that applies uments in most text mining problems are represented
data mining to research problems, for ex- by the frequency of words or so-called n-grams (a
ample, technology and life sciences. Here, group of n subsequent characters in a document).28
users are mainly interested in tools with well- The most prominent format having a higher di-
proven methods, a graphical user interface mensionality contains time series as elements, leading
(GUI), and interfaces to domain-related data to dataset dimensions between one (i.e., only one ex-
formats or databases. ample of a time series with K samples) and three (i.e.,
• Algorithm development: Develops new data N different examples of s-dimensional vector time se-
mining algorithms, and requires tools to both ries with K samples). Typical tasks are forecasting of
future values, finding typical patterns in a time se- (c) regression: prediction of a real-valued
ries or finding similar time series by clustering. The output variable, including special
analysis of time series plays an import role in many cases of predicting future values in
different applications, including prediction of stock a time series out of recent or past
markets, forecasting of energy consumption and other values;
markets, and quality supervision in production, and • unsupervised learning, without a known out-
is also supported by most data mining tools. put variable in the dataset, including
With a similar dimensionality, different kinds of (a) clustering: finds and describes groups
structured data exist such as gene sequences (spatial of similar examples in the data using
structure), spectrograms or mass spectrograms (struc- crisp of fuzzy clustering algorithms;
tured by frequencies or masses), and others. Only a
(b) association learning: finds typical
few tools support these types of structured data ex-
groups of items that occur frequently
plicitly, but some tools for time series analysis can be
together in examples;
rearranged to cope with these problems.
A more recent trend is the application of data • semisupervised learning, whereby the output
mining methods for images and videos.26, 27 The main variable is known only for some examples.
challenge is the handling of extremely large raw
datasets, up to gigabytes and terabytes, caused by the Each of these tasks consists of a chain of low-
high dimensionality of the examples. Typical applica- level tasks. Furthermore, some low-level tasks can act
tions are microscopic images in biology and medicine, as stand-alone tasks; for example, by identifying in
camera-based sensors in quality control and robotics, a large dataset elements that possess a high similar-
biometrics, and security. Such datasets must be split ity to a given example. Examples of such low-level
into metadata—with links to image and video files tasks are:
handled in a main dataset and files—which contain
the main part of the data. Until now, these problems • data cleaning (e.g., outlier detection);
were normally solved using a combination of tools: • data filtering (e.g., smoothing of time series);
the initial tool (e.g., ImageJ and ITK) would pro-
• feature extraction from time series, images,
cess the images or videos, resulting in segmented im-
videos, and graphs (e.g., consisting of seg-
ages and extracted features describing the segments;
mentation and segment description for im-
a second tool would solve data mining problems han-
ages, characteristic values such as community
dling the extracted features as a classical table or time
structures in graphs);
series.
Another format leading to image-like dimen- • feature transformation (e.g., mathematical
sions includes graphs that can be represented as operations, including logarithms, dimension
adjacency matrices, describing the connection be- reduction by linear or nonlinear combina-
tween different nodes of a graph. Graph mining tions by a principal component analysis,
has powerful applications,24, 25 such as characteriz- factor analysis or independent component
ing social networks and chemical structures; however, analysis);
only a few such tools exist, including Pegasus and • feature evaluation and selection (e.g., by filter
Proximity. or wrapper methods);
• computation of similarities and detection of
the most similar elements in terms of exam-
Tasks and Methods ples or features (e.g., by k-nearest-neighbor-
The most important tasks in data mining are methods and correlation analysis);
• supervised learning, with a known output • model validation (cross validation, bootstrap-
variable in the dataset, including ping, statistical relevance tests and complexity
(a) classification: class prediction, with measures);
the variable typically coded as an in- • model fusion (mixture of experts); and
teger output; • model optimization (e.g., by evolutionary al-
(b) fuzzy classification: with gradual gorithms).
memberships with values in-between
0 and 1 applied to the different For almost all of these tasks, a large variety
classes; of classical statistical methods—including classifiers
4
c 2011 John Wiley & Sons, Inc. Volume 00, January/February 2011
WIREs Data Mining and Knowledge Discovery Data mining tools
using estimated probability density functions, fac- • graphical user interface where the user selects
tor analysis and others, and newer machine learn- ‘function blocks’ or algorithms from a palette
ing methods—such as artificial neural networks, fuzzy of choices, defines parameters, places them
models, rough sets, support vector machines, decision in a work area, and connects them to create
trees, and random forests, are available. In addition, complete data mining streams or workflows;
optimization models such as evolutionary algorithms a good compromise, but difficult to handle for
can assist with the identification of model structures large workflows.
and parameters. The related methods are described in
survey articles29 or textbooks4, 5 and are not summa- Mixtures of these forms arise if macros of menu
rized in this paper. items can be recorded for workflows or if additional
Not all of the data mining methods are available blocks in a workflow can be implemented using a
in all software tools. The following list contains a sub- programming language. Automation (scripting) is ex-
jective evaluation of the frequency with which specific tremely important for routine tasks, especially with
methods are incorporated in the different tools: large datasets, because the workload of the user is
reduced. Almost all tools provide powerful visualiza-
• Frequent: classifiers using estimated probabil- tion techniques for the presentation of data mining
ity density functions, correlation analysis, sta- results; particularly tools for business application and
tistical feature selection, and relevance tests; applied research, which are able to generate complete
reports containing the most important results in a
• In many tools: decision trees, clustering, re-
readable form for users lacking explicit data mining
gression, data cleaning, data filtering, feature
skills. Interactive methods can support an explorative
extraction, principal component analysis, fac-
data analysis. An example is a method called brush-
tor analysis, advanced feature evaluation and
ing that enables the user to select specific data points
selection, computation of similarities, artifi-
in a figure or subsets of data (e.g., nodes of a decision
cial neural networks, model cross validation,
tree) and highlight these data points in other plots.
and statistical relevance tests;
• In some tools: fuzzy classification, associa-
tion learning and mining frequent item sets, Import and Export of Data and Models
independent component analysis, bootstrap- The ease with which data and models can be imported
ping, complexity measures, model fusion, and exported among different software tools plays a
support vector machines, k-nearest-neighbor- crucial role in the functionality of data mining tools.
methods, Bayesian networks, and learning of First, the data are normally generated and hosted
crisp rules; from different sources such as databases or software
• Infrequent: random forests30 (contained in associated with measurement devices. In business ap-
Waffles, Random Forests, WEKA, and all plications, interfaces to databases such as Oracle or
of its derivatives), learning of fuzzy systems any database supporting the Structured Query Lan-
(contained in KnowledgeMiner, See5, and guage (SQL) standard are the most common means
Gait-CAD), rough sets31 (in ROSETTA, and of importing data. Because almost all other nondata
Rseslibs), and model optimization by evolu- mining tools support export as text or excel files,
tionary algorithms14 (in KEEL, ADaM, and formats such as CSV (comma separated values) are
D2K). frequently used to import formats with data mining
tools. In addition, almost all software have propri-
etary binary or textual files, and exchanges formats
Interaction and Visualization for data and models, e.g., Attribute-Relation File For-
There are three main types of interaction between a mat in WEKA (WEKA standard).
user and a data mining tool: In order to import and export developed mod-
els as components in other processes and systems,
the XML-based standard PMML32 was developed by
• pure textual interface using a programming the Data Mining Group (http://www.dmg.org) and
language—difficult to handle, but easily au- is supported by many companies such as IBM and
tomated; SAS. Another standard initiative is the Object Link-
• graphical interface with a menu structure— ing and Embedding Database (OLEDB, sometimes
easy to handle, but not so easily automated; written as OLEDB or OLE-DB) for data mining, an
and API (Application Programming Interface) designed
by Microsoft to access different types of data stored source software are faster bug fixes and method-
in a uniform manner (http://msdn.microsoft.com/ ological improvements, potential for integration with
en-us/library/ms146608.aspx). OLEDB is a set of in- other tools, the existence of developer and user com-
terfaces implemented using the Component Object munities, faster adoption of methods to other inno-
Model (COM). For data exchange among differ- vative applications, and the fair comparison of new
ent tools, another initiative deals with Java Specifi- data mining algorithms with alternative ones. These
cation Requests for data mining: versions 1.0 (JSR advantages attract mainly users of applied research,
73, final release in 2004: http://www.jcp.org/en/jsr/ development, and education; however, open-source
detail?id=73) and 2.0 (JSR 247, public review as tools are beginning to migrate even into business user
last activity in 2006: http://www.jcp.org/en/jsr/detail? groups,37 particularly when additional commercial
id=247) define an extensible Java API for data mining services such as training or maintenance are offered
systems. The consortium includes many related com- (e.g., Pentaho).
panies, such as Oracle, SAS, SPSS (now IBM), SAP, The most popular type of open-source licenses is
and others; recent overviews can be found in Refs 33 the GNU General Public License of the Free Software
and 34. Another interesting feature is the export of Foundation (GNU-GPL or GPL: http://www.fsf.org).
an executable runtime version of developed models. It permits free redistribution, integration in other
Often, they do not require a more expensive develop- packages, and modification of the software as long
ment license and can be run free of charge, or at least as all subsequent users receive the same level of free-
with a cheaper runtime license. dom (so-called ‘copy left’). This restriction guarantees
that all software containing GNU-GPL components
must be licensed under GNU-GPL. Weaker forms are
Platforms licenses that are free for academic use, but not for
Data mining tools can be subdivided into stand- business users.
alone and client/server solutions. Client/server solu- Mixed forms of licenses occur especially if open-
tions dominate, especially in products designed for source software is used to expand commercial tools
business users. They are available for different plat- such as Matlab.
forms, including Windows, MAC OS, Linux, or spe- The Excel table (see, Section Supplementary In-
cial mainframe supercomputers. There is a growing formation) lists 195 recent tools (119 commercial
number of JAVA-based systems that are platform- tools, 67 open source tools, and nine tools with mixed
independent for users in research and applied license models).
research.
Further expected trends are an increasing num-
ber of web interfaces providing data mining as SAAS CATEGORIZATION OF DATA MINING
(software as a service, with tools like Data Applied) SOFTWARE INTO DIFFERENT TYPES
and a stronger support of client/server-based data
mining solutions on grids (tool ADaM, e.g., see, steps Following the criteria from the previous section, dif-
to a standardization in Ref 35); however, both trends ferent types of similar data mining tools can be found.
have the potential risk of hurting privacy policies be- The typical characteristics of these types are explained
cause the protection of data is difficult and many com- in this section. Matching of the different types and
panies are very careful with sensitive data. user groups and the number of recent tools are sum-
marized in Table 2. In addition, for commercial data
mining tools, related tools and their group member-
Licenses ship are summarized in different tables for commer-
There exists a wide variety of data mining tools with cial (Tables 3 and 4), free, and open-source data min-
commercial and open-source licenses. This is partic- ing tools (Table 5). In these tables, very popular tools
ularly true in the business application user group, are marked in bold. The popularity was measured by
where commercial software is very attractive due
to high software stability, good coupling with other • the 20 most frequently used tools for real
commercial tools for data warehouses, included soft- projects from ‘Data Mining/Analytic Tools
ware maintenance, and the possibility of user train- Used Poll 2010’ of KDnuggets with 912 voters
ing for sophisticated topics. For all other user groups, (http://www.kdnuggets.com/polls/2010/data-
there is a strong trend toward open-source software, mining-analytics-tools.html); [top 10 tools
but different types of licenses exist for this (e.g., see, were RapidMiner, R, Excel (here ignored),
survey in Ref 36). The main advantages of open- KNIME, Pentaho/WEKA, SAS, MATLAB,
6
c 2011 John Wiley & Sons, Inc. Volume 00, January/February 2011
WIREs Data Mining and Knowledge Discovery Data mining tools
T A B L E 2 Matching Between Different User Groups and Tool Types with Number of Recent Tools in the Excel Table (see, Section Supplementary Information, tools belonging to two
Solutions
Microsoft SQL Server];
• all main products of vendors with more than
19
+
−
0
0
1% market share in the section ‘Advanced
Analytics Tools’ from Ref 7; and
Prototypes
• the most popular image processing tools (ITK
Research and ImageJ) from the author’s own experi-
17
+
−
ence to cover this field.
0
In this paper, the following nine types are pro-
Specialities
posed:
• Data mining suites (DMS) focus largely on
56
+
0
0
0
data mining and include numerous meth-
ods. They support feature tables and time se-
Data Mining
+
−
0
−
+
0
0
Spotfire.
• Business intelligence packages (BIs) have no
Business Intelligence
+
−
−
−
+
+
−
+
Algorithm development
taho).
Types
large and extendable set of algorithms and based on Java; as KNIME, the GUI-version of
visualization routines. They support feature WEKA, KEEL, and TANAGRA) or as a kind
tables, time series, and have at least import of larger extension package for tools from
formats for images. The user interaction of- the MAT type (such as Gait-CAD, PRTools
ten requires programming skills in a scripting for MATLAB, and RWEKA for R). Import
language. MATs are attractive to users and export support standard formats, but
in algorithm development and applied re- database support is quite weak. Most tools
search because data mining algorithms can are available for different platforms and in-
be rapidly implemented, mostly in the form clude a GUI. Mixtures of license models oc-
of extensions (EXT) and research prototypes cur if open-source integration packages are
(RES). MAT packages exist as commercial based on commercial tools from the MAT
(MATLAB and R-PLUS) or open-source tools type. With these characteristics, the tools are
(R, Kepler). In principle, table calculation attractive to algorithm developers and users
software such as Excel may also be catego- in applied research due to expandability and
rized here, but it is not included in this pa- rapid comparison with alternative tools, and
per. Most tools are available for different due to easy integration of application-specific
platforms but have weaknesses in database methods and import options.
coupling. • EXT are smaller add-ons for other tools such
• Integration packages (INTs) are extendable as Excel, Matlab, R, and so forth, with limited
bundles of many different open-source algo- but quite useful functionality. Here, only a
rithms, either as stand-alone software (mostly few data mining algorithms are implemented
8
c 2011 John Wiley & Sons, Inc. Volume 00, January/February 2011
WIREs Data Mining and Knowledge Discovery Data mining tools
such as artificial neural networks for Excel ing Package, and LibSVM (C++ and JAVA-
(Forecaster XL and XLMiner) or MATLAB based) for support vector machines. A com-
(Matlab Neural Networks Toolbox). There mercial example is Neurofusion for C++,
are commercial or open-source versions, but whereas XELOPES (Java, C++, and C) uses
licenses for the basic tools must also be avail- different license models. LIB tools are mainly
able. The user interaction is the same as for the attractive to users in algorithm development
basic tool, for example, by using a program- and applied research, for embedding data
ming language (MATLAB) or by embedding mining software into larger data mining soft-
the extension in the menu (Excel). ware tools or specific solutions for narrow
• Data mining libraries (LIBs) implement data applications.
mining methods as a bundle of functions. • Specialties (SPECs) are similar to DMS tools,
These functions can be embedded in other but implement only one special family of
software tools using an Application Program- methods such as artificial neural networks.
ming Interface (API) for the interaction be- They contain many elaborate visualization
tween the software tool and the data mining techniques for such methods. SPECs are
functions. A graphical user interface is miss- rather simple to handle as compared with
ing, but some functions can support the in- other tools, which eases the use of such tools
tegration of specific visualization tools. They in education. Examples are CART for deci-
are often written in JAVA or C++ and the sion trees, Bayesia Lab for Bayesian networks,
solutions are platform independent. Open C5.0, WizRule, Rule Discovery System for
source examples are WEKA (Java-based), rule-based systems, MagnumOpus for asso-
MLC++ (C++ based), JAVA Data Min- ciation analysis, and JavaNNS, Neuroshell,
NeuralWorks Predict, RapAnalyst for artifi- very innovative fields. Examples are GIFT for
cial neural networks. content-based image retrieval, Himalaya for
• RES are usually the first—and not always mining maximal frequent item sets, sequential
stable—implementations of new and innova- pattern mining and scalable linear regression
tive algorithms. They contain only one or a trees, Rseslibs for rough sets, and Pegasus for
few algorithms with restricted graphical sup- graph mining. Early versions of today’s pop-
port and without automation support. Import ular tools such as WEKA and RapidMiner
and export functionality is rather restricted started in this category and shifted later to
and database coupling is missing or weak. other categories as DMS.
RES tools are mostly opensource. They are • Solutions (SOLs) describe a group of tools
mainly attractive to users in algorithm devel- that are customized to narrow application
opment and applied research, specifically in fields such as text mining (GATE), image
10
c 2011 John Wiley & Sons, Inc. Volume 00, January/February 2011
WIREs Data Mining and Knowledge Discovery Data mining tools
processing (ITK, ImageJ), drug discovery ferent types of tools are presented: DMS, BIs, MATs,
(Molegro Data Modeler), image analysis in INT, EXT, SPECs, RES, LIBs, and SOLs. They vary in
microscopy (CellProfilerAnalyst), or mining many different characteristics, such as intended user
gene expression profiles (Partek Genomics groups, possible data structures, implemented tasks
Suite, MEGA). The advantage of these so- and methods, interaction styles, import and export
lutions is the excellent support of domain- capabilities, platforms and license policies are vari-
specific feature extraction techniques, eval- able. Recent tools are able to handle large datasets
uation measures, visualizations, and import with single features, time series, and even unstruc-
formats. The level of data mining methods tured data-like texts; however, there is a lack of pow-
ranges from rather weak support (particularly erful and generalized mining tools for multidimen-
in image processing) to highly developed al- sional datasets such as images and videos.
gorithms. In some cases, more general tools
from types DMS or INT also support spe-
cific domains (KNIME, Gait-CAD for peptide SUPPLEMENTARY INFORMATION
chemoinformatics). There are many commer-
cial and open-source solutions. An additional Excel table contains a list of 269 tools
(195 recent and 74 historical tools, version from July
A large variety of tools actually requires a fuzzy cat- 22, 2010). For each tool, the following information
egorization with gradual memberships to different is available:
types. Examples are tools including a set of differ-
ent algorithms (LIB) with an additional GUI acting as • toolbox name,
an INT, DMS, including special methods for narrow • company or group (with the term ‘various’
application fields and others. In these cases, a main for open-source projects without an explicit
type was assigned and the other fuzzy memberships developer),
are discussed in the Excel table in the additional ma- • categorization into types with abbreviations
terial section. for Research Prototypes (RES), Data Min-
The following kinds of tools were not included ing Libraries (LIB), Business Intelligence Pack-
in the comparison: ages (BI), Data Mining Software (DMS),
Specialties (SPEC), Mathematical Packages
• nonavailable software (e.g., owing to com- (MAT), Extensions (EXT), Integration Pack-
pany mergers or stopped developments) is ages (INT), Solutions (SOL),
only listed in the Excel table in the additional
• Giraud-Carrier: marking the covering by the
material,
Excel table in Ref 12 (Stand: February 3,
• software for the handling of data warehouses 2010) with the values 1 (included in a de-
without explicit focus on data mining, tailed categorization), −1 (excluded), empty
• software for the manual design and applica- field: not mentioned,
tion of rule-based systems, • remarks,
• software for table calculation with a focus to • web link,
office users, and
• activity: 1 (relevant tool, included in the com-
• customized solutions for very narrow parison), 0 (less relevant), −1 (not available).
fields.
• license: OS, open source; CO, commercial;
CO/OS, different versions available.
• The Data Mine: http://www.the-data-mine. find data mining tools hosted at Sourceforge):
com/bin/view/software http://sourceforge.net/
• The Open Directory Project: http://www. • Kernel Machines (especially to get a list
dmoz.org/Computers/Software/Databases/ of software to support vector machines):
Data Mining http://www.kernel-machines.org/software
• Sourceforge (very popular platform for open- • Tools for Bayesian Networks: www.cs.
source solutions, search for ‘data mining’ to helsinki.fi/research/cosco/Bnets.
ACKNOWLEDGEMENTS
The authors thank C. Giraud-Carrier for a copy of an Excel table containing a large set of data
mining tools, the anonymous reviewers for many comments and suggestions, and R. A. Klady
for the critical proofreading of the manuscript.
REFERENCES
1. Fayyad U, Piatetsky-Shapiro G, Smyth P. From data 12. Giraud-Carrier C, Povel O. Characterising data mining
mining to knowledge discovery in databases. AI Mag software. Intell Data Anal 2003, 7:181–192.
1996, 17:37–54. 13. Chen X, Ye Y, Williams G, Xu X. A survey of open
2. Smyth P. Data mining: Data analysis on a grand scale? source data mining systems, Lecture Notes in Com-
Stat Methods Med Res 2000, 9:309–327. puter Science 2007, 4819:3–14.
3. Lovell MC. Data mining. Rev Econ Stat 1983, 65:1– 14. Alcalá-Fdez J, Sánchez L, Garcı́a S, del Jesus M,
11. Ventura S, Garrell J, Otero J, Romero C, Bacardit J,
4. Han J, Kamber M. Data Mining: Concepts and Tech- Rivas V, et al. KEEL: A software tool to assess evo-
niques. San Francisco: Morgan Kaufmann; 2006. lutionary algorithms for data mining problems. Soft
5. Hastie T, Tibshirani R, Friedman J. The Elements of Comput 2009, 13:307–318.
Statistical Learning: Data Mining, Inference, and Pre- 15. Haughton D, Deichmann J, Eshghi A, Sayek S, Teebagy
diction. New York: Springer; 2008. N, Topi H. A review of software packages for data
6. Engelbrecht AP. Computational Intelligence - An In- mining. Am Stat 2003, 57:290–310.
troduction. Chichester: John Wiley; 2007. 16. Barrett T, Troup D, Wilhite S, Ledoux P, Rudnev D,
7. Vesset D, McDonough B. Worldwide business intel- Evangelista C, Kim I, Soboleva A, Tomashevsky M,
ligence tools 2008 vendor shares, IDC Competitive Edgar R. NCBI GEO: Mining tens of millions of ex-
Analysis Report (2009). pression profiles–database and tools update. Nucleic
8. Frank E, Hall M, Holmes G, Kirkby R, Pfahringer B, acids Res 2007, D760.
Witten I. Weka: A machine learning workbench for 17. Weiss S. Text mining: predictive methods for analyzing
data mining. Data Mining and Knowledge Discovery unstructured information. New York: Springer-Verlag;
Handbook: A Complete Guide for Practitioners and 2005.
Researchers. New York: Springer; 2005, 1305–1314. 18. Dillmann R. Teaching and learning of robot tasks via
9. Goebel M. A survey of data mining and knowledge observation of human performance. Rob Auton Syst
discovery software tools, ACM SIGKDD Explorations. 2004, 47:109–116.
Newsletter 1999, 1:20–33. 19. Leach A, Gillet V. An Introduction to Chemoinformat-
10. Wang J, Hu X, Hollister K, Zhu D. A comparison and ics. Springer; 2007.
scenario analysis of leading data mining software. Int 20. Shearer C. The CRISP-DM model: The new blueprint
J Knowl Manage 2008, 4:17–34. for data mining. J Data Warehousing 2000, 5:
11. Wang J, Chen Q, Yao J. Data mining software. In: 13–22.
Tomei L, ed., Encyclopedia of Information Technol- 21. Mikut R, Reischl M, Burmeister O, Loose T. Data min-
ogy Curriculum Integration. Hershey, PA: Information ing in medical time series. Biomed Tech 2006, 51:288–
Science Publishing; 2008, 173–178. 293.
12
c 2011 John Wiley & Sons, Inc. Volume 00, January/February 2011
WIREs Data Mining and Knowledge Discovery Data mining tools
22. Grossman R, Hornick M, Meyer G. Data mining stan- 31. Pawlak Z. Rough sets and intelligent data analysis. Inf
dards initiatives. Commun ACM 2002, 45:61. Sci 2002, 147:1–12.
23. Muthukrishnan S. Data Streams: Algorithms and Ap- 32. Pechter R. What’s PMML and what’s new in PMML
plications. Hanover, MA: Now Publishers Inc.; 2005. 4.0?, ACM SIGKDD Explorations. Newsletter 2009,
24. Chakrabarti D, Faloutsos C. Graph mining: laws, gen- 11:19–25.
erators, and algorithms. ACM Comput Surv (CSUR) 33. Hornick M, Marcadé E, Venkayala S. Java Data
2006, 38:1–69. Mining: Strategy, Standard, and Practice: A Practi-
25. Borgelt C. Graph mining: An overview. Proc., 19. cal Guide for Architecture, Design, and Implementa-
Workshop Computational Intelligence. Karlsruhe, tion. San Francisco: Morgan Kaufmann Publishers Inc.;
Germany: KIT Scientific Publishing; 2009, 189–203. 2006.
26. Datta R, Joshi D, Li J, Wang J. Image retrieval: Ideas, 34. Anand S, Grobelnik M, Herrmann F, Hornick M,
influences, and trends of the new age. ACM Comput Lingenfelder C, Rooney N, Wettschereck D. Knowl-
Surv (CSUR) 2008, 40:1–60. edge discovery standards. Artificial Intelligence Review
27. Zhu X, Wu X, Elmagarmid A, Feng Z, Wu L. Video 2007, 27:21–56.
data mining: Semantic indexing and event detection 35. Cannataro M, Congiusta A, Pugliese A, Talia D,
from the association perspective. IEEE Trans Knowl Trunfio P. Distributed data mining on grids: Services,
Data Eng 2005, 17:665–677. tools, and applications. IEEE Trans Syst Man Cybern
28. Damashek M. Gauging similarity with n-Grams: B Cybern 2004, 34:2451–2465.
Language-independent categorization of text. Science 36. Sonnenburg S, Braun M, Ong C, Bengio S, Bottou L,
1995, 267:843–848. Holmes G, LeCun Y, Müller K, Pereira F, Rasmussen
29. Jain AK, Duin RPW, Mao J. Statistical pattern recog- C, et al. The need for open source software in machine
nition: A review. IEEE Trans Pattern Anal Mach Intell learning. J Mach Learn Res 2007, 8:2443–2466.
2000, 22:4–36. 37. Bitterer A. Open-source business intelligence tool pro-
30. Breiman L. Random forests. Mach Learn 2001, 45:5– duction deployments will grow five-fold through 2010,
32. Gartner RAS Research Note G00171189 (2009).