Data Mining Tools

Advanced Review
Data mining tools

Ralf Mikut∗ and Markus Reischl
The development and application of data mining algorithms requires the use of
powerful software tools. As the number of available tools continues to grow, the
choice of the most suitable tool becomes increasingly difficult. This paper attempts
to support the decision-making process by discussing the historical development
and presenting a range of existing state-of-the-art data mining and related tools.
Furthermore, we propose criteria for the tool categorization based on different
user groups, data structures, data mining tasks and methods, visualization and
interaction styles, import and export options for data and models, platforms, and
license policies. These criteria are then used to classify data mining tools into
nine different types. The typical characteristics of these types are explained and
a selection of the most important tools is categorized. This paper is organized
as follows: the first section Historical Development and State-of-the-Art highlights
the historical development of data mining software until present; the criteria to
compare data mining software are explained in the second section Criteria for
Comparing Data Mining Software. The last section Categorization of Data Mining
Software into Different Types proposes a categorization of data mining software
and introduces typical software tools for the different types. C 2011 John Wiley & Sons,
Inc. WIREs Data Mining Knowl Discov 2011 00 1–13 DOI: 10.1002/widm.24
HISTORICAL DEVELOPMENT port the complete KDD process and not just a single
AND STATE-OF-THE-ART step.
Today, a large number of standard data min-
D ata mining has a long history, with strong
roots in statistics, artificial intelligence, machine
learning, and database research.1, 2 The word ‘data
ing methods are available, (see Refs 4 and 5 for
detailed descriptions). From a historical perspective,
these methods have different roots. One early group
mining’ can be found relatively early, as in the article
of methods was adopted from classical statistics: the
of Lovell,3 published in the 1980s. Advancements in
focus was changed from the proof of known hypothe-
this field were accompanied by development of related
ses to the generation of new hypotheses. Examples
software tools, starting with mainframe programs for
include methods from Bayesian decision theory, re-
statistical analysis in the early 1950s, and leading to
gression theory, and principal component analysis.
a large variety of stand alone, client/server, and web-
Another group of methods stemmed from artificial in-
based software as today’s service solution.
telligence - like decision trees, rule-based systems, and
Following the original definition given in Ref 1,
others. The term ‘machine learning’ includes methods
data mining is a step in the knowledge discovery
such as support vector machines and artificial neu-
from databases (KDD) process that consists of ap-
ral networks. There are several different and some-
plying data analysis and discovery algorithms to
times overlapping categorizations; for example, fuzzy
produce a particular enumeration of patterns (or
logic, artificial neural networks, and evolutionary al-
models) across the data. In that same article, KDD is
gorithms, which are summarized as computational
defined as the nontrivial process of identifying valid,
intelligence.6
novel, potentially useful, and ultimately understand-
The typical life cycle of new data mining meth-
able patterns in data. Sometimes, the wider KDD def-
ods begins with theoretical papers based on in-
inition is used synonymously for data mining. This
house software prototypes, followed by public or
wider interpretation is especially popular in the con-
on-demand software distribution of successful algo-
text of software tools because most such tools sup-
rithms as research prototypes. Then, either special
commercial or open source packages containing a
∗
Correspondence to: ralf.mikut.kit.edu family of similar algorithms are developed or the al-
Karlsruhe Institute of Technology, Hermann-von-Helmholtz-Platz gorithms are integrated into existing open source or
1, 76344 Eggenstein-Leopoldshafen, GERMANY commercial packages. Many companies have tried to
DOI: 10.1002/widm.24 promote their own stand alone packages, but only
Volume 00, January/February 2011

c 2011 John Wiley & Sons, Inc. 1
Advanced Review wires.wiley.com/widm
few have reached notable market shares. The life cy- in many other open-source tools such as Pentaho,
cle of some data mining tools is remarkably short. RapidMiner, and KNIME.
Typical reasons include internal marketing decisions A large group of research prototypes are based
and acquisitions of specialized companies by larger on script-oriented mathematical programs such as
ones, leading to a renaming and integration of prod- MATLAB (commercial) and R (open source). Such
uct lines. mathematical programs were not originally focused
The largest commercial success stories resulted on data mining, but contain many useful mathemati-
from the step-wise integration of data mining methods cal and visualization functions that support the im-
into established commercial statistical tools. Compa- plementation of data mining algorithms. Recently,
nies such as SPSS, founded in 1975 with precursors graphical user interfaces such as those utilized for R
from 1968, or SAS, founded in 1976, have been of- (e.g., Rattle) and Matlab (e.g., Gait-CAD, Established
fering statistical tools for mainframe computers since in 2006) can be used as integration packages (INT)
the 1970s. These tools were later adapted to personal for many single, open-source algorithms.
computers and client/server solutions for larger cus- As the number of available tools continues to
tomers. With the increasing popularity of data min- grow, the choice of one special tool becomes increas-
ing, algorithms such as artificial neural networks or ingly difficult for each potential user. This decision-
decision trees were integrated in the main products making process can be supported by criteria for the
and specialized data mining companies such as Inte- categorization of data mining tools. Different catego-
grated Solutions Ltd. (acquired in 1998 by SPSS) were rizations of tools were proposed in Refs 9–12. The last
acquired to obtain access to data mining tools such as two comprehensive criteria-based surveys date back
Clementine. During these periods, renaming of tools to 1999, covering 43 software packages in Ref 9, and
and company mergers played an important role in 2003, with 33 tools in Ref 12 (a regularly updated
history; for example, the tool Clementine (SPSS) was Excel table is available on request from the same au-
renamed as PASW Modeler, and is now available as thor with 63 tools in 2009). In addition, smaller re-
IBM SPSS Modeler after the acquisition of SPSS by views have been published, containing 12 open-source
IBM in 2009. In general, tools of this statistical branch tools,13 eight noncommercial tools,14 nine commer-
are now very popular for the user groups in business cial tools,10 and five commercial tools using bench-
application and applied research. mark datasets.15
Concurrently, many companies offering busi- In the past 10–15 years, data mining has be-
ness intelligence products have integrated data mining come a technology in its own right, is well established
solutions into their database products; one example also in business intelligence (BI), and continues to ex-
is Oracle Data Mining (established in 2002). Many hibit steadily increasing importance in technology and
of these products are also a product of the acquisition life sciences sectors. For example, data mining was a
and integration of specialized data mining companies. key factor supporting methodological breakthroughs
In 2008, the worldwide market for business in- in genetics.16 It is a promising technology for fu-
telligence (i.e., software and maintenance fees) was ture fields such as text mining and semantic search
7.8 billion USD, including 1.5 billion USD in so- engines,17 learning in autonomous systems—as with
called ‘advanced analytics’, containing data mining humanoid robots18 and cars, chemoinformatics19 and
and statistics.7 This sector has grown 12.1% be- others.
tween 2007 and 2008, with large players including Various standardization initiatives have been in-
companies such as SAS (33.2%, tool: SAS Enterprise troduced for data mining processes, data and model
Miner), SPSS (14.3%, since 2009, an IBM company; interfaces—as with Cross Industry Standard Pro-
tool: IBM SPSS Modeler), Microsoft (1.7%, tool: SQL cess for Data Mining for industrial data mining,20
Server Analysis Services), Teradata (1.5%, tool: Tera- and approaches focused on clinical and biological
data Database, former name TeraMiner), and TIBCO applications.21 A survey of such initiatives is pro-
(1.4%, tool: TIBCO Spotfire). vided in Ref 22, and a large variety of standard data
Open-source libraries have also become very mining methods are described in comprehensive stan-
popular since the 1990s. The most prominent exam- dard text books;4, 5 however, new methods, especially
ple is Waikato Environment for Knowledge Analy- for data streams,23 extremely large datasets, graph
sis (WEKA), see Ref 8. WEKA started in 1994 as mining,24, 25 text mining,17 and others have been pro-
a C++ library, with its first public release in 1996. posed in the last few years. In the near future, meth-
In 1999, it was completely rebuilt as a JAVA pack- ods for high-dimensional problems such as image
age; since that time, it has been regularly updated. In retrieval26 and video mining27 will also be optimized
addition, WEKA components have been integrated and embedded into powerful tools.
2
c 2011 John Wiley & Sons, Inc. Volume 00, January/February 2011
WIREs Data Mining and Knowledge Discovery Data mining tools
T A B L E 1 Maximum Dimensions of Datasets for Different Types of Problems
Data Dim. Structure for Each of the N Examples
Feature table 2 s features (e.g., age and income)

Texts 2 frequency of words or n-grams (vector-space approach)
Time series 3 s time series with K time samples
Sequences 3 s sequences of length L (e.g., mass spectrograms and genes)
Images 4 s images with pixels
Graphs 4 s graphs with adjacency matrixes
3D images 5 s images with pixels and slices
Videos 5 s videos containing images with pixels and K time samples
3D videos 6 like videos, but with additional slices
Dim., maximum dimensionality; s, number of features; N, number of examples; K, number of samples in a time series. Lower dimensions
of the dataset can occur for problems with only one feature s = 1 resp. one example ( N = 1).
CRITERIA FOR COMPARING DATA integrate its own methods and compare these
MINING SOFTWARE with existing methods. The necessary tools
should contain many concurrent algorithms.
Survey
• Education: For education at universities, data
In the following, different criteria for comparison of
mining tools should be very intuitive, with
data mining software are introduced. These criteria
a comfortable interactive user interface, and
are based on user groups, data structures, data min-
inexpensive. In addition, they should allow
ing tasks and methods, import and export options,
the integration of in-house methods during
and license models. A detailed overview about the
programming seminars.
different tools is given later in this paper and as an
Excel table in the additional material; however, some
specific information about tools is discussed if a spe-
cific tool is unique to some aspects of the proposed Data Structures
criteria. The complete list of tools is provided toward An important criterion is the dimensionality of the un-
the end of this paper. derlying raw data in the processed dataset (Table 1).
The first data mining applications were focused on
handling datasets represented as two-dimensional fea-
User Groups ture tables. In this classical format, a dataset consists
There are many different data mining tools available, of a set of N examples (e.g., clients of an insurance
which fit the needs of quite different user groups: company) with s features containing real values or
usually integer-coded classes or symbols (e.g., income,
• Business applications: This group uses data
age, number of contracts, and alike). This format is
mining as a tool for solving commercially
supported by nearly all existing tools. In some cases,
relevant business applications such as cus-
the dataset can be sparse, with only a few nonzero
tomer relationship management, fraud detec-
features such as a list of s shopping items for N differ-
tion, and so on. This field is mainly covered by
ent customers. The computational and memory effort
a variety of commercial tools providing sup-
can be reduced if a tool exploits this sparse structure.
port for databases with large datasets, and
Some structured datasets are characterized by
deep integration in the company’s workflow.
the same dimensionality. As an example, sample doc-
• Applied research: A user group that applies uments in most text mining problems are represented
data mining to research problems, for ex- by the frequency of words or so-called n-grams (a
ample, technology and life sciences. Here, group of n subsequent characters in a document).28
users are mainly interested in tools with well- The most prominent format having a higher di-
proven methods, a graphical user interface mensionality contains time series as elements, leading
(GUI), and interfaces to domain-related data to dataset dimensions between one (i.e., only one ex-
formats or databases. ample of a time series with K samples) and three (i.e.,
• Algorithm development: Develops new data N different examples of s-dimensional vector time se-
mining algorithms, and requires tools to both ries with K samples). Typical tasks are forecasting of

future values, finding typical patterns in a time se- (c) regression: prediction of a real-valued
ries or finding similar time series by clustering. The output variable, including special
analysis of time series plays an import role in many cases of predicting future values in
different applications, including prediction of stock a time series out of recent or past
markets, forecasting of energy consumption and other values;
markets, and quality supervision in production, and • unsupervised learning, without a known out-
is also supported by most data mining tools. put variable in the dataset, including
With a similar dimensionality, different kinds of (a) clustering: finds and describes groups
structured data exist such as gene sequences (spatial of similar examples in the data using
structure), spectrograms or mass spectrograms (struc- crisp of fuzzy clustering algorithms;
tured by frequencies or masses), and others. Only a
(b) association learning: finds typical
few tools support these types of structured data ex-
groups of items that occur frequently
plicitly, but some tools for time series analysis can be
together in examples;
rearranged to cope with these problems.
A more recent trend is the application of data • semisupervised learning, whereby the output
mining methods for images and videos.26, 27 The main variable is known only for some examples.
challenge is the handling of extremely large raw
datasets, up to gigabytes and terabytes, caused by the Each of these tasks consists of a chain of low-
high dimensionality of the examples. Typical applica- level tasks. Furthermore, some low-level tasks can act
tions are microscopic images in biology and medicine, as stand-alone tasks; for example, by identifying in
camera-based sensors in quality control and robotics, a large dataset elements that possess a high similar-
biometrics, and security. Such datasets must be split ity to a given example. Examples of such low-level
into metadata—with links to image and video files tasks are:
handled in a main dataset and files—which contain
the main part of the data. Until now, these problems • data cleaning (e.g., outlier detection);
were normally solved using a combination of tools: • data filtering (e.g., smoothing of time series);
the initial tool (e.g., ImageJ and ITK) would pro-
• feature extraction from time series, images,
cess the images or videos, resulting in segmented im-
videos, and graphs (e.g., consisting of seg-
ages and extracted features describing the segments;
mentation and segment description for im-
a second tool would solve data mining problems han-
ages, characteristic values such as community
dling the extracted features as a classical table or time
structures in graphs);
series.
Another format leading to image-like dimen- • feature transformation (e.g., mathematical
sions includes graphs that can be represented as operations, including logarithms, dimension
adjacency matrices, describing the connection be- reduction by linear or nonlinear combina-
tween different nodes of a graph. Graph mining tions by a principal component analysis,
has powerful applications,24, 25 such as characteriz- factor analysis or independent component
ing social networks and chemical structures; however, analysis);
only a few such tools exist, including Pegasus and • feature evaluation and selection (e.g., by filter
Proximity. or wrapper methods);
• computation of similarities and detection of
the most similar elements in terms of exam-
Tasks and Methods ples or features (e.g., by k-nearest-neighbor-
The most important tasks in data mining are methods and correlation analysis);
• supervised learning, with a known output • model validation (cross validation, bootstrap-
variable in the dataset, including ping, statistical relevance tests and complexity
(a) classification: class prediction, with measures);
the variable typically coded as an in- • model fusion (mixture of experts); and
teger output; • model optimization (e.g., by evolutionary al-
(b) fuzzy classification: with gradual gorithms).
memberships with values in-between
0 and 1 applied to the different For almost all of these tasks, a large variety
classes; of classical statistical methods—including classifiers
4
using estimated probability density functions, fac- • graphical user interface where the user selects
tor analysis and others, and newer machine learn- ‘function blocks’ or algorithms from a palette
ing methods—such as artificial neural networks, fuzzy of choices, defines parameters, places them
models, rough sets, support vector machines, decision in a work area, and connects them to create
trees, and random forests, are available. In addition, complete data mining streams or workflows;
optimization models such as evolutionary algorithms a good compromise, but difficult to handle for
can assist with the identification of model structures large workflows.
and parameters. The related methods are described in
survey articles29 or textbooks4, 5 and are not summa- Mixtures of these forms arise if macros of menu
rized in this paper. items can be recorded for workflows or if additional
Not all of the data mining methods are available blocks in a workflow can be implemented using a
in all software tools. The following list contains a sub- programming language. Automation (scripting) is ex-
jective evaluation of the frequency with which specific tremely important for routine tasks, especially with
methods are incorporated in the different tools: large datasets, because the workload of the user is
reduced. Almost all tools provide powerful visualiza-
• Frequent: classifiers using estimated probabil- tion techniques for the presentation of data mining
ity density functions, correlation analysis, sta- results; particularly tools for business application and
tistical feature selection, and relevance tests; applied research, which are able to generate complete
reports containing the most important results in a
• In many tools: decision trees, clustering, re-
readable form for users lacking explicit data mining
gression, data cleaning, data filtering, feature
skills. Interactive methods can support an explorative
extraction, principal component analysis, fac-
data analysis. An example is a method called brush-
tor analysis, advanced feature evaluation and
ing that enables the user to select specific data points
selection, computation of similarities, artifi-
in a figure or subsets of data (e.g., nodes of a decision
cial neural networks, model cross validation,
tree) and highlight these data points in other plots.
and statistical relevance tests;
• In some tools: fuzzy classification, associa-
tion learning and mining frequent item sets, Import and Export of Data and Models
independent component analysis, bootstrap- The ease with which data and models can be imported
ping, complexity measures, model fusion, and exported among different software tools plays a
support vector machines, k-nearest-neighbor- crucial role in the functionality of data mining tools.
methods, Bayesian networks, and learning of First, the data are normally generated and hosted
crisp rules; from different sources such as databases or software
• Infrequent: random forests30 (contained in associated with measurement devices. In business ap-
Waffles, Random Forests, WEKA, and all plications, interfaces to databases such as Oracle or
of its derivatives), learning of fuzzy systems any database supporting the Structured Query Lan-
(contained in KnowledgeMiner, See5, and guage (SQL) standard are the most common means
Gait-CAD), rough sets31 (in ROSETTA, and of importing data. Because almost all other nondata
Rseslibs), and model optimization by evolu- mining tools support export as text or excel files,
tionary algorithms14 (in KEEL, ADaM, and formats such as CSV (comma separated values) are
D2K). frequently used to import formats with data mining
tools. In addition, almost all software have propri-
etary binary or textual files, and exchanges formats
Interaction and Visualization for data and models, e.g., Attribute-Relation File For-
There are three main types of interaction between a mat in WEKA (WEKA standard).
user and a data mining tool: In order to import and export developed mod-
els as components in other processes and systems,
the XML-based standard PMML32 was developed by
• pure textual interface using a programming the Data Mining Group (http://www.dmg.org) and
language—difficult to handle, but easily au- is supported by many companies such as IBM and
tomated; SAS. Another standard initiative is the Object Link-
• graphical interface with a menu structure— ing and Embedding Database (OLEDB, sometimes
easy to handle, but not so easily automated; written as OLEDB or OLE-DB) for data mining, an
and API (Application Programming Interface) designed

by Microsoft to access different types of data stored source software are faster bug fixes and method-
in a uniform manner (http://msdn.microsoft.com/ ological improvements, potential for integration with
en-us/library/ms146608.aspx). OLEDB is a set of in- other tools, the existence of developer and user com-
terfaces implemented using the Component Object munities, faster adoption of methods to other inno-
Model (COM). For data exchange among differ- vative applications, and the fair comparison of new
ent tools, another initiative deals with Java Specifi- data mining algorithms with alternative ones. These
cation Requests for data mining: versions 1.0 (JSR advantages attract mainly users of applied research,
73, final release in 2004: http://www.jcp.org/en/jsr/ development, and education; however, open-source
detail?id=73) and 2.0 (JSR 247, public review as tools are beginning to migrate even into business user
last activity in 2006: http://www.jcp.org/en/jsr/detail? groups,37 particularly when additional commercial
id=247) define an extensible Java API for data mining services such as training or maintenance are offered
systems. The consortium includes many related com- (e.g., Pentaho).
panies, such as Oracle, SAS, SPSS (now IBM), SAP, The most popular type of open-source licenses is
and others; recent overviews can be found in Refs 33 the GNU General Public License of the Free Software
and 34. Another interesting feature is the export of Foundation (GNU-GPL or GPL: http://www.fsf.org).
an executable runtime version of developed models. It permits free redistribution, integration in other
Often, they do not require a more expensive develop- packages, and modification of the software as long
ment license and can be run free of charge, or at least as all subsequent users receive the same level of free-
with a cheaper runtime license. dom (so-called ‘copy left’). This restriction guarantees
that all software containing GNU-GPL components
must be licensed under GNU-GPL. Weaker forms are
Platforms licenses that are free for academic use, but not for
Data mining tools can be subdivided into stand- business users.
alone and client/server solutions. Client/server solu- Mixed forms of licenses occur especially if open-
tions dominate, especially in products designed for source software is used to expand commercial tools
business users. They are available for different plat- such as Matlab.
forms, including Windows, MAC OS, Linux, or spe- The Excel table (see, Section Supplementary In-
cial mainframe supercomputers. There is a growing formation) lists 195 recent tools (119 commercial
number of JAVA-based systems that are platform- tools, 67 open source tools, and nine tools with mixed
independent for users in research and applied license models).
research.
Further expected trends are an increasing num-
ber of web interfaces providing data mining as SAAS CATEGORIZATION OF DATA MINING
(software as a service, with tools like Data Applied) SOFTWARE INTO DIFFERENT TYPES
and a stronger support of client/server-based data
mining solutions on grids (tool ADaM, e.g., see, steps Following the criteria from the previous section, dif-
to a standardization in Ref 35); however, both trends ferent types of similar data mining tools can be found.
have the potential risk of hurting privacy policies be- The typical characteristics of these types are explained
cause the protection of data is difficult and many com- in this section. Matching of the different types and
panies are very careful with sensitive data. user groups and the number of recent tools are sum-
marized in Table 2. In addition, for commercial data
mining tools, related tools and their group member-
Licenses ship are summarized in different tables for commer-
There exists a wide variety of data mining tools with cial (Tables 3 and 4), free, and open-source data min-
commercial and open-source licenses. This is partic- ing tools (Table 5). In these tables, very popular tools
ularly true in the business application user group, are marked in bold. The popularity was measured by
where commercial software is very attractive due
to high software stability, good coupling with other • the 20 most frequently used tools for real
commercial tools for data warehouses, included soft- projects from ‘Data Mining/Analytic Tools
ware maintenance, and the possibility of user train- Used Poll 2010’ of KDnuggets with 912 voters
ing for sophisticated topics. For all other user groups, (http://www.kdnuggets.com/polls/2010/data-
there is a strong trend toward open-source software, mining-analytics-tools.html); [top 10 tools
but different types of licenses exist for this (e.g., see, were RapidMiner, R, Excel (here ignored),
survey in Ref 36). The main advantages of open- KNIME, Pentaho/WEKA, SAS, MATLAB,
6
IBM SPSS Statistics, IBM SPSS Modeler, and
T A B L E 2 Matching Between Different User Groups and Tool Types with Number of Recent Tools in the Excel Table (see, Section Supplementary Information, tools belonging to two
Solutions
Microsoft SQL Server];
• all main products of vendors with more than
19
+
−
0
0
1% market share in the section ‘Advanced
Analytics Tools’ from Ref 7; and
Prototypes
• the most popular image processing tools (ITK
Research and ImageJ) from the author’s own experi-
17
+
−
ence to cover this field.
0
In this paper, the following nine types are pro-
Specialities
posed:
• Data mining suites (DMS) focus largely on
56
+
0
0
0
data mining and include numerous meth-
ods. They support feature tables and time se-
Data Mining
ries, while additional tools for text mining

Libraries
are sometime available. The application fo-

cus is wide and not restricted to a special
20
+
−
0
application field, such as business applica-

tions; however, coupling to business solu-
Extensions
tions, import and export of models, report-

ing, and a variety of different platforms are
nonetheless supported. In addition, the pro-
10
−
+
0
0
ducers provide services for adaptation of the

tools to the workflows and data structures of
Integration
Packages
the customer. DMS is mostly commercial and

rather expensive, but some open-source tools
+
+
such as RapidMiner exist. Typical examples

8
include IBM SPSS Modeler, SAS Enterprise

Mathematical
Miner, Alice d’Isoft, DataEngine, DataDetec-

tive, GhostMiner, Knowledge Studio, KXEN,
Packages
NAG Data Mining Components, Partek

Discovery Suite, STATISTICA, and TIBCO
−
+
+
5
Spotfire.
• Business intelligence packages (BIs) have no
Business Intelligence
special focus to data mining, but include basic

data mining functionality, especially for sta-
tistical methods in business applications. BIs
Packages
are often restricted to feature tables and time

series, large feature tables are supported. They
16
+
−
−
−
have a highly developed reporting function-

Evaluation, +: especially useful, 0: less useful, −: not useful.
ality and good support for education, han-

Data Mining
dling, and adaptation to the workflows of the

customer. They are characterized by a strong
Suites
focus on database coupling, and are imple-

46
+
+
−
+
mented via a client/server architecture. Most

categories are counted twice)
BI softwares are commercial (IBM Cognos 8

Number of Recent Tools
Algorithm development
BI, Oracle Data Mining, SAP Netweaver Busi-

Business applications
ness Warehouse, Teradata Database, DB2

Applied research
Data Warehouse from IBM, and PolyVista),

but a few open-source solutions exist (Pen-
Education
taho).
Types
• Mathematical packages (MATs) have no

special focus on data mining, but provide a

T A B L E 3 List of Commercial Tools (Part 1)
Tool Type Link
ADAPA (Zementis) DMS www.zementis.com

Alice (d’Isoft) DMS www.alice-soft.com
Bayesia Lab SPEC www.bayesia.com
C5.0 SPEC www.rulequest.com
CART SPEC www.salford-systems.com
Data Applied DMS data-applied.com
DataDetective DMS www.sentient.nl/?dden
DataEngine DMS www.dataengine.de
Datascope DMS www.cygron.hu
DB2 Data Warehouse BI www.ibm.com/software/data/infosphere/warehouse
DeltaMaster BI www.bissantz.com/deltamaster
Forecaster XL EXT www.alyuda.com
GhostMiner DMS www.fqs.pl/business intelligence/products/ghostminer
IBM Cognos 8 BI BI www.ibm.com/software/data/cognos/data-mining-tools.html
IBM SPSS Modeler DMS www.spss.com/software/modeling/modeler
IBM SPSS Statistics MAT www.spss.com/software/statistics
iModel DMS www.biocompsystems.com/products/imodel
InfoSphere Warehouse BI www.ibm.com/software/data/infosphere/warehouse
JMP DMS www.jmpdiscovery.com
KnowledgeMiner SPEC www.knowledgeminer.net
KnowledgeStudio DMS www.angoss.com
KXEN DMS www.kxen.com
Magnum Opus SPEC www.giwebb.com
MATLAB MAT www.mathworks.com
MATLAB Neural Network Toolbox EXT www.mathworks.com
Model Builder DMS www.fico.com
ModelMAX SOL www.asacorp.com/products/mmxover.jsp
Very popular tools are marked in bold letters.
large and extendable set of algorithms and based on Java; as KNIME, the GUI-version of
visualization routines. They support feature WEKA, KEEL, and TANAGRA) or as a kind
tables, time series, and have at least import of larger extension package for tools from
formats for images. The user interaction of- the MAT type (such as Gait-CAD, PRTools
ten requires programming skills in a scripting for MATLAB, and RWEKA for R). Import
language. MATs are attractive to users and export support standard formats, but
in algorithm development and applied re- database support is quite weak. Most tools
search because data mining algorithms can are available for different platforms and in-
be rapidly implemented, mostly in the form clude a GUI. Mixtures of license models oc-
of extensions (EXT) and research prototypes cur if open-source integration packages are
(RES). MAT packages exist as commercial based on commercial tools from the MAT
(MATLAB and R-PLUS) or open-source tools type. With these characteristics, the tools are
(R, Kepler). In principle, table calculation attractive to algorithm developers and users
software such as Excel may also be catego- in applied research due to expandability and
rized here, but it is not included in this pa- rapid comparison with alternative tools, and
per. Most tools are available for different due to easy integration of application-specific
platforms but have weaknesses in database methods and import options.
coupling. • EXT are smaller add-ons for other tools such
• Integration packages (INTs) are extendable as Excel, Matlab, R, and so forth, with limited
bundles of many different open-source algo- but quite useful functionality. Here, only a
rithms, either as stand-alone software (mostly few data mining algorithms are implemented
8
T A B L E 4 List of Commercial Tools (Part 2)
Tool Type Link
Molegro Data Modeler SOL www.molegro.com

NAG Data Mining Components LIB www.nag.co.uk/numeric/DR/DRdescription.asp
NeuralWorks Predict SPEC www.neuralware.com/products.jsp
Neurofusion LIB www.alyuda.com
Neuroshell SPEC www.neuroshell.com
Oracle Data Mining (ODM) DMS www.oracle.com/technology/products/bi/odm/index.html
Partek Discovery Suite DMS www.partek.com/software
Partek Genomics Suite SOL www.partek.com/software
PolyAnalyst DMS www.megaputer.com/polyanalyst.php
PolyVista BI www.polyvista.com
Random Forests SPEC www.salford-systems.com
RapAnalyst SPEC www.raptorinternational.com/rapanalyst.html
R-PLUS MAT www.experience-rplus.com
SAP Netweaver Business Warehouse (BW) BI www.sap.com/platform/netweaver/components/businesswarehouse
SAS Enterprise Miner DMS www.sas.com/products/miner
See5 SPEC www.rulequest.com
SPAD Data Mining DMS eng.spadsoft.com
SQL Server Analysis Services DMS www.microsoft.com/sql
STATISTICA DMS www.statsoft.com/products/data-mining-solutions/G259
SuperQuery DMS www.azmy.com
Teradata Database BI www.teradata.com
Think Enterprise Data Miner (EDM) DMS www.thinkanalytics.com
TIBCO Spotfire DMS spotfire.tibco.com
Unica PredictiveInsight DMS www.unica.com
WizRule and WizWhy SPEC www.wizsoft.com
XAffinity SPEC www.exclusiveore.com
such as artificial neural networks for Excel ing Package, and LibSVM (C++ and JAVA-
(Forecaster XL and XLMiner) or MATLAB based) for support vector machines. A com-
(Matlab Neural Networks Toolbox). There mercial example is Neurofusion for C++,
are commercial or open-source versions, but whereas XELOPES (Java, C++, and C) uses
licenses for the basic tools must also be avail- different license models. LIB tools are mainly
able. The user interaction is the same as for the attractive to users in algorithm development
basic tool, for example, by using a program- and applied research, for embedding data
ming language (MATLAB) or by embedding mining software into larger data mining soft-
the extension in the menu (Excel). ware tools or specific solutions for narrow
• Data mining libraries (LIBs) implement data applications.
mining methods as a bundle of functions. • Specialties (SPECs) are similar to DMS tools,
These functions can be embedded in other but implement only one special family of
software tools using an Application Program- methods such as artificial neural networks.
ming Interface (API) for the interaction be- They contain many elaborate visualization
tween the software tool and the data mining techniques for such methods. SPECs are
functions. A graphical user interface is miss- rather simple to handle as compared with
ing, but some functions can support the in- other tools, which eases the use of such tools
tegration of specific visualization tools. They in education. Examples are CART for deci-
are often written in JAVA or C++ and the sion trees, Bayesia Lab for Bayesian networks,
solutions are platform independent. Open C5.0, WizRule, Rule Discovery System for
source examples are WEKA (Java-based), rule-based systems, MagnumOpus for asso-
MLC++ (C++ based), JAVA Data Min- ciation analysis, and JavaNNS, Neuroshell,

T A B L E 5 List of Free and Open-Source Tools
Tool Type Link
ADaM∗ LIB datamining.itsc.uah.edu/adam

CellProfilerAnalyst SOL www.cellprofiler.org/index.htm
D2K∗ DMS alg.ncsa.uiuc.edu
Gait-CAD INT sourceforge.net/projects/gait-cad
GATE SOL gate.ac.uk/download
GIFT RES www.gnu.org/software/gift
Gnome Data Mine Tools DMS www.togaware.com/datamining/gdatamine
Himalaya RES himalaya-tools.sourceforge.net
ImageJ SOL rsbweb.nih.gov/ij
ITK SOL www.itk.org
JAVA Data Mining Package LIB sourceforge.net/projects/jdmp
JavaNNS SPEC www.ra.cs.uni-tuebingen.de/software/JavaNNS/welcome e.html
KEEL INT www.keel.es
Kepler MAT kepler-project.org
KNIME INT www.knime.org
LibSVM LIB www.csie.ntu.edu.tw/ cjlin/libsvm
MEGA SOL www.megasoftware.net/m distance.html
MLC++ LIB www.sgi.com/tech/mlc
Orange LIB www.ailab.si/orange
Pegasus RES www.cs.cmu.edu/ pegasus
Pentaho BI sourceforge.net/projects/pentaho
Proximity SPEC kdl.cs.umass.edu/proximity/index.html
PRTools EXT www.prtools.org
R MAT www.r-project.org
RapidMiner DMS www.rapidminer.com
Rattle INT rattle.togaware.com
ROOT LIB root.cern.ch/root
ROSETTA SPEC www.lcb.uu.se/tools/rosetta/index.php
Rseslibs RES logic.mimuw.edu.pl/ rses
Rule Discovery System∗ SPEC www.compumine.com
RWEKA INT cran.r-project.org/web/packages/RWeka/index.html
TANAGRA INT eric.univ-lyon2.fr/ ricco/tanagra/en/tanagra.html
Waffles LIB waffles.sourceforge.net
WEKA DMS, LIB sourceforge.net/projects/weka
XELOPES Library∗ LIB www.prudsys.de/en/technology/xelopes
XLMiner∗ EXT www.resample.com/xlminer

∗
, Commercial tools with free licenses for academic use.
NeuralWorks Predict, RapAnalyst for artifi- very innovative fields. Examples are GIFT for
cial neural networks. content-based image retrieval, Himalaya for
• RES are usually the first—and not always mining maximal frequent item sets, sequential
stable—implementations of new and innova- pattern mining and scalable linear regression
tive algorithms. They contain only one or a trees, Rseslibs for rough sets, and Pegasus for
few algorithms with restricted graphical sup- graph mining. Early versions of today’s pop-
port and without automation support. Import ular tools such as WEKA and RapidMiner
and export functionality is rather restricted started in this category and shifted later to
and database coupling is missing or weak. other categories as DMS.
RES tools are mostly opensource. They are • Solutions (SOLs) describe a group of tools
mainly attractive to users in algorithm devel- that are customized to narrow application
opment and applied research, specifically in fields such as text mining (GATE), image
10
processing (ITK, ImageJ), drug discovery ferent types of tools are presented: DMS, BIs, MATs,
(Molegro Data Modeler), image analysis in INT, EXT, SPECs, RES, LIBs, and SOLs. They vary in
microscopy (CellProfilerAnalyst), or mining many different characteristics, such as intended user
gene expression profiles (Partek Genomics groups, possible data structures, implemented tasks
Suite, MEGA). The advantage of these so- and methods, interaction styles, import and export
lutions is the excellent support of domain- capabilities, platforms and license policies are vari-
specific feature extraction techniques, eval- able. Recent tools are able to handle large datasets
uation measures, visualizations, and import with single features, time series, and even unstruc-
formats. The level of data mining methods tured data-like texts; however, there is a lack of pow-
ranges from rather weak support (particularly erful and generalized mining tools for multidimen-
in image processing) to highly developed al- sional datasets such as images and videos.
gorithms. In some cases, more general tools
from types DMS or INT also support spe-
cific domains (KNIME, Gait-CAD for peptide SUPPLEMENTARY INFORMATION
chemoinformatics). There are many commer-
cial and open-source solutions. An additional Excel table contains a list of 269 tools
(195 recent and 74 historical tools, version from July
A large variety of tools actually requires a fuzzy cat- 22, 2010). For each tool, the following information
egorization with gradual memberships to different is available:
types. Examples are tools including a set of differ-
ent algorithms (LIB) with an additional GUI acting as • toolbox name,
an INT, DMS, including special methods for narrow • company or group (with the term ‘various’
application fields and others. In these cases, a main for open-source projects without an explicit
type was assigned and the other fuzzy memberships developer),
are discussed in the Excel table in the additional ma- • categorization into types with abbreviations
terial section. for Research Prototypes (RES), Data Min-
The following kinds of tools were not included ing Libraries (LIB), Business Intelligence Pack-
in the comparison: ages (BI), Data Mining Software (DMS),
Specialties (SPEC), Mathematical Packages
• nonavailable software (e.g., owing to com- (MAT), Extensions (EXT), Integration Pack-
pany mergers or stopped developments) is ages (INT), Solutions (SOL),
only listed in the Excel table in the additional
• Giraud-Carrier: marking the covering by the
material,
Excel table in Ref 12 (Stand: February 3,
• software for the handling of data warehouses 2010) with the values 1 (included in a de-
without explicit focus on data mining, tailed categorization), −1 (excluded), empty
• software for the manual design and applica- field: not mentioned,
tion of rule-based systems, • remarks,
• software for table calculation with a focus to • web link,
office users, and
• activity: 1 (relevant tool, included in the com-
• customized solutions for very narrow parison), 0 (less relevant), −1 (not available).
fields.
• license: OS, open source; CO, commercial;
CO/OS, different versions available.
CONCLUSION There are a number of regularly updated web re-

Many advanced tools for data mining are available sources with link lists, but lacking a criteria-based
either as open-source or commercial software. They comparison of the tools. The most important web re-
cover a wide range of software products, from com- sources are:
fortable problem-independent data mining suites, to
business-centered data warehouses with integrated • KDnuggets: http://www.kdnuggets.com/
data mining capabilities, to early research prototypes software/suites.html, including regular polls
for newly developed methods. In this paper, nine dif- to identify the most frequently used tools,

• The Data Mine: http://www.the-data-mine. find data mining tools hosted at Sourceforge):
com/bin/view/software http://sourceforge.net/
• The Open Directory Project: http://www. • Kernel Machines (especially to get a list
dmoz.org/Computers/Software/Databases/ of software to support vector machines):
Data Mining http://www.kernel-machines.org/software
• Sourceforge (very popular platform for open- • Tools for Bayesian Networks: www.cs.
source solutions, search for ‘data mining’ to helsinki.fi/research/cosco/Bnets.
ACKNOWLEDGEMENTS
The authors thank C. Giraud-Carrier for a copy of an Excel table containing a large set of data
mining tools, the anonymous reviewers for many comments and suggestions, and R. A. Klady
for the critical proofreading of the manuscript.
REFERENCES
1. Fayyad U, Piatetsky-Shapiro G, Smyth P. From data 12. Giraud-Carrier C, Povel O. Characterising data mining
mining to knowledge discovery in databases. AI Mag software. Intell Data Anal 2003, 7:181–192.
1996, 17:37–54. 13. Chen X, Ye Y, Williams G, Xu X. A survey of open
2. Smyth P. Data mining: Data analysis on a grand scale? source data mining systems, Lecture Notes in Com-
Stat Methods Med Res 2000, 9:309–327. puter Science 2007, 4819:3–14.
3. Lovell MC. Data mining. Rev Econ Stat 1983, 65:1– 14. Alcalá-Fdez J, Sánchez L, Garcı́a S, del Jesus M,
11. Ventura S, Garrell J, Otero J, Romero C, Bacardit J,
4. Han J, Kamber M. Data Mining: Concepts and Tech- Rivas V, et al. KEEL: A software tool to assess evo-
niques. San Francisco: Morgan Kaufmann; 2006. lutionary algorithms for data mining problems. Soft
5. Hastie T, Tibshirani R, Friedman J. The Elements of Comput 2009, 13:307–318.
Statistical Learning: Data Mining, Inference, and Pre- 15. Haughton D, Deichmann J, Eshghi A, Sayek S, Teebagy
diction. New York: Springer; 2008. N, Topi H. A review of software packages for data
6. Engelbrecht AP. Computational Intelligence - An In- mining. Am Stat 2003, 57:290–310.
troduction. Chichester: John Wiley; 2007. 16. Barrett T, Troup D, Wilhite S, Ledoux P, Rudnev D,
7. Vesset D, McDonough B. Worldwide business intel- Evangelista C, Kim I, Soboleva A, Tomashevsky M,
ligence tools 2008 vendor shares, IDC Competitive Edgar R. NCBI GEO: Mining tens of millions of ex-
Analysis Report (2009). pression profiles–database and tools update. Nucleic
8. Frank E, Hall M, Holmes G, Kirkby R, Pfahringer B, acids Res 2007, D760.
Witten I. Weka: A machine learning workbench for 17. Weiss S. Text mining: predictive methods for analyzing
data mining. Data Mining and Knowledge Discovery unstructured information. New York: Springer-Verlag;
Handbook: A Complete Guide for Practitioners and 2005.
Researchers. New York: Springer; 2005, 1305–1314. 18. Dillmann R. Teaching and learning of robot tasks via
9. Goebel M. A survey of data mining and knowledge observation of human performance. Rob Auton Syst
discovery software tools, ACM SIGKDD Explorations. 2004, 47:109–116.
Newsletter 1999, 1:20–33. 19. Leach A, Gillet V. An Introduction to Chemoinformat-
10. Wang J, Hu X, Hollister K, Zhu D. A comparison and ics. Springer; 2007.
scenario analysis of leading data mining software. Int 20. Shearer C. The CRISP-DM model: The new blueprint
J Knowl Manage 2008, 4:17–34. for data mining. J Data Warehousing 2000, 5:
11. Wang J, Chen Q, Yao J. Data mining software. In: 13–22.
Tomei L, ed., Encyclopedia of Information Technol- 21. Mikut R, Reischl M, Burmeister O, Loose T. Data min-
ogy Curriculum Integration. Hershey, PA: Information ing in medical time series. Biomed Tech 2006, 51:288–
Science Publishing; 2008, 173–178. 293.
12
22. Grossman R, Hornick M, Meyer G. Data mining stan- 31. Pawlak Z. Rough sets and intelligent data analysis. Inf
dards initiatives. Commun ACM 2002, 45:61. Sci 2002, 147:1–12.
23. Muthukrishnan S. Data Streams: Algorithms and Ap- 32. Pechter R. What’s PMML and what’s new in PMML
plications. Hanover, MA: Now Publishers Inc.; 2005. 4.0?, ACM SIGKDD Explorations. Newsletter 2009,
24. Chakrabarti D, Faloutsos C. Graph mining: laws, gen- 11:19–25.
erators, and algorithms. ACM Comput Surv (CSUR) 33. Hornick M, Marcadé E, Venkayala S. Java Data
2006, 38:1–69. Mining: Strategy, Standard, and Practice: A Practi-
25. Borgelt C. Graph mining: An overview. Proc., 19. cal Guide for Architecture, Design, and Implementa-
Workshop Computational Intelligence. Karlsruhe, tion. San Francisco: Morgan Kaufmann Publishers Inc.;
Germany: KIT Scientific Publishing; 2009, 189–203. 2006.
26. Datta R, Joshi D, Li J, Wang J. Image retrieval: Ideas, 34. Anand S, Grobelnik M, Herrmann F, Hornick M,
influences, and trends of the new age. ACM Comput Lingenfelder C, Rooney N, Wettschereck D. Knowl-
Surv (CSUR) 2008, 40:1–60. edge discovery standards. Artificial Intelligence Review
27. Zhu X, Wu X, Elmagarmid A, Feng Z, Wu L. Video 2007, 27:21–56.
data mining: Semantic indexing and event detection 35. Cannataro M, Congiusta A, Pugliese A, Talia D,
from the association perspective. IEEE Trans Knowl Trunfio P. Distributed data mining on grids: Services,
Data Eng 2005, 17:665–677. tools, and applications. IEEE Trans Syst Man Cybern
28. Damashek M. Gauging similarity with n-Grams: B Cybern 2004, 34:2451–2465.
Language-independent categorization of text. Science 36. Sonnenburg S, Braun M, Ong C, Bengio S, Bottou L,
1995, 267:843–848. Holmes G, LeCun Y, Müller K, Pereira F, Rasmussen
29. Jain AK, Duin RPW, Mao J. Statistical pattern recog- C, et al. The need for open source software in machine
nition: A review. IEEE Trans Pattern Anal Mach Intell learning. J Mach Learn Res 2007, 8:2443–2466.
2000, 22:4–36. 37. Bitterer A. Open-source business intelligence tool pro-
30. Breiman L. Random forests. Mach Learn 2001, 45:5– duction deployments will grow five-fold through 2010,
32. Gartner RAS Research Note G00171189 (2009).


Data Mining Tools

Uploaded by

Copyright:

Available Formats

Data Mining Tools

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining Tools

Uploaded by

Copyright:

Available Formats

Advanced Review

Data mining tools

Volume 00, January/February 2011

T A B L E 1 Maximum Dimensions of Datasets for Different Types of Problems

Data Dim. Structure for Each of the N Examples

Feature table 2 s features (e.g., age and income)

Volume 00, January/February 2011

Volume 00, January/February 2011

IBM SPSS Statistics, IBM SPSS Modeler, and

ries, while additional tools for text mining

are sometime available. The application fo-

application field, such as business applica-

tions, import and export of models, report-

ducers provide services for adaptation of the

the customer. DMS is mostly commercial and

such as RapidMiner exist. Typical examples

include IBM SPSS Modeler, SAS Enterprise

Miner, Alice d’Isoft, DataEngine, DataDetec-

NAG Data Mining Components, Partek

special focus to data mining, but include basic

are often restricted to feature tables and time

have a highly developed reporting function-

ality and good support for education, han-

dling, and adaptation to the workflows of the

focus on database coupling, and are imple-

mented via a client/server architecture. Most

BI softwares are commercial (IBM Cognos 8

BI, Oracle Data Mining, SAP Netweaver Busi-

ness Warehouse, Teradata Database, DB2

Data Warehouse from IBM, and PolyVista),

• Mathematical packages (MATs) have no

Volume 00, January/February 2011

T A B L E 3 List of Commercial Tools (Part 1)

Tool Type Link

ADAPA (Zementis) DMS www.zementis.com

Very popular tools are marked in bold letters.

T A B L E 4 List of Commercial Tools (Part 2)

Tool Type Link

Molegro Data Modeler SOL www.molegro.com

Very popular tools are marked in bold letters.

Volume 00, January/February 2011

T A B L E 5 List of Free and Open-Source Tools

Tool Type Link

ADaM∗ LIB datamining.itsc.uah.edu/adam

Very popular tools are marked in bold letters.

CONCLUSION There are a number of regularly updated web re-

Volume 00, January/February 2011

Volume 00, January/February 2011

You might also like