Nothing Special   »   [go: up one dir, main page]

Academia.eduAcademia.edu

Inducer: a Rule Induction Workbench for Data Mining

2009

Inducer: a Rule Induction Workbench for Data Mining Max Bramer Faculty of Technology University of Portsmouth Portsmouth, UK Email: Max.Bramer@port.ac.uk Fax: +44-2392-843030 Abstract One of the key technologies of data mining is the automatic induction of classification rules from examples. A variety of methods have been proposed and many comparative studies of methods have been carried out. However the absence of a widely available common platform has made it difficult to carry out sufficiently in-depth studies to establish the superiority of one method over another. This paper describes Inducer, a rule induction workbench which aims to provide a common platform with a graphical user interface for analysing a variety of rule induction algorithms and related strategies with a range of datasets, without the need for any programming by the user. The principal features of Inducer are described with reference to two wellknown rule induction algorithms. The paper concludes with an example of its use to compare these algorithms' performance when there are missing data values. 1 Introduction The growing commercial importance of knowledge discovery and data mining techniques has stimulated new interest in the automatic induction of classification rules from examples, a field in which research can be traced back at least as far as the mid1960s [1]. Many practical decision-making tasks can be formulated as classification problems, for example • • • Customers who are likely to buy or not buy a particular product in a supermarket People who are at high, medium or low risk of acquiring a certain illness Student projects worthy of a distinction, merit, pass or fail grade • Objects on a radar display which correspond to vehicles, people, buildings or trees. In many fields, large collections of examples (often collected for other purposes) are readily available and automatically generating classification rules for such tasks has proved to be a realistic alternative to the standard Expert System approach of eliciting decision rules from experts. Reference [2] reports two large applications of 2,800 and 30,000+ rules, developed using automatic rule induction techniques in only one and 9 man-years, respectively, compared with the estimated 100 and 180 man-years needed to develop the celebrated 'conventional' Expert Systems Mycin and XCON. Most work in this field to date has concentrated on generating classification rules in the intermediate form of a decision tree using variants of the TDIDT (Top-Down Induction of Decision Trees) algorithm [3]. Although the decision tree representation is widely used, it suffers from the fundamental problem that the corresponding classification rules can be excessively complex owing to the need for the rules invariably to 'link together' to form a tree structure. As rule induction techniques are applied to larger and larger commercial datasets this weakness is likely to become increasingly important. One approach to overcoming this problem is the rule induction algorithm Prism [4,5], which generates classification rules directly, term by term from the training set without using the intermediate representation of a decision tree. Algorithms for generating classification rules using other approaches (such as neural networks) have also been developed but will not be considered further in this paper. Although there are many comparative studies of different algorithms reported in the research literature, the absence of a widely available common platform for this work has inevitably made it difficult to carry out sufficiently in-depth studies to establish the relative strengths and weaknesses of different methods for particular types of application domain. Two systems which seek to provide a common platform for the experimenter are MLC++ [6], a library of C++ classes for supervised machine learning which can be incorporated into a user's own programs, and WEKA [7], a library of machine learning algorithms written in Java which can be run from a Java command line. The Inducer rule induction workbench is concerned specifically with the generation of classification rules and is aimed at investigators who prefer to use a graphical interface without the need for any programming or use of Unix commands. The package incorporates a wide range of user options and also permits the creation of a number of output files to facilitate further analysis. The current version of Inducer includes implementations of TDIDT and N-Prism (a revised version of Prism, as described in [5]), as representatives of two of the most important classes of algorithm for automatic induction of classification rules. Further algorithms and additional user facilities will be added in future versions. sunny 69 70 overcast 72 90 overcast 83 78 overcast 64 65 overcast 81 75 rain 71 80 rain 65 70 rain 75 80 rain 68 80 rain 70 96 Table 1. The Golf Training Set false true false true false true true false false false play play play play play don't play don't play play play play What combination of weather conditions determines whether the decision is to play or not to play? The standard terminology is to call Table 1 a training set and each of its rows an instance. Each instance comprises the values of a number of attributes (variables) and the value of a corresponding classification. Attributes can be either categorical (such as outlook) or continuous such as humidity. However, continuous attributes need to be split into a discrete set of ranges (e.g. humidity <= 75 or > 75) before use. One possible set of classification rules that can be derived from Table 1 is as follows: Following a brief introduction to the basic TDIDT and Prism algorithms, this paper goes on to describe the principal features of Inducer. A brief description of a series of experiments to compare TDIDT and N-Prism within the common framework of Inducer is also given. 1. IF outlook = sunny AND humidity <= 75 THEN Class = play 2. IF outlook = sunny AND humidity > 75 THEN Class = don't play 3. IF outlook = overcast THEN Class = play 4. IF outlook = rain AND windy = true THEN Class = don't play 5. IF outlook = rain AND windy = false THEN Class = play 2 Automatic Induction of Classification Rules 2.2 Inducing Decision Trees: The TDIDT Algorithm 2.1 Example and Basic Terminology The following example is taken from [3]. Table 1 records a golf player's decisions on whether or not to play each day based on that day's weather conditions. Outlook sunny sunny sunny sunny Temp (°F) 75 80 85 72 Humidity (%) 70 90 85 95 Windy Class true true false false play don't play don't play don't play The TDIDT algorithm constructs a set of classification rules via the intermediate representation of a decision tree. The algorithm begins by choosing an attribute and then divides the training set into subsets corresponding to each of its possible values. Each resulting branch of the tree then leads to either a leaf node, if all the instances in the corresponding subset have the same classification or otherwise a subtree constructed using the same algorithm recursively. A possible decision tree corresponding to the above table is shown in Figure 1. additional test. This is the approach adopted in the well-known system C4.5 [3]. To Play or Not to Play? 2.3 Inducing Decision Rules: The Prism Algorithm outlook overcast The basic Prism algorithm can be summarised as follows, assuming that there are n (>1) possible classes. Further details are given in [5]. sunny humidity <= 75 play play > 75 don't play rain windy true don't play For each class i from 1 to n inclusive: false play Figure 1. Decision Tree Corresponding to Table 1 This tree corresponds to the five classification rules given in Section 2.1, one rule per branch. In general there are many possible decision trees and thus many different sets of rules corresponding to any given training set. The aim is to generate the set of classification rules that gives the best possible level of predictive accuracy when applied to a table of previously unseen instances, known as a test set. The TDIDT algorithm can be summarised as follows. IF all cases in the training set belong to the same class THEN return the value of the class ELSE (a) select an attribute to split on * (b) sort the instances in the training set into nonempty subsets, one for each attribute value (c) return a tree with one branch for each subset, each branch having a descendant subtree or a class value produced by applying the algorithm recursively for each subset in turn. * No attribute may be selected more than once in any branch. Provided that no two instances with the same values of all the attributes belong to different classes, any method of selecting attributes at step (a) will suffice to produce a decision tree. However the predictive value of the corresponding set of classification rules on unseen instances in a test set may depend critically on how it is done. The most common attribute selection criterion is probably Information Gain [3], which uses the information theoretic measure entropy at each stage to split on the attribute which maximises the expected gain of information from applying the (1) Calculate the probability that class = i for each attribute-value pair (2) Select the attribute-value pair with the maximum probability and create a subset of the training set comprising all instances with the selected combination (for all classes) (3) Repeat 1 and 2 for this subset until it contains only instances of class i. The induced rule is then the conjunction of all the attribute-value pairs selected in creating this subset (4) Remove all instances covered by this rule from the training set Repeat 1-4 until all instances of class i have been removed 3 The Inducer Rule Induction Workbench 3.1 Introduction to Inducer The Inducer rule induction workbench [8] is one of a suite of packages developed to facilitate experiments with different techniques for generating classification rules. Inducer is implemented in Java (version 1.1) in the interests of portability and is available both as a standalone application and also as an applet. The package was implemented partly for teaching purposes but principally to facilitate practical experimentation with a range of rule induction algorithms and associated strategies. It is written in a modular fashion to enable further algorithms and strategies to be added relatively easily in the future. Figure 2 shows a screen image of Inducer running the hypothyroid dataset ('hypo') from the repository of machine learning datasets at the University of California at Irvine (UCI) [9]. there are N possible classifications the matrix is thus of N rows by (N+1) columns. The entry for row i, column j corresponds to the number of instances in the dataset with a correct classification of i and a computed classification of j. Entries in column N+1 occur when the induced rule set is unable to make a classification of a given instance. Figure 2. Inducer Screen Image 3.2 Basic Use Inducer is designed to be easy to use. It has a graphical user interface which enables the expert user to select from a wide range of options, but for the inexperienced user it requires only two mouse clicks to obtain a set of classification rules: one to select a dataset from a menu in the top middle of the screen and another to press the go button in the top left-hand corner. The default rule generation algorithm used is TDIDT, with entropy (or information gain) as the attribute selection criterion, as described in Section 2.2 above. Other parameters all have reasonable default values compatible with these. There are limitations on the maximum size of training set and test set that can be processed, the maximum number of rules that can be generated etc. These are platform dependent and can be obtained by the user pressing the info (information) button, which displays the contents of file inducer.inf. Execution times are typically no more than a few seconds for datasets of the size of most of those in the UCI Repository. Having generated a set of classification rules Inducer then runs the rules first against the original training set and then against a test set of previously unseen data (provided that such a test set exists and that the Use Test Set option has been selected). For each training and test set examined Inducer calculates and displays a confusion matrix with one row and column per classification and one additional column corresponding to unclassified instances. If In the case of a perfect classification all non-zero entries in the confusion matrix will occur on the leading diagonal of the main N x N matrix, with none in the rightmost ('unclassified') column. This will generally occur when the classification rules are run against the original training set which was used to generate them, but this is not invariably the case, e.g. if cutoffs (as described in Section 3.4.2) were applied during the rule generation process. As well as the confusion matrix, Inducer displays the number of correct matches, incorrect matches and unclassified instances and the percentage of correct matches. The options available for the experienced user are described in Sections 3.3 to 3.11 below. The current values of all system parameters, including those not displayed on the screen, can be obtained by the user by pressing the settings button at any time. 3.3 Data Input Parameters 3.3.1 Dataset Selection There are currently 24 datasets available for selection from the default input file directory. These are mainly taken from the UCI Repository [9]. Alternative input file directories can be specified if preferred. The format of input files is essentially that specified in [3]. Each dataset comprises a name file, a training set and in most cases also a test set. The first line of the name file gives a list of all possible classifications. Subsequent lines give details of each of the attributes in turn, with the name of the attribute followed by either a list of its possible values, in the case of a categorical attribute, or the word continuous, denoting a numerical attribute, or ignore. The facility to specify an attribute as ignore is a valuable one, enabling the user easily to experiment with the effect of 'turning off' one or more attributes without having to change the data. Training sets and test sets have the same format. Each record corresponds to one instance and comprises the values of each of the attributes in the order in which they are listed in the name file, separated by commas as delimiters, followed by the corresponding classification as the last field in the record. 3.3.2 Ignore Continuous Attributes The TDIDT algorithm as implemented in Inducer has a facility for local discretization of continuous attributes, i.e. dividing the values of an attribute X into two parts, X<a and X>=a, say, at each stage of the tree generation process. However, many rule induction algorithms including N-Prism have no facilities for dealing (directly) with continuous attributes and it is sometimes helpful for the user to be able to 'turn off' such attributes, effectively treating them as if they were specified as ignore attributes in the name file. (2) gain ratio - a measure devised by Quinlan [3] aimed at overcoming the bias inherent in the use of information gain towards selecting attributes with a large number of values (3) takefirst - work through the classes specified in the name file in order, from left to right (4) takelast - as (3) but work from right to left (5) random - select at random. 3.4.2 Cutoffs During Rule Generation A problem that often arises during rule generation is the over-fitting of rules to data. A rule such as IF a = 1 AND b = 1 AND c = 3 AND d = 2 THEN Class=3 which is correct but corresponds to only a small number of instances in the training set may be of little value in predicting the classification for unseen data. 3.3.3 Missing Value Strategy For many practical applications the value of some attributes (or even in some cases the correct classification) may not be available for some or perhaps even all of the instances. An important practical requirement of a rule induction algorithm in many domains is the ability to make an accurate prediction of the classification of unseen test data even when there are missing values in the training set, the test set or both. Missing values in both training and test sets are conventionally identified by the symbol '?'. Two missing value strategies are available in Inducer. (a) Ignore any instance in either the training or the test set that contains one or more missing values. (b) Replace any missing values for a categorical attribute or for the classification by the most frequently occurring (non-missing) value for that attribute or classification and any missing values for a continuous attribute by the average of the (non-missing) values of that attribute. This is the default setting. 3.4 Rule Generation Parameters 3.4.1 Attribute Selection Method There are five attribute selection criteria available when the TDIDT algorithm is selected, the last three being provided for teaching and comparative purposes only. (1) entropy (or information gain) - the default Generalising such a rule by stopping the rule generation process before the left-hand side is complete, e.g. IF a = 1 AND b = 1 THEN Class=3 may be less accurate as far as the training set is concerned but of considerably more value when classifying data in an unseen test set. Inducer has facilities for the user to specify two different types of cutoff during rule generation: (a) a depth cutoff after either 1, 2 or 5 terms have been generated for a given rule (the default is 'unlimited') (b) a size cutoff if the number of instances in the training set currently under consideration is below either 5 or 10 (the default is zero). For both types of cutoff the result is a partly complete left-hand side of a rule, together with a subset of the training set containing instances with more than one classification, for which further subdivision has been inhibited. The rule is then either discarded or a single classification is taken depending on the clash handling strategy selected, as described in Section 3.4.4 below. 3.4.3 Discarding Rules on the Basis of 'Interestingness' A topic of growing importance in recent years is that of rule 'interestingness' [10], the aim of which is to identify those rules in a generated ruleset which are likely to be of value in classifying unseen instances. There are many measures of rule interestingness available, a basic one being Piatetsky-Shapiro's RI measure [11], which gives the difference between the actual number of instances matched by the rule in the training set and the number that would be expected by chance. Inducer allows the user to specify that a rule should be discarded (and not displayed) if the value of RI is less than zero, one or five. The default is 'never discard'. 3.4.4 Handling Generation Clashes During Rule A clash occurs when a rule induction algorithm encounters a subset of the training set which has more than one classification but which cannot be further processed, either because there are no more attributes available or because of a cutoff. Such a subset is called a clash set. Two strategies are currently implemented in Inducer for dealing with clashes encountered during rule generation: (a) discard all instances in the clash set (b) (default setting) treat the clash set as if all the instances in it had the same classification, i.e. the one that occurs most frequently. In the case of TDIDT a new rule is created. For N-Prism the rule is created only if the most frequently occurring classification is the one for which rules are currently being generated, otherwise the instances are discarded. 3.4.5 Dealing with Continuous Attributes Inducer has a parameter that specifies how continuous attributes are to be handled during rule generation (TDIDT only): (a) the attribute may be used once only in any given rule, as for categorical attributes (b) (the default setting) the attribute may be used, i.e. subdivided into ranges, more than once in a rule if necessary, e.g. IF a = yes AND x < 6.4 AND b=3 AND x < 3.2 THEN Class = 2 Inducer automatically combines multiple ranges of a continuous attribute; e.g. the above rule would be treated as IF a = yes AND x < 3.2 AND b = 3 THEN Class = 2 3.5 Rule Execution Parameters A 'default to majority class' facility is provided to force Inducer to classify all instances in the test set. Instances that would otherwise be unclassified are assigned to the most commonly occurring class. This is likely to be of practical value in domains where it is important to maximise the number of correct classifications and there is little or no penalty for incorrect ones. The default setting is false. 3.6 Display Parameters Options are available (all defaulting to 'false') to display: (a) the frequency of occurrence of each value of each categorical attribute, together with the minimum, maximum and average values of each continuous attribute (b) the value of the RI measure for each rule (c) details of clashes encountered during rule generation. There is also an option to suppress the display of rules to the screen. This can speed up execution considerably. The total number of rules and terms and the average number of terms per rule are still displayed. 3.7 Saved Rule Parameters Two parameters are available in connection with the saving of rules (the default for both is false): (a) Save Rules - Save rules in a saved rule file in a coded form, together with the values of the main system parameter settings. Rule files are output to the saved rules directory, which is set by the Inducer initialisation file (see Section 3.10). (b) Use Saved Rules - Read in rules from a saved rule file for the specified dataset. No processing of the training set takes place and the values of all rule generation parameters are ignored. 3.8 Log File Every time the go button is pressed details of the main system parameter settings are recorded in a log file inducer.log, together with the number of rules and terms generated and the number of instances correctly classified, incorrectly classified or unclassified in the training set and (if applicable) the test set. 3.9 Other Output Files The default for all four selections is false. All output files are saved to the output files directory, which is set by the Inducer initialisation file. (a) A number of interestingness measures associated with each rule, including RI, are output to a file. (b) Statistics giving the number of correctly classified, incorrectly classified and unclassified instances and the confusion matrix are output for the training set and (if selected) the test set for the chosen dataset. (c) Detailed information about any misclassified instances can be output to an 'exceptions' file. (d) Selecting the 'rule output to file' option enables the user to output the rules for possible use by another language or package. Three output formats are available: • a Java method • a set of Prolog clauses • (TDIDT only) a script file suitable for input to the SPSS flowcharting package AllClear. 3.10 Initialisation File The initialisation file inducer.ini is invoked automatically when the application version of Inducer starts executing. It is used to specify the location of the input, output and rule file directories and the file extensions for the name, training and test sets, plus the saved rules file, the 'output rules' file and the rule interestingness measures, exceptions and statistics files. For the applet version an equivalent effect to the initialisation file is achieved by embedding the information in the web page used to invoke the applet using the <PARAM> tag. 4 Experiments in Rule Induction The availability of the Inducer package makes it straightforward to conduct even very extensive comparisons of different rule induction algorithms and related strategies. A number of comparisons of TDIDT and N-Prism are reported in [5]. As a further example, an experiment was conducted to compare the sensitivity of the two algorithms to missing values. The dataset used for this experiment was the Vote dataset from the UCI Repository. The dataset has 16 attributes (all categorical), 2 classes (Democrat and Republican), with 300 instances in the training set and 135 in the test set. Using Datagen, another of the packages in the Inducer suite, missing values were systematically introduced into the training and test sets in a random fashion with frequency from 10% up to 70%. Using the batch mode facility of Inducer the classification accuracy of the two algorithms was then computed for missing value levels of 0%, 10%, 20%, 30%, 50% and 70% in each of the training and test sets, as a single batch run. Both algorithms used the same strategy for dealing with missing values, with each missing value being replaced by the most frequently occurring value when generating or applying the classification rules. Vote Dataset: Number of Terms Generated An option is available (default setting false) to run Inducer in batch mode. When this is selected details of the input file, output file and saved rule file directories are taken from the file inducer.bat. These are followed by an unlimited number of triples specifying a name file, a training set and a test set. Pressing the go button causes Inducer to run on each of these triples of files in turn. All other parameter settings are taken from the graphical interface. Running Inducer in batch mode enables a substantial series of experiments, say with a fixed name file and training set and a range of different test sets, to be run in a simple and rapid fashion, with the output automatically recorded in the log file. No. Of Terms 3.11 Batch Mode 800 600 400 200 0 TDIDT Terms Prism Terms 0 10 20 30 50 70 % Missing Values in Training Set Figure 3. Number of Terms Generated for Varying Levels of Missing Values in the 'Vote' Training Set The two algorithms produced virtually identical numbers of rules for each level of missing values in the training set. With 70% missing values there is some advantage to N-Prism (65 rules compared with 73). However, Figure 3 shows that N-Prism is considerably better than TDIDT when measured by the total number of terms generated and thus the average number of terms per rule. With no missing values N-Prism generates a total of only 89 terms compared with 156 for TDIDT. Most strikingly, the total number of terms generated by N-Prism is not much more with 70% missing values (306 terms) than with 20% (296). By contrast TDIDT generates 462 terms with 20% missing values and this rises to 596 as the level of missing values increases to 70%. Correct Classifications Vote Dataset: 20% Missing Values in Training Set The Inducer rule induction workbench provides a powerful framework for in-depth experiments with alternative rule induction algorithms and related strategies. One such experiment has been reported briefly above. The package has been developed in a modular fashion to facilitate the addition of further algorithms and strategies as required. References 135 90 TDIDT 45 Prism 0 0 10 20 30 40 50 60 70 80 % Missing Values in Test Set Vote Dataset: 50% Missing Values in Training Set Correct Classifications 5 Conclusions 135 90 TDIDT 45 Prism 0 0 10 20 30 40 50 60 70 80 % Missing Values in Test Set Figure 4. Effects of Introducing Missing Values in the ‘Vote’ Training and Test Sets Figure 4 shows the comparative levels of classification accuracy of the two algorithms for missing value levels of 20% and 50% in the training set. Both algorithms perform well overall, even with high levels of missing values in both sets. One way to extend these experiments would be to examine the effect of pre-pruning the rules, e.g. by means of a depth cutoff during the rule generation process, or of post-pruning them, say by discarding any rules with too low a value of the RI rule interestingness measure. In general, a range of such experiments would need to be carried out to determine the most appropriate rule induction technique for a given application. 1. Hunt, E.B., Marin J. and Stone, P.J. Experiments in Induction. Academic Press, 1966 2. Michie, D. Machine Executable Skills from "Silent" Brains. In: Research and Development in Expert Systems VII. Cambridge University Press, 1990 3. Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993 4. Cendrowska, J. PRISM: an Algorithm for Inducing Modular Rules. International Journal of Man-Machine Studies, 1987; 27: 349-370 5. Bramer, M.A. Automatic Induction of Classification Rules from Examples Using NPrism. In: Research and Development in Intelligent Systems XVI. Springer-Verlag, pp. 99121, 2000 6. MLC++ Machine Learning Library in C++. [A library of C++ classes, downloadable from http://www.sgi.com/Technology/mlc] 7. Witten, I.H. and Frank, E. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, 2000 8. Bramer, M.A. The Inducer User Guide and Reference Manual. Technical Report: University of Portsmouth, Faculty of Technology, 1999 9. Blake, C.L. and Merz, C.J. UCI Repository of Machine Learning Databases [http://www.ics.uci.edu/~mlearn/MLRepository.h tml]. Irvine, CA: University of California, Department of Information and Computer Science, 1998 10. Freitas, A.A. On Rule Interestingness Measures. In: Research and Development in Expert Systems XV. Springer-Verlag, 1999, pp.147-158 11. Piatetsky-Shapiro, G. Discovery, Analysis and Presentation of Strong Rules. In: PiatetskyShapiro, G. and Frawley, W.J. (eds.), Knowledge Discovery in Databases. AAAI Press, 1991, pp. 229-248