CN101401100B

CN101401100B - Data mining by determining patterns in input data

Info

Publication number: CN101401100B
Application number: CN2007800089386A
Authority: CN
Inventors: A·多奈希; T·博林格; C·林根费尔德
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 2006-03-14
Filing date: 2007-02-02
Publication date: 2012-10-10
Anticipated expiration: 2027-02-02
Also published as: CN101401100A

Abstract

Patterns detection in input data containing a plurality of transactions, each transaction having at least one item, is carried out in the following way. Filter conditions for interesting patterns are received, and a first set of filter conditions applicable in connection with generation of candidate patterns is determined. An evaluated candidate pattern is selected as a parent candidate pattern, and evaluation information about the parent candidate pattern is maintained. Child candidate patterns are generated by extending the parent candidate pattern and taking into account the first set of filter conditions. The child candidate patterns are evaluated with respect to the input data together in sets of similar candidate patterns and based on the evaluation information about the parent candidate pattern. At least one child candidate pattern successfully passing the evaluation step is recursively used as a parent candidate pattern.

Description

Carry out data mining through confirming the pattern in the input data

Technical field

Relate generally to data mining of the present invention.Specifically, the present invention relates to confirm the pattern in the input data.

Background technology

Data mining refers generally to be used for the method from the data-driven of input data extract information.Other are used for from the method for input data extract information (the hypothesis driven) of hypothesis driven normally, wherein prove that from the angle of input data one group of hypothesis is correctly or mistake.

The input data volume maybe be very huge, so data mining technology need consider how effectively handle mass data usually.Consider to manufacture a product as an instance.Wherein, the input data can comprise origin and function, processing, the assembly of assembly in manufacturing plant of various and the assembly relevant data that how to be assembled together.In making context, the purpose of data mining can be the solution problem relevant with quality analysis and quality assurance.Data mining can for example be used for the early warning system in root cause analysis, the manufacturing plant, and reduces claim.As second instance, consider various information technology systems.Wherein, data mining can be further used for intrusion detection, system monitoring and case study.Data mining also has various other purposes, for example can analyze the typical customers behavior in retail and service field, and is used for finding the cause-effect relationship of clinical research at medical science and life science.

Mode detection is the data mining subject, wherein import data and comprise a plurality of affairs set, and each affairs comprises a project set.Affairs can also be sorted.Can still alternatively can define any ordering according to time-sequencing.For example, can serial number be provided for each affairs.Correlation rule is the pattern how the description project occurs in affairs.On the other hand, Cahn-Ingold-Prelog sequence rule refers to the particular order of order affairs item set.

Consider a project set I={I ₁, I ₂... I _m.Suppose that D is an affairs set, wherein each affairs T is a project set that belongs to I,

T &SubsetEqual; I

Therefore, if

A &SubsetEqual; T

, then affairs T comprises project set A in I.Correlation rule contains form

A &DoubleRightArrow; B,

Wherein

A &Subset; I

,

B &Subset; I

, and A is called as rule body and B is called as regular head.Also comprise B, correlation rule if comprise the c% affairs of A among the D

A &DoubleRightArrow; B

In the affairs set D of degree of confidence c is true.In other words, degree of confidence c is conditional probability p (B|A), and wherein p (S) is the probability of in D, finding as the S of the subclass of affairs T.When the affairs of s% among the D comprise A ∪ B, rule

A &DoubleRightArrow; B

In affairs set D, has support (support) s.In other words, support s is the probability that the union of the project in set A and the set B in affairs, occurs.

Usually, the purpose of data mining is correlation rule and a Cahn-Ingold-Prelog sequence rule of finding that accurately all satisfy user-defined criterion.The user can be directed against rule definition minimum support or degree of confidence, because few or loose relevant incident maybe be unimportant as far as some application.The user is also maybe be only interested in and only hope that search comprises the pattern of at least one project in these items of interest the specific project.

There is the multiple technology that is used for confirming correlation rule and Cahn-Ingold-Prelog sequence rule according to the input data.Usually, search for correlation rule and Cahn-Ingold-Prelog sequence rule, then with respect to input data assessment candidate pattern based on the generation of candidate pattern.Expand these suitable candidate pattern that comes to light through adding new projects then, thereby produce new more complicated candidate pattern to rule.

Since the input data volume maybe be very huge and pattern maybe be very complicated, therefore need be organized in the search in whole candidate pattern space effectively and from the candidate pattern assessment of data angle.According to the mode of prior art, can these technology be divided into two types of algorithms through the candidate pattern space.

For example, if do not reach the minimum support that has defined, will use some filter criteria immediately, because these filter criteria are inherited by subpattern.Other criterions such as min confidence only can be applied to complete rule, and this has stoped its early stage application.

The algorithm of first kind is BFS.In these algorithms, begin search in whole candidate pattern space from simple mode with two projects.At first produce and assess two all project modes with respect to the input data.Produce and assess three all project modes with respect to the input data then.Usually, according to each candidate pattern of input data transactions assessment.The candidate pattern of assessment is not stored in the storer usually.On the other hand, the input data are not stored in usually in the storer but read from data source.The instance of BFS can (be used to excavate the fast algorithm of correlation rule at " Fast Algorithms for MiningAssociation Rules "; Author Rakesh Agrawal and Ramakrishnan Srikant; The minutes of the 20th very-large database (VLDB) international conference, 1994) in find.

The algorithm of second kind is depth-first search.In these algorithms, the assessment of candidate pattern set is since the first seed candidate pattern, and before forwarding other candidate pattern to, assesses its all compatriot.An instance as the depth-first search algorithm; Consider that " Sequential pattern miningusing a bitmap Representation " (uses bitmap to represent to excavate ordered mode; Author JayAyres etc.; About the minutes of the ACM SIGKDD international conference of Knowledge Discovery and data mining,, 429-435 page or leaf in 2002) in the algorithm described.In this algorithm, the input data are converted into binary format and are stored in the storer.Activity data log history (it is used for safeguarding about which data recording (affairs) information relevant with AD HOC) also is stored in storer.

Under specific circumstances, these known data mining algorithms have shortcoming.Depend on the size in input data volume (the especially mean size of affairs) and candidate pattern space, BFS maybe be slower, because need scanning many raw data source and because need assess each candidate pattern according to all affairs.On the other hand, depth-first search may exhaust storer for a large amount of input data, or because assess in a large number according to the input data, when the input data were switched to dish, depth-first search maybe be slower.

Therefore, need a kind of effective ways that are used for the pattern of definite input data, said method is with the subproblem at least that overcomes in the relevant problem of above-mentioned and known data mining technology.

Summary of the invention

According to a first aspect of the invention, a kind of Computerized method that is used for detecting the pattern of the input data that comprise a plurality of affairs is provided, each affairs has at least one project, said method comprising the steps of:

Reception is used for the filter condition of pattern interested,

According to the filter condition that is received, confirm the first suitable filter condition set relevant with generating candidate pattern,

The candidate pattern of selecting to have assessed is also safeguarded the appreciation information about said father's candidate pattern as father's candidate pattern,

Through expanding said father's candidate pattern and consider that first filter condition set generates sub-candidate pattern,

With respect to the input data in a plurality of similar candidate pattern set and the said appreciation information of the relevant said father's candidate pattern of basis are assessed said sub-candidate pattern jointly; Similar candidate pattern and at least one set that each set has maximum predetermined quantities have at least two similar candidate pattern, and

Recursively use at least one sub-candidate pattern of successfully passing through appraisal procedure as father's candidate pattern.

Candidate pattern difference each other in each similar candidate pattern set is to be added to a respective item of public father's candidate pattern.Can generate sub-candidate pattern through following steps: first project set that new projects is added to father's candidate pattern; New projects are added to last project set of father's candidate pattern; And/or new projects' set that will comprise a project appends to father's candidate pattern.The predetermined quantity of candidate pattern can depend on the characteristic of the computing system of carrying out said Computerized method.

Can come counting statistics tolerance so that in generation and appraisal procedure, use according to said input data, said statistical measures comprises at least one in following: project be to statistical information and weight statistical information.When said first filter condition of application is gathered, the search volume of limiting said candidate pattern according to said statistical measures.Can confirm the order that to expand which sub-candidate pattern and expand sub-candidate pattern according to said statistical measures.

Said filter condition can comprise that at least one is based at least one condition in following: the weight of weight, the total weight about the input data, the average weight of supporting affairs, rule body, the weight of rule head, about total weight of the rule head of input data, about total weight of the rule body of input data, and addressable other total weights.

Can in the compressed data structure of the affairs set in the expression input data, said input data be provided.Said compressed data structure has: the information of the identifier quantity in the tabulation of the identifier of disparity items in the affairs set, the said tabulation of indication; And the bit field information of indicating the existence of disparity items described in the said affairs set, organize said bit field information so that gather to come evaluation profile according to said tabulation with respect to said affairs.Can use bitmap operation to assess said candidate pattern to said bit field information.

The appreciation information of the candidate pattern of can safeguard the affairs of expression in the said input data, the candidate pattern of having assessed, having assessed, the candidate pattern that will assess; And the data structure of resulting schema; And can be according to the available total memory and the use of said data structure, dynamically confirm during generating and assessing sub-candidate pattern and will which data structure is kept in the storer and will which data structure be placed into dish.Whether can be at least the indication of first data structure should be with said first data structure by prioritized when confirming to be kept at which data structure in the storer.In addition, can be at least second data structure indication, so that confirm and will which data structure be kept in the storer according to the said time of fetching recently from the fetching the time recently of dish.

The appreciation information of one of can following column format safeguarding said father's candidate pattern: be designated as first bit field of the input data event that the support of said father's candidate pattern contributes, the length of said first bit field equals to import the quantity of data event; Be designated as second bit field of the input data event that the support of said father's candidate pattern contributes, the length of said second bit field equals the quantity of the input data event that the support for another father's pattern of said father's candidate pattern contributes; And with two that contribute for the support of said father's candidate pattern continuous relevant information of input data event quantity between the input data events.At this, the input data event is one of following item: affairs and one group of affairs.Can select the form of the appreciation information of said father's candidate pattern according to the support of said father's candidate pattern.

Can in said appraisal procedure, be designated as the appreciation information of safeguarding said sub-candidate pattern set in the bit field of the input data event that the support of corresponding sub-candidate pattern contributes, wherein import data event and be one of following: affairs and one group of affairs.The length of said bit field equals to import the quantity of data event.The quantity of the bit field of each sub-candidate pattern set is the quantity of the input data event that contributes of the support for corresponding father's pattern.

Can confirm the second suitable filtrator set relevant according to the filter condition that is received, and consider the said second filter condition set relevant with the said sub-candidate pattern of assessment with the said sub-candidate pattern of assessment.In addition; Can be according to the filter condition that is received; Confirm the 3rd filter condition set suitable during confirming resulting schema; Consider the said three filter condition set relevant, and will be output as resulting schema through the candidate pattern that the existing effect of said the 3rd filter condition set is assessed with the said sub-candidate pattern of assessment.

According to a second aspect of the invention; A kind of computer program that comprises computer-readable program instructions that is stored on the computer usable medium is provided; When said computer program is loaded into computing machine, said instruction will cause said computing machine to carry out aforesaid method.

According to a third aspect of the invention we, a kind of computer system that is used for detecting the pattern of the input data that comprise a plurality of affairs is provided, each affairs has at least one project, and said computer system is configured to:

Reception is used for the filter condition of pattern interested,

Recursively use at least one sub-candidate pattern of successfully passing through assessment as father's candidate pattern.

It being understood that and for example to carry out the assessment of the candidate pattern in the similar candidate pattern set with respect to original input data, pre-service input data (wherein for example original item identifier has been replaced by unique integer) or compression input data.Except assessing similar candidate pattern set, below specifying provides relevant compression input data and relevant some details that is fit to the dynamic memory management of data mining.

Description of drawings

In the accompanying drawings, the mode unrestricted through instance shows the present invention, and these accompanying drawings are:

Fig. 1 schematically shows the computing system that can be used for data mining according to of the present invention;

Fig. 2 show as an example wherein with respect to the process flow diagram of the method for the input data assessment candidate pattern in the set of similar candidate pattern;

Fig. 3 a, 3b and 3c show some instance that generates similar candidate pattern according to public father's pattern;

Fig. 4 show as an example be the process flow diagram of the method for subpattern with father's mode expansion;

Fig. 5 shows the more detailed process flow diagram of the method for the candidate pattern in the similar candidate pattern set of assessment as an example;

Fig. 6 a, 6b and 6c show the process of the appreciation information of storage father candidate pattern as an example;

Fig. 7 a, 7b, 7c and 7d show as an example with binary format compress effectively the input data data structure;

The data compression with comprising in the affairs set that Fig. 8 shows as an example is the process flow diagram of the method for specific data structure;

Fig. 9 a and 9b show the process flow diagram of the further method of the data that comprise in the compression affairs set as an example;

Figure 10 a, 10b and 10c show process flow diagram and the details of said method with respect to the method for the input data detection correlation rule after the compression as an example;

Figure 11 shows the process flow diagram with respect to the method for the correlation rule in the input data detection similar rule set after the compression as an example;

Figure 12 a, 12b and 12c show as an example with respect to the more detail flowchart of the method for the correlation rule in the similar correlation rule of the input data detection set after the compression and the details of said method;

Figure 13 a, 13b and 13c show process flow diagram and the details of said method with respect to the method for the input data detection Cahn-Ingold-Prelog sequence rule after the compression as an example; And

Figure 14 schematically shows the dynamic memory management that suitable data mining is as an example used.

Embodiment

Hereinafter, discussed through confirming that the pattern in the input data carries out data mining.In the explanation below, term " pattern " is meant order, correlation rule and/or Cahn-Ingold-Prelog sequence rule.

Embodiments of the invention adopt at least a function in the following function: assess the candidate pattern in the similar candidate pattern set; Store the appreciation information of father's candidate pattern efficiently; To import data efficient ground boil down to binary format; With respect to the input data assessment candidate pattern after the compression; Handle weight information; And in data mining is used diode-capacitor storage dynamically.

Fig. 1 shows the computer system 10 that can be used for mining data storehouse or other input information sources.Specifically, computer system 10 can be used for carrying out data mining and/or handling the input data according to embodiments of the invention.In the specific calculation machine architecture shown in Fig. 1, system 10 can comprise one or more digital information processing devices, for example client computer 12 and server computer 14.Said server computer can be the principal computer that is positioned at IBM Corporation's manufacturing of New York A Mengke, and uses the multiple task operating system of selling with trade mark z/OS.Server computer 14 can alternatively be UNIX computing machine, Windows NT Server or the IBM RS/6000 workstation that uses AIX operating system.Server computer 14 can comprise Database Systems, and for example DB2 or ORACLE perhaps can have the data in the file on other data storage mediums.Obviously, can use other architectures shown in above-mentioned or Fig. 1.For example, the function of client computer 12 can be incorporated in the server computer 14 and vice versa.Server computer 14 itself can be that server computer is trooped.

As shown in the figure, the operating system of server computer 14 can be excavated program/function 16 by hosted data, and the latter can be used as series of computation machine executable instruction and carried out by the processor in the server computer 14.These computer executable instructions can reside in the storer, for example, reside in the hard disk of RAM (RAS) or server computer 14 of server computer 14.Alternatively, said instruction can be included in the data storage device that has computer-readable medium (like computer disk, CD) or be called in the equipment of memory stick.Said instruction can computer program form provide, said computer program is generally comprised within on the computer-readable medium.

Fig. 2-14 relates to the various functions that can in data mining, use.It will be apparent to one skilled in the art that and in data mining program 16 or input data compression program 17 that maybe be independent, to realize method and data structure with reference to these pattern descriptions.Alternatively, some function of the present invention can realize in hardware.Can also use the appropriate combination of software and hardware to provide and be suitable for carrying out the equipment of method according to an embodiment of the invention.Input data compression program 17 can reside in the alone server computing machine that separates with data mining program 16.

Return Fig. 1, data access program and utility 18 can make the input data 21 after the one or more database of data mining program 16 visits 20, the compression usually and/or comprise the flat file (being text) 22 of the data of relevant one or more affairs.Alternatively, data access program and utility 18 can never reside in the remote database server retrieve data on the server computer 14, perhaps excavate program 16 and can embed in the database 20.Input data 21 after the compression can reside in the storer and/or dish of server computer 14.Data mining program 16 is exported the pattern that finds usually, or exports those user's interest patterns at least.The pattern that finds can be stored in can be by in the database 20 of client computer 12 visit or the independent thesaurus as a result 24.

As shown in Figure 1, client computer 12 can comprise data mining interface 26, and this interface and data mining program are similar, can realize with suitable computer program code.Wherein, said interface is as the input mechanism of setting up pattern criterion (for details, seeing also the hereinafter discussion of relevant these criterions).In addition, client computer 12 preferably includes output module 28, and it is used for the result of thesaurus 24 storages is as a result exported/be shown to graphic alphanumeric display 30, printing equipment 32 or data storage medium 34.

Hereinafter, with the functional character of introducing data mining program 16 in detail.At first introduce conventional sign and notion in the data mining.

Describe to association area as top,, have the set D of a project set I and an affairs T for the input data.There are two nested grouping ranks usually in these projects.The all items that occurs simultaneously will form a single affairs T (perhaps, in other words, forming a project set).Usually, each affairs T has unique transaction identifiers TA_ID.The project that therefore the TA_ID identifier will belong to single affairs T binds together.Several affairs or project set can form a transaction set by identifier TA_Group_ID sign.Therefore TA_Group_ID binds together some affairs TA.If do not have interim or other affairs ordering, then lose the second nested of transaction set usually.

Therefore, the input data recording that is used for data mining comprises TA_ID, one or more ITEM value and an optional TA_Group_ID.As some instance, TA_ID can represent that the client buys the time that the date and time of time buying, certain production step or failure in the analysis, particular webpage are clicked, or apart from the distance of starting point.As some instance of TA_Group_ID, this identifier can be represented the product identifiers in voip identifiers, production and/or the quality control in the customer analysis, or Web uses the user identifier in following the tracks of.For example, the ITEM field can be represented identifier, parts or the production stage identifier or the web page address of the commodity of purchasing respectively.

Except affairs input data, further defined item purpose classification (taxonomy).Classification refers to other ranks of hierarchical structure.Classified information statement specific project (subclassification) belongs to particular items group (father's classification).

The particular community of pattern is that institute is interested, so that measure their statistics or business importance.The pattern of a particular category especially relevant with purpose with practical problems is a mode of rule.Rule is the predicate on type " left side " → " right side ".The left side is called preceding item parts (antecedent part) or rule body, and the right side is called back item parts (consequent part) or regular head.The semanteme of " → " depends on the type of use-case and data available.When searching related between some projects, the fact or the incident, the meaning of " → " is " related with it ".When the looked-up sequence rule, the meaning of " → " is " taking place after this ".The support s and the degree of confidence c of correlation rule in the part of above association area, have been defined.Hereinafter, use b to come the marking convention body, use h to come the marking convention head.Correlation rule ρ: b (ρ)=> expenditure (lift) l that does of h (ρ) is calculated as the ratio of regular degree of confidence and rule expection degree of confidence e with its statistical significance, can be e (ρ)=s (h (ρ)) with said expection confidence calculations according to project independent hypothesis that takes place on statistics.Therefore the expenditure l that does of correlation rule is l (ρ)=c (ρ)/s (h (ρ)).

Can also define support, degree of confidence and make expenditure for Cahn-Ingold-Prelog sequence rule.The transaction set number of σ and the ratio between the transaction set sum are supported in the support value indication of Cahn-Ingold-Prelog sequence rule σ.The confidence value of Cahn-Ingold-Prelog sequence rule σ is defined as the ratio between the transaction set number of supporting σ fully and the transaction set number of the supporting the σ body.At last, Cahn-Ingold-Prelog sequence rule σ's makes the ratio between actual support that expenditure l is σ and its expection support e (σ).Therefore Cahn-Ingold-Prelog sequence rule is l (σ)=c (σ)/s (h (σ)) as expenditure.

From actual purpose, be necessary usually set of modes is restricted to manageable size.For this reason, filter the complete models set according to specified criteria traditionally.These criterions are based on number attribute and mode contents.The user possibly hope through the definition count constraints resulting schema number to be restricted to maximum number.As selection criterion, can use any number attribute, for example degree of confidence or support.Range constraint only allows to be positioned at the pattern of the set-point scope of the number attribute such as support or degree of confidence.The usable range constraint, the resulting schema number still depends on data.At last, the user maybe be only interested in the pattern that in pattern body or pattern head, has (or not having) specific project.This type of constraint is called the project constraint.If the support of project/project set/pattern surpasses the minimum support criterion of user's appointment, then said project, project set or pattern are called " frequent ".

As an instance, the problem from the possible cause of the wrong output of computer chip production line is confirmed in consideration.In manufacturing process, measure quantity of parameters.These parameters can comprise temperature, every cubic metre dust granule number and the exabyte that semiconductor material is provided.In the test to the chip produced subsequently, with whether there being any logic error in the measured chip.In numerous resulting schemas, possibly have following rule

(1) if 50 ° < < 60 ° and material are provided by A company T, logic error then occurs.

(2) if logic error and some dust granules, then material is provided by B company.

Suppose that rule (1) support=0.02% and degree of confidence are 28%, the support of rule (2) be 0.9% and degree of confidence be 85%.Can be that 0.1% range constraint comes exclusionary rules (1) by specifying minimum support now.Can exclusionary rules (2), because the user is provided with the project constraint of the rule that only allows to have " logic error " in the rule head.

Can be input data definition weight information.For example, can exist and the related independent weight of each project (or with some project, for example, default-weight and all the other item associations).Weight can be represented cost, price, physical weight, risk or any other continuous number attribute of project.

For the project (that is to say) of polymerization for project set, rule, order or Cahn-Ingold-Prelog sequence rule, the associated weight information related with said polymerization can be (based on) weight of project or (based on) weight of affairs.

Support polymerization affairs or transaction set (based on) weight of project is the weight sum of those projects that in polymerization, occur in affairs or the transaction set in essence.Those projects that in polymerization, do not occur will can not contribute to the project weight of affairs (group) in the affairs (group).Use the project weight of affairs (transaction set), can calculate the average weight that all support affairs.(based on) weight of affairs is to support the weight sum of all items in the affairs (group) of polymerization.In other words, all single weights of affairs (group) item all will contribute to the affairs weight.

Use weight information, can define various filter criteria.For single project or project polymerization; At least the filter criteria below is correlated with: weight, support multiply by weight (all importing total weight of data), support the average weight of the affairs of project polymerization, and the average weight of the transaction set of support project or polymerization.For mode of rule with rule body (left side) and rule head (right side); Following extraly filter criteria can be correlated with: the support that the weight of the weight of rule body, rule head, the support of rule head multiply by weight, rule body multiply by weight, and addressable extra total weight.The support of rule head multiply by total weight that weight is a rule head in the input data.This total weight can for example represent that the overall traffic of rule is worth.Same, the support of rule body multiply by total weight that weight is a rule body in the input data.This total weight can for example represent that the overall traffic of the preceding item parts of rule is worth.To this understanding can for example help to determine whether should be from sell the project in the item parts after the revocation rule: continuing to sell certain, itself not have the commodity of profit can be favourable, and it can attract to buy a large amount of other large profits commodity " high-quality " client of (promptly referring to the commodity in the item parts before the rule).Addressable extra weight is calculated in the following manner: (1-confidence) ^*AbsoluteBodySupport ^*Confidence ^*WeightOfHead.Wherein confidence refers to the degree of confidence of " rule body → rule head " rule, and absoluteBodySupport refers to import total support of rule body in the data.For example, can in target marketing, use this filter criteria: it provides the estimation to extra returns, can pass through to rule body (but not rule head) and post for genuine all clients' target for it and obtain this extra returns.The client's number that meets associated condition is (1-confidence) ^*AbsoluteBodySupport.Can expect that the sub-fraction of " degree of confidence " of maximum institutes contact customer will make positive response, because this is the general degree of confidence of " rule body → rule head ".Therefore, addressable extra weight is for example represented the upper limit of the obtainable additional income of target marketing activity.

Can be for expansion below the PMML3.1 standard definition of mining model so that combine weight information.Can be at least PMML element < Item >, < Itemset >, < AssociationRule >, < Sequence>and < SequenceRule>and define new mark < weight >.Mark < weight>can have following attribute at least: " mean ", " standardDeviation ", " min ", " max " and " type ".The meaning of " type " attribute will be described below." Subset " expression weight statistical information refers to comprise the weight of the polymerization (< Item >, < Itemset >, < AssociationRule >, < Sequence >, < SequenceRule >) of < weight>mark itself." Total " expression weight statistical information refers to support to comprise affairs (correlation model) or average total weight of transaction set (order model) of the object of < weight>mark.

Hereinafter, except when whether checking mode is suitable as resulting schema, with the order of using project set as pattern.Under the situation of association analysis, the length of these orders can only be length 2.Fig. 2 shows the process flow diagram of assessing the method for candidate pattern with respect to the input data in the set, and each set comprises similar candidate data pattern.Therefore, assess the candidate pattern set jointly with respect to the input data.In addition, will consider the filter condition of pattern at the commitment that generates and assess candidate pattern.

Preferably, the difference each other of the candidate pattern in the particular candidate set of modes only is a project, promptly is added to the project of their public father's pattern.Each candidate pattern set all comprises the pattern of maximum predetermined numbers.This predetermined number N depends on the computing hardware of carrying out the data mining program therein.The representative instance of this predetermined number comprises 32,64 and 128.Whether these numerals allow between quick check candidate pattern and the relevant input transaction (with the binary format storage) consistent.Optimal value depends on hardware architecture.Preferably, digital N is chosen as the maximum number of digits that can in a clock period, be handled by an integer operation on the computer system of carrying out the excavation program on it.This enables in single signless integer variable (it is re-interpreted as the bit field that length is N), to revise efficiently and store all by the information of the candidate pattern of assessment simultaneously (that is similar candidate pattern set).For example, for given affairs (TA) or transaction set (TAG) in the input data, single this type of bit field can be indicated for which candidate pattern in N the candidate pattern, and given TA or TAG will make relation to the support of this candidate pattern.

Through beginning to generate candidate pattern from public father's candidate pattern, said father's candidate pattern is expanded through a project in a different manner.The form of initial father's pattern is " { item1}, { item2} ".Can or add a new set jointly and expand these initial father's patterns through interpolation project in set, thereby will generate the candidate pattern of following form:

“{item1，newItem}，{item2}”

“{item1}，{item2，newItem}”

“{item1}，{item2}，{newItem}”

Fig. 3 a, 3b and 3c relate to the sub-candidate pattern of generation.Generally speaking, adopt following mode to generate sub-candidate pattern usually according to father's candidate pattern (it can be the seed candidate pattern).Can be to creating three kinds of basic spread step of sub-candidate pattern definition.

First possible spread step is a new project to be added to last project set of father's pattern.The project that is added is Already in last project set of father's pattern.In correlation rule, the project that is added can not appear at any position in father's rule.Fig. 3 a shows the instance of the subpattern 301 of uncle's mode 3 00a expansion.

Second possible spread step is that the new projects' set that comprises a project is appended to father's pattern.Can only use this step to Cahn-Ingold-Prelog sequence rule; The project of being added is any project.Fig. 3 b shows the instance of the subpattern 311 of uncle's mode 3 00a expansion.

The 3rd possible spread step is a new project to be added to first project set of father's pattern.The project that is added is Already in first project set of father's pattern.In correlation rule, the project that is added can not appear at any position in father's rule.Only when father's candidate pattern has only two project sets and second project set to have only a project, just allow to use the 3rd spread step.If there is not this constraint, the build path of given pattern will not unique, and can make up identical pattern and verify at the diverse location place that candidate pattern expansion is set.Fig. 3 c shows the instance of the subpattern 321 of uncle's mode 3 00b expansion.

As further instance, consider father's pattern { A}, { B}.As the instance of spread step 1, can make up following subpattern: { A}, { B, C}.As the instance of spread step 2, can make up following subpattern: { A}, { B}, { C}.As the instance of spread step 3, can make up following subpattern: { A, C}, { B}.

To combine Fig. 4 to specify the expansion of father's pattern below.

In some embodiment at least of the present invention, in a step, gather with respect to the similar candidate pattern of input data assessment.Because belonging to the similar candidate pattern difference each other of a set is a project, therefore most of evaluation work only need be carried out once to set, rather than all carries out once to each single candidate pattern.This is a tangible advantage than depth-first search, in depth-first search, must carry out whole evaluation work again to each single candidate pattern.

Return Fig. 2, in step 201, the filter condition that supplies a pattern for data mining program 16.Usually, the user specifies these conditions through data mining interface 26.In step 202; Filter condition analysis is confirmed the set of first filter condition, the set of second filter condition and the set of the 3rd filter condition; To when generating candidate pattern, consider said first filter condition set; Consider said second filter condition set when confirming whether they are suitable as following father's candidate pattern in the assessment candidate pattern, consider said the 3rd filter condition set when confirming whether they are effective resulting schema in the assessment candidate pattern.Can not limit the quantity of filter condition by any way.

The possible mode of extension set of said first filtrator set restriction.As an instance, consider from resulting schema, to get rid of the filtrator of specific project E.Obviously, the pattern that comprises E never through further expansion and effectively.The potential candidate pattern set of said second filtrator set restriction.As an instance, consider the item number in the pattern is restricted to the filtrator of m.The pattern itself that comprises m project possibly be effectively, but further just can not remain valid after the expansion again.The set of said the 3rd filtrator set restriction effective model.As an instance, consider to specify the filtrator of lowest confidence.Possibly not have enough degree of confidence from the rule of mode producing itself, but can increase degree of confidence, make its subpattern can generate potential resulting schema through adding more project.

Can the project weight criterion of introducing hereinbefore and affairs/transaction set weight criterion be applied to three kinds of arbitrary collections in the filtrator set.When project weight criterion and affairs (group) weight criterion are applied to the set of first filtrator, will make maximizing performance.For example, if having minimum project weight filtrator or minimum affairs weight filtrator, then can, candidate rule from the input data, remove single project or single affairs before generating and check beginning.Can in the second and the 3rd filtrator set, use above-mentioned other all criterions, that is to say, according to input data variation after the candidate rule.Following code extracts to show how in the set of the 3rd filtrator, to use the weight filter criteria.

bool?passesWeightFilter＝true；

double?bodyWeight＝bodyWeightStats.getMean()；

double?headWeight＝headWeightStats.getMean()；

double?ruleWeight＝headWeight+bodyWeight；

double?TAWeight＝TAWeightStats.getMean()；

f(

filter.isInAllowedRange(RuleFilter::BODYWEIGHT，

bodyWeight)

filter.isInAllowedRange(RuleFilter::HEADWEIGHT，

headWeight)

Ifilter.isInAll?owedRange(RuleFilter::RULEWEIGHT，

ruleWeight)

filter.isInAllowedRange(RuleFilter::SUPPTIMESWEIGHT，

support＊ruleWeight)

filter.isInAllowedRange(RuleFilter::TOTALWEIGHT，

TAWeight)

filter.isInAllowedRange(RuleFilter::ACCESSIBLEVOLUME，

headWeight＊(1.confidence)＊absSupport))

passesWeightFilter＝fal?se；

}

Return Fig. 2, confirm and assessment initial candidate set of modes (step 203).In step 204, the appreciation information that the candidate pattern that will be assessed from this candidate pattern set is chosen as father's candidate pattern and safeguards this father's candidate pattern.In step 205, through expanding father's candidate pattern and considering that first filter condition generates sub-candidate pattern.Usually use at least a in above-mentioned three kinds of spread step to realize this expansion.In step 206, with respect to importing the sub-candidate pattern in the similar candidate pattern set of data assessment and considering the set of second filter condition.This assessment is based on the appreciation information of relevant father's candidate pattern.Because the form of subpattern,, only need those input transactions with father's pattern match that come to light of visit as their former generation's superset.All patterns that are assessed as potential father's pattern all are added to candidate pattern set (or rather, it can be storehouse rather than set).In step 207, with safeguarding that the relevant appreciation information that these are affirmed the sub-candidate pattern of assessing is for using future.Those are used the pattern that the 3rd filter condition set is evaluated as effective resulting schema will be stored as resulting schema.

In step 208, whether inspection is existed other candidate pattern that can be used as father's candidate pattern.Method 200 continues repeating step 203-208, up to assessing all candidate pattern.When not having other candidate pattern to can be used as potential father's pattern, processing finishes (209).

To understand, and can use various recording methods to follow the tracks of which candidate pattern by assessment and can make in all sorts of ways and safeguard the information relevant with the appreciation information of the candidate pattern of having been assessed.

It is the process of subpattern that Fig. 4 shows father's mode expansion, and Fig. 4 has done more detailed explanation to the step 205 among Fig. 2.Process flow diagram among Fig. 4 is consistent with three kinds of spread step describing with reference to figure 3a, 3b and 3c.In step 401, check at first whether current father's pattern satisfies following requirements.At first, last project set of father's pattern should only comprise a project, and the second, the project set number in the pattern is two, the 3rd, first project set of pattern should be extendible.In other words, the form of pattern is { I ₁, I ₂... I _n} → { I}.If satisfy these conditions,, come at most the individual different but similar pattern of generation N through add a project (this is consistent with top the 3rd spread step of describing with reference to figure 3c) to first set of father's pattern then in step 402.As stated, value N is the number of being scheduled to, and depends on computer hardware usually.The realistic model number that produces from step 402 can specifically depend on the item number that can add first project set to less than N.After this, return maximum N sub-modes of generation so that assess in step 408 according to the input data.

If do not satisfy the condition in the step 401, then whether can expand at last project set of step 403 inspection father pattern.If then add a project through last project set and come to generate at most N difference but similarly subpattern to pattern in step 404.This step 403 is consistent with top first spread step of describing with reference to figure 3a.After this, maximum N similar subpatterns of returning generation in step 408 are so that assess.If do not satisfy the condition in the step 403, then further check in step 405.In step 405, whether inspection can expand father's pattern through adding new projects (shown in Fig. 3 b, can add regular head to, also can add rule body to) to pattern.If then create maximum N similarly subpatterns, and return these subpatterns so that assess in step 408 in step 406.If do not satisfy any one condition in the step 401,403 or 405, then in step 407, do not return any candidate pattern of returning.

If the arbitrary step in the step 402/404/406 will produce possible similar subpattern more than the N, then correspondingly write down and after circulation in assess remaining subpattern.Usually, will a step, not assess, because assessment result difference possibly can't effectively be handled too greatly from the subpattern that different steps 402/404/406 produces.But in fact, the subpattern of step 402/404/406 generation is far more than N.Some similar candidate pattern set of returning in step 408 possibly only comprise the candidate pattern that will assess.

Fig. 5 shows the process flow diagram of method 500, wherein imports data and is converted into binary format to save storer and to allow candidate pattern set to carry out more effective assessment.In addition, with the initial statistical information of calculating the input data more effectively to assess candidate pattern.When

step

504 or 508 assessment candidate pattern, can use these initial statistical informations to remove the candidate pattern that some becomes the superset of these initial configurations, for example project is right.In addition, in the data structure that is particularly suitable for respect to scale-of-two input data assessment candidate pattern, preserve the evaluation history of candidate pattern.To understand, although this three specific character is present in this combination shown in Fig. 5, wherein any one can both independently use or be present in other property combination, so that the method that fine setting is described with reference to figure 2.In other words, can be through creating the input data of binary mode; Calculate the initial statistical information of input data; And/or use specific binary data structure to store the evaluation history of candidate pattern, strengthen the method shown in Fig. 2.In addition, the combination of various other characteristics of introducing below can existing, when those skilled in the art read this explanation, this was conspicuous.

In step 501, calculate the initial statistical information of input data.This step comprises all input data (for example, being arranged in the data of database table or flat file) of complete transmission.Frequent item will be collected and frequent item is right.Frequent item set F is the subclass of the project I that frequently occurs as the subclass of input transaction T.These projects and project are stored in the data structure statistical information, and said data structure is preferably the project hash table, wherein 64 hash table aspect order titles.If to project definition one or more classification, then with project (to) statistical information expands to all fathers' classification of finding project in input in the data.Under the hash collision situation that can not occur, execution in step 501 and use another hash seed to create hash table again again.But, if use 64 hashed values, hash collision then unlikely appears at all.In this case, on average 10 ¹⁹Hash collision only appears in the individual project name one time.Can be not right through project and the project that abandons all frequent (they being the minimum support value of frequency) less than user's appointment, and it is right perhaps to abandon by another regular filters criterion that has defined inactive project and project, comes end step 501.To specify the integer item id between 0 to #frequentItems-1 for remaining each project.Or rather, also be necessary to keep the not frequent item that those have frequent father's classification.These not frequent projects also will obtain ID (#frequentItems ..., #frequentItems+#infrequentItems-1).

If for the input data definition weight information, then can also be in the step 501 weight statistical information of identifying project.Weight information can for example be stored in the above-mentioned project hash table (or other data structures).Project weight statistical information comprises the average weight of each project and each father's classification usually at least.In addition, project weight statistical information can comprise the further information of the statistical distribution (for example, fractile, the degree of bias and/or kurtosis) of minimum and/or maximal value, standard deviation and/or related item and father's classification.All weight statistical informations be can abandon and those projects and father's classification violated based on the filter criteria of weight.

In step 502, create the input data of binary mode.Hereinafter with reference Fig. 7 a has described possible details to 9b.This step comprises for the second time complete transmission input data and is binary representation with data-switching.In binary representation, all input data recording that belong to a TA_group_ID (if perhaps there is not TA_group_ID, then being TA_ID) all are placed in the binary data object (TAGroup by name).In this TAGroup binary data object, by 16 or 32 alternative text item values of integer ID.With abandoning not frequent project fully.In some DataPage objects, organize these specific to affairs or specific to the binary data object of affairs order.The size of each DataPage object is about the 2%-5% of the overall available RAM storer that is used for the data mining program.The DataPage object of expression input data resides in the RAM usually, if but there are not enough storeies to come to store simultaneously all objects, then each DataPage object can be with self being dumped to dish and when needing once more, fetching self pellucidly again.Introduce this dynamic memory management in more detail below with reference to Figure 14.

After the execution in step 502, the project hash table that no longer need in step 501, form.16 (or 32) ID through each project describe each project now.Therefore, can abandon project and project to hash table, and with remaining frequent item ID and frequent to information and their original (text) name storage in simpler array (rather than storing all frequent item ID).At this moment, the input data are represented by the TAGROUP binary object and with the array index (item id) and the array of text names associate.No longer need visit the raw data source.

In step 503, according to preceding text refer step 203 said initially (seed) candidate pattern of calculating.In step 504, come with respect to these initial candidate patterns of input data assessment through the binary data object that uses expression input data.Hereinafter with reference step 508 is introduced the details of assessment candidate pattern.

In step 505, whether inspection exists the new candidate pattern expanded.When the filter condition that has the more projects that can add and do not forbid expanding, candidate pattern is extendible.This type of filter condition can be the maximum item number in the pattern.Find the new candidate pattern expanded (from the assessment in 504) if test 505,, shift these candidate pattern and appreciation information thereof onto storehouse then in step 506.

In step 507, use above-mentioned three kinds of spread step doubly with the candidate pattern expansion N that selects.Guarantee not violate any filter condition here.This type of filter condition can for example be that pattern should not comprise project A and item B simultaneously.

In step 508, use of the set of the binary data object of expression input data with respect to N similar candidate pattern of input data assessment.During assessing, will consider the set of the 3rd filter condition, this set restriction resulting schema set.For example, whether the degree of confidence of the inspection strictly all rules that can generate from pattern satisfies filtrator.If find resulting schema, then can they be stored as the part of net result now.

Repeated execution of steps 505-508 is up to there not being the more how new candidate pattern expanded.Method 500 continues step 509 then, and wherein whether inspection arrives the top layer of candidate pattern.If, then candidate pattern is not deleted from storehouse in step 510.In step 509, doubly with this candidate pattern expansion N.

Be appreciated that Fig. 5 shows the combination that the candidate pattern in the similar candidate pattern set of N is assessed and binary format is used to import data.But, will recognize that assessment has improved efficient in N similar candidate pattern set, this form with the input data is irrelevant.As will be hereinafter with reference to Figure 11 explanation, the combination of candidate pattern was very effective during the scale-of-two input data layout (for example, Fig. 7 a and 7b) of compression was gathered with assessment.

Hereinafter, the appreciation information with father's pattern is called father's pattern historical information.For each affairs (TA) or the transaction set (TAG) of input in the data, father's pattern historical information preferably includes a Boolean, and whether indicate given TA or TAG is that the support of father's pattern contributes.Those are called ' activity ' TA or TAG for TA or the TAG that the support of pattern contributes at this.Hereinafter, suppose to have N in the input data _TOTIndividual TA (in association rule mining) or TAG (in Cahn-Ingold-Prelog sequence rule excavates), the N in them _ActIndividual is movable for given father's pattern.

Then, the basic storage scheme of the appreciation information of father's pattern (that is father's pattern historical information) can be described below.It is N that father's pattern historical information is stored in length _TOT1 bit field in, each bit representation N _TOTAmong individual TA or the TAG one, the TA or the TAG of position=1 expression activity, position=0 inactive TA of expression or TAG.It is 16 (N that Fig. 6 a schematically shows length _TOT=16) bit field 600.Father's pattern historical information of storage comprises following information in the bit field 600: since 0 is that TA or TAG number, and TA/TAG1,3,6,7,8,11,12 and 14 supports for given father's rule contribute.

Because TA or TAG only just contribute for the support of sub-candidate pattern when this TA/TAG contributes for the support of father's pattern, therefore only need to gather to the similar candidate pattern of the movable TA/TAG assessment uncle mode expansion of father's pattern.Fig. 5 a schematically shows the appreciation information of gathering (said set has N candidate pattern) like candidate pattern with respect to the related genera of TA/TAG and how to be stored in the single signless integer variable, and said variable is re-interpreted as the bit field that length is N.Whether signless integer variable 611 to be numbered 1 TA/TAG be the information that support that N candidate pattern gathered contributes if comprising indication.Likewise, whether be information that the support of N similar candidate pattern set contribute to

signless integer variable

612 and 613 if comprising relevant 3 and 6 the TA/TAG of being numbered.All the other comprise the signless integer variable of appreciation information of the similar candidate pattern set of N not shown in Fig. 6 a.

About storage activities father pattern historical information (appreciation information of father's pattern), hereinafter will be introduced some modification.If father's pattern support is enough low, then first of the scheme shown in Fig. 6 a improvement is to select " sparse " storage format.If N _Act/ N _TOT>=1/8, if promptly the relative support of father's pattern is at least 12.5%, then the bit-by-bit storage scheme shown in Fig. 6 a is generally the most effectively scheme.Memory consumption is N _TOT/ 8 bytes are with N _ActIrrelevant.If support is lower than 12.5%, the step-length of then storing between the adjacent activities TA/TAG is more effective usually.Schematically show this situation among Fig. 6 b; Wherein with 1000 TA or TAG (be labeled as 0 ..., 999) as an example; Wherein TA or TAG3,250,493 and 617 are movable; Can father's pattern historical information be stored in 41 byte signless integers the numbering of the inertia TA/TAG between the said integer indicative of active TA/TAG: i.e. 3,246 (intervals 621 among Fig. 6 b), 142 (at interval 622) and 123 (intervals 623).This is because have 3 inactive TA or TAG before first movable TA/TAG, between first and second movable TA or TAG, has 246 inactive TA or TAG, and the rest may be inferred in the instance in Fig. 6 b.If run into step-length >=255, then must this step be divided into two or more substeps; For example, step-length 510 can be represented as 255,255,0.This 255 and 0 is that its value will be added to substep together for first 255 indication here.More generally, as step-length being stored as 1 byte signless integer, required amount of memory exceedance N never can be shown then _Act+ ((N _TOT-N _Act+ 256) div256)+(N _TOTDiv65536).Therefore, if father's support is 12.5% relatively, then the memory consumption of this storage format is about N _TOT/ 8, and, will be reduced to about N for low-down support _TOT/ 255.

As second improvement, can from storage formats different more than 2, select, specifically depend on the N of the current father's pattern that is processed _Act/ N _TOTThis second improvement is the first improved conclusion of describing with reference to figure 6b.For example, the user can be directed against N _Act/ N _TOT>=12.5% select basic storage scheme, to 12.5%>M _a>N _TOT>=0.4% selects to improve the sparse form described in 1, and to N _Act/ N _TOT<0.4% selects the second sparse form.The said second sparse form can for example be similar to the sparse form in first improvement, but with 2 byte signless integers storage step-length.For low-down support, this format only needs about 2*N _TOT/ 65535 bytes of memory devices.It is historical that common a kind of form is used for a single-unit activity father pattern.If the character of input data is all different to ending from the beginning of input table, then can use different coding to different portions.

As the 3rd improvement,, can only consider that then those are movable TA or TAG for father's pattern of father's pattern if the father's pattern that is considered itself has father's pattern.Fig. 6 c schematically shows the instance of relevant this situation.In the instance of Fig. 6 c, the activity history information of father's pattern of father's pattern is shown to have bit field 600.The activity history information 600 of father's pattern of father's pattern of therefore, in Fig. 6 c, considering has N _ActThe TA of individual activity or TAG.Be used for the historical non-improvement option of storage activities father pattern and be shown to have bit field 630, its median is N _TOT, similar with figure place in movable father's pattern historical 600 of father's pattern of father's pattern.When the activity history information of the record father pattern related, can in above-mentioned basic format or sparse form, use N with bit field 630 _ActSubstitute N _TOTFig. 6 c shows bit field 631 as an example, and it is consistent with the same basic format of using to bit field 600.Shown in Fig. 6 c, only consider TA or TAG that those father's patterns for father's pattern (that is, in bit field 600) are movable in the activity history of father's pattern (that is, in bit field 631).

Next introduce and to import the high efficiency method that data compression is a binary format.The method of these compression input data can be used with above-mentioned mode evaluation method, and in said mode evaluation method, similar candidate pattern set is assessed together.In addition, they can be applied to any other data digging method of confirming the pattern in the input data.They can for example be applied to the data mining based on depth-first search or BFS.

Fig. 7 a shows some to 7d and compresses input data computing machine data structure in order to carry out data mining.Fig. 7 a shows computer data structure 700, and this structure is particularly suitable for importing data, and wherein affairs do not comprise timing/sequencing information, and perhaps wherein said timing/sequencing information has no influence and can be dropped.Fig. 7 b shows computer data structure 710, and this structure is particularly suitable for importing data, and wherein affairs have timing/sorting data.In other words, these affairs can be grouped into a plurality of transaction set TAG.Fig. 7 c and 7d show similar data structure 720,730 as Fig. 7 a and 7b, but data structure 720,730 also comprises weight information.

Data structure 700 all comprises the information that affairs are gathered with data structure 710.Number of transactions in the data structure 700 is generally predetermined number N, and this number depends on hardware.N is the length of processor manageable bit string in a clock period.Number of transactions in the data structure 710 is defined by the number of transactions among the transaction set TAG of data structure 710 expressions.

Disparity items number in the affairs uses n (label 701 among Fig. 7 a and the 7b) to represent.The identifier 702a, the 702b that also have disparity items in the data structure 700 and 710.Identifier 702 is generally round values, and said round values is unique in the input data; Minimum requirements is that identifier 702 is unique in data structure 700/710, but extra record is carried out in this requirement.Identifier 702 may be selected as far as possible little integer, be no more than usually 16 long.

The existence of the disparity items in each affairs is by the bit field information representation in the data structure 700/710.For each identifier 702, can there be independent bit field, shown in Fig. 7 a and 7b.As further selection, can have a bigger bit field and represent and the independent identical information of bit field.As long as have indication by the bit field information that the disparity items in represented each affairs of data structure 700/710 exists, then can carry out any other modification.About data structure 700, wherein number of transactions normally is scheduled to, and the length of

bit field

703a, 703b also is (in other words, being generally N) of being scheduled to separately.In data structure 710, wherein number of transactions depends on the number of transactions among the represented transaction set TAG of data structure, and the length of

bit field

704a, 704b can arrive another data structure 710 and difference with data structure 710.Shown in Fig. 7 b, if having an independent bit field 704 for each identifier 702, then the length of bit field 704 is (N _TA+ 7) byte/8.Data structure 710 comprises the Integer N of indicating number of transactions usually _TA705 so that handle data structures.

In addition, data structure 710 comprises the sequencing information by the represented affairs of data structure 710.Shown in Fig. 7 b, said sequencing information can have difference (for example, other distances in timestamp difference or the sequencing information), and in this case, sequencing information is represented by n-1 integer 706a, 706b.The integer that n indication sequencing information (for example, timestamp) alternatively, can be arranged.

The order of each field is as follows in the data structure 700: the integer 701 of indication Projects with Different number, the identifier 702 of disparity items, and a series of bit field 703.In addition, data structure 700 can also the beginning of data structure (for example) comprises the total N of the represented affairs of designation data structure 700 _TA Integer 705 so that handle data structures.Usually, integer 705 is N for the data structure of every other expression input data, has only 1<n _TA<except the situation of N.

Order of the field in the data structure 710 is as follows: indication number of transactions N _TAInteger 705, identifier 702, sequencing information and indication affairs that the indication disparity items is counted integer 701, the disparity items of n in the bit field information 704 that exists of disparity items.Data structure 710 can further comprise the integer 707 of the project sum in the indication affairs.This allows to check effectively whether transaction set supports specific order.If the item number in the order is greater than the item number in the transaction set, support mode not then.

Data structure 700 among data structure 720 among Fig. 7 c (through the mode of instance) and Fig. 7 a is similar in other respects, but it comprises extra weight information.N storage unit 721 comprises the weight statistical information of each project in n the disparity items.Storage unit 721 can have W position, and it can be 32 single precision floating datums.Weight information storage unit 721 comprises the average weight of given project usually, and this is average to all appearance of this project in N TA.Storage unit 721 can comprise the statistics and convergence such as standard deviation in addition.The average weight of storage unit 721 expression can (also may not) be explained the repeatedly appearance of identical items in the TA.(for example, if the client buys 4 bottles of milk with the price of every bottle of 1$ in single purchase, then the milk commodity price of record can be 4$ or 1$, specifically depends on the type of the analysis that digging user hopes to carry out).Alternatively, data structure 720 can comprise further storage unit 722, and storage unit 722 comprises the accumulation weight statistical information of different TA.

Usually these storage unit 722 also are that the W position is long, and have a storage unit 722 for data structure 720 each represented TA, and it comprises the summation of all weights among the TA.

Data structure 710 among data structure 730 among Fig. 7 d (through the mode of instance) and Fig. 7 b is similar in other respects, but has extra weight information.N storage unit 731 comprises the weight statistical information of each project in n the disparity items.Storage unit 731 can have W position, and it can be 32 single precision floating datums.Weight information storage unit 731 comprises the average weight of given project usually, and this is average to all appearance of this project in N TA.Storage unit 731 can comprise the statistics and convergence such as standard deviation in addition.The average weight of storage unit 731 expression can (also may not) be explained the repeatedly appearance of identical items in the TA.Alternatively, data structure 730 can comprise further storage unit 732, and storage unit 732 comprises the accumulation weight statistical information of whole transaction set.Usually storage unit 732 also is that the W position is long.Clearly, the order of the storage unit in the data structure 720 or 730 can be different with the order shown in Fig. 7 c and the 7d.

The data structure that comprises the input data of compressed format can be included on the computer usable medium.Usually, during the assessment candidate pattern, store data structure in the storer of computer system or hard disk.Data structure can also be stored on the hard disk or such as computer disk, CD or be called on the movable computer computer-readable recording medium the equipment of memory stick.

At least has advantage with reference to figure 7a to the data structure that 7d describes.At first, will compress the input data efficiently.Compression effectiveness improves the most nearly three times: (i) original project name is substituted by less ID (each ID need be no more than 16 storer usually); (ii) be not that N all items among the TA all must be stored as 16 integers, but each disparity items among these N TA must only be stored once; And (iii) compressed weight information, because be not every input of storage data weighting information, but only store the statistics and convergence of weight information.In the field information on the throne, only use a single position to store TA mean terms purpose relation to each relation.The tabulation of disparity items being included in the data structure of expression affairs set is efficiently because each affairs only comprise usually the input datarams the considerably less part of all items.Therefore, the quantity of the disparity items quantity of all disparity items in the data usually among N TA, and to the size that has significantly reduced bit field of quoting of disparity items among N the TA.If the compression of higher data size as target, then can be resequenced to affairs before the formation group, as described in inciting somebody to action like hereinafter.Again the target that affairs is sorted is a series of creating very similar affairs (that is, said affairs comprise item id much at one) and troop.This helps to produce the optimal compression rate.Usually, size of data is reduced to 5% of about raw data size.

Secondly, said storage scheme is treatment classification efficiently.If all have a large amount of taxonomical hierarchy structures at each project top, there is significant enlarge-effect in then traditional storage scheme.That is, if exist the disparity items of specific quantity and each project to have 3 grades of father's hierarchical structures at the top, then traditional storage scheme will be amplified to 4 times.In the described herein scheme, storage is amplified less, because n project possibly shared some father's project at most, so these father's projects of sharing will only occur once in the data structure of storage.

The 3rd, can come with respect to input data detection candidate pattern (referring to following to the explanation of Figure 10 a) through the bit field information of data structure 700/710/720/730 being carried out bitmap operation efficiently to 12c.This can significantly quicken checkout procedure.In practical application, check speed is crucial.The typical operating position of search pattern is with several hours consuming time or a few days in a large amount of Transaction Informations, and said storage scheme can be so that mode and activation record history are used efficiently.In addition, storage and verification scheme are very suitable for using efficiently various mode filtering device criterions.These characteristics are introduced in the reference model check in more detail hereinafter.Said storage scheme can also be well with create simultaneously and check the candidate pattern of large-scale similar candidate rule group to create and check mechanism is used.

Fig. 8 shows the process flow diagram of the method 800 of the data that comprise in a plurality of affairs of compression.In step 801, confirm the statistical measures that related item and possible father classify.For example, can use the hashed value of the original item identifier of storage and the hash table of project frequency to accomplish aforesaid operations.If defined classification, then calculate the frequency of all father's classification and add it to project hash table.To abandon the not frequent item that does not have frequent father's classification in step 802 (this step is optional).In step 803, for each Projects with Different is distributed unique identifier, and if defined classification, then, each different father distributes unique identifier for classifying.In step 804, form the affairs set.To each affairs set repeating step 805 to 809.In step 806, confirm the disparity items number in the affairs set.In step 807, confirm the identifier of disparity items.In step 808, confirm to belong to the existence of the disparity items in the affairs of this affairs set and it is rendered as bit field information.In step 809, formation and storage package are contained in the data structure of specified message among the step 806-808.It will be apparent to those skilled in the art that before the step 809 that is formed on the data structure of introducing here, need to confirm further information with reference to figure 7a and 7b description.

Fig. 9 a shows the process flow diagram of input data compression for the method 900 of the data structure consistent with data structure 700.The method is from step 801 and 802 beginnings.After these steps, in step 901, confirm integer constant M and N.As stated, N depends on computer hardware.Usually the value of N is 32 and 64.M is that the minimum that can store the integer variable of the required different identification symbol of the project of this input in data is fit to the position size.M is generally 16 or 32.In step 402, will abandon the affairs of all (after the abandoning of step 802) item number less than the minimum regular length (or under the situation that does not define minimum regular length) of user's appointment less than 2.To it be sorted according to the similarity of residue affairs in step 903 (this step is optional).Similarity self-explanatory characters affair here has how many identical projects.This rearrangement can improve compression efficiency.At step 804a, will form the affairs set, and each affairs set have N affairs.After this, method 900 proceeds to step 805 to form and store data structure, and said data structure comprises as the item identifier of M position integer and has the n bit field (n N position integer) of N position usually.If correctly selected N, then each bit field can be implemented as a single signless integer variable, and said variable allows fast and effeciently to handle.When forming data structure, can use associative array (mapping and dictionary) to store the information of relevant affairs set efficiently in advance, in said array, item identifier is as key word, and n bit field is as value.

Fig. 9 b shows the process flow diagram of input data compression for the method 910 of the data structure consistent with data structure 710.The method also starts from step 801 and 802.After this, confirm integer constant T and M in step 911.These constants depend on the input data characteristic.M such as top to 900 definition of method.T is the integer of transaction identifiers or the position size of floating point representation.Usually, this transaction identifiers comprises sequencing information; Transaction identifiers can for example be a timestamp.Method 910 these continued execution in step 803.Form the affairs set at step 804b then, transaction set of each affairs set expression.Transaction set TAG is identified by each affairs of carrying the same transaction group identifier in the TAG.After this, in optional step, abandon the residue project and be less than the affairs set that predetermined item number or affairs are less than the predetermined transaction number.Method 910 proceeds to step 805 then, according to data structure 710 formation and store data structure.Usually, use T position integer that sequencing information is stored as different information.

Next, the input data that go through with respect to after the compression are checked candidate pattern.At first, with combining Figure 10 a and 11 that the check of correlation rule is discussed.The data structure that the check use of this correlation rule and data structure 700 are consistent.After this, with combining Figure 12 a that the check of Cahn-Ingold-Prelog sequence rule is discussed.The data structure that the check use of this Cahn-Ingold-Prelog sequence rule and data structure 710 are consistent.It being understood that if importing data comprises sequencing information and be compressed to the data layout consistent with data structure 710, then can ignore the sequencing information in the data structure 710 and in packed data, search for correlation rule.

In conjunction with Figure 10 a, suppose to create through arbitrary pattern and given candidate pattern of extension mechanism formation (that is candidate association rule).Now must be according to this single fixedly candidate pattern of data detection.In other words, need to support in the sign input data those TA of this candidate pattern.If pattern is the subclass that comprises the project set of all items that comprises among the TA when being interpreted as project set, then TA supports this pattern.In conjunction with Figure 10 a, suppose that candidate pattern comprises disparity items or the classification father project that adds up to k.As stated, suppose that each project or classification father project represented by unique integer ID.In addition, suppose that n project in TA and the candidate pattern stored with orderly fashion with classification father project, this means ID value ordering through increasing.Should be understood that this ordering is not enforceable, but it makes the processing of TA and candidate pattern more efficient.

Basically, check algorithm must be opened each TA successively.For the complicacy of evaluation test algorithm, suppose that current TA comprises the individual Projects with Different of n '.If n ' less than k, knows that TA can not support current candidate pattern just then need not any further inspection.Therefore, suppose that n ' is greater than k.In the case, must check in the individual project of n ' whether k project of candidate pattern be included in TA.Because these two bulleted lists of hypothesis can be through increasing the ID ordering, therefore can (max (n ', k))=O (n ') carries out this checking procedure with workload O.For N continuous T A (on average each has the individual disparity items of n '), therefore amount of calculation is O (n ' * N).

But as stated and shown in Fig. 7 a, if can N TA be provided with the form of the data object of a compression, then can carry out the check of all N TA through total amount of calculation O (n), wherein n is the quantity of disparity items among all N TA.If n with the N linear growth, this means if n=N*n _Avg, then any algorithm is all less than the basic check algorithm that does not adopt according to the data structure of Fig. 2 a.But prove that in fact n increases along with N sublinear ground is strong, for example n～log (N) * n for typical real data set and project probability distribution _AvgOr n～N ^1/2* n _AvgTherefore, compare with basic verification scheme, the suggestion check algorithm with amount of calculation O (n) has very big advantage.

Figure 10 a shows the process flow diagram of method 1010 as an example, wherein uses data structure after the above-mentioned compression to realize the check of association mode with effective and efficient manner.Method 1010 starts from comprising the candidate rule r of k disparity items in

step 1001.Step

1002 and 1008 shows the mode expansion algorithm of movable TA information (being the appreciation information of father's pattern) that verification scheme is very suitable for calculating and store father's pattern of current candidate pattern, and is suitable for the expansion algorithm that can not handle this type of information such as A-Priori.Under first kind of situation, will select

step

1004 and 1010, this means the TA set that must only present and check those TA that comprise at least one father's pattern of supporting candidate pattern according to pattern.Under second kind of situation, will select

step

1003 and 1009, this means fetching all TA set and according to the said set of mode checking.

Step 1005 realizes the preliminary examination of calculated amount little (O (1)): if be less than will be according to the project in the pattern of these TA checks for the disparity items that the TA set comprises, then the TA in the set does not support this pattern.Step 1006 is committed steps.Function f indActiveTAs () respectively with the orderly k project array of current TA set (" g ") and current candidate pattern (" r ") as parameter.It returns the bit field of N position, wherein the support mode r whether of i TA among the bit representation g at position i place.Through selecting suitable N, can guarantee that bit field can be implemented as single signless integer variable, for example can select N=64 on 64 bit CPUs in modern times.

Figure 10 b shows the false code section of a kind of possibility embodiment that comprises function f indActiveTAs ().This function is included in the remaining single circulation afterwards of maximum n+k step; Because k is less than n, so this is actually above-mentioned O (n).

Row in the false code of Figure 10 b (130) comprises function header.Suppose that BITFIELD_N is the signless integer type that length is at least the N position.ITEM is the integer type that length is enough to represent to import all disparity items in the data and the father's item id of classifying.ITEM [] represents the project array of type I TEM.The bit field that row (131) initialization will be returned.Each TA during at first hypothesis is gathered is support mode.Two of (132) initialization of row travel through the iterator variable of all items of the pattern (itemPosi_g) of TA set (itemPosi) respectively.Row (134) comprises the circulation framework.(134) end afterwards of maximum n+k step guaranteed to circulate in together in the statement that increases progressively in the condition in and (137) and (142).(138) the if branching representation in is a project of discovery mode in the TA set not, returns the situation of room field then.

(141) the if branching representation in has found the situation of present mode project in the TA set.In the case, bit field " activeTAs " is through logical AND operator (" & ") is combined with the bit field of indicating which TA that comprises the present mode project by turn.Using operational symbol " & " is to be directed against the great majority (programming languages-CPU) computing very fast of combination execution in the cycle at a cpu clock.Then, forward next project (142) in two tabulations to.If found last schema entry, then can withdraw from function f indActiveTAs and return bit field (144).If do not find in the TA set that all schema entries just can arrive row (146).Return the room field in the case.

Refer again to Figure 10 a, all " 1 " in the step 1007 pair bit field " activeTAs " count.(there is 1 special command of counting to the integer variable in the clock period in programming language-CPU) combination for some.But, also can realize the position in the integer is counted through mode very efficiently even without this type of order is provided.In Figure 10 c, provide an instance with programming language C.How this instance only realizes at 20 to 30 cpu clocks all 1 functions of counting " countBits " to 64 integers in the cycle if showing.It is the auxiliary array " nb1Bits " of 256 bytes that this function uses size.

In step 1011, whether inspection can fetch other TA set with respect to its assessment candidate rule.If there are not the more TA set that will fetch, then the assessment of candidate pattern is accomplished.Appreciation information with respect to the candidate rule of importing data is stored among the variable activeTAs.After this, can use identical method to assess next candidate pattern.After each pattern assessment, if pattern is through all filtrators constraints, just this pattern added to the tabulation of resulting schema.After all candidate pattern were all by assessment, data mining was accomplished.Output is through the tabulation of all patterns of all filter condition (being the tabulation of resulting schema).

Some pattern is created and the mode expansion algorithm will be created a large amount of similar candidate pattern set, and these set are shared k-1 public project and difference usually and only are a project of having added.Alternatively, the pattern of using any algorithm to generate can be ranked into this similar candidate pattern set so that test.The verification scheme that can revise combination Figure 10 a description is so that further reduce work for inspection.

If the input data structure after the compression also comprises weight information, then the step 1007 among Figure 10 a also comprises other operations relevant with the weight storage unit.These operations are common as follows:

bodyWeightStats[r]：＝bodyWeightStats[r]+

getBodyWeights(activeTAs)

headWeightStats[r]：＝headWeightStats[r]+

getHeadWeights(activeTAs)

TAWeightStats[r]：＝TAWeightStats[r]+getTAWeights(

activeTAs)

Similar calculating can be used to have the transaction set of weighted information.

Figure 11 shows the process flow diagram according to the method 1100 of amended verification scheme as an example, and said method can be summarized as follows.For all TA collective data objects (promptly for data structure 700),, then skip current object (step 1101 among Figure 11) if movable TA set historys teaches that this set right and wrong activity.If its TA does not all comprise enough projects to satisfy the regular length criteria of current minimum (this possibly be limited) in rule creation and checkout procedure, then skip current object; Step 1102 among Figure 11.After this, confirm the public keys destination locations (step 1103 among Figure 11) of all candidates in the TA set.If find all these projects (step 1104), then confirm to comprise in the set those TA (step 1105) of all these public projects.This can realize through TA item bit field being carried out fast by turn ' or ' computing.Be non-NULL (step 1106,1109) if comprise the tabulation of TA as a result of all public projects of candidate pattern set, then definite those TA (step 1107) that also support the non-public project of each candidate pattern.Step 1108 is and the relevant optional step of the project of affirming constraint; Discuss this step below in more detail.After handling all TA that comprise all public projects, with respect to the set (step 1110) of N similar candidate pattern of TA set assessment.It being understood that through the step 1103-1105 and 1107 among deletion Figure 11, the result is the general rank explanation of more detailed process flow diagram among Figure 10 a.

Figure 12 a shows the more detailed figure of this verification scheme of similar candidates set.Method 1200 starts from candidate list 1 in step 1201, and candidate list 1 comprises N candidate rule with k project, and in this k project, k-1 project all is identical for all candidate rules.Be similar to method 1000, method 1200 can adopt the appreciation information (

step

1002a, 1003,1004,1008a, 1009 and 1010) of father's rule.When the process flow diagram of the process flow diagram of comparative approach 1200 and method 1000, can comprise additional cycles (

step

1204,1205,1206,1207 and 1208) by discover method 1200.All N candidate rule " i " of the current candidate pattern set that will check of this searching loop.The additional benefit of amended method 1200 is that for all candidate rules, the largest portion of project check can withdraw from this circulation (step 1202).This means that for all N candidate rule k-1 project in k project only need be verified once, and only need check the different single project (step 1206) in the different candidate rules separately to each candidate rule.

Figure 12 b has listed the false code of the possible embodiment of function in the step 1202 " findActiveTAs ".The embodiment of the corresponding step 1006 in this embodiment and the verification scheme shown in Figure 10 a is almost completely identical.

Function in the step 1202 " findActiveTAs " has different signatures and needs different embodiment.This function is only checked the existence of single given project among the TA of current TA set.In step 1203, whether inspection finds the TA of any activity.The possible embodiment of function " findActiveTAs " has been shown among Figure 12 c.

It is to be noted that the 3rd function parameter " itemPosi " among Figure 12 c is to import into to spread out of (in-out) parameter.When this function of input, begin search item " itemID " if will organize g from TA, parameter " itemPosi " is indicated the item location of all disparity items in array.In the ending of this function, " itemPosi " points to the position of discovery project " itemID ".This position (increasing by 1) is to be used for the suitable starting point of searching for during (itemID that uses next time is greater than the itemID that had before called) call function " findActiveTAs " in next time.This means must be to this embodiment, and through increasing the add-ins purpose itemID different with candidate pattern, the different candidate pattern that circulation traveled through of step 1204 to 1208 sort among the flow figure.

Next, the further improvement of correlation rule assessment is discussed.It being understood that this improve to be fit to association and Cahn-Ingold-Prelog sequence rule, and be fit to the single candidate pattern of assessment and with a similar candidate pattern set of step evaluates.The user can the definition project retrain, and promptly the user has only specified those rules that comprise/do not comprise specific project or project category (=classification father project) are had interest.

First kind of situation is referred to herein as " project constraint certainly ", and second kind of situation is called as " negating the project constraint ".

About negating the project constraint, can note following item.If constraint is " project X can not appear in the rule ", just can neglected items X when then in original input data, finding project X at every turn.Therefore, when original input data during, possibly consider this type of negative project constraint by pre-service and boil down to binary format.If constraint is " project X can not appear in regular head/rule body ", then candidate rule is created algorithm (being not the part of this discussion) and must be handled this constraint.

About project constraint certainly, can note following item.Usually candidate rule establishment scheme adopts progressively method: they are from simple rule item1==>item2 begins, and comes progressively to construct more complex rule through repeating to add a project to rule body or rule head then.If there is project constraint certainly, then when the given candidate rule of check, this will cause three kinds of possible results.The first, possibly find (for example, if it is non-frequent) of the activity of candidate rule right and wrong.The second, possibly find that candidate rule is movable (for example, if it be frequent and comprise all required projects in the tram).The 3rd, the activity of candidate rule right and wrong but can expand (for example, if it is frequent, lacking required project, possibly be movable but comprise the expansion candidate rule that lacks project).

Combine described storage scheme of Figure 10 a to 14 and checkout procedure that the mode of these sure project constraints of effective consideration is provided at this.Provided an instance among Figure 11, wherein the project constraint is relevant certainly with consideration for step 1108.In this step 1108, for sure each project of the intrafascicular approximately appearance of project, will be to the candidate pattern of not only supporting current assessment but also those TA of comprising the intrafascicular approximately project of sure project count.In the ending of checkout procedure, this improvement helps to pick out some otherwise will be classified as the candidate pattern of " non-activity but can expand ": can abandon all not candidate pattern expanded of the frequent effective items combination that occurs in constraint certainly.This is important improvement, because restriction candidate search space is extremely important to reducing working time.

Next discuss with respect to the input data detection Cahn-Ingold-Prelog sequence rule after the compression.It being understood that Figure 11 also provides general view to this scheme (once assessing one or more candidate pattern).For Cahn-Ingold-Prelog sequence rule, relevant modification is, whether the project that step 1107 also comprises among the inspection TA occurs with the order of Cahn-Ingold-Prelog sequence rule definition.

Temporal information is used as an instance of sequencing information at this.At first, suppose to create through any pattern and given candidate pattern of extension mechanism formation (=order of candidates is regular).Suppose that this pattern comprises m project set, in other words, it comprises, and m different time stabbed and therefore comprise m-1 time step.Now must be according to this single fixedly candidate pattern of data detection.This means the TAG that must identify those support candidate pattern in the input data.If there is the subclass { TA of the TA of TAG ₁... TA _m, TAG support mode then, so that

● for i=1...m, TA _iAccording to the timestamp ordering that increases

● for i=1...m, be interpreted as the TA of project set _iIt is the superset of i project set of pattern.

Figure 13 A shows the process flow diagram of method 1300 as an example, wherein uses data structure after the above-mentioned compression with mode implementation pattern check efficiently.Process flow diagram hypothesis candidate pattern among Figure 13 a comprises s project set, has the disparity items or the classification father's project (step 1301) that add up to k.Suppose that each project or classification father project represented by unique integer ID; In addition, suppose to visit in an orderly way k project and classification father project, this means ID value ordering through increasing.This ordering has improved the efficient that realizes, but it is not enforceable.

Basically, check algorithm must be opened each TAG now successively.The step 1302 of process flow diagram and 1307 shows the mode expansion algorithm of movable TAG information that verification scheme is very suitable for calculating and store father's pattern of current candidate pattern, and is very suitable for the expansion algorithm of not handling this type of information such as A-Priori.Under first kind of situation, will select

step

1304 and 1309, this means the TAG that only needs to appear and check father's pattern of those support candidate pattern according to pattern.Under second kind of situation, will select

step

1303 and 1308, this means according to pattern and fetch and check all TAG.

Suppose that current TAG comprises n Projects with Different.If n less than k, knows that TAG can not support current candidate pattern just need not any further inspection.Therefore, suppose that n is greater than k.

Whether step 1305 inspection TA and the distribution of timestamp in TAG thereof are such distributions: it can support a series of effective s project sets.If there is not particular constraints in the permission time step in the order, this inspection is unimportant and have computation complexity O (1): only must check that the TA quantity among the TAG is not less than s.If must consider particular constraints, for example step-length must be higher than certain minimum or be lower than particular maximum value effective time each, and then this inspection may become more complicated.

Further inspection in the step 1305 is as two purposes.At first, it is second preliminary examination of picking out the specific T AG that can not support candidate pattern.Whether all disparity items in function f indItemPosisInData () the inspection candidate pattern all appear among the TAG.This means whether k project must checking candidate pattern is included in n the project of TAG.Because these two bulleted lists of hypothesis are according to the ID ordering that increases, therefore can (max (n, k))=O (k) carries out this checking procedure with workload O.

If this preliminary examination failure, function f indItemPosisInData () return false immediately then, otherwise function continues and as second purpose.This second purpose of in the step 1305 second inspection is to create the expression of the candidate pattern of all items set that comprises pattern, and each project set is represented as the project array of (with classification father project).But, for the ease of mating current TAG, hope not through its ID but represent each project or classification father project through its position in the tabulation of the disparity items of TAG.The 3rd function parameter " itemPosis " returns this information with the form of two-dimensional array, all items set of first array dimension traversal candidate pattern, and the second array dimension travels through all items in the current project set.

Figure 13 b shows the false code of a kind of possibility embodiment that comprises function f indItemPosisInData ().The overall computation complexity of this function is O (n+k)=O (n).

Row in the false code of Figure 13 b (130) comprises function header.Supposing provides TAG with the form of the structured object of TAGROUP type that method " itemID (index) " is provided.Suppose that the method returns all disparity items and classification father item ids according to the ID ordering that increases that occur among the TAG.In addition, suppose and can candidate pattern be provided with the form of the structured object of CANDIDATERULE type that following method is provided:

● when index when 0 runs to k-1, itemID (itemIndex) return occur in the candidate pattern all according to the projects that increase the ID ordering and classification father item ids.

● the value k of disparity items in the numberOfDifferentItems () backtrack mode and classification father project.

● the value s of the project set in the numberOfItemsets () backtrack mode.

● the quantity of the disparity items in the project set that numberOfItems (itemsetIndex) home position " itemsetIndex " is located.

● (itemsetIndex i) returns and must put into function itemID (itemIndex) so that obtain i project of project set of locating position " itemsetIndex " or the itemIndex (=position) of the ID of classification father project itemPosi.

Two of (131) initialization of row travel through the iterator variable of all items of the candidate rule (itemPosi_r) of TAG (itemPosi_g) respectively.Row (133) initialization length is the integer array of n.For each disparity items that occurs in the candidate pattern, this array will comprise the position of this project in TAG.From row (134) to the circulation of row (142) project of candidate pattern is compared with the project of TAG and use correct value filling array " mapItemPosis_r_to_g ".If in TAG, do not find a project of candidate pattern, then arrive row (138) and (144).In the case, function return false immediately.Just arrive row (145) when only in TAG, finding all items of candidate pattern.In the case, function is carried out its second portion responsibility and is used correct value to fill return parameters " itemPosis ", and promptly for each project set of candidate pattern, all form the position of project in TAG of project set.

Finish the discussion of function f indItemPosisInData and return the process flow diagram among Figure 13 a at this.In step 1306, call function " supports () ".This function is carried out the core of check: it checks whether current TAG supports candidate pattern.If support that function returns 1, otherwise it returns 0.Figure 13 c shows the false code of the possible embodiment that comprises function " supports () ".Workload is O (n).

Row among Figure 13 c (150) comprises function header.Supposing provides TAG with the form of the structured object of TAGROUP type that following method is provided:

● numberOfTAsInGroup () returns the sum (quantity of different time stamp value) of TA among the TAG.

● itemID (itemIndex) returns n the disparity items that occurs among the TAG and the father's item id of classifying.

● bitField (itemIndex) returns the bit field of each disparity items that occurs among the TAG or the movable TA that classifies father's project.In other words, indication ID in the i position of bitField (itemIndex) is whether the project of itemID (itemIndex) appears among i the TA of TAG.

The several local variables of row (151) to (154) initialization.The quantity of TA among row (151) the storage TAG.The iterator variable of all items set of row (152) definition traversal candidate pattern.Row (153) definition indication TAG planted agent begins to mate the variable of the beginning TA position of next candidates set.Row (154) definition will be used to represent that all are bit field variablees of TA of the superset of given project set.

Row (155) begins to travel through all items set of candidate pattern, and each TA with TAG is complementary successively.The bit field of the possible movable TA of the current project set of row (156) initialization.All be set to zero from 0 all positions, because need not check whether these TA mate current project set to the TA position of finding the set of the next item up order.(157) cycle through the movable TA bit field of all items of the stacked current project set of logical AND operator () by turn and in (158).Therefore, after the circulation, bit field activeTAs comprises the TA that all timestamps are later than the timestamp of the TA that matees the next item up order set (all items that comprises current project set).If this bit field is empty, row (159) causes withdrawing from function and rreturn value is 0; In the case, TAG does not support candidate pattern.Otherwise, after remembeing to mate the TA position of current project set, continue all candidates set of traversal.

Some pattern is created and the mode expansion algorithm will be created a large amount of similar candidate pattern set, and the project that k-1 public project and difference only are an interpolation is shared in these set usually.Alternatively, can candidate pattern that use any method to generate be organized into similar candidate pattern set.The verification scheme that can revise combination Figure 13 a description is so that further reduce work for inspection.

The instance of amended verification scheme may be summarized as follows:

● if current TAG does not comprise enough projects or project set to satisfy current minimum order length criteria, then skips this TAG.

● if current TAG does not have enough step-lengths effective time (so that forming the order through all elapsed time restrictions of digging user definition) between its project set, then skip this TAG.

● all public projects of guaranteeing all candidate pattern all are included among the TAG.If lack public project, then skip TAG, otherwise remember the position of public project in TAG.

● the specific extra items of confirming candidate pattern is also contained in those candidate pattern among the TAG, other candidate pattern of stopping using.

If ● there is remaining activity candidate pattern: the TA combination of searching the public part that comprises all candidate pattern public project set of orthochronous order (=by).If do not find this type of TA combination, then skip this TAG.

● for each remaining activity candidate pattern: the TA combination beginning that from a last step, identifies also comprises the specific TA combination of having revised project set (a plurality of) by the candidate pattern of orthochronous order with respect to the sundry item set search of candidate pattern.

A kind of optional attribute in the mode evaluation method is a dynamic memory management.This will be below with reference to Figure 14 discussion.Dynamic memory management generally includes the part of two complementations: memory manager instance 1400 itself and managed objects 1410, they have some public attribute and ability to mate the requirement of memory manager.In OO method; Can to realize dynamic memory management by the object instance of memory manager control through a classification instance of classification type " MemoryManager " and through derive all from public base class " MemoryManagedObject ".Next, will dynamic memory management be discussed, but obviously can also use additive method to realize similar characteristics and function with reference to OO method.

Memory manager 1400 has the fixed storage capacity, and it provides following method usually at least.In Figure 14, return the total volume (for example being unit) of memory manager 1400 with the byte with square 1401 schematically illustrated method getCapacity ().The method of square 1402 expression getOccupiedSize () by name, the method are returned the amount of memory (for example being unit with byte) of using in the memory manager 1400 (=take).The method of square 1403 expressions addObject (MemoryManagedObject) by name, the method is added new object 1410 to memory manager 1400.If add the capacity that new object will exceed memory manager afterwards, then memory manager 1400 is before adding new object, and the storage object that it is enough is dumped to the predetermined work space on the dish automatically.Memory manager 1400 is only preserved the information of relevant dump object in storer, in the time must visiting the dump object once more, need this information to reload the dump object.The method of square 1404 expressions removeObject (MemoryManagedObject) by name, the method is from managed object of memory manager 1400 deletions.

Storer managed object 1410 has following attribute and method usually at least.The method getMemorySize () of object total memory size (for example being unit with the byte) is returned in square 1411 expressions.Square 1412 expression is with the method lockForReading () of object tag for " current visited with read-only mode ".The object that reads locking can not keep being dumped to dish: if it before had been dumped to dish, it will be re-loaded to storer automatically through dynamic memory management.Square 1413 expressions are the method lockForWriting () of " current by visit and modification " with object tag.The object of write lock-out can not keep being dumped to dish: if it before had been dumped to dish, it will be re-loaded to storer automatically.The existing dump of object will be deleted from dish, because object is modified and dump no longer is up-to-date.Square 1414 method for expressing getState (), whether the method indication model is current is visited so that read or write, or be not dumped to dish.Square 1415 method for expressing dumpToDisk (), the method writes dish with the major part of object, only in storer, preserves after a while and fetches the required information of this object again from dish.Square 1416 method for expressing refetchFromDisk (), the method recovers before to be dumped to the object of dish, so that object resides in the storer now fully.

When using the notion of MemoryManager and MemoryManagedObject in the specific implementations in special algorithm, each data structure that algorithm is used becomes MemoryManagedObject.The attribute of MemoryManagedObject and method can just be unrolled or be combined with the particular community or the method for data structure.

For each data structure of using in the algorithm, must the specified data structure whether enough greatly with and life cycle long enough whether, be effective with proof with its expense that is regarded as MemoryManagedObject.Sometimes, can confirm the individual data structure not to be regarded as MemoryManagedObj ect, but the whole set or the array of this type of data structure is regarded as MemoryManagedObject.For example; When TA that stores the input data to data mining problem or TAG; Can confirm not make each single TA or TAG data structure to become the storer managed object, but for example make the array of 10000 or 100000 these type of continuous data structures become the storer managed object.

As first improvement, memory manager 1400 can comprise some less relatively but frequent object that uses.On the other hand, can have unwanted a large amount of objects in most of working time of mining algorithm.When memory manager 1400 need be dumped to dish with its partial content, should preferably second class object (a large amount of objects that promptly seldom need) be dumped to dish.Can be through realizing this target for each object in the memory manager 1400 is provided with " viscosity " property value.In Figure 14, can have for example three values: " alwaysInMemory ", " preferablyInMemory ", " preferablyOnDisk " with the viscosity shown in the square 1417.In this example; First kind object (" alwaysInMemory ") will never be dumped to dish, when only there is not much more not the 3rd class objects of dump (" preferablyOnDisk ") in second class object (" preferablyInMemory ") in memory manager just by dump.

The further improvement of dynamic memory management is to use " fetching again recently " timestamp (square 1418 among Figure 14) in memory manager 1400, object 1410 to be pressed prioritized.It is incoherent mutually that this second improvement improves with first: these two kinds of improvement can or be used separately.In order to introduce this notion, consider following instance.Have n similar object (for example, data page) in the memory manager at necessary frequent access of the specific run stage of data mining.In addition, memory manager 1400 comprises m current unwanted other objects.The capacity of memory manager is enough to store simultaneously n-1 object in the object of n frequent access.According to the dump strategy of memory manager, must finish from coiling under the situation of fetching this object again in the time of can needing one of n object therein at every turn.On the contrary, best attainable situation is to have only 2 objects to be dumped to dish in all m object and n object, and n-2 object in n common object forever remains among the RAM.Suppose T ₀Be the start time that current data is excavated the operation phase, and each object in the memory manager 1400 all carry and comprise recently the time stamp T of fetching the time (being object is taken back to storer from dish the up-to-date time) again.Then, following dump/fetch trial method again will reach above-mentioned " best attainable situation ": if the dump object, and memory manager 1400 those T of dump preferably then<t ₀Object (these to as if above-mentioned m other objects).At T>T ₀Object among, the object with maximum T at first is dumped to dish.Can be understood that to quoting of current generation computer program wherein carries out similar operations and the frequent particular fixed subclass of using all data available structures repeatedly, and other data structures at this moment between in time period of almost never being used.

As the 3rd improvement, can consume to other light weight object trace memorys.The instance of these light weight objects is auxiliary variable, array or other data objects.Each object in these light weight objects all too little (or having too many light weight object), thus can't to prove in them each and provide the expense of the attribute of MemoryManagedObject (in storer and execution time) be effective.But,, then also answer trace memory consumption so that memory manager is worked exactly if the specified point that all storer summations that all these objects have flow in algorithm controls reaches the considerable quantity of overall available memory.Memory manager 1400 can also be followed the tracks of by being not the amount of memory from " light weight " object consumption of MemoryManagedObject derivation.For this reason, memory manager at least need be with method reserveMemory () and releaseMemory ().They are respectively by square among Figure 14 1405 and 1406 expressions.But the light weight object can not be dumped to dish so that in memory manager 1400, be new object space for the creativity.

It being understood that above-mentioned dynamic memory management is fit to any data digging method with respect to input data assessment candidate pattern.

What it is also understood that is, can be used as the input data that original input data is provided for assessing the method for candidate pattern.In the case, the method that is used to assess candidate pattern can comprise any essential step with original input data boil down to compressed format.Alternatively, the method that is used to assess candidate pattern can just be visited the input data after the compression.

Some combination of said various characteristics has clearly been described in this explanation.It being understood that various other combinations are conspicuous for the research those skilled in the art.

In accompanying claims, Computerized method refers to the method that its step is carried out by the computing system of the appropriate combination that comprises one or more processors, storage arrangement and memory storage.

Though above reference specific embodiment of the present invention, it will be apparent to one skilled in the art that and can in these embodiment, make change and do not depart from principle of the present invention and the spirit that is defined by the following claims.

Claims

1. Computerized method that is used for detecting the pattern of the input data that comprise a plurality of affairs, each affairs has at least one project, said method comprising the steps of:

Reception is used for the filter condition of pattern interested,

The candidate pattern of selecting to have assessed is as father's candidate pattern and safeguard the appreciation information of relevant said father's candidate pattern, and said appreciation information comprises that whether affairs or the transaction set in the indication input data be the Boolean that the support of father's candidate pattern is made contributions,

Through expanding said father's candidate pattern and be suitable for the filter condition set and generate sub-candidate pattern through using first,

With respect to the input data in a plurality of similar candidate pattern set and the said appreciation information of the relevant said father's candidate pattern of basis are assessed said sub-candidate pattern jointly; Similar candidate pattern and at least one similar candidate pattern set that each similar candidate pattern set has maximum predetermined quantities have at least two similar candidate pattern; Similar candidate pattern in each similar candidate pattern set is owing to a disparity items of adding father's candidate pattern to differs from one another, and

2. Computerized method as claimed in claim 1, wherein the candidate pattern difference each other in each similar candidate pattern set is to be added to a respective item of father's candidate pattern.

3. according to claim 1 or claim 2 Computerized method, at least one step during the step that wherein generates sub-candidate pattern may further comprise the steps:

New projects are added to first project set of said father's candidate pattern;

New projects are added to last project set of said father's candidate pattern; And

New projects' set that will comprise a project appends to said father's candidate pattern.

4. according to claim 1 or claim 2 Computerized method, wherein said predetermined quantity depends on the characteristic of the computing system of carrying out said Computerized method.

5. according to claim 1 or claim 2 Computerized method; Comprise according to said input data and come counting statistics tolerance so that use at least one step in generation and appraisal procedure, said statistical measures comprises at least one in following: project is to statistical information and weight statistical information.

6. Computerized method as claimed in claim 5 comprises when said first filter condition of application is gathered the search volume of limiting said candidate pattern according to said statistical measures.

7. like claim 5 or 6 described Computerized methods, comprise according to said statistical measures and confirm at least one in following: expand which sub-candidate pattern, and the order of expanding sub-candidate pattern.

8. according to claim 1 or claim 2 Computerized method; Wherein said filter condition comprises that at least one is based at least one condition in following: the weight of weight, the total weight about the input data, the average weight of supporting affairs, rule body, the weight of rule head, about total weight of the rule head of input data, about total weight of the rule body of input data, and addressable other total weights.

9. according to claim 1 or claim 2 Computerized method comprises

The data structure of the affairs set in the said input data of expression is provided; Said data structure has: the information of the identifier quantity in the tabulation of the identifier of disparity items in the affairs set, the said tabulation of indication; And the bit field information of indicating the existence of disparity items described in the said affairs set; Organize said bit field information so that gather to come evaluation profile according to said tabulation with respect to said affairs, and

Use bitmap operation to assess said candidate pattern to said bit field information.

10. according to claim 1 or claim 2 Computerized method comprises

The appreciation information of the candidate pattern of safeguard the affairs in the said input data of expression, the candidate pattern of having assessed, having assessed, the candidate pattern that will assess, and the data structure of resulting schema; And

According to the available total memory and the use of said data structure, dynamically confirm during generating and assessing sub-candidate pattern and will which data structure is kept in the storer and will which data structure be placed into dish.

11. Computerized method as claimed in claim 10 comprises being at least the indication of first data structure whether should be with said first data structure by prioritized when confirming to be kept at which data structure in the storer.

12. like claim 10 or 11 described Computerized methods, comprise being at least second data structure indication, so that confirm and will which data structure be kept in the storer according to the said time of fetching recently from the fetching the time recently of dish.

13. Computerized method according to claim 1 or claim 2 comprises that one of following column format safeguards the appreciation information of said father's candidate pattern:

Be designated as first bit field of the input data event that the support of said father's candidate pattern contributes, the length of said first bit field equals to import the quantity of data event;

Be designated as second bit field of the input data event that the support of said father's candidate pattern contributes, the length of said second bit field equals the quantity of the input data event that the support for another father's pattern of said father's candidate pattern contributes; And

With two that contribute for the support of the said father's candidate pattern relevant information of importing continuously between the data event of input data event quantity;

Wherein importing data event is one of following item: affairs and one group of affairs.

14. Computerized method as claimed in claim 13 comprises the form of selecting the appreciation information of said father's candidate pattern according to the support of said father's candidate pattern.

15. Computerized method according to claim 1 or claim 2; Be included in the appreciation information of in the bit field of the input data event that the support that is designated as corresponding sub-candidate pattern contributes, safeguarding said sub-candidate pattern set in the said appraisal procedure, wherein importing data event is one of following item: affairs and one group of affairs.

16. Computerized method as claimed in claim 15, the length of wherein said bit field equals to import the quantity of data event.

17. like claim 15 or 16 described Computerized methods, wherein the quantity of the bit field of each sub-candidate pattern set is the quantity of the input data event that contributes of the support for corresponding father's pattern.

18. Computerized method according to claim 1 or claim 2 comprises

According to the filter condition that is received, confirm the second suitable filtrator set relevant with the said sub-candidate pattern of assessment; And

Consider the said second suitable filter condition set relevant with the said sub-candidate pattern of assessment.

19. Computerized method as claimed in claim 18 comprises

According to the filter condition that is received, confirm the 3rd filter condition set that during confirming resulting schema, is suitable for;

Consider the said three filter condition set relevant with the said sub-candidate pattern of assessment; And

To be output as resulting schema through the candidate pattern that the existing effect of said the 3rd filter condition set is assessed.

20. a computerized device that is used for detecting the pattern of the input data that comprise a plurality of affairs, each affairs has at least one project, and said device comprises:

Be used to receive the module of the filter condition that is used for pattern interested,

Be used for according to the filter condition that is received, confirm the module of the first suitable filter condition set relevant with generating candidate pattern,

The candidate pattern that is used to select to have assessed is also safeguarded the module about the appreciation information of said father's candidate pattern as father's candidate pattern; Said appreciation information comprises that whether affairs or the transaction set in the indication input data be the Boolean that the support of father's candidate pattern is made contributions

Be used for through expanding said father's candidate pattern and be suitable for the module that the filter condition set generates sub-candidate pattern through using first,

Be used for also assessing the module of said sub-candidate pattern according to the said appreciation information of relevant said father's candidate pattern with respect to common input data in a plurality of similar candidate pattern set; Similar candidate pattern and at least one similar candidate pattern set that each similar candidate pattern set has maximum predetermined quantities have at least two similar candidate pattern; Similar candidate pattern in each similar candidate pattern set is owing to a disparity items of adding father's candidate pattern to differs from one another, and

Be used for recursively using at least one successfully to pass through the module of the sub-candidate pattern of appraisal procedure as father's candidate pattern.

21. computerized device as claimed in claim 20, wherein the candidate pattern difference each other in each similar candidate pattern set is to be added to a respective item of father's candidate pattern.

22. like claim 20 or 21 described computerized device, the module that wherein is used for generating sub-candidate pattern comprises at least one module with lower module:

Be used for the module of adding new projects first project set of said father's candidate pattern to;

Be used for the module of adding new projects last project set of said father's candidate pattern to; And

Be used for to comprise that new projects' set of a project appends to the module of said father's candidate pattern.

23. like claim 20 or 21 described computerized device, wherein said predetermined quantity depends on the characteristic of the computing system of carrying out said Computerized method.

24. like claim 20 or 21 described computerized device; Comprise being used for coming counting statistics tolerance so that the module of at least one step that generates with appraisal procedure, using that said statistical measures comprises at least one item in following: project is to statistical information and weight statistical information according to said input data.

25. computerized device as claimed in claim 24 comprises being used for when said first filter condition of application is gathered, limiting the module of the search volume of said candidate pattern according to said statistical measures.

26., comprise at least one the module that is used for confirming following item according to said statistical measures like claim 24 or 25 described computerized device: expand which sub-candidate pattern, and the order of expanding sub-candidate pattern.

27. like claim 20 or 21 described computerized device; Wherein said filter condition comprises that at least one is based at least one condition in following: the weight of weight, the total weight about the input data, the average weight of supporting affairs, rule body, the weight of rule head, about total weight of the rule head of input data, about total weight of the rule body of input data, and addressable other total weights.

28., comprise like claim 20 or 21 described computerized device

Be used for providing the module of the data structure of the affairs set of representing said input data; Said data structure has: the information of the identifier quantity in the tabulation of the identifier of disparity items in the affairs set, the said tabulation of indication; And the bit field information of indicating the existence of disparity items described in the said affairs set; Organize said bit field information so that gather to come evaluation profile according to said tabulation with respect to said affairs, and

Be used for using bitmap operation to assess the module of said candidate pattern to said bit field information.

29., comprise like claim 20 or 21 described computerized device

The appreciation information of the candidate pattern of be used for safeguarding the affairs of the said input data of expression, the candidate pattern of having assessed, having assessed, the candidate pattern that will assess, and the module of the data structure of resulting schema; And

Be used for according to the available total memory and the use of said data structure, dynamically confirm during generating and assessing sub-candidate pattern, will be kept in the storer which data structure and the module that will which data structure be placed into dish.

30. computerized device as claimed in claim 29 comprises being used for being at least the indication of first data structure whether should be with the module of said first data structure by prioritized when confirming will which data structure be kept at storer.

31. like claim 29 or 30 described computerized device, comprise being used to be at least the indication of second data structure, so that confirm and will which data structure be kept at the module in the storer according to the said time of fetching recently from the fetching the time recently of dish.

32., comprise being used for the module that one of following column format is safeguarded the appreciation information of said father's candidate pattern like claim 20 or 21 described computerized device:

33. computerized device as claimed in claim 32 comprises the module of form that is used for selecting according to the support of said father's candidate pattern the appreciation information of said father's candidate pattern.

34. like claim 20 or 21 described computerized device; Comprise the module that is used in said appraisal procedure is being designated as the bit field of the input data event that the support of corresponding sub-candidate pattern contributes, safeguarding the appreciation information of said sub-candidate pattern set, wherein import data event and be one of following: affairs and one group of affairs.

35. computerized device as claimed in claim 34, the length of wherein said bit field equals to import the quantity of data event.

36. like claim 34 or 35 described computerized device, wherein the quantity of the bit field of each sub-candidate pattern set is the quantity of the input data event that contributes of the support for corresponding father's pattern.

37., comprise like claim 20 or 21 described computerized device

Be used for according to the filter condition that is received, confirm the module of the second suitable filtrator set relevant with the said sub-candidate pattern of assessment; And

Be used to consider the module of the said second suitable filter condition set relevant with the said sub-candidate pattern of assessment.

38. computerized device as claimed in claim 37 comprises

Be used for according to the filter condition that is received, confirm the module of the 3rd filter condition set suitable during confirming resulting schema;

Be used to consider the module of the said three filter condition set relevant with the said sub-candidate pattern of assessment; And

Be used for to be output as through the candidate pattern that the existing effect of said the 3rd filter condition set is assessed the module of resulting schema.