US20050278139A1 - Automatic match tuning - Google Patents
Automatic match tuning Download PDFInfo
- Publication number
- US20050278139A1 US20050278139A1 US10/856,694 US85669404A US2005278139A1 US 20050278139 A1 US20050278139 A1 US 20050278139A1 US 85669404 A US85669404 A US 85669404A US 2005278139 A1 US2005278139 A1 US 2005278139A1
- Authority
- US
- United States
- Prior art keywords
- similarity
- weighting
- weighting vector
- matches
- degrees
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/151—Transformation
- G06F40/16—Automatic learning of transformation rules, e.g. from examples
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/254—Fusion techniques of classification results, e.g. of results related to same input data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/194—Calculation of difference between files
Definitions
- the present invention relates to data processing by digital computer, and more particularly to mapping elements between disparate schemas.
- Enterprise application integration can require significant effort when migrating from disparate legacy applications to a more integrated framework.
- Enterprise application integration can be performed using a message exchange procedure, in which messages are exchanged between different data sets.
- Application data is typically organized according to the type of application or applications with which the data is designed to operate. As a result, the organization or structure of the data can be highly specialized.
- the messages used for enterprise application integration are generally structured sets of data in a well-defined syntax. The structure of the data can be referred to as its schema.
- Countless different schemas and/or schema domains exist.
- Many different integration scenarios e.g., business process integration, enterprise application integration, and master data management
- schema matching in which a mapping between the elements of two schemas is produced.
- Schema matching can also be important in data translation applications (e.g., where data from a first database is migrated into a second database for use with a different application).
- schemas in a first classification may use a composite matcher that heavily weights the contribution of a field name matcher that is a component of the composite matcher
- schemas in a second classification may use a composite matcher that heavily weights the contribution of a structural matcher that is a component of the composite matcher.
- Such an approach may provide improved performance relative to conventional simple, hybrid, or composite matchers but only works for schema domains that have previously been associated with a particular class of schema domains.
- the present invention provides methods and apparatus, including computer program products, that implement techniques for mapping schemas by tuning the relative contributions of different component matchers.
- the relative contributions (i.e., the weights) of different matchers can be tuned by optimizing a measure of ambiguity, which may be an algorithm that is based on a number of ambiguous matches, a number of unambiguous matches, and/or a number of impossible matches.
- the relative contributions of different matchers can be tuned by monitoring user interaction (e.g., user approvals and rejections of proposed matches) and using the user feedback to fine-tune the weights of the different matchers.
- the techniques feature calculating a degree of similarity between elements of two schemas using each of multiple matching processes and combining the calculated degrees of similarity using a first weighting vector to produce first combined degrees of similarity.
- the first weighting vector includes multiple weighting coefficients and each weighting coefficient corresponds to one of the matching processes.
- the weighting coefficients are tuned using information relating to a predicted degree of matching accuracy associated with the first weighting vector.
- the invention can be implemented to include one or more of the following advantageous features.
- the calculated degrees of similarity are combined using each of multiple weighting vectors.
- Each weighting vector includes multiple weighting coefficients, and each weighting coefficient corresponds to one of the matching processes.
- the weighting coefficients are tuned by determining, using the combined degrees of similarity for each of the weighting vectors, a predicted degree of matching accuracy associated with each of the weighting vectors.
- a second weighting vector is selected to determine possible matches between the elements of the two schemas.
- the second weighting vector is selected based on a comparison of information relating to the respective predicted degrees of matching accuracy associated with the first weighting vector and the second weighting vector.
- Each predicted degree of matching accuracy is determined using a number of ambiguous matches, a number of unambiguous matches, and/or a number of impossible matches.
- the weighting coefficients are tuned by identifying a set of possible matches between the elements of the two schemas based on the first combined degrees of similarity and receiving user feedback relating to a subset of the possible matches and using the user feedback to produce the information relating to a predicted degree of matching accuracy associated with the first weighting vector.
- the first weighting vector is then modified based on the information relating to the predicted degree of matching accuracy to produce a second weighting vector.
- the calculated degrees of similarity are combined using the second weighting vector to produce second combined degrees of similarity, and a modified set of possible matches between the elements of the two schemas is identified based on the second combined degrees of similarity.
- the calculated degrees of similarity are combined by multiplying each calculated degree of similarity for each matching process by the corresponding weighting coefficient to obtain weighted degrees of similarity and summing the weighted degrees of similarity.
- a degree of similarity is calculated between multiple pairs of elements. Each pair of elements includes one element selected from a source schema and one element selected from a target schema.
- weighting vectors can be used.
- a level of ambiguity is determined for each weighting vector, and a particular weighting vector to determine possible matches between the elements of the two schemas is selected based on the level of ambiguity for each weighting vector.
- a level of ambiguity can be determined by determining a number of ambiguous matches, a number of unambiguous matches, and/or a number of impossible matches.
- a factor is calculated, and the particular weighting vector selected is based on a value of the factor for the particular weighting vector relative to values of the factors for other weighting vectors.
- the particular weighting vector selected can be a weighting vector having a factor that tends to indicate a relatively high number of ambiguous matches or a relatively high number of unambiguous matches.
- the particular weighting vector selected can be a weighting vector having a factor that tends to indicate a relatively low number of ambiguous matches and a relatively low number of impossible matches.
- Unambiguous matches can be determined by identifying a maximum combined degree of similarity for the particular element, or identifying a combined degree of similarity for the particular element that exceeds a predetermined threshold and that exceeds all other combined degrees of similarity for the particular element by at least a predetermined amount.
- Ambiguous matches can be determined by identifying a combined degree of similarity for the particular element that exceeds a first threshold and is less than a second threshold or identifying a combined degree of similarity for the particular element that exceeds a predetermined threshold and that is within a predetermined range of other combined degrees of similarity for the particular element.
- Impossible matches can be identified by determining, for a particular element, that no combined degree of similarity for the particular element exceeds a predetermined minimum threshold.
- the matching processes can include schema-based criteria, content-based criteria, per-element criteria, structural criteria, linguistic criteria, and/or constraint-based criteria.
- User feedback relating to possible matches can be used to modify a first weighting vector to produce a second weighting vector.
- the calculated degrees of similarity can then be combined using the second weighting vector to produce second combined degrees of similarity, and a modified set of possible matches between the elements of the two schemas can be identified based on the second combined degrees of similarity.
- the first weighting vector can be selected based on a context associated with the two schemas and/or a similarity of one or more of the schema to schema for which the first weighting vector was previously used.
- the invention can be implemented to realize one or more of the following advantages.
- the invention can be used to provide enhanced matching performance, to improve the quality of matching, and/or, depending on the particular algorithms that are used, regulate the number and types of possible matches that are identified for manual review and approval.
- the invention can also be used to provide enhanced matching results for unclassified schemas.
- the invention can be used to assist users with manual finishing touches because the system can provide some different mapping examples as suggestions to the user.
- the elements of disparate schemas may be mapped without detailed knowledge of the characteristics of the schemas.
- the techniques provide generic data model matching (i.e., the techniques can perform matching independent of the data model).
- mapping can be performed automatically or at least semi-automatically.
- FIG. 1 is a flow diagram of a process for identifying matches between disparate schemas.
- FIG. 2 is a block diagram of a system for identifying matches between disparate schemas.
- FIG. 3 is an illustrative example of a similarity cube that can be used in the system of FIG. 2 .
- FIG. 4 is an illustrative example of a weighting vector similarity cube.
- FIG. 5 is an illustrative diagram of a technique for categorizing match results into different levels of ambiguity.
- FIG. 1 is a flow diagram of a process 100 for identifying matches between disparate schemas.
- a degree of similarity between elements of two schemas is calculated using multiple different matching techniques (step 105 ).
- a schema can be represented graphically or by a textual description of a logical relationship among different elements of the schema.
- the elements of a schema can be graphs, nodes, vertices, fields, leafs, or branches (i.e., groups of nodes or vertices) of the schema.
- the matching techniques can use matchers that implement particular matching processes. Any number of different types of matching processes can be used.
- the matching processes may be implemented in individual matchers that are schema-based, content-based, type-based, or semantic-based matchers.
- Schema-based matchers consider schema information, while content-based matchers consider instance data within a particular schema.
- Schema-based matchers can include per-element matchers, which can be linguistic (e.g., using element names or descriptions) or constraint-based (e.g., using types or keys).
- Schema-based matchers can also include structural matchers, which match combinations of elements or nodes and may be constraint based (e.g., graph matchers).
- Content-based matchers can include per-element matchers, which can be linguistic (e.g., using word frequencies or key terms) or constraint-based (e.g., using value patterns and ranges).
- Type-based matchers can include per-element matchers, which can perform matching based on the type of node (e.g., characteristics, facets, regular expressions), and semantic matchers can analyze the semantical context of the definition and name of each node. Matching processes may also be implemented in combined matchers, which may be hybrid (e.g., using multiple match criteria) or composite (e.g., using manually or automatically determined combinations of results from different match algorithms). One or more of these various different matching techniques can be used in step 105 . Other types of matchers that are known or that may be developed in the future can also be used.
- Each matching technique produces results that indicate a degree of similarity between an element in a first schema and an element in a second schema. For example, for every pair of elements between the two schemas, a matching technique may assign a value between zero and one, which indicates a probability estimate that the two elements match, with a value of zero indicating an absolute impossibility and a value of one indicating an absolute certainty of a match.
- the calculated degrees of similarity are then combined using one or more weighting vectors to provide composite match results (step 110 ).
- Each weighting vector includes multiple weighting coefficients, with each weighting coefficient corresponding to a particular matching process. By multiplying each degree of similarity for a specific matching process by the corresponding weighting coefficient, the degree of similarity can be weighted to provide more or less of a contribution relative to other matching processes.
- the weighted degree of similarity for the specific matching process is then added to the weighted degrees of similarity for the other matching processes to obtain a combined degree of similarity.
- Each possible pairing of elements thus has a corresponding combined degree of similarity.
- the weighting vector will typically provide relatively more accurate or less accurate results (e.g., compared to a different weighting vector or an even weighting of all calculated degrees of similarity).
- the initial weighting vector or vectors that are used may be selected based on characteristics of the schema to be matched.
- parameters relating to the schema and/or the matching process can be manually input into, or automatically generated by (e.g., by performing an automated analysis of the schema's structure, type, etc.), a system that performs the matching. These parameters can be used to influence which weighting vector or vectors are initially selected.
- the parameters may related to, e.g., the schema domain, a context of the schema and/or the matching process, etc.
- a schema that is similar to a previously mapped schema is assigned a weighting vector that is the same as or otherwise corresponds to (e.g., a modified or tuned weighting vector, as described below) the weighting vector for the previously mapped schema.
- Parameters that relate to the context of the schema can also affect the weighting vectors. For example, if a specific schema comes from a specific industry (e.g. automotive), the weighting vectors can be adjusted according the requirements of the specific industry. Different industries may have different specific requirements for the matching process and thus the weighting vectors may be adjusted in accordance with these requirements.
- Context drivers can include, for example: a business process type, a business document type, an industry category, a product category, a geopolitical area, and/or a system type.
- weighting vectors are used for particular contexts can be manually preprogrammed or can be selected based on an automated or partially automated tuning process, through which weighting vectors used in a particular context are adjusted through a “learning” process and the adjusted weighting vectors are subsequently used for matching other schema with-the same context.
- the weighting coefficients are tuned using information relating to a predicted degree of matching accuracy associated with the one or more weighting vectors (step 115 ).
- the weight coefficients can be adjusted based on one or more predicted degrees of matching accuracy, or a specific weighting vector can be selected over other possible weighting vectors based on a comparison of predicted degrees of matching accuracy for the various possible weighting vectors.
- the adjustment can be performed by a user, after receiving the comparison results, or automatically by analyzing other comparison results, in which similar schema structures are mapped.
- the predicted degree of matching accuracy is a calculation of a level of ambiguity associated with a particular weighting vector.
- the combined degree of similarity for a particular pair of elements i.e., an element from a source schema and a potentially matching element from a target schema
- the level of ambiguity can be calculated based on a number of ambiguous matches, a number of unambiguous matches, and a number of impossible matches.
- An ambiguous match generally means that a statistical possibility exists that the pair of elements actually match.
- multiple ambiguous matches can be associated with a particular element.
- a particular source element might have several statistically possible matches in a target schema.
- Each of the statistically possible matches can be an ambiguous match.
- an unambiguous match generally means that it is at least statistically probable that the pair of elements actually matches
- an impossible match generally means that it is statistically improbable or impossible that the pair of elements actually match.
- an unambiguous match can be defined by combined degrees of similarity for which the maximum probability of a match, among all possible matches, exceeds 70%
- an impossible match can be defined by combined degrees of similarity for which the maximum probability of a match, among all possible matches, is less than 50%.
- Classifying a match as unambiguous does not necessarily mean that two identified elements actually do match, just that the particular matching process (or combination of processes) used to predict matches generates matching results that suggest a statistical probability of a match. Similarly, classifying a match as impossible does not necessarily mean that a match does not exist, just that the particular matching process (or combination of processes) used to predict matches is unable to predict a match with a sufficient degree of confidence.
- Matches between two schemas can be categorized based on combined degrees of similarity in both directions or in only one direction (i.e., from a source to a target schema). For example, if matching is performed in both directions, a particular pair of elements may be identified as unambiguous only if the pair of elements meet the criteria for an unambiguous match in both directions (e.g., target element t and source element s represent an unambiguous match only if the corresponding probability of a match: (a) exceeds 70%, (b) is the maximum probability associated with target element t for all possible source elements, and (c) is the maximum probability associated with source element s for all possible target elements).
- target element t and source element s represent an unambiguous match only if the corresponding probability of a match: (a) exceeds 70%, (b) is the maximum probability associated with target element t for all possible source elements, and (c) is the maximum probability associated with source element s for all possible target elements).
- the particular pair of elements may be identified as unambiguous if the pair of elements meet the criteria for an unambiguous match in only one direction (e.g., target element t and source element s represent an unambiguous match if the corresponding probability of a match exceeds 70% and is the maximum probability associated with target element t for all possible source elements, but is not necessarily the maximum probability associated with source element s for all possible target elements).
- the values of U, A, I, and N can correspond to the number of target elements, source elements, or total elements that fit into each category. Generally, the values of U, A, I, and N should be expressed in the same units (e.g., if U is the number of target elements that are classified as unambiguous, then A, I, and N should be expressed as a number of target elements, rather than a number of source elements or total elements).
- the value of a for the particular weighting vector can then be compared to the value of a for other predefined weighting vectors to find the lowest overall level of ambiguity a.
- the weighting coefficients can be adjusted using an adjustment algorithm to optimize or improve (e.g., reduce) the overall level of ambiguity a.
- the calculated overall level of ambiguity can serve as a measure of a predicted degree of matching accuracy for weighting vectors.
- the goal may be to reduce the overall level of ambiguity a as much as possible, thereby favoring weighting vectors that minimize the number of ambiguous matches.
- it may be desirable to reduce (or increase) the number of impossible assignments, to reduce (or increase) the number of unambiguous matches, or to perform some combination of these alternatives (e.g. to reduce the number of unambiguous matches while increasing (or maximizing) the number of ambiguous matches.
- a matching process such as process 100
- the tool produces proposed mappings that are reviewed by a user to approve or reject each individual mapping and/or to identify mappings that may not have been proposed by the tool.
- the tool can present the proposed mappings to the user on a user interface that distinguishes between mappings that are unambiguous, ambiguous, or impossible.
- unambiguous results can be color-coded in green, ambiguous matches in yellow, and impossible matches in red.
- the tool may be used to reduce the workload of the user by reducing the number of ambiguous matches. In other cases, the tool may be used to reduce the number of unambiguous matches to prevent the possibility that the user will incorrectly assume that the tool made a correct mapping.
- the tool may be used for different purposes at different stages of a mapping procedure.
- the tool may be initially used to minimize the number of ambiguous matches. Subsequently, after the user has approved some of the proposed matches, settings for the tool can be changed to favor minimizing the number of unambiguous matches.
- the results of the composite matcher can also be influenced by adjusting threshold levels or other criteria for determining whether pairs of elements represent unambiguous, ambiguous, or impossible matches.
- the categorization among ambiguous, unambiguous, and impossible matches is arbitrary in that the categories can be defined differently for different implementations (e.g., what constitutes an unambiguous match can differ between different implementations or even in the same implementation depending on other characteristics of the element).
- the criteria used to categorize a particular combined degree of similarity as ambiguous, unambiguous, or impossible can be selected by a developer (e.g., programmer) of software that implements the process 100 or can be set by a user of such software.
- a developer e.g., programmer
- unambiguous matches and impossible matches do not necessarily require a probabilistic certainty.
- a fewer or greater number of levels can also be defined.
- some implementations may use only the ambiguous and impossible match categorizations, while other implementations may categorize the combined degrees of similarity into a greater number of different levels of ambiguity (e.g., unambiguous, mildly ambiguous, moderately ambiguous, highly ambiguous, and impossible).
- Other techniques for determining a level of ambiguity associated with a particular weighting vector can also be used (e.g., using an algorithm that performs computations using some or all of the combined degrees of similarity).
- the predicted degree of matching accuracy can be based on feedback from a user.
- the combined degrees of similarity generally provide composite match results that indicate which pairs of elements between the source and target schemas are likely and/or unlikely to represent actual matches.
- a user can review a subset (e.g., ten possible matches or 5% of the possible matches) of the total set of possible matches and provide feedback regarding whether the possible matches in the subset represent actual matches. This feedback can be used to modify the weighting vector.
- the correct matches identified by the user can be compared with results of the various matching processes to determine correlations (i.e., which matching processes were most likely to predict the correct match).
- the weighting vector can then be adjusted to more heavily weight the matching processes that showed the greatest correlations.
- the adjusted weighting vector can then be used to generate new combined degrees of similarity.
- the user feedback on a subset of the possible matches provides a measure of a predicted degree of matching accuracy for weighting vectors.
- the use of user feedback to adjust the weighting vector can be applied iteratively, such that the matching process continuously “learns” how to better predict matches between the particular schemas being mapped.
- the settings of the weighting vectors is changed according to feedback from the user.
- the user can influence the different weighting vectors of each matching type. For example, if the user defines that the matching result of name or definition is primarily wrong, then the weighting vector of a semantic or name matcher will be changed.
- User feedback can also be used to fine-tune a weighting vector that is selected from one or more candidate weighting vectors using a calculated level of ambiguity. For example, by identifying a particular weighting vector having a lowest calculated level of ambiguity among a set of predefined weighting vectors, the particular weighting vector can be selected as a “best” candidate for producing matching proposals. The particular weighting vector can then be fine-tuned by adjusting the weighting coefficients based on feedback from a user.
- the performance of a particular matching process can be assessed based on certain metrics.
- the precision of the matching process is a measure of the reliability of the proposed matches and can be calculated as the number of correct matches divided by the total number of proposed matches.
- the recall, precision, and overall measurements can only be calculated once all correct matches are known. Thus, these metrics do not generally provide an estimate of performance for a generic matching process.
- the process 100 can be used to predict whether a particular weighting vector will produce results with a favorable overall measurement and, thus, can be used to improve performance.
- FIG. 2 is a block diagram of a system 200 for identifying matches between disparate schemas.
- a source schema 205 and a target schema 210 represent schemas to be mapped to one another (or from one to the other).
- Multiple different matchers 215 ( 1 ), 215 ( 2 ), . . . 215 ( n ) are used to calculate degrees of similarity between elements of the two schemas 205 and 210 .
- the calculated degrees of similarity are stored in a similarity cube 220 , which can be maintained in a buffer or a memory.
- the similarity cube 220 includes a storage location for each combination of a matcher 215 , a source schema 205 element, and a target schema 210 element.
- the similarity cube 220 can include storage locations that, conceptually, have x, y, and z coordinates.
- FIG. 3 illustrates an example of a similarity cube 220 .
- Each level 325 in the z direction 330 represents a different matcher 215 (e.g., matcher 215 ( 1 ), matcher 215 ( 2 ), . . . matcher 215 ( n )).
- a degree of similarity can be calculated for each source schema element-target schema element pair, as analyzed by each different matcher 215 , and the degree of similarity can be stored in a storage location 335 corresponding to the source schema element, the target schema element, and the matcher 215 .
- a branch of the source schema 205 might include elements that exclusively store text data. The possibility that such a branch matches a branch of the target schema 210 having elements that exclusively store floating-point numbers can be easily rejected. As a result, degrees of similarity do not need to be calculated for elements in these branches, and the similarity cube may include empty storage locations.
- Which element pairs can be omitted from the degree of similarity calculation can be determined on a matcher-by-matcher basis (e.g., one matcher calculates a degree of similarity while another does not) or for all matchers 215 (e.g., a particular element pair is omitted from the degree of similarity calculation for all matchers 215 ).
- the calculated degrees of similarity from the similarity cube 220 are combined by a match results combiner 225 in accordance with one or more weighting vectors.
- the calculated degrees of similarity from each matcher 215 are weighted by a weighting coefficient corresponding to the matcher 215 , and the weighted degrees of similarity for each element pair are added together.
- Each weighting coefficient represents a level of importance for the calculated degree of similarity relative to the calculations from other matchers.
- all of the calculated degrees of similarity for a particular matcher are given the same weight. Accordingly, the weighting vector is used to attribute greater importance to some matchers relative to others.
- Ontology information (e.g., information about a classification of each schema) can also be used, if available, to obtain match results.
- the combined degrees of similarity are used to identify which element pairs are likely to match, might match, or are unlikely to match.
- the likely or possible matches can be used to generate at least a partial mapping of elements between schemas (e.g., from the source schema 205 to the target schema 210 , from the target schema 210 to the source schema 205 , or both).
- a “threshold” selection algorithm identifies all element pairs with a combined degree of similarity over a certain threshold.
- a “MaxN” type of selection algorithm identifies the n largest combined degrees of similarity, where n is an integer greater than or equal to one, and a “Max Delta” type of selection algorithm identifies: (a) the element pair with the largest combined degree of similarity, and (b) all element pairs having a combined degree of similarity within some delta value of the largest value.
- a set of combined degrees of similarity for a specific weighting vector can be used as an initial estimation for predicting matches or can simply be compared to combined degrees of similarity for other weighting vectors to narrow the selection of weighting vectors.
- the weighting coefficients are tuned to obtain an improved mapping of the schemas and/or to improve the identification of likely or probable matches.
- FIG. 4 illustrates an example of a weighting vector similarity cube 400 .
- each row 405 in the x direction 410 represents a different source schema 205 element
- each column 415 in the y direction 420 represents a different target schema 210 element.
- each level 425 in the z direction 430 represents a different weighting vector (w 0 , w 1 , . . . w j ).
- each storage location 435 in the weighting vector similarity cube 400 contains a combination of the degrees of similarity for the corresponding source schema element and target schema element across all of the matchers 215 (e.g., a weighted combination of the storage locations 335 in the z-direction 330 from FIG. 3 ).
- Each level of the weighting vector similarity cube 400 can be compared to the other levels to identify one or more weighting vectors that provide the most desirable results according to a measure of ambiguity in the results.
- the measure of ambiguity that is most desirable and how the measure of ambiguity is defined can be selected by a user of the system 200 or can be predefined in the system 200 .
- a weighting vector that provides a minimum number of ambiguous matchers and minimum number of impossible matches relative to other weighting vectors may be selected as the most desirable.
- the weighting vector with weighting coefficients that produce the most desirable results can be selected, thereby performing a tuning operation.
- tuning is performed by selecting a particular weighting vector among a limited set of weighting vectors defined prior to performing the tuning operation.
- tuning is performed by generating new weighting coefficients (e.g., identifying one or more additional candidate weighting vectors) after making an initial selection of a weighting vector. For example, when only one weighting vector is initially used to calculate combined degrees of similarity, the weighting coefficients for the weighting vector can be modified or tuned after obtaining the initial results. As shown in FIG. 2 , tuning can be performed based on user feedback (as received at 230 ) and/or based on one or more calculated levels of ambiguity. For example, the results associated with several weighting vectors may tend to indicate trends in how weighting coefficients affect levels of ambiguity. By analyzing such trends, fine-tuning of a weighting vector can be performed.
- new weighting coefficients e.g., identifying one or more additional candidate weighting vectors
- optional user feedback involves approving or rejecting matches proposed by the match results combiner 225 .
- the user feedback can be used to generate a final mapping 245 of elements between the source schema 205 and the target schema 210 .
- the user feedback can be used to fine-tune the mapping results.
- additional match iterations are performed.
- Subsequent match iterations may involve re-executing at least some of the matchers 215 , such as when some of the matchers 215 themselves are hybrid matchers that take into account user feedback.
- subsequent match iterations do not impact the results produced by the matchers 215 or the corresponding degree of similarity information stored in the similarity cube 220 and, thus, do not involve any re-execution of the matchers 215 .
- Such match iterations instead, can involve merely looping back to the match results combiner 225 (as indicated at 240 ).
- the weight vectors applied in the match results combiner 225 can be adjusted in an attempt to produce more desirable matching results (e.g., a lower measure of ambiguity, results that have a higher percentage of correct matches, results that have a lower percentage of incorrect matches, results that identify a correct match as one of the possible matches, etc.).
- What defines desirable matching results can depend on the particular environment in which the system is used, the types of schemas on which the system operates, user-selected settings, and/or settings that are predefined in the system 200 .
- FIG. 5 is an illustrative diagram of a technique 500 for categorizing match results into different levels of ambiguity.
- a calculation of a degree of similarity between each pair of elements in a source schema 505 and a target schema 510 results in a factor between zero and one hundred percent, with the factor reflecting a percent likelihood that the element pair matches, as determined by the particular matching process used.
- the categorization technique 500 is used for matching processes that involve a weighted combination of other matching processes, but the categorization technique 500 can be applied to any type of matching process.
- the categorization technique 500 is discussed below in the direction of finding elements in the target schema 510 that match elements in the source schema 505 , the technique can alternatively or additionally be used for categorizing matches in the opposite direction.
- each source schema 505 element for which the maximum calculated degree of similarity 515 among all possible matches for the source schema 505 element is less than a first threshold value 520 equal to 0.3 (i.e., thirty percent) is considered to be an impossible match. In other words, it is impossible for the matching process to predict a match involving the source schema 505 element.
- Each source schema 505 element for which the maximum calculated degree of similarity among all possible matches for the source schema 505 element is greater than the first threshold value (or a larger, second threshold value) and is greater than the next largest calculated degree of similarity for the source schema 505 element by at least a value ⁇ t 525 is considered to be an unambiguous match.
- each source schema 505 element for which at least two calculated degrees of similarity are greater than the first threshold value and are within a range value 530 equal to 0.1 (i.e., a ten percent interval) of the maximum calculated degree of similarity for the source schema 505 element is considered to be an ambiguous match.
- the number of ambiguous, impossible, and/or unambiguous matches can be used to calculate a measure of ambiguity.
- the measure of ambiguity can, in turn, be used to compare the weighting vector used to generate the matching results with other weighting vectors or to otherwise tune the weighting vector (e.g., by comparing the measure of ambiguity with corresponding measures for similar weighting vectors in which the weighting coefficients have been adjusted).
- the invention and all of the functional operations described in this specification can be implemented in digital electronic circuitry, or in computer software, firmware, or hardware, including the structural means disclosed in this specification and structural equivalents thereof, or in combinations of them.
- the invention can be implemented as one or more computer program products, i.e., one or more computer programs tangibly embodied in an information carrier, e.g., in a machine-readable storage device or in a propagated signal, for execution by, or to control the operation of, data processing apparatus, e.g., a programmable processor, a computer, or multiple computers.
- a computer program (also known as a program, software, software application, or code) can be written in any form of programming language, including compiled or interpreted languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.
- a computer program does not necessarily correspond to a file.
- a program can be stored in a portion of a file that holds other programs or data, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).
- a computer program can be deployed to be executed on one computer or on multiple computers at one site or distributed across multiple sites and interconnected by a communication network.
- the processes and logic flows described herein, including the method steps of the invention, can be performed by one or more programmable processors executing one or more computer programs to perform functions of the invention by operating on input data and generating output.
- the processes and logic flows can also be performed by, and apparatus of the invention can be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).
- processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer.
- a processor will receive instructions and data from a read-only memory or a random access memory or both.
- the essential elements of a computer are a processor for executing instructions and one or more memory devices for storing instructions and data.
- a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks.
- Information carriers suitable for embodying computer program instructions and data include all forms of non-volatile memory, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto-optical disks; and CD-ROM and DVD-ROM disks.
- semiconductor memory devices e.g., EPROM, EEPROM, and flash memory devices
- magnetic disks e.g., internal hard disks or removable disks
- magneto-optical disks e.g., CD-ROM and DVD-ROM disks.
- the processor and the memory can be supplemented by, or incorporated in special purpose logic circuitry.
- the invention can be implemented on a computer having a display device, e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor, for displaying information to the user and a keyboard and a pointing device, e.g., a mouse or a trackball, by which the user can provide input to the computer.
- a display device e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor
- a keyboard and a pointing device e.g., a mouse or a trackball
- Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input.
- the invention can be implemented in a computing system that includes a back-end component (e.g., a data server), a middleware component (e.g., an application server), or a front-end component (e.g., a client computer having a graphical user interface or a Web browser through which a user can interact with an implementation of the invention), or any combination of such back-end, middleware, and front-end components.
- a back-end component e.g., a data server
- a middleware component e.g., an application server
- a front-end component e.g., a client computer having a graphical user interface or a Web browser through which a user can interact with an implementation of the invention
- the components of the system can be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include a local area network (“LAN”) and a wide area network (“WAN”), e.g., the Internet.
- LAN local area network
- the computing system can include clients and servers.
- a client and server are generally remote from each other and typically interact through a communication network.
- the relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Methods and apparatus, including computer program products, for identifying matches between disparate schemas calculates a degree of similarity between elements of two schemas using each of multiple matching processes. The calculated degrees of similarity are combined using a first weighting vector to produce first combined degrees of similarity. The first weighting vector includes multiple weighting coefficients and each weighting coefficient corresponds to one of the matching processes. The weighting coefficients are tuned using information relating to a predicted degree of matching accuracy associated with the first weighting vector.
Description
- The present invention relates to data processing by digital computer, and more particularly to mapping elements between disparate schemas.
- Integration of applications in an enterprise can lead to more efficient operations. Enterprise application integration can require significant effort when migrating from disparate legacy applications to a more integrated framework. Enterprise application integration can be performed using a message exchange procedure, in which messages are exchanged between different data sets. Application data is typically organized according to the type of application or applications with which the data is designed to operate. As a result, the organization or structure of the data can be highly specialized. The messages used for enterprise application integration are generally structured sets of data in a well-defined syntax. The structure of the data can be referred to as its schema. Countless different schemas and/or schema domains (e.g., SQL DDL, XML-based dialects (such as xCBL), OWL, RDF, ODMG, SAP-IDoc, EDI, UBL, etc.) exist. Many different integration scenarios (e.g., business process integration, enterprise application integration, and master data management) require schema matching, in which a mapping between the elements of two schemas is produced. Schema matching can also be important in data translation applications (e.g., where data from a first database is migrated into a second database for use with a different application).
- Existing techniques for schema matching primarily rely upon manual mapping of elements from one schema to another. Some approaches exist, however, for partially automating the schema matching process using simple algorithms for field name or database structure matching or using machine learning technologies. Some approaches combine the criteria of different matching algorithms to produce a more complex matching technique (i.e., hybrid and composite matchers). Simple, hybrid, and composite matchers, however, are inflexible and tend to produce good results for some types of schemas while producing poor results for other types of schemas.
- Techniques have also been proposed for building ontologies for different schema domains. By building an ontology, schemas can be classified by type, and different weights can be applied to different individual matchers based on the class or classes of the schemas to be matched. For example, schemas in a first classification may use a composite matcher that heavily weights the contribution of a field name matcher that is a component of the composite matcher, while schemas in a second classification may use a composite matcher that heavily weights the contribution of a structural matcher that is a component of the composite matcher. Such an approach may provide improved performance relative to conventional simple, hybrid, or composite matchers but only works for schema domains that have previously been associated with a particular class of schema domains.
- The present invention provides methods and apparatus, including computer program products, that implement techniques for mapping schemas by tuning the relative contributions of different component matchers. The relative contributions (i.e., the weights) of different matchers can be tuned by optimizing a measure of ambiguity, which may be an algorithm that is based on a number of ambiguous matches, a number of unambiguous matches, and/or a number of impossible matches. In addition or as an alternative, the relative contributions of different matchers can be tuned by monitoring user interaction (e.g., user approvals and rejections of proposed matches) and using the user feedback to fine-tune the weights of the different matchers.
- In one general aspect, the techniques feature calculating a degree of similarity between elements of two schemas using each of multiple matching processes and combining the calculated degrees of similarity using a first weighting vector to produce first combined degrees of similarity. The first weighting vector includes multiple weighting coefficients and each weighting coefficient corresponds to one of the matching processes. The weighting coefficients are tuned using information relating to a predicted degree of matching accuracy associated with the first weighting vector.
- The invention can be implemented to include one or more of the following advantageous features. The calculated degrees of similarity are combined using each of multiple weighting vectors. Each weighting vector includes multiple weighting coefficients, and each weighting coefficient corresponds to one of the matching processes. The weighting coefficients are tuned by determining, using the combined degrees of similarity for each of the weighting vectors, a predicted degree of matching accuracy associated with each of the weighting vectors. A second weighting vector is selected to determine possible matches between the elements of the two schemas. The second weighting vector is selected based on a comparison of information relating to the respective predicted degrees of matching accuracy associated with the first weighting vector and the second weighting vector. Each predicted degree of matching accuracy is determined using a number of ambiguous matches, a number of unambiguous matches, and/or a number of impossible matches.
- The weighting coefficients are tuned by identifying a set of possible matches between the elements of the two schemas based on the first combined degrees of similarity and receiving user feedback relating to a subset of the possible matches and using the user feedback to produce the information relating to a predicted degree of matching accuracy associated with the first weighting vector. The first weighting vector is then modified based on the information relating to the predicted degree of matching accuracy to produce a second weighting vector. The calculated degrees of similarity are combined using the second weighting vector to produce second combined degrees of similarity, and a modified set of possible matches between the elements of the two schemas is identified based on the second combined degrees of similarity.
- The calculated degrees of similarity are combined by multiplying each calculated degree of similarity for each matching process by the corresponding weighting coefficient to obtain weighted degrees of similarity and summing the weighted degrees of similarity. A degree of similarity is calculated between multiple pairs of elements. Each pair of elements includes one element selected from a source schema and one element selected from a target schema.
- Multiple different weighting vectors can be used. A level of ambiguity is determined for each weighting vector, and a particular weighting vector to determine possible matches between the elements of the two schemas is selected based on the level of ambiguity for each weighting vector. A level of ambiguity can be determined by determining a number of ambiguous matches, a number of unambiguous matches, and/or a number of impossible matches. For each weighting vector, a factor is calculated, and the particular weighting vector selected is based on a value of the factor for the particular weighting vector relative to values of the factors for other weighting vectors. The particular weighting vector selected can be a weighting vector having a factor that tends to indicate a relatively high number of ambiguous matches or a relatively high number of unambiguous matches. Alternatively, the particular weighting vector selected can be a weighting vector having a factor that tends to indicate a relatively low number of ambiguous matches and a relatively low number of impossible matches.
- Unambiguous matches can be determined by identifying a maximum combined degree of similarity for the particular element, or identifying a combined degree of similarity for the particular element that exceeds a predetermined threshold and that exceeds all other combined degrees of similarity for the particular element by at least a predetermined amount. Ambiguous matches can be determined by identifying a combined degree of similarity for the particular element that exceeds a first threshold and is less than a second threshold or identifying a combined degree of similarity for the particular element that exceeds a predetermined threshold and that is within a predetermined range of other combined degrees of similarity for the particular element. Impossible matches can be identified by determining, for a particular element, that no combined degree of similarity for the particular element exceeds a predetermined minimum threshold. The matching processes can include schema-based criteria, content-based criteria, per-element criteria, structural criteria, linguistic criteria, and/or constraint-based criteria.
- User feedback relating to possible matches can be used to modify a first weighting vector to produce a second weighting vector. The calculated degrees of similarity can then be combined using the second weighting vector to produce second combined degrees of similarity, and a modified set of possible matches between the elements of the two schemas can be identified based on the second combined degrees of similarity. The first weighting vector can be selected based on a context associated with the two schemas and/or a similarity of one or more of the schema to schema for which the first weighting vector was previously used.
- The invention can be implemented to realize one or more of the following advantages. The invention can be used to provide enhanced matching performance, to improve the quality of matching, and/or, depending on the particular algorithms that are used, regulate the number and types of possible matches that are identified for manual review and approval. In addition to providing improved matching results for schemas that previously have been classified, the invention can also be used to provide enhanced matching results for unclassified schemas. In addition, the invention can be used to assist users with manual finishing touches because the system can provide some different mapping examples as suggestions to the user. In other words, the elements of disparate schemas may be mapped without detailed knowledge of the characteristics of the schemas. In this regard, the techniques provide generic data model matching (i.e., the techniques can perform matching independent of the data model). Furthermore, mapping can be performed automatically or at least semi-automatically. One implementation of the invention provides all of the above advantages.
- Details of one or more implementations of the invention are set forth in the accompanying drawings and in the description below. Further features, aspects, and advantages of the invention will become apparent from the description, the drawings, and the claims.
-
FIG. 1 is a flow diagram of a process for identifying matches between disparate schemas. -
FIG. 2 is a block diagram of a system for identifying matches between disparate schemas. -
FIG. 3 is an illustrative example of a similarity cube that can be used in the system ofFIG. 2 . -
FIG. 4 is an illustrative example of a weighting vector similarity cube. -
FIG. 5 is an illustrative diagram of a technique for categorizing match results into different levels of ambiguity. - Like reference numbers and designations in the various drawings indicate like elements.
-
FIG. 1 is a flow diagram of aprocess 100 for identifying matches between disparate schemas. A degree of similarity between elements of two schemas is calculated using multiple different matching techniques (step 105). Generally, a schema can be represented graphically or by a textual description of a logical relationship among different elements of the schema. The elements of a schema can be graphs, nodes, vertices, fields, leafs, or branches (i.e., groups of nodes or vertices) of the schema. - The matching techniques can use matchers that implement particular matching processes. Any number of different types of matching processes can be used. For example, the matching processes may be implemented in individual matchers that are schema-based, content-based, type-based, or semantic-based matchers. Schema-based matchers consider schema information, while content-based matchers consider instance data within a particular schema. Schema-based matchers can include per-element matchers, which can be linguistic (e.g., using element names or descriptions) or constraint-based (e.g., using types or keys). Schema-based matchers can also include structural matchers, which match combinations of elements or nodes and may be constraint based (e.g., graph matchers). Content-based matchers can include per-element matchers, which can be linguistic (e.g., using word frequencies or key terms) or constraint-based (e.g., using value patterns and ranges). Type-based matchers can include per-element matchers, which can perform matching based on the type of node (e.g., characteristics, facets, regular expressions), and semantic matchers can analyze the semantical context of the definition and name of each node. Matching processes may also be implemented in combined matchers, which may be hybrid (e.g., using multiple match criteria) or composite (e.g., using manually or automatically determined combinations of results from different match algorithms). One or more of these various different matching techniques can be used in
step 105. Other types of matchers that are known or that may be developed in the future can also be used. - Each matching technique produces results that indicate a degree of similarity between an element in a first schema and an element in a second schema. For example, for every pair of elements between the two schemas, a matching technique may assign a value between zero and one, which indicates a probability estimate that the two elements match, with a value of zero indicating an absolute impossibility and a value of one indicating an absolute certainty of a match.
- The calculated degrees of similarity are then combined using one or more weighting vectors to provide composite match results (step 110). Each weighting vector includes multiple weighting coefficients, with each weighting coefficient corresponding to a particular matching process. By multiplying each degree of similarity for a specific matching process by the corresponding weighting coefficient, the degree of similarity can be weighted to provide more or less of a contribution relative to other matching processes. The weighted degree of similarity for the specific matching process is then added to the weighted degrees of similarity for the other matching processes to obtain a combined degree of similarity. Each possible pairing of elements thus has a corresponding combined degree of similarity. Depending on the type of schemas to be combined, the weighting vector will typically provide relatively more accurate or less accurate results (e.g., compared to a different weighting vector or an even weighting of all calculated degrees of similarity).
- It is possible to define the weighting vector for each matching procedure. The initial weighting vector or vectors that are used may be selected based on characteristics of the schema to be matched. When schema are to be matched, parameters relating to the schema and/or the matching process can be manually input into, or automatically generated by (e.g., by performing an automated analysis of the schema's structure, type, etc.), a system that performs the matching. These parameters can be used to influence which weighting vector or vectors are initially selected. The parameters may related to, e.g., the schema domain, a context of the schema and/or the matching process, etc. For example, a schema that is similar to a previously mapped schema (e.g., a schema that is a different version of a previously mapped dialect) is assigned a weighting vector that is the same as or otherwise corresponds to (e.g., a modified or tuned weighting vector, as described below) the weighting vector for the previously mapped schema.
- Parameters that relate to the context of the schema can also affect the weighting vectors. For example, if a specific schema comes from a specific industry (e.g. automotive), the weighting vectors can be adjusted according the requirements of the specific industry. Different industries may have different specific requirements for the matching process and thus the weighting vectors may be adjusted in accordance with these requirements. Context drivers can include, for example: a business process type, a business document type, an industry category, a product category, a geopolitical area, and/or a system type. Which weighting vectors are used for particular contexts can be manually preprogrammed or can be selected based on an automated or partially automated tuning process, through which weighting vectors used in a particular context are adjusted through a “learning” process and the adjusted weighting vectors are subsequently used for matching other schema with-the same context.
- To improve the accuracy of the composite match results, the weighting coefficients are tuned using information relating to a predicted degree of matching accuracy associated with the one or more weighting vectors (step 115). In other words, the weight coefficients can be adjusted based on one or more predicted degrees of matching accuracy, or a specific weighting vector can be selected over other possible weighting vectors based on a comparison of predicted degrees of matching accuracy for the various possible weighting vectors. The adjustment can be performed by a user, after receiving the comparison results, or automatically by analyzing other comparison results, in which similar schema structures are mapped.
- In some implementations, the predicted degree of matching accuracy is a calculation of a level of ambiguity associated with a particular weighting vector. The combined degree of similarity for a particular pair of elements (i.e., an element from a source schema and a potentially matching element from a target schema) can be used to categorize the potential match as ambiguous, unambiguous, or impossible. Thereafter, the level of ambiguity can be calculated based on a number of ambiguous matches, a number of unambiguous matches, and a number of impossible matches.
- An ambiguous match generally means that a statistical possibility exists that the pair of elements actually match. In some implementations, multiple ambiguous matches can be associated with a particular element. For example, a particular source element might have several statistically possible matches in a target schema. Each of the statistically possible matches can be an ambiguous match. For purposes of this description, an unambiguous match generally means that it is at least statistically probable that the pair of elements actually matches, and an impossible match generally means that it is statistically improbable or impossible that the pair of elements actually match. For example, an unambiguous match can be defined by combined degrees of similarity for which the maximum probability of a match, among all possible matches, exceeds 70%, while an impossible match can be defined by combined degrees of similarity for which the maximum probability of a match, among all possible matches, is less than 50%.
- Classifying a match as unambiguous does not necessarily mean that two identified elements actually do match, just that the particular matching process (or combination of processes) used to predict matches generates matching results that suggest a statistical probability of a match. Similarly, classifying a match as impossible does not necessarily mean that a match does not exist, just that the particular matching process (or combination of processes) used to predict matches is unable to predict a match with a sufficient degree of confidence.
- Matches between two schemas can be categorized based on combined degrees of similarity in both directions or in only one direction (i.e., from a source to a target schema). For example, if matching is performed in both directions, a particular pair of elements may be identified as unambiguous only if the pair of elements meet the criteria for an unambiguous match in both directions (e.g., target element t and source element s represent an unambiguous match only if the corresponding probability of a match: (a) exceeds 70%, (b) is the maximum probability associated with target element t for all possible source elements, and (c) is the maximum probability associated with source element s for all possible target elements). If matching is performed in a single direction, on the other hand, the particular pair of elements may be identified as unambiguous if the pair of elements meet the criteria for an unambiguous match in only one direction (e.g., target element t and source element s represent an unambiguous match if the corresponding probability of a match exceeds 70% and is the maximum probability associated with target element t for all possible source elements, but is not necessarily the maximum probability associated with source element s for all possible target elements).
- Once a categorization is made among the different levels of ambiguity, a calculation of the overall level of ambiguity for a particular weighting vector can be made. For example, an overall level of ambiguity a can be calculated by:
a=(U+A+I)/N,
where U is the number of unambiguous matches, A is the number of ambiguous matches including all proposed matches (e.g., if one node of a source schema is ambiguously assigned to five potential target nodes, there are five ambiguous matches), I is the number of impossible matches, and N is the total number of nodes or elements and is used for normalizing the value of the overall level of ambiguity a. The values of U, A, I, and N can correspond to the number of target elements, source elements, or total elements that fit into each category. Generally, the values of U, A, I, and N should be expressed in the same units (e.g., if U is the number of target elements that are classified as unambiguous, then A, I, and N should be expressed as a number of target elements, rather than a number of source elements or total elements). - The value of a for the particular weighting vector can then be compared to the value of a for other predefined weighting vectors to find the lowest overall level of ambiguity a. Alternatively, the weighting coefficients can be adjusted using an adjustment algorithm to optimize or improve (e.g., reduce) the overall level of ambiguity a. Thus, the calculated overall level of ambiguity can serve as a measure of a predicted degree of matching accuracy for weighting vectors.
- Other algorithms for calculating the overall level of ambiguity for weighting vectors can also be used. In the above example, the goal may be to reduce the overall level of ambiguity a as much as possible, thereby favoring weighting vectors that minimize the number of ambiguous matches. In other implementations, it may be desirable to reduce (or increase) the number of impossible assignments, to reduce (or increase) the number of unambiguous matches, or to perform some combination of these alternatives (e.g. to reduce the number of unambiguous matches while increasing (or maximizing) the number of ambiguous matches.
- Which type of weighting vector tends to be favored and how the level of ambiguity is calculated generally depends on the desired results. Typically, implementations of a matching process, such as
process 100, act as a tool for performing a semi-automated mapping of elements between two or more schemas. The tool produces proposed mappings that are reviewed by a user to approve or reject each individual mapping and/or to identify mappings that may not have been proposed by the tool. Accordingly, the tool can present the proposed mappings to the user on a user interface that distinguishes between mappings that are unambiguous, ambiguous, or impossible. For example, unambiguous results can be color-coded in green, ambiguous matches in yellow, and impossible matches in red. A user can use this information to assume that unambiguous matches are correct, to assume that impossible matches can be ignored, and to devote their primary attention to reviewing ambiguous matches to identify which ones are correct. In some cases, the tool may be used to reduce the workload of the user by reducing the number of ambiguous matches. In other cases, the tool may be used to reduce the number of unambiguous matches to prevent the possibility that the user will incorrectly assume that the tool made a correct mapping. - Furthermore, the tool may be used for different purposes at different stages of a mapping procedure. For example, the tool may be initially used to minimize the number of ambiguous matches. Subsequently, after the user has approved some of the proposed matches, settings for the tool can be changed to favor minimizing the number of unambiguous matches. In addition to favoring different levels of ambiguity using different weighting vectors, the results of the composite matcher can also be influenced by adjusting threshold levels or other criteria for determining whether pairs of elements represent unambiguous, ambiguous, or impossible matches.
- The categorization among ambiguous, unambiguous, and impossible matches is arbitrary in that the categories can be defined differently for different implementations (e.g., what constitutes an unambiguous match can differ between different implementations or even in the same implementation depending on other characteristics of the element). The criteria used to categorize a particular combined degree of similarity as ambiguous, unambiguous, or impossible can be selected by a developer (e.g., programmer) of software that implements the
process 100 or can be set by a user of such software. As can be seen from the example above, unambiguous matches and impossible matches do not necessarily require a probabilistic certainty. A fewer or greater number of levels can also be defined. For example, some implementations may use only the ambiguous and impossible match categorizations, while other implementations may categorize the combined degrees of similarity into a greater number of different levels of ambiguity (e.g., unambiguous, mildly ambiguous, moderately ambiguous, highly ambiguous, and impossible). Other techniques for determining a level of ambiguity associated with a particular weighting vector can also be used (e.g., using an algorithm that performs computations using some or all of the combined degrees of similarity). - In other implementations, instead of defining the predicted degree of matching accuracy as a calculation of a level of ambiguity associated with a particular weighting vector, the predicted degree of matching accuracy can be based on feedback from a user. For example, the combined degrees of similarity generally provide composite match results that indicate which pairs of elements between the source and target schemas are likely and/or unlikely to represent actual matches. A user can review a subset (e.g., ten possible matches or 5% of the possible matches) of the total set of possible matches and provide feedback regarding whether the possible matches in the subset represent actual matches. This feedback can be used to modify the weighting vector. For instance, the correct matches identified by the user can be compared with results of the various matching processes to determine correlations (i.e., which matching processes were most likely to predict the correct match). The weighting vector can then be adjusted to more heavily weight the matching processes that showed the greatest correlations. The adjusted weighting vector can then be used to generate new combined degrees of similarity. Thus, the user feedback on a subset of the possible matches provides a measure of a predicted degree of matching accuracy for weighting vectors. The use of user feedback to adjust the weighting vector can be applied iteratively, such that the matching process continuously “learns” how to better predict matches between the particular schemas being mapped. The settings of the weighting vectors is changed according to feedback from the user. The user can influence the different weighting vectors of each matching type. For example, if the user defines that the matching result of name or definition is primarily wrong, then the weighting vector of a semantic or name matcher will be changed.
- User feedback can also be used to fine-tune a weighting vector that is selected from one or more candidate weighting vectors using a calculated level of ambiguity. For example, by identifying a particular weighting vector having a lowest calculated level of ambiguity among a set of predefined weighting vectors, the particular weighting vector can be selected as a “best” candidate for producing matching proposals. The particular weighting vector can then be fine-tuned by adjusting the weighting coefficients based on feedback from a user.
- In general, the performance of a particular matching process can be assessed based on certain metrics. The precision of the matching process is a measure of the reliability of the proposed matches and can be calculated as the number of correct matches divided by the total number of proposed matches. The recall of the matching process indicates the percentage of correct matches found and can be calculated as the number of correct matches divided by the number of actual matches. Neither precision nor recall alone, however, provides a good assessment of performance. Generally, high precision can be obtained at the expense of recall, and vice versa. Performance can more accurately be assessed by an overall measurement, which is calculated as:
Overall=Recall*(2−1/Precision).
The recall, precision, and overall measurements can only be calculated once all correct matches are known. Thus, these metrics do not generally provide an estimate of performance for a generic matching process. Theprocess 100, however, can be used to predict whether a particular weighting vector will produce results with a favorable overall measurement and, thus, can be used to improve performance. -
FIG. 2 is a block diagram of asystem 200 for identifying matches between disparate schemas. Asource schema 205 and atarget schema 210 represent schemas to be mapped to one another (or from one to the other). Multiple different matchers 215(1), 215(2), . . . 215(n) are used to calculate degrees of similarity between elements of the twoschemas similarity cube 220, which can be maintained in a buffer or a memory. Thesimilarity cube 220 includes a storage location for each combination of amatcher 215, asource schema 205 element, and atarget schema 210 element. For example, thesimilarity cube 220 can include storage locations that, conceptually, have x, y, and z coordinates. -
FIG. 3 illustrates an example of asimilarity cube 220. Eachrow 305 in thex direction 310 represents adifferent source schema 205 element (s0, s1, . . . sm-1, where m is the number of elements in the source schema 205), and eachcolumn 315 in the y direction 320 represents adifferent target schema 210 element (t0, t1, . . . ti-1, where i is the number of elements in thetarget schema 210, with m=i, m>i, or m<i). Eachlevel 325 in thez direction 330 represents a different matcher 215 (e.g., matcher 215(1), matcher 215(2), . . . matcher 215(n)). A degree of similarity can be calculated for each source schema element-target schema element pair, as analyzed by eachdifferent matcher 215, and the degree of similarity can be stored in astorage location 335 corresponding to the source schema element, the target schema element, and thematcher 215. - In some implementations, however, it may be unnecessary to calculate a degree of similarity for every source schema element-target schema element pair because some pairs (or entire branches of a schema) may be easily rejected without having to calculate a degree of similarity. For example, a branch of the
source schema 205 might include elements that exclusively store text data. The possibility that such a branch matches a branch of thetarget schema 210 having elements that exclusively store floating-point numbers can be easily rejected. As a result, degrees of similarity do not need to be calculated for elements in these branches, and the similarity cube may include empty storage locations. Which element pairs can be omitted from the degree of similarity calculation can be determined on a matcher-by-matcher basis (e.g., one matcher calculates a degree of similarity while another does not) or for all matchers 215 (e.g., a particular element pair is omitted from the degree of similarity calculation for all matchers 215). - As shown in
FIG. 2 , the calculated degrees of similarity from thesimilarity cube 220 are combined by a match resultscombiner 225 in accordance with one or more weighting vectors. For example, the calculated degrees of similarity from eachmatcher 215 are weighted by a weighting coefficient corresponding to thematcher 215, and the weighted degrees of similarity for each element pair are added together. Each weighting coefficient represents a level of importance for the calculated degree of similarity relative to the calculations from other matchers. Typically, for a given weighting vector, all of the calculated degrees of similarity for a particular matcher are given the same weight. Accordingly, the weighting vector is used to attribute greater importance to some matchers relative to others. Ontology information (e.g., information about a classification of each schema) can also be used, if available, to obtain match results. The combined degrees of similarity are used to identify which element pairs are likely to match, might match, or are unlikely to match. The likely or possible matches can be used to generate at least a partial mapping of elements between schemas (e.g., from thesource schema 205 to thetarget schema 210, from thetarget schema 210 to thesource schema 205, or both). - Which element pairs are identified as likely or possible matches depends on a type of selection algorithm used. A “threshold” selection algorithm identifies all element pairs with a combined degree of similarity over a certain threshold. A “MaxN” type of selection algorithm identifies the n largest combined degrees of similarity, where n is an integer greater than or equal to one, and a “Max Delta” type of selection algorithm identifies: (a) the element pair with the largest combined degree of similarity, and (b) all element pairs having a combined degree of similarity within some delta value of the largest value. These selection algorithms can be combined and/or other selection algorithms can be used.
- Depending on the particular implementation, a set of combined degrees of similarity for a specific weighting vector can be used as an initial estimation for predicting matches or can simply be compared to combined degrees of similarity for other weighting vectors to narrow the selection of weighting vectors. In either case, the weighting coefficients are tuned to obtain an improved mapping of the schemas and/or to improve the identification of likely or probable matches.
- When multiple weighting vectors are applied to the
similarity cube 220, the result is essentially a new similarity cube in which each level in the z-direction corresponds to results from a particular weighting vector instead of from aparticular matcher 215.FIG. 4 illustrates an example of a weightingvector similarity cube 400. As with theoriginal similarity cube 220, eachrow 405 in thex direction 410 represents adifferent source schema 205 element, and eachcolumn 415 in they direction 420 represents adifferent target schema 210 element. However, eachlevel 425 in thez direction 430 represents a different weighting vector (w0, w1, . . . wj). Thus, eachstorage location 435 in the weightingvector similarity cube 400 contains a combination of the degrees of similarity for the corresponding source schema element and target schema element across all of the matchers 215 (e.g., a weighted combination of thestorage locations 335 in the z-direction 330 fromFIG. 3 ). - Each level of the weighting
vector similarity cube 400 can be compared to the other levels to identify one or more weighting vectors that provide the most desirable results according to a measure of ambiguity in the results. The measure of ambiguity that is most desirable and how the measure of ambiguity is defined can be selected by a user of thesystem 200 or can be predefined in thesystem 200. For example, in one possible implementation, a weighting vector that provides a minimum number of ambiguous matchers and minimum number of impossible matches relative to other weighting vectors may be selected as the most desirable. By comparing the results of multiple weighting vectors, the weighting vector with weighting coefficients that produce the most desirable results can be selected, thereby performing a tuning operation. Thus, tuning is performed by selecting a particular weighting vector among a limited set of weighting vectors defined prior to performing the tuning operation. - In some implementations, tuning (or fine-tuning) is performed by generating new weighting coefficients (e.g., identifying one or more additional candidate weighting vectors) after making an initial selection of a weighting vector. For example, when only one weighting vector is initially used to calculate combined degrees of similarity, the weighting coefficients for the weighting vector can be modified or tuned after obtaining the initial results. As shown in
FIG. 2 , tuning can be performed based on user feedback (as received at 230) and/or based on one or more calculated levels of ambiguity. For example, the results associated with several weighting vectors may tend to indicate trends in how weighting coefficients affect levels of ambiguity. By analyzing such trends, fine-tuning of a weighting vector can be performed. - In some implementations, optional user feedback (as indicated at 230) involves approving or rejecting matches proposed by the match results
combiner 225. The user feedback can be used to generate afinal mapping 245 of elements between thesource schema 205 and thetarget schema 210. In addition or as an alternative, the user feedback can be used to fine-tune the mapping results. In the latter situation, additional match iterations (as indicated at 235) are performed. Subsequent match iterations may involve re-executing at least some of thematchers 215, such as when some of thematchers 215 themselves are hybrid matchers that take into account user feedback. In other cases, however, and for somematchers 215, subsequent match iterations do not impact the results produced by thematchers 215 or the corresponding degree of similarity information stored in thesimilarity cube 220 and, thus, do not involve any re-execution of thematchers 215. Such match iterations, instead, can involve merely looping back to the match results combiner 225 (as indicated at 240). In subsequent match iterations, the weight vectors applied in the match results combiner 225 can be adjusted in an attempt to produce more desirable matching results (e.g., a lower measure of ambiguity, results that have a higher percentage of correct matches, results that have a lower percentage of incorrect matches, results that identify a correct match as one of the possible matches, etc.). What defines desirable matching results can depend on the particular environment in which the system is used, the types of schemas on which the system operates, user-selected settings, and/or settings that are predefined in thesystem 200. -
FIG. 5 is an illustrative diagram of atechnique 500 for categorizing match results into different levels of ambiguity. A calculation of a degree of similarity between each pair of elements in asource schema 505 and atarget schema 510 results in a factor between zero and one hundred percent, with the factor reflecting a percent likelihood that the element pair matches, as determined by the particular matching process used. Typically, thecategorization technique 500 is used for matching processes that involve a weighted combination of other matching processes, but thecategorization technique 500 can be applied to any type of matching process. Although thecategorization technique 500 is discussed below in the direction of finding elements in thetarget schema 510 that match elements in thesource schema 505, the technique can alternatively or additionally be used for categorizing matches in the opposite direction. - In the illustrated example of
FIG. 5 , eachsource schema 505 element for which the maximum calculated degree ofsimilarity 515 among all possible matches for thesource schema 505 element is less than afirst threshold value 520 equal to 0.3 (i.e., thirty percent) is considered to be an impossible match. In other words, it is impossible for the matching process to predict a match involving thesource schema 505 element. Eachsource schema 505 element for which the maximum calculated degree of similarity among all possible matches for thesource schema 505 element is greater than the first threshold value (or a larger, second threshold value) and is greater than the next largest calculated degree of similarity for thesource schema 505 element by at least avalue Δt 525 is considered to be an unambiguous match. Finally, eachsource schema 505 element for which at least two calculated degrees of similarity are greater than the first threshold value and are within arange value 530 equal to 0.1 (i.e., a ten percent interval) of the maximum calculated degree of similarity for thesource schema 505 element is considered to be an ambiguous match. - The number of ambiguous, impossible, and/or unambiguous matches can be used to calculate a measure of ambiguity. The measure of ambiguity can, in turn, be used to compare the weighting vector used to generate the matching results with other weighting vectors or to otherwise tune the weighting vector (e.g., by comparing the measure of ambiguity with corresponding measures for similar weighting vectors in which the weighting coefficients have been adjusted).
- The invention and all of the functional operations described in this specification can be implemented in digital electronic circuitry, or in computer software, firmware, or hardware, including the structural means disclosed in this specification and structural equivalents thereof, or in combinations of them. The invention can be implemented as one or more computer program products, i.e., one or more computer programs tangibly embodied in an information carrier, e.g., in a machine-readable storage device or in a propagated signal, for execution by, or to control the operation of, data processing apparatus, e.g., a programmable processor, a computer, or multiple computers. A computer program (also known as a program, software, software application, or code) can be written in any form of programming language, including compiled or interpreted languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program does not necessarily correspond to a file. A program can be stored in a portion of a file that holds other programs or data, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code). A computer program can be deployed to be executed on one computer or on multiple computers at one site or distributed across multiple sites and interconnected by a communication network.
- The processes and logic flows described herein, including the method steps of the invention, can be performed by one or more programmable processors executing one or more computer programs to perform functions of the invention by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus of the invention can be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).
- Processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer. Generally, a processor will receive instructions and data from a read-only memory or a random access memory or both. The essential elements of a computer are a processor for executing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. Information carriers suitable for embodying computer program instructions and data include all forms of non-volatile memory, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto-optical disks; and CD-ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in special purpose logic circuitry.
- To provide for interaction with a user, the invention can be implemented on a computer having a display device, e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor, for displaying information to the user and a keyboard and a pointing device, e.g., a mouse or a trackball, by which the user can provide input to the computer. Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input.
- The invention can be implemented in a computing system that includes a back-end component (e.g., a data server), a middleware component (e.g., an application server), or a front-end component (e.g., a client computer having a graphical user interface or a Web browser through which a user can interact with an implementation of the invention), or any combination of such back-end, middleware, and front-end components. The components of the system can be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include a local area network (“LAN”) and a wide area network (“WAN”), e.g., the Internet.
- The computing system can include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
- The invention has been described in terms of particular embodiments, but other embodiments can be implemented and are within the scope of the following claims. For example, the operations of the invention can be performed in a different order and still achieve desirable results. Other embodiments are within the scope of the following claims
Claims (25)
1. A computer program product, tangibly embodied in an information carrier, for identifying matches between disparate schemas, the computer program product being operable to cause data processing apparatus to:
calculate a degree of similarity between elements of two schemas using each of a plurality of matching processes;
combine the calculated degrees of similarity using a first weighting vector to produce first combined degrees of similarity, with the first weighting vector including a plurality of weighting coefficients and each weighting coefficient corresponding to one of the plurality of matching processes; and
tune the weighting coefficients using information relating to a predicted degree of matching accuracy associated with the first weighting vector.
2. The computer program product of claim 1 wherein:
the calculated degrees of similarity are combined using each of a plurality of weighting vectors, with each weighting vector including a plurality of weighting coefficients and each weighting coefficient corresponding to one of the plurality of matching processes; and
the weighting coefficients are tuned by determining, using the combined degrees of similarity for each of the plurality of weighting vectors, a predicted degree of matching accuracy associated with each of the plurality of weighting vectors and selecting a second weighting vector to determine possible matches between the elements of the two schemas, with the second weighting vector selected based on a comparison of information relating to the respective predicted degrees of matching accuracy associated with the first weighting vector and the second weighting vector.
3. The computer program product of claim 2 wherein each predicted degree of matching accuracy is determined using at least one quantity selected from the group consisting of a number of ambiguous matches, a number of unambiguous matches, and a number of impossible matches.
4. The computer program product of claim 1 wherein the weighting coefficients are tuned by:
identifying a set of possible matches between the elements of the two schemas based on the first combined degrees of similarity;
receiving user feedback relating to a subset of the possible matches and using the user feedback to produce the information relating to a predicted degree of matching accuracy associated with the first weighting vector; and
modifying the first weighting vector based on the information relating to the predicted degree of matching accuracy to produce a second weighting vector.
5. The computer program product of claim 4 , with the computer program product being operable to cause data processing apparatus to further:
combine the calculated degrees of similarity using the second weighting vector to produce second combined degrees of similarity; and
identify a modified set of possible matches between the elements of the two schemas based on the second combined degrees of similarity.
6. The computer program product of claim 1 wherein the calculated degrees of similarity are combined by multiplying each calculated degree of similarity for each matching process by the corresponding weighting coefficient to obtain weighted degrees of similarity and summing the weighted degrees of similarity.
7. The computer program product of claim 1 wherein a degree of similarity is calculated between multiple pairs of elements, with each pair of elements having one element selected from a source schema and one element selected from a target schema.
8. A method for identifying matches between disparate schemas, the method comprising:
calculating a degree of similarity between elements of two schemas using each of a plurality of matching processes;
combining the calculated degrees of similarity using each of a plurality of weighting vectors, with each weighting vector including a plurality of weighting coefficients and each weighting coefficient corresponding to one of the plurality of matching processes;
determining, using the combined degrees of similarity, a level of ambiguity for each weighting vector; and
selecting a particular weighting vector to determine possible matches between the elements of the two schemas, wherein the particular weighting vector is selected based on the level of ambiguity for each weighting vector.
9. The method of claim 8 wherein determining a level of ambiguity comprises determining at least one quantity selected from the group consisting of a number of ambiguous matches, a number of unambiguous matches, and a number of impossible matches and the particular weighting vector is selected based on at least one quantity selected from the group consisting of a number of ambiguous matches, a number of unambiguous matches, and a number of impossible matches.
10. The method of claim 9 further comprising:
for each weighting vector, calculating a factor using at least one quantity selected from the group consisting of a number of ambiguous matches, a number of unambiguous matches, and a number of impossible matches; and
wherein selecting the particular weighting vector is based on a value of the factor for the particular weighting vector relative to values of the factors for others of the plurality of weighting vectors.
11. The method of claim 10 wherein selecting the particular weighting vector based on the value of the factor for the particular weighting vector comprises selecting, as the particular weighting vector, a weighting vector having a factor that tends to indicate one of a relatively high number of ambiguous matches or a relatively high number of unambiguous matches.
12. The method of claim 10 wherein selecting the particular weighting vector based on the value of the factor for the particular weighting vector comprises selecting, as the particular weighting vector, a weighting vector having a factor that tends to indicate at least one of a relatively low number of ambiguous matches, a relatively low number of impossible matches, or a relatively low number of unambiguous matches.
13. The method of claim 12 wherein selecting the particular weighting vector based on the value of the factor for the particular weighting vector comprises selecting, as the particular weighting vector, a weighting vector having a factor that tends to indicate a relatively low number of ambiguous matches and a relatively low number of impossible matches.
14. The method of claim 10 wherein selecting the particular weighting vector further comprises:
selecting a candidate weighting vector; and
tuning the candidate weighting vector by modifying the weighting coefficients for the candidate weighting vector to produce the particular weighting vector, wherein the factor for the particular weighting vector indicates a favorable weighting relative to the factor for the candidate weighting vector.
15. The method of claim 9 wherein determining the number of unambiguous matches comprises one of:
identifying, as representing an unambiguous match for a particular element, a maximum combined degree of similarity for the particular element; or
identifying, as representing an unambiguous match for a particular element, a combined degree of similarity for the particular element that exceeds a predetermined threshold and that exceeds all other combined degrees of similarity for the particular element by at least a predetermined amount.
16. The method of claim 9 wherein determining the number of ambiguous matches comprises at least one of:
identifying, as representing an ambiguous match for a particular element, a combined degree of similarity for the particular element that exceeds a first threshold and is less than a second threshold; or
identifying, as representing an ambiguous match for a particular element, a combined degree of similarity for the particular element that exceeds a predetermined threshold and that is within a predetermined range of other combined degrees of similarity for the particular element.
17. The method of claim 9 wherein determining the number of impossible matches comprises identifying an impossible match by determining, for a particular element, that no combined degree of similarity for the particular element exceeds a predetermined minimum threshold.
18. The method of claim 8 wherein the plurality of matching processes include matching criteria selected from the group consisting of schema-based criteria, content-based criteria, per-element criteria, structural criteria, linguistic criteria, and constraint-based criteria.
19. The method of claim 8 further comprising:
determining a set of possible matches between the elements of the two schemas using the combined degrees of similarity for the particular weighting vector;
receiving user feedback relating to a subset of the possible matches;
tuning the particular weighting vector based on the user feedback;
combining the calculated degrees of similarity using the tuned weighting vector; and
determining a new set of possible matches between the elements of the two schemas using the combined degrees of similarity for the tuned weighting vector.
20. A method for identifying matches between disparate schemas, the method comprising:
calculating a degree of similarity between elements of two schemas using each of a plurality of matching processes;
combining the calculated degrees of similarity using a first weighting vector to produce first combined degrees of similarity, with the first weighting vector including a plurality of weighting coefficients and each weighting coefficient corresponding to one of the plurality of matching processes;
identifying a set of possible matches between the elements of the two schemas based on the first combined degrees of similarity;
receiving user feedback relating to a subset of the possible matches;
modifying the first weighting vector based on the user feedback to produce a second weighting vector;
combining the calculated degrees of similarity using the second weighting vector to produce second combined degrees of similarity; and
identifying a modified set of possible matches between the elements of the two schemas based on the second combined degrees of similarity.
21. The method of claim 20 wherein the first weighting vector comprises one of a plurality of weighting vectors and modifying the first weighting vector based on the user feedback comprises adjusting the first weighting vector to incorporate weighting features of another of the plurality of weighting vectors selected based on the user feedback.
22. A system for identifying matches between disparate schemas, the system comprising:
means for calculating a degree of similarity between elements of two schemas using each of a plurality of matching processes;
means for combining the calculated degrees of similarity using a first weighting vector to produce first combined degrees of similarity, with the first weighting vector including a plurality of weighting coefficients and each weighting coefficient corresponding to one of the plurality of matching processes; and
means for tuning the weighting coefficients using information relating to a predicted degree of matching accuracy associated with the first weighting vector.
23. The system of claim 22 wherein the means for combining the calculated degrees of similarity is operable to combine the calculated degrees of similarity using each of a plurality of weighting vectors, with each weighting vector including a plurality of weighting coefficients and each weighting coefficient corresponding to one of the plurality of matching processes, and the means for tuning comprises:
means for determining, using the combined degrees of similarity for each of the plurality of weighting vectors, at least one quantity selected from the group consisting of a number of ambiguous matches, a number of unambiguous matches, and a number of impossible matches; and
means for selecting a second weighting vector to determine possible matches between the elements of the two schemas, wherein the second weighting vector is selected based on a comparison of information relating to a predicted degree of accuracy associated with each of the first weighting vector and the second weighting vector, with the information relating to the predicted degree of accuracy determined using at least one quantity selected from the group consisting of a number of ambiguous matches, a number of unambiguous matches, and a number of impossible matches.
24. The system of claim 22 wherein the means for tuning comprises:
means for identifying a set of possible matches between the elements of the two schemas based on the first combined degrees of similarity;
means for receiving user feedback relating to a subset of the possible matches and using the user feedback to produce the information relating to a predicted degree of matching accuracy associated with the first weighting vector; and
means for modifying the first weighting vector based on the information relating to the predicted degree of matching accuracy to produce a second weighting vector, the system further comprising:
means for combining the calculated degrees of similarity using the second weighting vector to produce second combined degrees of similarity; and
means for identifying a modified set of possible matches between the elements of the two schemas based on the second combined degrees of similarity.
25. The system of claim 22 wherein the first weighting vector is selected based on at least one selected from the group consisting of a context associated with the two schemas and a similarity of at least one of the schema to schema for which the first weighting vector was previously used.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US10/856,694 US20050278139A1 (en) | 2004-05-28 | 2004-05-28 | Automatic match tuning |
US12/796,192 US8271503B2 (en) | 2004-05-28 | 2010-06-08 | Automatic match tuning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US10/856,694 US20050278139A1 (en) | 2004-05-28 | 2004-05-28 | Automatic match tuning |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US12/796,192 Continuation US8271503B2 (en) | 2004-05-28 | 2010-06-08 | Automatic match tuning |
Publications (1)
Publication Number | Publication Date |
---|---|
US20050278139A1 true US20050278139A1 (en) | 2005-12-15 |
Family
ID=35461593
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US10/856,694 Abandoned US20050278139A1 (en) | 2004-05-28 | 2004-05-28 | Automatic match tuning |
US12/796,192 Active 2025-01-15 US8271503B2 (en) | 2004-05-28 | 2010-06-08 | Automatic match tuning |
Family Applications After (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US12/796,192 Active 2025-01-15 US8271503B2 (en) | 2004-05-28 | 2010-06-08 | Automatic match tuning |
Country Status (1)
Country | Link |
---|---|
US (2) | US20050278139A1 (en) |
Cited By (78)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060212860A1 (en) * | 2004-09-30 | 2006-09-21 | Benedikt Michael A | Method for performing information-preserving DTD schema embeddings |
US20060242142A1 (en) * | 2005-04-22 | 2006-10-26 | The Boeing Company | Systems and methods for performing schema matching with data dictionaries |
US20070005658A1 (en) * | 2005-07-02 | 2007-01-04 | International Business Machines Corporation | System, service, and method for automatically discovering universal data objects |
US20080071887A1 (en) * | 2006-09-19 | 2008-03-20 | Microsoft Corporation | Intelligent translation of electronic data interchange documents to extensible markup language representations |
US20080098010A1 (en) * | 2004-09-03 | 2008-04-24 | Carmel-Haifa University Economic Corp. Ltd | System and Method for Classifying, Publishing, Searching and Locating Electronic Documents |
US20080126386A1 (en) * | 2006-09-20 | 2008-05-29 | Microsoft Corporation | Translation of electronic data interchange messages to extensible markup language representation(s) |
US20080168081A1 (en) * | 2007-01-09 | 2008-07-10 | Microsoft Corporation | Extensible schemas and party configurations for edi document generation or validation |
US20080263104A1 (en) * | 2006-06-15 | 2008-10-23 | Chowdhary Pawan R | Updating a data warehouse schema based on changes in an observation model |
US20090112916A1 (en) * | 2007-10-30 | 2009-04-30 | Gunther Stuhec | Creating a mapping |
US20090132569A1 (en) * | 2007-11-15 | 2009-05-21 | Canon Kabushiki Kaisha | Data compression apparatus, data decompression apparatus, and method for compressing data |
US20090248587A1 (en) * | 2007-08-31 | 2009-10-01 | Van Buskirk Peter C | Selectively negotiated ridershare system comprising riders, drivers, and vehicles |
US20120095973A1 (en) * | 2010-10-15 | 2012-04-19 | Expressor Software | Method and system for developing data integration applications with reusable semantic types to represent and process application data |
US20120144028A1 (en) * | 2010-12-07 | 2012-06-07 | Mark Blackburn | Monitoring processes in a computer |
US20120179644A1 (en) * | 2010-07-09 | 2012-07-12 | Daniel Paul Miranker | Automatic Synthesis and Presentation of OLAP Cubes from Semantically Enriched Data Sources |
US20130081065A1 (en) * | 2010-06-02 | 2013-03-28 | Dhiraj Sharan | Dynamic Multidimensional Schemas for Event Monitoring |
US20130141585A1 (en) * | 2011-12-02 | 2013-06-06 | Hidehiro Naito | Checkout system and method for operating checkout system |
US20130297661A1 (en) * | 2012-05-03 | 2013-11-07 | Salesforce.Com, Inc. | System and method for mapping source columns to target columns |
US20130311456A1 (en) * | 2012-05-17 | 2013-11-21 | Sap Ag | Systems and Methods for Performing Data Analysis for Model Proposals |
US20150046150A1 (en) * | 2013-08-12 | 2015-02-12 | International Business Machines Corporation | Identifying and amalgamating conditional actions in business processes |
US20150193478A1 (en) * | 2014-01-09 | 2015-07-09 | International Business Machines Corporation | Method and Apparatus for Determining the Schema of a Graph Dataset |
US9311429B2 (en) | 2013-07-23 | 2016-04-12 | Sap Se | Canonical data model for iterative effort reduction in business-to-business schema integration |
US20160224996A1 (en) * | 2007-01-26 | 2016-08-04 | Information Resources, Inc. | Similarity matching of products based on multiple classification schemes |
US9646246B2 (en) | 2011-02-24 | 2017-05-09 | Salesforce.Com, Inc. | System and method for using a statistical classifier to score contact entities |
CN106651317A (en) * | 2016-12-28 | 2017-05-10 | 浙江省公众信息产业有限公司 | Method and device for judging business process correlation |
US20170149907A1 (en) * | 2015-11-23 | 2017-05-25 | International Business Machines Corporation | Identifying an entity associated with an online communication |
US20180107731A1 (en) * | 2012-10-22 | 2018-04-19 | Palantir Technologies Inc. | Sharing information between nexuses that use different classification schemes for information access control |
CN108416525A (en) * | 2018-03-13 | 2018-08-17 | 三峡大学 | A kind of procedural model method for measuring similarity based on metadata |
US10324925B2 (en) | 2016-06-19 | 2019-06-18 | Data.World, Inc. | Query generation for collaborative datasets |
US10346429B2 (en) | 2016-06-19 | 2019-07-09 | Data.World, Inc. | Management of collaborative datasets via distributed computer networks |
US10353911B2 (en) | 2016-06-19 | 2019-07-16 | Data.World, Inc. | Computerized tools to discover, form, and analyze dataset interrelations among a system of networked collaborative datasets |
WO2019182977A1 (en) * | 2018-03-19 | 2019-09-26 | Perkinelmer Informatics, Inc. | Methods and systems for automating clinical data mapping and transformation |
US10438013B2 (en) | 2016-06-19 | 2019-10-08 | Data.World, Inc. | Platform management of integrated access of public and privately-accessible datasets utilizing federated query generation and query schema rewriting optimization |
US10452677B2 (en) | 2016-06-19 | 2019-10-22 | Data.World, Inc. | Dataset analysis and dataset attribute inferencing to form collaborative datasets |
US10452975B2 (en) | 2016-06-19 | 2019-10-22 | Data.World, Inc. | Platform management of integrated access of public and privately-accessible datasets utilizing federated query generation and query schema rewriting optimization |
US10515085B2 (en) | 2016-06-19 | 2019-12-24 | Data.World, Inc. | Consolidator platform to implement collaborative datasets via distributed computer networks |
US20200012626A1 (en) * | 2018-07-06 | 2020-01-09 | Capital One Services, Llc | Systems and methods for a data search engine based on data profiles |
US10621203B2 (en) | 2007-01-26 | 2020-04-14 | Information Resources, Inc. | Cross-category view of a dataset using an analytic platform |
US10645548B2 (en) | 2016-06-19 | 2020-05-05 | Data.World, Inc. | Computerized tool implementation of layered data files to discover, form, or analyze dataset interrelations of networked collaborative datasets |
US10691710B2 (en) | 2016-06-19 | 2020-06-23 | Data.World, Inc. | Interactive interfaces as computerized tools to present summarization data of dataset attributes for collaborative datasets |
US10699027B2 (en) | 2016-06-19 | 2020-06-30 | Data.World, Inc. | Loading collaborative datasets into data stores for queries via distributed computer networks |
US10747774B2 (en) | 2016-06-19 | 2020-08-18 | Data.World, Inc. | Interactive interfaces to present data arrangement overviews and summarized dataset attributes for collaborative datasets |
CN111782817A (en) * | 2020-05-30 | 2020-10-16 | 国网福建省电力有限公司信息通信分公司 | Knowledge graph construction method and device for information system and electronic equipment |
US10824637B2 (en) | 2017-03-09 | 2020-11-03 | Data.World, Inc. | Matching subsets of tabular data arrangements to subsets of graphical data arrangements at ingestion into data driven collaborative datasets |
US10853376B2 (en) | 2016-06-19 | 2020-12-01 | Data.World, Inc. | Collaborative dataset consolidation via distributed computer networks |
US10860653B2 (en) | 2010-10-22 | 2020-12-08 | Data.World, Inc. | System for accessing a relational database using semantic queries |
US10922308B2 (en) | 2018-03-20 | 2021-02-16 | Data.World, Inc. | Predictive determination of constraint data for application with linked data in graph-based datasets associated with a data-driven collaborative dataset platform |
US10984008B2 (en) | 2016-06-19 | 2021-04-20 | Data.World, Inc. | Collaborative dataset consolidation via distributed computer networks |
USD920353S1 (en) | 2018-05-22 | 2021-05-25 | Data.World, Inc. | Display screen or portion thereof with graphical user interface |
US11016931B2 (en) | 2016-06-19 | 2021-05-25 | Data.World, Inc. | Data ingestion to generate layered dataset interrelations to form a system of networked collaborative datasets |
US11023104B2 (en) | 2016-06-19 | 2021-06-01 | data.world,Inc. | Interactive interfaces as computerized tools to present summarization data of dataset attributes for collaborative datasets |
US11036697B2 (en) | 2016-06-19 | 2021-06-15 | Data.World, Inc. | Transmuting data associations among data arrangements to facilitate data operations in a system of networked collaborative datasets |
US11036716B2 (en) | 2016-06-19 | 2021-06-15 | Data World, Inc. | Layered data generation and data remediation to facilitate formation of interrelated data in a system of networked collaborative datasets |
US11042548B2 (en) | 2016-06-19 | 2021-06-22 | Data World, Inc. | Aggregation of ancillary data associated with source data in a system of networked collaborative datasets |
US11042560B2 (en) | 2016-06-19 | 2021-06-22 | data. world, Inc. | Extended computerized query language syntax for analyzing multiple tabular data arrangements in data-driven collaborative projects |
US11042537B2 (en) | 2016-06-19 | 2021-06-22 | Data.World, Inc. | Link-formative auxiliary queries applied at data ingestion to facilitate data operations in a system of networked collaborative datasets |
US11042556B2 (en) | 2016-06-19 | 2021-06-22 | Data.World, Inc. | Localized link formation to perform implicitly federated queries using extended computerized query language syntax |
US11068847B2 (en) | 2016-06-19 | 2021-07-20 | Data.World, Inc. | Computerized tools to facilitate data project development via data access layering logic in a networked computing platform including collaborative datasets |
US11068453B2 (en) | 2017-03-09 | 2021-07-20 | data.world, Inc | Determining a degree of similarity of a subset of tabular data arrangements to subsets of graph data arrangements at ingestion into a data-driven collaborative dataset platform |
US11068475B2 (en) | 2016-06-19 | 2021-07-20 | Data.World, Inc. | Computerized tools to develop and manage data-driven projects collaboratively via a networked computing platform and collaborative datasets |
US11086896B2 (en) | 2016-06-19 | 2021-08-10 | Data.World, Inc. | Dynamic composite data dictionary to facilitate data operations via computerized tools configured to access collaborative datasets in a networked computing platform |
USD940169S1 (en) | 2018-05-22 | 2022-01-04 | Data.World, Inc. | Display screen or portion thereof with a graphical user interface |
USD940732S1 (en) | 2018-05-22 | 2022-01-11 | Data.World, Inc. | Display screen or portion thereof with a graphical user interface |
US11238109B2 (en) | 2017-03-09 | 2022-02-01 | Data.World, Inc. | Computerized tools configured to determine subsets of graph data arrangements for linking relevant data to enrich datasets associated with a data-driven collaborative dataset platform |
US11243960B2 (en) | 2018-03-20 | 2022-02-08 | Data.World, Inc. | Content addressable caching and federation in linked data projects in a data-driven collaborative dataset platform using disparate database architectures |
US11327991B2 (en) | 2018-05-22 | 2022-05-10 | Data.World, Inc. | Auxiliary query commands to deploy predictive data models for queries in a networked computing platform |
US11334625B2 (en) | 2016-06-19 | 2022-05-17 | Data.World, Inc. | Loading collaborative datasets into data stores for queries via distributed computer networks |
US11360990B2 (en) | 2019-06-21 | 2022-06-14 | Salesforce.Com, Inc. | Method and a system for fuzzy matching of entities in a database system based on machine learning |
US11436500B2 (en) * | 2019-12-05 | 2022-09-06 | PeerNova, Inc. | Schema correspondence rule generation using machine learning |
US11442988B2 (en) | 2018-06-07 | 2022-09-13 | Data.World, Inc. | Method and system for editing and maintaining a graph schema |
US11468049B2 (en) | 2016-06-19 | 2022-10-11 | Data.World, Inc. | Data ingestion to generate layered dataset interrelations to form a system of networked collaborative datasets |
US11537990B2 (en) | 2018-05-22 | 2022-12-27 | Data.World, Inc. | Computerized tools to collaboratively generate queries to access in-situ predictive data models in a networked computing platform |
US11675808B2 (en) | 2016-06-19 | 2023-06-13 | Data.World, Inc. | Dataset analysis and dataset attribute inferencing to form collaborative datasets |
US11755602B2 (en) | 2016-06-19 | 2023-09-12 | Data.World, Inc. | Correlating parallelized data from disparate data sources to aggregate graph data portions to predictively identify entity data |
US11941140B2 (en) | 2016-06-19 | 2024-03-26 | Data.World, Inc. | Platform management of integrated access of public and privately-accessible datasets utilizing federated query generation and query schema rewriting optimization |
US11947600B2 (en) | 2021-11-30 | 2024-04-02 | Data.World, Inc. | Content addressable caching and federation in linked data projects in a data-driven collaborative dataset platform using disparate database architectures |
US11947554B2 (en) | 2016-06-19 | 2024-04-02 | Data.World, Inc. | Loading collaborative datasets into data stores for queries via distributed computer networks |
US11947529B2 (en) | 2018-05-22 | 2024-04-02 | Data.World, Inc. | Generating and analyzing a data model to identify relevant data catalog data derived from graph-based data arrangements to perform an action |
US12008050B2 (en) | 2017-03-09 | 2024-06-11 | Data.World, Inc. | Computerized tools configured to determine subsets of graph data arrangements for linking relevant data to enrich datasets associated with a data-driven collaborative dataset platform |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8463743B2 (en) * | 2009-02-17 | 2013-06-11 | Microsoft Corporation | Shared composite data representations and interfaces |
US8738584B2 (en) * | 2009-02-17 | 2014-05-27 | Microsoft Corporation | Context-aware management of shared composite data |
US9400647B2 (en) | 2013-03-15 | 2016-07-26 | Sap Se | Application discovery and integration using semantic metamodels |
US10505873B2 (en) | 2014-12-30 | 2019-12-10 | Sap Se | Streamlining end-to-end flow of business-to-business integration processes |
US10192202B2 (en) | 2014-12-31 | 2019-01-29 | Sap Se | Mapping for collaborative contribution |
US11128664B1 (en) * | 2016-12-08 | 2021-09-21 | Trend Micro Incorporated | Intrusion prevention system with machine learning model for real-time inspection of network traffic |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030120651A1 (en) * | 2001-12-20 | 2003-06-26 | Microsoft Corporation | Methods and systems for model matching |
US6618727B1 (en) * | 1999-09-22 | 2003-09-09 | Infoglide Corporation | System and method for performing similarity searching |
US6681223B1 (en) * | 2000-07-27 | 2004-01-20 | International Business Machines Corporation | System and method of performing profile matching with a structured document |
US20040158567A1 (en) * | 2003-02-12 | 2004-08-12 | International Business Machines Corporation | Constraint driven schema association |
US20050060345A1 (en) * | 2003-09-11 | 2005-03-17 | Andrew Doddington | Methods and systems for using XML schemas to identify and categorize documents |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6269368B1 (en) * | 1997-10-17 | 2001-07-31 | Textwise Llc | Information retrieval using dynamic evidence combination |
US6618223B1 (en) | 2000-07-18 | 2003-09-09 | Read-Rite Corporation | High speed, high areal density inductive writer |
CA2475319A1 (en) * | 2002-02-04 | 2003-08-14 | Cataphora, Inc. | A method and apparatus to visually present discussions for data mining purposes |
US7490116B2 (en) * | 2003-01-23 | 2009-02-10 | Verdasys, Inc. | Identifying history of modification within large collections of unstructured data |
-
2004
- 2004-05-28 US US10/856,694 patent/US20050278139A1/en not_active Abandoned
-
2010
- 2010-06-08 US US12/796,192 patent/US8271503B2/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6618727B1 (en) * | 1999-09-22 | 2003-09-09 | Infoglide Corporation | System and method for performing similarity searching |
US6681223B1 (en) * | 2000-07-27 | 2004-01-20 | International Business Machines Corporation | System and method of performing profile matching with a structured document |
US20030120651A1 (en) * | 2001-12-20 | 2003-06-26 | Microsoft Corporation | Methods and systems for model matching |
US20040158567A1 (en) * | 2003-02-12 | 2004-08-12 | International Business Machines Corporation | Constraint driven schema association |
US20050060345A1 (en) * | 2003-09-11 | 2005-03-17 | Andrew Doddington | Methods and systems for using XML schemas to identify and categorize documents |
Cited By (131)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080098010A1 (en) * | 2004-09-03 | 2008-04-24 | Carmel-Haifa University Economic Corp. Ltd | System and Method for Classifying, Publishing, Searching and Locating Electronic Documents |
US8799289B2 (en) * | 2004-09-03 | 2014-08-05 | Carmel-Haifa University Economic Corp. Ltd. | System and method for classifying, publishing, searching and locating electronic documents |
US20060212860A1 (en) * | 2004-09-30 | 2006-09-21 | Benedikt Michael A | Method for performing information-preserving DTD schema embeddings |
US7496571B2 (en) * | 2004-09-30 | 2009-02-24 | Alcatel-Lucent Usa Inc. | Method for performing information-preserving DTD schema embeddings |
US7353226B2 (en) * | 2005-04-22 | 2008-04-01 | The Boeing Company | Systems and methods for performing schema matching with data dictionaries |
US20060242142A1 (en) * | 2005-04-22 | 2006-10-26 | The Boeing Company | Systems and methods for performing schema matching with data dictionaries |
US20070005658A1 (en) * | 2005-07-02 | 2007-01-04 | International Business Machines Corporation | System, service, and method for automatically discovering universal data objects |
US20080263104A1 (en) * | 2006-06-15 | 2008-10-23 | Chowdhary Pawan R | Updating a data warehouse schema based on changes in an observation model |
US8024305B2 (en) * | 2006-06-15 | 2011-09-20 | International Business Machines Corporation | Updating a data warehouse schema based on changes in an observation model |
US20080071887A1 (en) * | 2006-09-19 | 2008-03-20 | Microsoft Corporation | Intelligent translation of electronic data interchange documents to extensible markup language representations |
US20080126386A1 (en) * | 2006-09-20 | 2008-05-29 | Microsoft Corporation | Translation of electronic data interchange messages to extensible markup language representation(s) |
US20080168081A1 (en) * | 2007-01-09 | 2008-07-10 | Microsoft Corporation | Extensible schemas and party configurations for edi document generation or validation |
US10621203B2 (en) | 2007-01-26 | 2020-04-14 | Information Resources, Inc. | Cross-category view of a dataset using an analytic platform |
US20160224996A1 (en) * | 2007-01-26 | 2016-08-04 | Information Resources, Inc. | Similarity matching of products based on multiple classification schemes |
US20090248587A1 (en) * | 2007-08-31 | 2009-10-01 | Van Buskirk Peter C | Selectively negotiated ridershare system comprising riders, drivers, and vehicles |
US20090112916A1 (en) * | 2007-10-30 | 2009-04-30 | Gunther Stuhec | Creating a mapping |
US8041746B2 (en) | 2007-10-30 | 2011-10-18 | Sap Ag | Mapping schemas using a naming rule |
EP2077505A3 (en) * | 2007-11-15 | 2011-06-22 | Canon Kabushiki Kaisha | Data compression apparatus, data decompression apparatus, and method for compressing data |
EP2077505A2 (en) * | 2007-11-15 | 2009-07-08 | Canon Kabushiki Kaisha | Data compression apparatus, data decompression apparatus, and method for compressing data |
US20090132569A1 (en) * | 2007-11-15 | 2009-05-21 | Canon Kabushiki Kaisha | Data compression apparatus, data decompression apparatus, and method for compressing data |
US8229975B2 (en) | 2007-11-15 | 2012-07-24 | Canon Kabushiki Kaisha | Data compression apparatus, data decompression apparatus, and method for compressing data |
US20130081065A1 (en) * | 2010-06-02 | 2013-03-28 | Dhiraj Sharan | Dynamic Multidimensional Schemas for Event Monitoring |
US20120179644A1 (en) * | 2010-07-09 | 2012-07-12 | Daniel Paul Miranker | Automatic Synthesis and Presentation of OLAP Cubes from Semantically Enriched Data Sources |
US9495429B2 (en) * | 2010-07-09 | 2016-11-15 | Daniel Paul Miranker | Automatic synthesis and presentation of OLAP cubes from semantically enriched data sources |
US20120095973A1 (en) * | 2010-10-15 | 2012-04-19 | Expressor Software | Method and system for developing data integration applications with reusable semantic types to represent and process application data |
US8954375B2 (en) * | 2010-10-15 | 2015-02-10 | Qliktech International Ab | Method and system for developing data integration applications with reusable semantic types to represent and process application data |
US10860653B2 (en) | 2010-10-22 | 2020-12-08 | Data.World, Inc. | System for accessing a relational database using semantic queries |
US11409802B2 (en) | 2010-10-22 | 2022-08-09 | Data.World, Inc. | System for accessing a relational database using semantic queries |
US20120144028A1 (en) * | 2010-12-07 | 2012-06-07 | Mark Blackburn | Monitoring processes in a computer |
US9646246B2 (en) | 2011-02-24 | 2017-05-09 | Salesforce.Com, Inc. | System and method for using a statistical classifier to score contact entities |
US20130141585A1 (en) * | 2011-12-02 | 2013-06-06 | Hidehiro Naito | Checkout system and method for operating checkout system |
US20130297661A1 (en) * | 2012-05-03 | 2013-11-07 | Salesforce.Com, Inc. | System and method for mapping source columns to target columns |
US8972336B2 (en) * | 2012-05-03 | 2015-03-03 | Salesforce.Com, Inc. | System and method for mapping source columns to target columns |
US9098550B2 (en) * | 2012-05-17 | 2015-08-04 | Sap Se | Systems and methods for performing data analysis for model proposals |
US20130311456A1 (en) * | 2012-05-17 | 2013-11-21 | Sap Ag | Systems and Methods for Performing Data Analysis for Model Proposals |
US20180107731A1 (en) * | 2012-10-22 | 2018-04-19 | Palantir Technologies Inc. | Sharing information between nexuses that use different classification schemes for information access control |
US10891312B2 (en) * | 2012-10-22 | 2021-01-12 | Palantir Technologies Inc. | Sharing information between nexuses that use different classification schemes for information access control |
US9626451B2 (en) | 2013-07-23 | 2017-04-18 | Sap Se | Canonical data model for iterative effort reduction in business-to-business schema integration |
US9311429B2 (en) | 2013-07-23 | 2016-04-12 | Sap Se | Canonical data model for iterative effort reduction in business-to-business schema integration |
US9262735B2 (en) * | 2013-08-12 | 2016-02-16 | International Business Machines Corporation | Identifying and amalgamating conditional actions in business processes |
US9558462B2 (en) * | 2013-08-12 | 2017-01-31 | International Business Machines Corporation | Identifying and amalgamating conditional actions in business processes |
US20150046150A1 (en) * | 2013-08-12 | 2015-02-12 | International Business Machines Corporation | Identifying and amalgamating conditional actions in business processes |
US11573935B2 (en) | 2014-01-09 | 2023-02-07 | International Business Machines Corporation | Determining the schema of a graph dataset |
US9710496B2 (en) * | 2014-01-09 | 2017-07-18 | International Business Machines Corporation | Determining the schema of a graph dataset |
US20150193478A1 (en) * | 2014-01-09 | 2015-07-09 | International Business Machines Corporation | Method and Apparatus for Determining the Schema of a Graph Dataset |
US20170149718A1 (en) * | 2015-11-23 | 2017-05-25 | International Business Machines Corporation | Identifying an entity associated with an online communication |
US10230677B2 (en) * | 2015-11-23 | 2019-03-12 | International Business Machines Corporation | Identifying an entity associated with an online communication |
US10225227B2 (en) * | 2015-11-23 | 2019-03-05 | International Business Machines Corporation | Identifying an entity associated with an online communication |
US20170149907A1 (en) * | 2015-11-23 | 2017-05-25 | International Business Machines Corporation | Identifying an entity associated with an online communication |
US10642802B2 (en) | 2015-11-23 | 2020-05-05 | International Business Machines Corporation | Identifying an entity associated with an online communication |
US10853376B2 (en) | 2016-06-19 | 2020-12-01 | Data.World, Inc. | Collaborative dataset consolidation via distributed computer networks |
US11086896B2 (en) | 2016-06-19 | 2021-08-10 | Data.World, Inc. | Dynamic composite data dictionary to facilitate data operations via computerized tools configured to access collaborative datasets in a networked computing platform |
US10452975B2 (en) | 2016-06-19 | 2019-10-22 | Data.World, Inc. | Platform management of integrated access of public and privately-accessible datasets utilizing federated query generation and query schema rewriting optimization |
US10515085B2 (en) | 2016-06-19 | 2019-12-24 | Data.World, Inc. | Consolidator platform to implement collaborative datasets via distributed computer networks |
US12061617B2 (en) | 2016-06-19 | 2024-08-13 | Data.World, Inc. | Consolidator platform to implement collaborative datasets via distributed computer networks |
US10438013B2 (en) | 2016-06-19 | 2019-10-08 | Data.World, Inc. | Platform management of integrated access of public and privately-accessible datasets utilizing federated query generation and query schema rewriting optimization |
US11609680B2 (en) | 2016-06-19 | 2023-03-21 | Data.World, Inc. | Interactive interfaces as computerized tools to present summarization data of dataset attributes for collaborative datasets |
US10645548B2 (en) | 2016-06-19 | 2020-05-05 | Data.World, Inc. | Computerized tool implementation of layered data files to discover, form, or analyze dataset interrelations of networked collaborative datasets |
US10691710B2 (en) | 2016-06-19 | 2020-06-23 | Data.World, Inc. | Interactive interfaces as computerized tools to present summarization data of dataset attributes for collaborative datasets |
US10699027B2 (en) | 2016-06-19 | 2020-06-30 | Data.World, Inc. | Loading collaborative datasets into data stores for queries via distributed computer networks |
US10747774B2 (en) | 2016-06-19 | 2020-08-18 | Data.World, Inc. | Interactive interfaces to present data arrangement overviews and summarized dataset attributes for collaborative datasets |
US11675808B2 (en) | 2016-06-19 | 2023-06-13 | Data.World, Inc. | Dataset analysis and dataset attribute inferencing to form collaborative datasets |
US11726992B2 (en) | 2016-06-19 | 2023-08-15 | Data.World, Inc. | Query generation for collaborative datasets |
US10353911B2 (en) | 2016-06-19 | 2019-07-16 | Data.World, Inc. | Computerized tools to discover, form, and analyze dataset interrelations among a system of networked collaborative datasets |
US10346429B2 (en) | 2016-06-19 | 2019-07-09 | Data.World, Inc. | Management of collaborative datasets via distributed computer networks |
US10860613B2 (en) | 2016-06-19 | 2020-12-08 | Data.World, Inc. | Management of collaborative datasets via distributed computer networks |
US10860601B2 (en) | 2016-06-19 | 2020-12-08 | Data.World, Inc. | Dataset analysis and dataset attribute inferencing to form collaborative datasets |
US10860600B2 (en) | 2016-06-19 | 2020-12-08 | Data.World, Inc. | Dataset analysis and dataset attribute inferencing to form collaborative datasets |
US10324925B2 (en) | 2016-06-19 | 2019-06-18 | Data.World, Inc. | Query generation for collaborative datasets |
US11468049B2 (en) | 2016-06-19 | 2022-10-11 | Data.World, Inc. | Data ingestion to generate layered dataset interrelations to form a system of networked collaborative datasets |
US10963486B2 (en) | 2016-06-19 | 2021-03-30 | Data.World, Inc. | Management of collaborative datasets via distributed computer networks |
US10984008B2 (en) | 2016-06-19 | 2021-04-20 | Data.World, Inc. | Collaborative dataset consolidation via distributed computer networks |
US11947554B2 (en) | 2016-06-19 | 2024-04-02 | Data.World, Inc. | Loading collaborative datasets into data stores for queries via distributed computer networks |
US11016931B2 (en) | 2016-06-19 | 2021-05-25 | Data.World, Inc. | Data ingestion to generate layered dataset interrelations to form a system of networked collaborative datasets |
US11023104B2 (en) | 2016-06-19 | 2021-06-01 | data.world,Inc. | Interactive interfaces as computerized tools to present summarization data of dataset attributes for collaborative datasets |
US11036697B2 (en) | 2016-06-19 | 2021-06-15 | Data.World, Inc. | Transmuting data associations among data arrangements to facilitate data operations in a system of networked collaborative datasets |
US11036716B2 (en) | 2016-06-19 | 2021-06-15 | Data World, Inc. | Layered data generation and data remediation to facilitate formation of interrelated data in a system of networked collaborative datasets |
US11042548B2 (en) | 2016-06-19 | 2021-06-22 | Data World, Inc. | Aggregation of ancillary data associated with source data in a system of networked collaborative datasets |
US11042560B2 (en) | 2016-06-19 | 2021-06-22 | data. world, Inc. | Extended computerized query language syntax for analyzing multiple tabular data arrangements in data-driven collaborative projects |
US11042537B2 (en) | 2016-06-19 | 2021-06-22 | Data.World, Inc. | Link-formative auxiliary queries applied at data ingestion to facilitate data operations in a system of networked collaborative datasets |
US11042556B2 (en) | 2016-06-19 | 2021-06-22 | Data.World, Inc. | Localized link formation to perform implicitly federated queries using extended computerized query language syntax |
US11068847B2 (en) | 2016-06-19 | 2021-07-20 | Data.World, Inc. | Computerized tools to facilitate data project development via data access layering logic in a networked computing platform including collaborative datasets |
US11734564B2 (en) | 2016-06-19 | 2023-08-22 | Data.World, Inc. | Platform management of integrated access of public and privately-accessible datasets utilizing federated query generation and query schema rewriting optimization |
US11068475B2 (en) | 2016-06-19 | 2021-07-20 | Data.World, Inc. | Computerized tools to develop and manage data-driven projects collaboratively via a networked computing platform and collaborative datasets |
US11755602B2 (en) | 2016-06-19 | 2023-09-12 | Data.World, Inc. | Correlating parallelized data from disparate data sources to aggregate graph data portions to predictively identify entity data |
US11093633B2 (en) | 2016-06-19 | 2021-08-17 | Data.World, Inc. | Platform management of integrated access of public and privately-accessible datasets utilizing federated query generation and query schema rewriting optimization |
US11163755B2 (en) | 2016-06-19 | 2021-11-02 | Data.World, Inc. | Query generation for collaborative datasets |
US11176151B2 (en) | 2016-06-19 | 2021-11-16 | Data.World, Inc. | Consolidator platform to implement collaborative datasets via distributed computer networks |
US11194830B2 (en) | 2016-06-19 | 2021-12-07 | Data.World, Inc. | Computerized tools to discover, form, and analyze dataset interrelations among a system of networked collaborative datasets |
US11210307B2 (en) | 2016-06-19 | 2021-12-28 | Data.World, Inc. | Consolidator platform to implement collaborative datasets via distributed computer networks |
US11210313B2 (en) | 2016-06-19 | 2021-12-28 | Data.World, Inc. | Computerized tools to discover, form, and analyze dataset interrelations among a system of networked collaborative datasets |
US10452677B2 (en) | 2016-06-19 | 2019-10-22 | Data.World, Inc. | Dataset analysis and dataset attribute inferencing to form collaborative datasets |
US11941140B2 (en) | 2016-06-19 | 2024-03-26 | Data.World, Inc. | Platform management of integrated access of public and privately-accessible datasets utilizing federated query generation and query schema rewriting optimization |
US11423039B2 (en) | 2016-06-19 | 2022-08-23 | data. world, Inc. | Collaborative dataset consolidation via distributed computer networks |
US11246018B2 (en) | 2016-06-19 | 2022-02-08 | Data.World, Inc. | Computerized tool implementation of layered data files to discover, form, or analyze dataset interrelations of networked collaborative datasets |
US11386218B2 (en) | 2016-06-19 | 2022-07-12 | Data.World, Inc. | Platform management of integrated access of public and privately-accessible datasets utilizing federated query generation and query schema rewriting optimization |
US11373094B2 (en) | 2016-06-19 | 2022-06-28 | Data.World, Inc. | Platform management of integrated access of public and privately-accessible datasets utilizing federated query generation and query schema rewriting optimization |
US11277720B2 (en) | 2016-06-19 | 2022-03-15 | Data.World, Inc. | Computerized tool implementation of layered data files to discover, form, or analyze dataset interrelations of networked collaborative datasets |
US11314734B2 (en) | 2016-06-19 | 2022-04-26 | Data.World, Inc. | Query generation for collaborative datasets |
US11327996B2 (en) | 2016-06-19 | 2022-05-10 | Data.World, Inc. | Interactive interfaces to present data arrangement overviews and summarized dataset attributes for collaborative datasets |
US11928596B2 (en) | 2016-06-19 | 2024-03-12 | Data.World, Inc. | Platform management of integrated access of public and privately-accessible datasets utilizing federated query generation and query schema rewriting optimization |
US11334625B2 (en) | 2016-06-19 | 2022-05-17 | Data.World, Inc. | Loading collaborative datasets into data stores for queries via distributed computer networks |
US11334793B2 (en) | 2016-06-19 | 2022-05-17 | Data.World, Inc. | Platform management of integrated access of public and privately-accessible datasets utilizing federated query generation and query schema rewriting optimization |
US11816118B2 (en) | 2016-06-19 | 2023-11-14 | Data.World, Inc. | Collaborative dataset consolidation via distributed computer networks |
US11366824B2 (en) | 2016-06-19 | 2022-06-21 | Data.World, Inc. | Dataset analysis and dataset attribute inferencing to form collaborative datasets |
CN106651317A (en) * | 2016-12-28 | 2017-05-10 | 浙江省公众信息产业有限公司 | Method and device for judging business process correlation |
US11238109B2 (en) | 2017-03-09 | 2022-02-01 | Data.World, Inc. | Computerized tools configured to determine subsets of graph data arrangements for linking relevant data to enrich datasets associated with a data-driven collaborative dataset platform |
US11669540B2 (en) | 2017-03-09 | 2023-06-06 | Data.World, Inc. | Matching subsets of tabular data arrangements to subsets of graphical data arrangements at ingestion into data-driven collaborative datasets |
US11068453B2 (en) | 2017-03-09 | 2021-07-20 | data.world, Inc | Determining a degree of similarity of a subset of tabular data arrangements to subsets of graph data arrangements at ingestion into a data-driven collaborative dataset platform |
US10824637B2 (en) | 2017-03-09 | 2020-11-03 | Data.World, Inc. | Matching subsets of tabular data arrangements to subsets of graphical data arrangements at ingestion into data driven collaborative datasets |
US12008050B2 (en) | 2017-03-09 | 2024-06-11 | Data.World, Inc. | Computerized tools configured to determine subsets of graph data arrangements for linking relevant data to enrich datasets associated with a data-driven collaborative dataset platform |
CN108416525A (en) * | 2018-03-13 | 2018-08-17 | 三峡大学 | A kind of procedural model method for measuring similarity based on metadata |
US11263185B2 (en) | 2018-03-19 | 2022-03-01 | Perkinelmer Informatics, Inc. | Methods and systems for automating clinical data mapping and transformation |
WO2019182977A1 (en) * | 2018-03-19 | 2019-09-26 | Perkinelmer Informatics, Inc. | Methods and systems for automating clinical data mapping and transformation |
US11243960B2 (en) | 2018-03-20 | 2022-02-08 | Data.World, Inc. | Content addressable caching and federation in linked data projects in a data-driven collaborative dataset platform using disparate database architectures |
US10922308B2 (en) | 2018-03-20 | 2021-02-16 | Data.World, Inc. | Predictive determination of constraint data for application with linked data in graph-based datasets associated with a data-driven collaborative dataset platform |
US11573948B2 (en) | 2018-03-20 | 2023-02-07 | Data.World, Inc. | Predictive determination of constraint data for application with linked data in graph-based datasets associated with a data-driven collaborative dataset platform |
USD940169S1 (en) | 2018-05-22 | 2022-01-04 | Data.World, Inc. | Display screen or portion thereof with a graphical user interface |
USD920353S1 (en) | 2018-05-22 | 2021-05-25 | Data.World, Inc. | Display screen or portion thereof with graphical user interface |
US11537990B2 (en) | 2018-05-22 | 2022-12-27 | Data.World, Inc. | Computerized tools to collaboratively generate queries to access in-situ predictive data models in a networked computing platform |
US11947529B2 (en) | 2018-05-22 | 2024-04-02 | Data.World, Inc. | Generating and analyzing a data model to identify relevant data catalog data derived from graph-based data arrangements to perform an action |
USD940732S1 (en) | 2018-05-22 | 2022-01-11 | Data.World, Inc. | Display screen or portion thereof with a graphical user interface |
US11327991B2 (en) | 2018-05-22 | 2022-05-10 | Data.World, Inc. | Auxiliary query commands to deploy predictive data models for queries in a networked computing platform |
US11442988B2 (en) | 2018-06-07 | 2022-09-13 | Data.World, Inc. | Method and system for editing and maintaining a graph schema |
US11657089B2 (en) | 2018-06-07 | 2023-05-23 | Data.World, Inc. | Method and system for editing and maintaining a graph schema |
US11474978B2 (en) * | 2018-07-06 | 2022-10-18 | Capital One Services, Llc | Systems and methods for a data search engine based on data profiles |
US20200012626A1 (en) * | 2018-07-06 | 2020-01-09 | Capital One Services, Llc | Systems and methods for a data search engine based on data profiles |
US11360990B2 (en) | 2019-06-21 | 2022-06-14 | Salesforce.Com, Inc. | Method and a system for fuzzy matching of entities in a database system based on machine learning |
US11436500B2 (en) * | 2019-12-05 | 2022-09-06 | PeerNova, Inc. | Schema correspondence rule generation using machine learning |
CN111782817A (en) * | 2020-05-30 | 2020-10-16 | 国网福建省电力有限公司信息通信分公司 | Knowledge graph construction method and device for information system and electronic equipment |
US11947600B2 (en) | 2021-11-30 | 2024-04-02 | Data.World, Inc. | Content addressable caching and federation in linked data projects in a data-driven collaborative dataset platform using disparate database architectures |
Also Published As
Publication number | Publication date |
---|---|
US20100250559A1 (en) | 2010-09-30 |
US8271503B2 (en) | 2012-09-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8271503B2 (en) | Automatic match tuning | |
US11256555B2 (en) | Automatically scalable system for serverless hyperparameter tuning | |
EP3591586A1 (en) | Data model generation using generative adversarial networks and fully automated machine learning system which generates and optimizes solutions given a dataset and a desired outcome | |
US7958113B2 (en) | Automatically and adaptively determining execution plans for queries with parameter markers | |
US9773053B2 (en) | Method and apparatus for processing electronic data | |
CN114175010A (en) | Finding semantic meaning of data fields from profile data of the data fields | |
US8612367B2 (en) | Learning similarity function for rare queries | |
CN112231592B (en) | Graph-based network community discovery method, device, equipment and storage medium | |
US11366806B2 (en) | Automated feature generation for machine learning application | |
Zhang et al. | Influence-aware truth discovery | |
US20230195809A1 (en) | Joint personalized search and recommendation with hypergraph convolutional networks | |
US9400826B2 (en) | Method and system for aggregate content modeling | |
US11615080B1 (en) | System, method, and computer program for converting a natural language query to a nested database query | |
US8650180B2 (en) | Efficient optimization over uncertain data | |
Wang et al. | Approximate truth discovery via problem scale reduction | |
Gao et al. | A user-knowledge dynamic pattern matching process and optimization strategy based on the expert knowledge recommendation system | |
Marie et al. | Managing uncertainty in schema matcher ensembles | |
Qinl et al. | Synthesizing privacy preserving entity resolution datasets | |
Gu et al. | Improving the quality of web-based data imputation with crowd intervention | |
US11922326B2 (en) | Data management suggestions from knowledge graph actions | |
US11748561B1 (en) | Apparatus and methods for employment application assessment | |
US20230100716A1 (en) | Self-optimizing context-aware problem identification from information technology incident reports | |
Wan et al. | Multivariate time series data clustering method based on dynamic time warping and affinity propagation | |
Trummer | BABOONS: Black-box optimization of data summaries in natural language | |
Vargas-Vera et al. | A framework for detecting and removing knowledge overlaps in a collaborative environment: case of study a computer configuration problem |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: SAP AKTIENGESELLSCHAFT, GERMANY Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:GLAENZER, HELMUT K.;STUHEC, GUNTHER;REEL/FRAME:015043/0135 Effective date: 20040719 |
|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |