Keywords

Instance matching is the problem of matching pairs of instances that refer to the same underlying entity, with numerous documented applications in the Semantic Web community [4]. Current state-of-the-art instance matchers use a variety of machine learning techniques to achieve effective performance. Many of these systems are supervised, and require sets of manually annotated samples to train their classifiers. This manual effort is expensive in open communities.

In recent years, minimally supervised approaches have been devised to alleviate extensive labeling effort [2]. While such approaches perform reasonably in many cases, a comparative analysis shows that there is still a considerable gap between their performance and that of supervised systems. An additional problem is that such systems rely on the specification of a function called the pseudo F-Measure (PFM). Intuitively, the PFM serves as a proxy for the true F-Measure, with minimally supervised instance matchers heuristically attempting to optimize the PFM instead of the true (unknown) F-Measure [3]. A recent study found the PFM to be uncorrelated (or negatively correlated) with the true F-Measure in several cases [2], raising concerns about whether currently defined PFMs are appropriate proxies.

Fig. 1.
figure 1

The proposed instance matching system. The dotted component (the classification step of instance matching) is iteratively executed for a pre-defined number of rounds, and constitutes the key innovation of this work

This paper presents an alternate minimally supervised instance matching approach that offers a practical compromise between the two paradigms above. The architecture of the system is illustrated in Fig. 1. The proposed system expects a few input seed training samples to bootstrap itself. Some preprocessing that is often specific to instance matching systems (blocking and restriction set generation) is then performed. We implemented existing state-of-the-art modules for these steps to maximize system performance. Two sets of real-valued feature vectorsFootnote 1 are output by the pipeline prior to the classification step (the dotted component in Fig. 1).

The first output is the set of seed training samples, with each (matching and non-matching) instance pair in the set converted to (positively and negatively labeled) feature vectors and are used to train an ensemble classifier. An ensemble classifier takes a base classifier as input and trains the classifier using a meta-classification strategy called boosting [1]. The goal is to obtain a committee (or ensemble) of base classifiers that use weighted majority voting to classify samples, which is shown to improve performance on many challenging tasks [1].

The second output is the candidate set, which represents the unseen (or test) data in the classification task. These are unlabeled feature vectors, with each vector representing an underlying instance pair. The trained ensemble classifier is used for probabilistic instance matching, where the classifier scores each feature vector in the candidate set according to its likelihood of having a positive label. If the seed training set is small (\({\le }2\,\%\) of the ground-truth), the initial output is not expected to have high quality. Instead, the system assumes that a percentage of the top-scoring feature vectors are correctly labeled, and uses them to iteratively self-train itself in a semi-supervised fashion. In our work, we showed that an aggressive strategy (doubling the size of the self-training set in each successive iteration) performed well, with convergence achieved within 7 self-training iterations. The overall intent of self-training is to improve generalization performance with each iteration, with large gains anticipated in the initial iterations.

To the best of our knowledge, this is the first minimally supervised instance matching system that combines boosting methods with iterative semi-supervised learning to achieve effective performance. The ensemble classifier is trained using the AdaBoost algorithm, and with a choice of two base classifiers, random forests (RFs) and multilayer perceptrons (MLPs), both of which have been individually validated for good instance matching performance in related work.

Experiments: We compare the two settings of the proposed system (with RF and MLP as base classifier) on six real-world benchmarks covering domains of people (Persons 1 & 2), restaurants (Restaurants), bibliographies (ACM-DBLP) and e-commerce (Amazon-GP, Abt-Buy). In our original work (see footnote 1), the highest F-Measure (averaged over all the benchmarks) achieved by the MLP setting of the system (76.78 %; see also Table 1) was within 2.5 % of that of state-of-the-art supervised approaches (79.25 %), and outcompeted various PFM optimization-based minimally supervised approaches (72.18 %). This was not true of the RF-based setting. Table 1 directly compares the two settings. Essentially, we record the highest achieved F-Measure of each setting on each benchmark over 1–7 self-training iterations (equiv. runs) and report the run in which the highest F-Measure was recorded, along with the corresponding precision and recall. The table shows that, while the difference between the settings on Amazon-GP, Abt-Buy and on average is considerable, it is quite small on the other four test cases. RF even outperforms MLP by 5.32 % on Restaurants.

An analysis of Table 1 shows that MLP is preferrable to RF. However, the recorded training times of both settings (Table 2) show that there is a cost to using MLP over RF. The training time of MLP depends at least linearly on the number of training samples, roughly doubling in a successive iteration (when the size of the self-training set also doubles), while RF training time grows slowly. To compare the difference, consider that, in Table 1, the best performance of MLP on Abt-Buy occurs in the fifth iteration, with training time being 887.28 s. RF achieves its best performance in the seventh iteration, at training time of 46.49 s. On large datasets, these differences (a factor of 19) will amplify even further, calling into question the practical feasibility of MLP.

Table 1. A comparison of the highest achieved F-Measure (and corresponding precision and recall) over 7 self-training runs of the system when using MLP/RF as base classifier in the AdaBoost algorithm
Table 2. A comparison of the training times (in seconds) over 7 self-training iterations of the system when using MLP/RF as base classifier in the AdaBoost algorithm. x indicates that reliable results could not be obtained for that run due to system-specific factors. Averages are computed over non-x values

Future Work: The excellent performance achieved by boosted MLP on the benchmarks showed that it is feasible to devise minimally supervised approaches that nearly equal the performance of supervised approaches requiring training on a non-trivially obtained manually labeled set (\({\ge }10\,\%\) of ground-truth). However, a close look at the training times of the MLP against a cheaper alternative, namely RFs, shows that the cost of training the MLP against the RF can be prohibitive. Future work will seek to make progress on this front by (i) predicting, through meta-strategies, when to use RF vs. MLP to secure the best cost-performance trade-off, and (ii) developing classifiers that can match the benefits of MLPs but without its prohibitive training times.