Minimally Supervised Instance Matching: An Alternate Approach

Mayank Kejriwal²⁰ &
Daniel P. Miranker²⁰

Part of the book series: Lecture Notes in Computer Science ((LNCCN,volume 9341))

Included in the following conference series:

European Semantic Web Conference

1589 Accesses
1 Citations

Abstract

Instance matching concerns identifying pairs of instances that refer to the same underlying entity. Current state-of-the-art instance matchers use machine learning methods. Supervised learning systems achieve good performance by training on significant amounts of manually labeled samples. To alleviate the labeling effort, this poster (The work presented herein is also being published as a full conference paper at ESWC 2015. This poster provides a more high-level overview and discusses supplemental experimental findings beyond the scope of the material in the full paper.) presents a minimally supervised instance matching approach that is able to deliver competitive performance using only 2 % training data. As a first step, a committee of base classifiers is trained in an ensemble setting using boosting. Iterative semi-supervised learning is used to improve the performance of the ensemble classifier even further, by self-training it on the most confident samples labeled in the current iteration. Empirical evaluations on real-world data show that, using a multilayer perceptron as base classifier, the system is able to achieve an average F-Measure that is within 2.5 % of that of state-of-the-art supervised systems.

You have full access to this open access chapter, Download conference paper PDF

Semi-supervised Instance Matching Using Boosted Classifiers

Instance-level accuracy versus bag-level accuracy in multi-instance learning

Article 07 May 2015

Using Similarity between Paired Instances to Improve Multiple-Instance Learning via Embedded Instance Selection

Keywords

Instance matching is the problem of matching pairs of instances that refer to the same underlying entity, with numerous documented applications in the Semantic Web community [4]. Current state-of-the-art instance matchers use a variety of machine learning techniques to achieve effective performance. Many of these systems are supervised, and require sets of manually annotated samples to train their classifiers. This manual effort is expensive in open communities.

In recent years, minimally supervised approaches have been devised to alleviate extensive labeling effort [2]. While such approaches perform reasonably in many cases, a comparative analysis shows that there is still a considerable gap between their performance and that of supervised systems. An additional problem is that such systems rely on the specification of a function called the pseudo F-Measure (PFM). Intuitively, the PFM serves as a proxy for the true F-Measure, with minimally supervised instance matchers heuristically attempting to optimize the PFM instead of the true (unknown) F-Measure [3]. A recent study found the PFM to be uncorrelated (or negatively correlated) with the true F-Measure in several cases [2], raising concerns about whether currently defined PFMs are appropriate proxies.

This paper presents an alternate minimally supervised instance matching approach that offers a practical compromise between the two paradigms above. The architecture of the system is illustrated in Fig. 1. The proposed system expects a few input seed training samples to bootstrap itself. Some preprocessing that is often specific to instance matching systems (blocking and restriction set generation) is then performed. We implemented existing state-of-the-art modules for these steps to maximize system performance. Two sets of real-valued feature vectors^{Footnote 1} are output by the pipeline prior to the classification step (the dotted component in Fig. 1).

The first output is the set of seed training samples, with each (matching and non-matching) instance pair in the set converted to (positively and negatively labeled) feature vectors and are used to train an ensemble classifier. An ensemble classifier takes a base classifier as input and trains the classifier using a meta-classification strategy called boosting [1]. The goal is to obtain a committee (or ensemble) of base classifiers that use weighted majority voting to classify samples, which is shown to improve performance on many challenging tasks [1].

The second output is the candidate set, which represents the unseen (or test) data in the classification task. These are unlabeled feature vectors, with each vector representing an underlying instance pair. The trained ensemble classifier is used for probabilistic instance matching, where the classifier scores each feature vector in the candidate set according to its likelihood of having a positive label. If the seed training set is small (\({\le }2\,\%\) of the ground-truth), the initial output is not expected to have high quality. Instead, the system assumes that a percentage of the top-scoring feature vectors are correctly labeled, and uses them to iteratively self-train itself in a semi-supervised fashion. In our work, we showed that an aggressive strategy (doubling the size of the self-training set in each successive iteration) performed well, with convergence achieved within 7 self-training iterations. The overall intent of self-training is to improve generalization performance with each iteration, with large gains anticipated in the initial iterations.

To the best of our knowledge, this is the first minimally supervised instance matching system that combines boosting methods with iterative semi-supervised learning to achieve effective performance. The ensemble classifier is trained using the AdaBoost algorithm, and with a choice of two base classifiers, random forests (RFs) and multilayer perceptrons (MLPs), both of which have been individually validated for good instance matching performance in related work.

Experiments: We compare the two settings of the proposed system (with RF and MLP as base classifier) on six real-world benchmarks covering domains of people (Persons 1 & 2), restaurants (Restaurants), bibliographies (ACM-DBLP) and e-commerce (Amazon-GP, Abt-Buy). In our original work (see footnote 1), the highest F-Measure (averaged over all the benchmarks) achieved by the MLP setting of the system (76.78 %; see also Table 1) was within 2.5 % of that of state-of-the-art supervised approaches (79.25 %), and outcompeted various PFM optimization-based minimally supervised approaches (72.18 %). This was not true of the RF-based setting. Table 1 directly compares the two settings. Essentially, we record the highest achieved F-Measure of each setting on each benchmark over 1–7 self-training iterations (equiv. runs) and report the run in which the highest F-Measure was recorded, along with the corresponding precision and recall. The table shows that, while the difference between the settings on Amazon-GP, Abt-Buy and on average is considerable, it is quite small on the other four test cases. RF even outperforms MLP by 5.32 % on Restaurants.

An analysis of Table 1 shows that MLP is preferrable to RF. However, the recorded training times of both settings (Table 2) show that there is a cost to using MLP over RF. The training time of MLP depends at least linearly on the number of training samples, roughly doubling in a successive iteration (when the size of the self-training set also doubles), while RF training time grows slowly. To compare the difference, consider that, in Table 1, the best performance of MLP on Abt-Buy occurs in the fifth iteration, with training time being 887.28 s. RF achieves its best performance in the seventh iteration, at training time of 46.49 s. On large datasets, these differences (a factor of 19) will amplify even further, calling into question the practical feasibility of MLP.

Table 1. A comparison of the highest achieved F-Measure (and corresponding precision and recall) over 7 self-training runs of the system when using MLP/RF as base classifier in the AdaBoost algorithm

Full size table

Table 2. A comparison of the training times (in seconds) over 7 self-training iterations of the system when using MLP/RF as base classifier in the AdaBoost algorithm. x indicates that reliable results could not be obtained for that run due to system-specific factors. Averages are computed over non-x values

Full size table

Future Work: The excellent performance achieved by boosted MLP on the benchmarks showed that it is feasible to devise minimally supervised approaches that nearly equal the performance of supervised approaches requiring training on a non-trivially obtained manually labeled set (\({\ge }10\,\%\) of ground-truth). However, a close look at the training times of the MLP against a cheaper alternative, namely RFs, shows that the cost of training the MLP against the RF can be prohibitive. Future work will seek to make progress on this front by (i) predicting, through meta-strategies, when to use RF vs. MLP to secure the best cost-performance trade-off, and (ii) developing classifiers that can match the benefits of MLPs but without its prohibitive training times.

Notes

1.
For a full description of features and preprocessing modules, we refer the reader to the full paper as well as the project website at https://sites.google.com/a/utexas.edu/mayank-kejriwal/projects/semi-supervised-im-using-boosting.

References

Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. In: Vitányi, P.M.B. (ed.) EuroCOLT 1995. LNCS, vol. 904, pp. 23–37. Springer, Heidelberg (1995)
Chapter Google Scholar
Ngomo, A.-C.N., Lyko, K.: Unsupervised learning of link specifications: deterministic vs. non-deterministic. In: OM, pp. 25–36 (2013)
Google Scholar
Nikolov, A., d’Aquin, M., Motta, E.: Unsupervised learning of link discovery configuration. In: Simperl, E., Cimiano, P., Polleres, A., Corcho, O., Presutti, V. (eds.) ESWC 2012. LNCS, vol. 7295, pp. 119–133. Springer, Heidelberg (2012)
Chapter Google Scholar
Scharffe, F., Ferrara, A., Nikolov, A., et al.: Data linking for the semantic web. Int. J. Seman. Web Inf. Syst. 7(3), 46–76 (2011)
Article Google Scholar

Download references

Author information

Authors and Affiliations

University of Texas at Austin, Austin, USA
Mayank Kejriwal & Daniel P. Miranker

Authors

Mayank Kejriwal
View author publications
You can also search for this author in PubMed Google Scholar
Daniel P. Miranker
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mayank Kejriwal .

Editor information

Editors and Affiliations

Inria, Sophia Antipolis, France
Fabien Gandon
Data Archiving and Networked Services, Den Haag, The Netherlands
Christophe Guéret
Inria - Sophia Antipolis-Méditerran, Sophia Antipolis, France
Serena Villata
Eng-3047, Engineering, National University of Ireland, Galway City, Ireland
John Breslin
Laboratoire I3S, Polytech Nice Sophia, Sophia Antipolis, France
Catherine Faron-Zucker
Ecole des Mines de Saint-Etienne, Saint-Etienne, France
Antoine Zimmermann

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kejriwal, M., Miranker, D.P. (2015). Minimally Supervised Instance Matching: An Alternate Approach. In: Gandon, F., Guéret, C., Villata, S., Breslin, J., Faron-Zucker, C., Zimmermann, A. (eds) The Semantic Web: ESWC 2015 Satellite Events. ESWC 2015. Lecture Notes in Computer Science(), vol 9341. Springer, Cham. https://doi.org/10.1007/978-3-319-25639-9_14

Download citation

DOI: https://doi.org/10.1007/978-3-319-25639-9_14
Published: 09 January 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-25638-2
Online ISBN: 978-3-319-25639-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Minimally Supervised Instance Matching: An Alternate Approach

Abstract

Similar content being viewed by others

Semi-supervised Instance Matching Using Boosted Classifiers

Instance-level accuracy versus bag-level accuracy in multi-instance learning

Using Similarity between Paired Instances to Improve Multiple-Instance Learning via Embedded Instance Selection

Keywords

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Minimally Supervised Instance Matching: An Alternate Approach

Abstract

Similar content being viewed by others

Semi-supervised Instance Matching Using Boosted Classifiers

Instance-level accuracy versus bag-level accuracy in multi-instance learning

Using Similarity between Paired Instances to Improve Multiple-Instance Learning via Embedded Instance Selection

Keywords

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation