Abstract
Dataset creation for the purpose of training natural language
processing (NLP) algorithms is often accompanied by an
uncertainty about how the target concept is represented in the
data. Extracting such data from web pages and verifying its
quality is a non-trivial task, due to the Web's unstructured and
heterogeneous nature and the cost of annotation. In that
situation, annotation heuristics can be employed to create a
dataset that captures the target concept, but in turn may lead to
an unstable downstream performance. On the one hand, a trade-off
exists between cost, quality, and magnitude for annotation
heuristics in tasks such as classification, leading to
fluctuations in trained models' performance. On the other hand,
general-purpose NLP tools like BERT are now commonly used to
benchmark new models on a range of tasks on static datasets. We
utilize this standardization as a means to assess dataset
quality, as most applications are dataset specific. In this
study, we investigate and evaluate the performance of three
annotation heuristics for a classification task on extracted web
data using BERT. We present multiple datasets, from which the
classifier shall learn to identify web pages that are centered
around an individual in the academic domain. In addition, we
assess the relationship between the performance of the trained
classifier and the training data size. The models are further
tested on out-of-domain web pages, to asses the influence of the
individuals' occupation and web page domain.
Users
Please
log in to take part in the discussion (add own reviews or comments).