research-article

Public Access

Interface Design for Crowdsourcing Hierarchical Multi-Label Text Annotations

Authors:

Rickard Stureborg,

Bhuwan Dhingra,

Jun YangAuthors Info & Claims

CHI '23: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems

Article No.: 410, Pages 1 - 17

https://doi.org/10.1145/3544548.3581431

Published: 19 April 2023 Publication History

All formats PDF

Abstract

Human data labeling is an important and expensive task at the heart of supervised learning systems. Hierarchies help humans understand and organize concepts. We ask whether and how concept hierarchies can inform the design of annotation interfaces to improve labeling quality and efficiency. We study this question through annotation of vaccine misinformation, where the labeling task is difficult and highly subjective. We investigate 6 user interface designs for crowdsourcing hierarchical labels by collecting over 18,000 individual annotations. Under a fixed budget, integrating hierarchies into the design improves crowdsource workers’ F1 scores. We attribute this to (1) Grouping similar concepts, improving F1 scores by +0.16 over random groupings, (2) Strong relative performance on high-difficulty examples (relative F1 score difference of +0.40), and (3) Filtering out obvious negatives, increasing precision by +0.07. Ultimately, labeling schemes integrating the hierarchy outperform those that do not — achieving mean F1 of 0.70.

1 Introduction

To both build and evaluate machine learning systems, researchers often rely on human-labeled datasets [13, 41, 54]. Gathering this labeled data efficiently and at high quality is a well-studied problem when labels are binary [25, 34, 55] or a flat list of choices [14, 33, 47], but labels can often be grouped into other structures as well, such as species in a taxonomy [50].

Figure 1:

Concept hierarchies (or taxonomies; ontologies) are used in many applications ([10, 17, 18]) to describe concepts at a flexible granularity, and generally serve to help organize and structure both language and thought around a topic. In certain situations, they may become a target for data labeling itself, where multiple hierarchically-structured class labels can be chosen for a given instance, a setting known as hierarchical multi-label annotation [7, 56]. Given their usefulness in organizing thought, one might expect that leveraging the hierarchy during annotation may yield higher quality and efficiency. However, there are many design choices to consider: Does interface complexity increase cognitive load ([19, 40])? Will false negatives in an upper level of the hierarchy end up amplifying errors into annotations on a lower level [7]? Will presenting one part of the hierarchy while hiding the rest create a lack of context that leads to misinterpreted label definitions?

In this paper, we study how to incorporate the concept hierarchy into labeling schemes for crowdsource data annotation platforms such as Amazon Mechanical Turk (AMT). We focus on a difficult annotation task, assigning vaccine concerns ([46]) to text passages taken from anti-vaccination websites. The hierarchy of vaccine concerns includes labels such as “Health risks” and “Issues with research”. Small, purpose-built taxonomies are common in the domain of misinformation research [1, 10, 24, 42]. In the setting of difficult annotation tasks, we show that labeling schemes incorpor- ating hierarchies can help annotators perform better against ground-truth labels.

We investigate two separate design choices when annotating a single passage: (1) how to format the hierarchical labels when shown on the interface, and (2) the pass-logic that decides how to coordinate multiple workers towards labeling that passage. We compare two formats for presenting the hierarchy to annotators (see Figure 3):

•

multi-label, which simply presents labels as a flat multiple choice list of options

•

hierarchical multi-label, which presents the entire hierarchy directly to the worker who then marks all relevant labels

For pass-logic, we look at three options (see Figure 1):

•

single-pass, where all labels are presented to a single worker, who annotates the passage on their own

•

multi-pass, which combines multiple workers’ annotations for a single passage by partitioning the labels into groups (each worker focuses on a small subset of labels at a time).

•

hierarchical multi-pass, in which a preliminary stage of annotation determines if child-labels need to be annotated.

We compare all valid combinations of these formats and pass-logic options under a fixed-budget setting, which provides practical insights for research and engineering teams interested in data collection of hierarchical multi-label tasks. For multi-pass logic, we consider both randomly partitioning labels into smaller subsets or utilizing the groupings given to us by the hierarchy. Our results point to a few statistically significant factors:

(1)

Grouping similar concepts together: When partitioning labels using the hierarchy as opposed to a random partition, we see significantly better performance for the groupings informed by the hierarchy (F1 score of 0.50 grouped vs 0.34 random)

(2)

Relative performance boost on difficult examples: Explicit access to the hierarchy increases workers performance on more difficult questions (as much as a + 0.40 in F1 as compared to multi-label).

(3)

Boosting true positive frequencies: By filtering out irrelevant passages from stage 2 annotation, more of the examples shown to workers are therefore true positives, which we show is associated with better precision without a detriment to recall. The performance boost from this alone moves the F1 score from 0.50 to 0.57.

Our results lead us to believe that difficult, high-subjectivity labeling tasks warrant new recommendations separate from crowdsource design guidelines in previous work ([7, 21]). We recommend considering incorporating hierarchies into the labeling process, and show a few options for how to do so. This is especially true if optimizing for individual worker performance, while choice of labeling scheme plays less of a role if using aggregation methods across several copies of annotations.

2 Related Works

The reliance of supervised ML algorithms on labeled data has led to a great wealth of knowledge regarding efficient data labeling at large scale. Huge datasets have been constructed requiring immense human labeling time across many media. Among them are image and video datasets generally containing thousands of classes such as ImageNet (14M images) [13] and OpenImages (9M images) [30], but even with fewer classes, such as CelebFaces labeling 40 facial attributes (200k images) [34]. Also audio datasets, typically with hundreds of classes, for instance AudioSet (2M clips) [18], Free Music Archive (100k clips) [12], and OpenMIC-2018 (20k clips) [21] Lots of work is focused on allowing this scale of data collection while maintaining high quality [8, 14, 29, 51] or protecting crowdsource workers [4, 23].

Often, this labeling is done on tasks with low ambiguity or subjectivity, and minimal required training – which makes them suitable for large scale collection. For example, in ImageNet [13], labels are the names of well-known objects such as “ambulance”, “folding chair” or “snail.” Even in more difficult audio-annotation tasks such as labeling noise categories in a busy city ([7]), the labels (“jackhammer”, “car horn”) have strong, objective definitions.

Given the clarity on such label definitions, previous studies on user interface design for crowdsource annotation have recommended increasing annotation throughput, or the rate at which labels are collected from the annotation platform [4, 16, 37]. Throughput can be very quick for some tasks (minutes for hundreds to thousands of annotations), while other tasks may be much slower. Prior work found that single-pass methods have up to 9 times higher throughput if annotations are required to be fully labeled (assigning a value for every label) rather than sparse [7]. However, work in psychology has long known that there is a tradeoff between speed and accuracy for any information processing task a human performs [53]. Other HCI work also studies this tradeoff [35, 57]. This suggests optimizing for throughput could be harmful to annotation quality, particularly if the task is difficult.

The cognitive load theory [48] suggests that tasks with high cognitive load (the amount of mental effort) can induce errors and mistakes at higher frequency than tasks with lower cognitive load. Work on user interfaces which require some level of accuracy often tries to minimize unnecessary cognitive load [19, 40, 51]. Similarly, work in crowdsourcing recommends to chunk difficult tasks into smaller units of work [28]. Some work has shown that crowdsource platforms have great potential for rapidly collecting measurements in user studies [27]. Other work examines how long annotators remain on tasks, and characterizes differences between those that annotate few examples versus those that annotate many [15].

Recent efforts have also moved towards datasets for high-impact social issues such as: misinformation [10, 46], which attempts to classify common concerns regarding issues such as climate change or vaccines; fact-checking [49], which labels whether claims are verified by trusted sources; and claim review [2, 3], which determines if claims are worth fact-checking. Such labels inherently lend themselves to be a more difficult annotation task, given the subjective label definitions and necessary processing to parse written rationales or arguments in text.

In data labeling, it is common to collect multiple copies of annotations and aggregate them using a majority vote [5, 54]. Some work studies how to perform aggregation more effectively [52]. This is said to reduce the impact of low-quality annotations during collection. Some old work in aggregation methods such as EM uses weightings from estimates of worker skill [11], while other work incorporates question difficulty through parametric approaches [26] or non-parametric approaches [44].

However, recent trends in NLP have began questioning aggregation, arguing that subjective labels should not be aggregated if multiple opinions are valid. Rather, this line of work ([38, 58]) suggests predicting the distribution of human opinions, rather than the majority vote. One implication that follows is that individual annotator performance becomes more important, since one cannot aggregate away labeling error using a simple majority vote.

Labels are not always in the form of lists. There has been a large amount of work on labeling hierarchical multi-label annotations [7, 21, 45, 56], where the task is to select any relevant option from labels in a hierarchical structure. While most work employs a small group of experts to build the concept hierarchies before it gets labeled by workers, some research attempts to build these hierarchies through crowdsourcing methods [6, 9].

In considering the performance of crowdsource workers, a lot of effort has been spent to introduce gamification of the labeling task [22, 31, 36, 39, 43], but we note that this requires significant overhead efforts to build the games, which may not be feasible when data collection is time-sensitive.

3 Approach

3.1 Data collection

We study interface designs when labeling against a taxonomy of vaccine concerns developed to promote high agreement among crowdsource workers [46]. The taxonomy is a hierarchy of labels with 5 top-level concerns such as Untrustworthy actors and Health risks, and 19 child labels such as Untrustworthy actors → Profit motives. See Appendix A for the full version of the taxonomy. The annotation task consists of annotating passages from known anti-vaccination blogs and websites, pre-filtered to ensure the articles are on the topic of vaccination, against multiple labels from both levels in the taxonomy. Articles are converted into paragraphs using existing markers in the HTML code to closely resemble the paragraphs rendered to readers. An example is shown in Figure 2.

Figure 2:

These blog articles are often written with vague mentions of these recurring themes of concerns, and paragraphs are given to annotators without context regarding who wrote it or what paragraph came before. There is therefore lots of ambiguity in the input text which must be dealt with by annotators. The authors had some disagreement initially in 45% of passages, an indication of the level of subjectivity existing in the task. This is not surprising, given the labeling task primarily revolves around a concept ripe with subjectivity: concerns. Passages may simply raise different concerns for different readers. Unlike the annotation of object in images, for example, there are very few passages where the correct labels are immediately obvious. That being said, such passages do occur—particularly when there is high overlap between the vocabulary used to define labels and the vocabulary in the passage.

3.2 Annotator training

To maximize the chance of high-quality annotations, we look into a few methods to train annotators and ensure quality. These methods are implemented through the exact same process for all labeling schemes to ensure fairness. We collect all our annotations on Amazon Mechanical Turk (AMT).

3.2.1 Definitions.

We provide written definitions for all labels and set up a micro-task as the very first step to have workers interact with the definitions directly. The very first screen the annotators will see is a list of all the labels they are expected to select from. Under each label is a written definition. The task we ask workers to complete is to mark any definition which they feel is unclear. This hopes to prompt fully reading and internalizing the definitions, as well as collects data for further improving the training process.¹ (See Appendix E for a screenshot of this step)

3.2.2 Tutorial.

Next, workers walk through 10 examples, where they annotate passages just like they would in real annotation. However, for these 10 examples, they are given corrections after each submission. The corrections show which labels they got wrong and which they got correct. For incorrectly marked labels, there is a written explanation for why the label should have been applied (or not). Tutorial explanations are written ahead of time, and appropriate tutorial examples are given according to which labels are presented to the worker. We ensure that there is always a consistent ratio of different types of examples in each tutorial. For example, there are always two examples where none of the labels should be selected, one where the passage is clearly anti-vaccination but no specific argument is made (e.g., “vaccines are bad”) and one where the passage is off-topic.

3.2.3 Entrance exam.

After finishing the tutorial but before being allowed to annotate real data, we have workers complete a three-question entrance exam. To workers, this looks like regular annotation. Two of these passages are clearly off-topic, and a third passage clearly mentions one of the concerns being labeled. If workers fail any one of these three questions they are banned from labeling.

3.2.4 Quality checks.

While the annotations are being collected, we randomly include attention checks (with $5\%$ probability) such as “Help us catch cheaters. Choose the first option and hit submit to show you are paying attention.” If workers fail such attention checks, we throw out all the annotations they gave us since the last passed attention check, and ban them from further annotations.

3.3 Ground-truth labels

To evaluate the different labeling schemes, we collect “expert” annotations from three authors of the paper. The sample size for evaluation spans 4,800 passage-label pairs (200 passages taken from 200 articles). First, the three authors annotate the passages separately, followed by a discussion phase in which they try to come to an agreement about diverging labels. We refer to these labels as the ground truth, and separate them into four categories: (1) labels which were agreed on immediately during individual, non-communicative annotation; (2) labels which were agreed on after re-annotating them individually without communicating, but asking for a written rationale for the given label; (3) labels which were agreed on after collaborative discussion; and (4) labels which never reached unanimous agreement, but rather a majority vote was taken. These categories can be seen as a proxy for difficulty, requiring increasing amounts of nuanced examination of the target passage. Further details on the construction of these sets can be found in Appendix M, and an analysis of the effect of difficulty in §4.3.2.

3.4 Labeling schemes

In this section, we discuss the definitions of each interface design through the two formats we consider (multi-label and hierarchical multi-label) but also a third option which we do not include in experiments due to prohibitive costs (binary-label). We then explore the three pass-logic options (single-pass, multi-pass, and hierarchical multi-pass) and show our design approach for combining these options.

3.4.1 Formats.

Labels can be presented on an interface in many different formats (see Figure 3). Here, formats refers to how to organize the set of labels in the user interface.

•

Binary-label format shows the label to annotators using a single yes/no question. A worker will focus on a single label across their time annotating, minimizing cognitive load. ²

•

Multi-label, which simply presents labels as a flat multiple choice list of options. The workers can select any/all/none of the labels. Depending on the pass-logic used, this list may be longer or shorter, but will generally only contain labels from the same level of the hierarchy.

•

Hierarchical multi-label, which presents a hierarchy directly such that choices in the top-level of the hierarchy prompt further choices in the next level. This option can be accomplished in two ways. In one version (v1), the hierarchy is given as checkboxes with child-level checkboxes that become enabled only if the parent category is selected. In the other (v2), the hierarchy is asked in a two-stage question. First there is a binary choice regarding the parent category. If the answer is yes, then a flat list of checkboxes for the child labels is presented to the worker.

Figure 3:

3.4.2 Pass logic.

Pass logic determines how many workers are brought together to work on the annotation of a single passage, and how to coordinate their efforts. Some decisions regarding pass logic will inform the look of the interface shown to users, while others will only affect which passages are shown to any given worker. We examine three options for pass logic (see Figure 1):

•

Single-pass, where one worker is asked to annotate the passage entirely on their own, and therefore must be presented all the labels at once. This option can be combined with both the multi-label and hierarchical multi-label formatting options. However, it is incompatible with binary-label since you cannot present multiple binary questions at once (that would be multi-label).

•

Multi-pass, which combines the annotations of multiple workers for a single passage by partitioning the labels into groups (letting each worker focus on a small subset of labels at a time). This option is compatible with all format versions. To accomplish this with the hierarchical multi-label format, we simply partition the hierarchy into sub-trees using the top-level labels.

•

Hierarchical multi-pass, in which there are different stages of annotation which determine whether child labels in the hierarchy need to be annotated. First, some worker is asked to annotate the passage with level-1 annotations. Based on their annotations, we create new tasks for any label the worker marked as positive. These new tasks are released in a second stage to annotate the child labels in level 2, and need not be labeled by the same worker as the level-1 labels. This option can be employed both in a binary-label setup, as well as in multi-label, but is incompatible with hierarchical multi-label formatting since it forces annotation to occur on distinct levels at a time.

3.4.3 Combinations.

When combining the format options with pass-logic options, we get the following possible labeling schemes:

•

Single-pass multi-label (single-pass multi) — A single worker annotates all level-2 labels at the same time, given in a flat list.

•

Single-pass hierarchical multi-label (single-pass hrchl) — A single worker annotates all labels (level-1 and level-2) at the same time. They are shown the hierarchy in its entirety using hierarchical multi-label v1 formatting (see Figure 3).

•

Multi-pass binary-label (multi-pass binary)³ — The labels are given one by one to multiple workers, who annotate the passage in parallel for that single label. Annotations from all workers are then combined.

•

Multi-pass multi-label (multi-pass multi) — Level-2 labels are partitioned into smaller groups, and these label-groups are then given to multiple workers who annotate the passage in parallel. Annotations from all workers are then combined.

•

Multi-pass hierarchical multi-label (multi-pass hrchl) — The labels are partitioned according to level-1 labels. One worker will be given label 1 and its children 1.1, 1.2,..., while another worker will be given label 2 and its children 2.1, 2.2,... This pattern continues for all level-1 labels in the hierarchy. They are shown their section of the hierarchy using hierarchical multi-label v2 formatting (see Figure 3).

•

Hierarchical multi-pass binary-label (hrchl-pass binary) — Level-1 labels are first given one by one to multiple workers, who annotate the passage in parallel for their single label. A second stage then looks at these annotations and determines which child labels need to be labeled (if a worker indicates a positive label for label 2, then we must annotate 2.1, 2.2,..., else we can skip them). The second stage then gives the child-labels one by one to multiple workers in parallel, just like the first stage.

•

Hierarchical multi-pass multi-label (hrchl-pass multi) — Level-1 labels are first given as a single group to one worker. A second stage then looks at this worker’s annotations and determines which child labels need to be labeled (according to the same logic as in the binary case). The second stage then gives the child-labels in groupings according to their parent category (so a single worker will be given 2.1, 2.2,... at once), just like in the first stage.

When partitioning the level-2 labels for multi-pass multi, we examine two possible choices: partitioning them using the groupings that already exist in the hierarchy, or partitioning them randomly into 5 groups (such that the number of groups is consistent with the other choice). We refer to these as multi-pass groupedmulti and multi-pass randommulti, respectively.

3.5 Controls

Beyond forcing an annotator training, we explore several additional controls. This section outlines the controls we took, and which factors we look at through post-hoc analysis. §3.5.1 looks at how we control cost, which is key to our experimental design.

Given the task difficulty, we limit access to the task to workers who (1) reside in the United States, (2) have completed at least 2,000 HITs, and (3) have a HIT approval percentage of above 99%, and (4) have a “Masters” qualification indicating they are workers that produce high quality annotations. These controls are facilitated by standard AMT tools, while most of the rest of the controls are implemented through our custom annotation platform.

We use a between-subjects design, meaning that we do not allow any worker to submit annotations for more than one labeling scheme. This avoids producing workers which are trained twice on the task. Further, the workers are not aware that there are multiple conditions. When publishing jobs on AMT we start with HITs that will send the workers to the first labeling scheme. Once we have collected enough annotations for this scheme, the current workers get blocked from beginning any new hits. The next labeling scheme then gets linked from the posted HITs, and new workers (which did not interact at all with the first labeling scheme) may begin annotation. This ensures workers are not aware of multiple schemes, even if they have seen the HIT advertised in their list of tasks previously. The description of the task is the same, except for the reward which fluctuates slightly to maintain a consistent budget (more details in §3.5.1), and we never inform the workers that there are multiple schemes. Workers are at most allowed to submit annotations for 200 unique passages.

We do not control exactly when these HITs are submitted to the AMT marketplace. Simply, we launch the next HITs shortly (1-2 hours) after gathering enough annotations for the last labeling scheme. When the last labeling scheme finishes collection during the night or late in the evening, we wait until the morning to launch the next scheme. One could argue that the populations of workers that click on tasks might vary meaningfully across the day. However, most annotations were collected during day-time in the United States, and we only allow workers from the United States. We also include this factor in our multiple regression analysis in §4.2, showing it does not significantly contribute to mean F1 score. Since we allow workers to complete any number of annotations they want (up to 200 unique passages), we cannot control how many workers are assigned to each condition. Instead, we allow annotation by new workers up until we have 3 copies of each of the 4,800 passage-label pairs. See Table 1 for more details.

Table 1:

3.5.1 Cost.

To control for cost, we approach the task from the perspective of a research team that wants to collect data to train an ML model.

Note that annotating text passages is very different from images. Whereas images can be cognitively processed near-instantaneously by a crowdsource worker, reading passages of text is more similar to the annotation of video or audio-clips. Performing a full read-through of the passage takes time, forcing a delay before selecting labels and thereby adding a significant temporal dimension to the annotation task. Therefore, we cannot compare the annotation of 10 passages with a single binary label to the annotation of 1 passage with 10 labels. In one case, the annotator has to spend 10 times more time reading than the other. This influences how to fairly pay workers. Instead, we must set a minimum reward threshold per passage read-through for workers, and consider the cost of data collection as variable. For a toy example of why cost would vary across labeling schemes, see Appendix L.

However, comparing labeling schemes without holding the total budget constant will not provide much value for ML researchers deciding which scheme to use. Research teams are heavily motivated by budgets, so how do they get the highest quality annotation for their money? ⁴ Answering this question is the focus of our experiments (§4). In particular, we set the reward per passage read-through for each labeling scheme to fully utilize the budget (ensuring that it was above a minimum of $0.10, which ensures that we are paying workers more than the United States minimum wage). This means in some labeling schemes the workers will get more rewards per passage than others, although these workers also have to consider more labels at the same time.

On AMT, workers are paid per HIT (one “unit” of work assigned to a worker) they complete. For each HIT, workers in our experiment will complete a small batch of passages (10-24 passages). Batching passages like this ensures the reward per HIT is not too small to attract workers. This also allows us to change the ratio of reward-to-number-of-passages, thereby controlling for the listed reward payment per HIT that workers see on the platform. We observe that changing this ratio (without changing the actual payment per work completed) causes noticeable differences in annotation throughput. This indicates a potentially large inefficiency in the AMT marketplace. For all labeling schemes, we launch the tasks with a ratio of reward-to-number-of-passages such that the reward is just above a dollar (as close to $1.01 as we can get while keeping the budget fixed). For all schemes we still keep the total budget spent constant for collecting the 14,400 labels needed (3 copies of 4,800 passage-label pairs).

3.5.2 Payment broken down by each labeling scheme.

From the process described in 3.5.1, we then end up with the following rates of pay for each condition: The listed reward for hrchl-pass multi was $1.01 for 10 passages, all multi-pass options were $1.03 for 24 passages, and single-pass options were $2.16 for 10 passages (after first having tried $1.08 for 5 passages and finding throughput was too slow). The reward we give workers amounts to approximately $7-10 per hour (USD) as self-reported through TurkOpticon ⁵[23]. We do not have access to more granular hourly-rate estimates due to limitations with monitoring when workers are inactive (taking a break) versus when they are taking longer than usual time to read a question. However, we include analysis regarding distributions of time spent labeling each passage across the various schemes in Appendix D.

4 Analysis

We collected three copies of annotations for each passage, for each of the 6 labeling schemes (§3.4.3) through AMT. In this section we compare the labeling schemes against each other on performance and examine the reasons for why performance varied across labeling schemes.

4.1 Performance Comparison

We evaluate the performance of workers against the ground-truth labels (§3.3). Majority labels are often computed to mitigate labeling error [52], but recent work has also shown the utility of high-quality individual annotations in order to estimate the distributions of human opinion [58]. The latter is particularly relevant in our setting where workers are labeling often subjective concerns: being able to measure the degrees of concern across individuals is relevant towards reducing vaccine hesitancy. We compute the precision, recall and F1 score for each label of the vaccine concerns taxonomy, and report an unweighted mean across the labels.

We employ a macro-level average of F1, which is computed by first finding the F1 score on every taxonomy label, and then averaging across all these labels. Note that in any analysis where we give an individual F1 score for each worker, the macro-averaging process happens in parallel for each worker. That is, the worker would be evaluated separately for each taxonomy label, and then an average performance is computed for that worker. However, for most of our analyses, we look at a single F1 score across all workers. In this case, we first pool all the annotations and treat them as if a single worker had submitted them. We then follow the macro-averaging process across the taxonomy labels. For further details on the metrics we use, see Appendix G.

We generate a choice/random baseline. For each passage, we draw 3 samples for each label from the binomial distribution with the probability p being determined by the gold-labeled data. We employ the same scheme to ensure consistency as described in Appendix F. Note that since F1 is computed using a macro average, and since there are “nans” in the data when positive labels are not generated, the mean F1 will not necessarily lie between the mean precision and mean recall.

Table 2:

Workers annotating with single-pass hrchl had the highest precision of 0.51, while multi-pass grouped multi had the highest recall at 0.71. Hrchl-pass multi balanced these the best, with an F1 score of 0.56. Generally, the data indicates that single-pass options lead to higher precision, while multi-pass and hierarchical multi-pass options perform better on recall.

One possible explanation could be that when workers focus on a smaller set of labels, they have a lower chance of forgetting about them while reading the passage. The tradeoff would be that as workers see a longer list of labels, they have to be more certain the passage is speaking about a label to think of it and select it. It could also be possible that workers “want” to select something on each passage. When the options are few they tend to over-annotate, and when the options are many they find the obvious ones more easily, producing fewer false positives. There is some evidence for this explanation. The mean number of selections per passage in the single-pass schemes was 0.9, while the mean number of selections in multi-pass options was 1.5, indicating that partitioning the labels into smaller categories may cause workers to annotate more positives than if they are given all together.

4.2 Multiple Regression Analysis

We investigate the effects of various factors on worker F1 scores. In this section, we fit a multiple regression model to the F1 scores of each worker. See Table 1 for how many workers completed annotations in each labeling scheme. We consider several factors beyond the labeling scheme, including ones that were not controlled for in our experimental design (such as the time each labeling scheme was distributed on AMT) as well as factors which arise due to each worker’s “luck”: the percentage of passages they were given which were relatively easy, and how often they were shown a passage which should be labeled with some positive label.

The labeling scheme factor is a categorical variable encoding the 6 labeling schemes considered in this paper. Multi-pass random multi is set as the baseline for this analysis. time started is a variable encoding when during the day a given worker began annotating passages. It is given in seconds past midnight. percentage easy/medium/hard/no agreement factors encode the percentage of easy / medium / hard / or no agreement (referring to the 4 proxy levels for difficulty) labels which the given worker was presented with. Some workers, by luck, get easier or harder passage-label pairs shown to them, and here we hope to see what the effect of this is. Details on how we choose these 4 difficulty levels are given in §4.3.2 and Appendix M. true pos freq is a variable encoding what percentage of labels shown to a given worker should be labeled positively.

F1 scores were significantly improved by three labeling schemes above the baseline: multi-pass grouped multi-label (estimate = 0.13, p-value < 0.001), single-pass hrchl-label (estimate = 0.08, p-value < 0.05), and single-pass multi-label (estimate = 0.07, p-value < 0.1).

For factors beyond the labeling scheme, we see that the time of day each worker began the task did not have a statistically significant effect on the data (p-value = 0.66), whereas both the rate of true positives that workers encounter during annotation (estimate = 0.54, p-value < 0.001) and the percentage of easy passages they encounter (estimate = 0.39, p-value < 0.05) do have a statistical significance. This is especially of interest to us since these factors can be indirectly manipulated through the labeling scheme. We analyse these factors in further detail in §4.3.2 and §4.3.3.

Table 3:

4.3 Contributing factors toward performance differences

In this section we perform deeper analysis on potential reasons for performance differences across labeling schemes.

4.3.1 Grouping labels.

Overall, integrating the hierarchy into the labeling scheme seems to help with performance. One direct comparison we can make is between the two versions of multi-pass multi-label schemes. In one, multi-pass random multi, the level-2 labels are partitioned randomly and given to separate workers. In multi-pass grouped multi, we use the groupings that already exist due to the hierarchy. Comparing performance between these schemes helps us examine whether presenting conceptually similar categories together can boost performance.

In every single measurement (accuracy, precision, recall, F1) and in every single vote setting (sensitive, majority, unanimous), the grouped scheme outperforms random partitions. On individual workers’ mean performance, multi-pass grouped multi scores 0.50 with a 95% confidence interval of [0.45, 0.55], while mutli-pass random multi only scores 0.34 ([0.29, 0.43]). It seems important when partitioning the labels to group related labels together. It is unclear exactly why this is, but one possibility is that having the context of similar labels increases worker’s understanding of the nuance between different cases. If they are shown a passage with a text which has criticism of research, it may be useful to be labeling both “Issues with vaccine research → poor quality” alongside “Issues with vaccine research → lacking quantity” rather than just one (without knowledge about the other).

4.3.2 Examining difficulty.

Even though our task generally contains more ambiguity and is higher in cognitive load than other crowdsourced annotation tasks, there are of course easy cases to label. For example, the passage below (Figure 4) should very clearly be labeled with “Health risks”.

Figure 4:

We utilize the ground-truth label categories discussed in §3.3, and examine the difference in performance as we vary difficulty. Importantly, we do not simply assign a difficulty measure to each passage, but rather to each passage-label pair. That means that we are able to mark that it is easy to annotate “Health risks” for the passage in Figure 4, but we can also mark that it is difficult to annotate the label “High risk individuals” if that was a label the authors did not immediately agree on. This analysis is done post-hoc. The passages are given at random ordering to workers, so workers will in expectation see the same proportion of difficult passages.

Figure 5:

Appendix M shows similar plots for accuracy, recall, and precision.

Focusing in on two comparable labeling schemes, the two single-pass versions, we see that performance on the labels diverges as difficulty increases. Performance on the easiest category (immediate agreement among authors) is almost identical (F1 score of 0.581 for single-pass multi and 0.582 for single-pass hrchl), while the difference is already + 0.400 in the favor of single-pass hrchl as we reach the most difficult category where there is still author consensus. This generally supports the explanation that explicitly providing the hierarchy helps workers reason about difficult labels. It is unclear, however, exactly why the performance increases as difficulty increases for the single-pass hrchl scheme. It is possible that the tradeoff between the helpful structure and the harmful interface complexity interact such that this labeling scheme performs worse on easier passages. Alternatively, it may be an effect of correcting workers’ priors for assigning a positive label.

4.3.3 True positive frequency.

Beyond the interface design format shown to workers, and the pass-logic used to combine annotations, there may be other factors that impact their performance. Does a worker who sees lots of positive examples perform differently from a worker who rarely sees any positives?

The results of the multiple linear regression indicates that there is a significant increase in F1 score due to higher true positive frequencies shown to workers. Knowing this, we may want to design annotation platforms which “filter out” negative examples, so that more workers have higher true positive frequencies during annotation. See Appendix I for a plot of the relationship between true positive frequency and F1 score among all the workers. Intuitively, one reason higher true positive frequencies may cause better performance could be that workers expect to have to assign positive labels to some proportion of passages, which would cause them to over-assign positives.

We examine the performance differences between labels collected in multi-pass grouped multi and hrchl-pass multi. If we ignore the level-1 annotations collected in hrchl-pass multi, then the interface shown to workers in these two cases is identical. The only difference is which passages actually get shown. For multi-pass grouped multi, we show all available passages to the workers. There is no pre-filtering on relevant passages done. For hrchl-pass multi, we only show passages that already have a positive annotation of the parent label, meaning there is a high chance of more labels being relevant. In fact, the frequencies of true positives shown to workers jumps from $3\%$ to $13\%$ on average (a more than 4-fold increase) just from this pre-filtering. Below, we compare the label performance on the passages that were annotated in both schemes.

Figure 6:

Note that overall we see a statistically significant positive correlation (Table 3), when we aggregate across all workers and examine changes on a per-label basis this trend is more nuanced. Overall, for passages directly annotated by workers in both schemes, hrchl-pass multi achieves a mean F1 score of 0.57 on level 2 labels whereas multi-pass grouped multi only scores 0.50. This is driven mainly by an improvement on precision ($+6.7\%$) rather than recall, which stays fairly unaffected (+ 0.01). It therefore seems that better balancing class priors for the workers can help with their performance on the task. This may warrant recommendations of a pre-filtering step to remove obvious true negatives. Ultimately, hierarchical multi-pass schemes acts as a form of pre-filtering, and seems to have a positive influence on worker performance.

4.4 Voting schemes

If one’s primary goal is not to measure the distribution of judgments about a label, but rather to get a single binary answer for each passage, then employing a vote may still be beneficial. That is not the primary motivation of this work, but in order to give some guidance to the implications of our results for aggregation methods, we examine simple, threshold-based voting schemes.

We look at three possible vote setting to aggregate the three copies of annotations collected on each passage. In sensitive vote, only 1 positive vote (of 3) is required to mark a label as positive. In majority, 2 of 3 is required, and in unanimous all 3 must be positive to mark it positive.

Figure 7:

Majority vote balances precision and recall the best across all labeling schemes (see Figure 7). Note that as mentioned in §2, strong individual annotator performance may still be necessary for some tasks where aggregation is not possible. Majority vote also outperforms individual worker’s performance in all labeling schemes, and the differences in performance between labeling schemes become less pronounced as you aggregate using voting. The highest F1 score was achieved by single-pass hrchl at 0.70 using majority vote. Sensitive vote scores the best on recall, with multi-pass grouped multi achieving 0.92, and unanimous vote settings perform the best on precision: single-pass multi scores 0.93. Increasing the vote threshold results in more conservative labeling, which is why we see an increase in precision.

5 Discussion

There is a large body of work which has studied crowdsource annotations. How to make it scalable, how to keep quality high, and how to increase throughput. There is also a deep wealth of knowledge and best practices regarding user interface design. Much of this work, however, has been done for labeling tasks that are low-subjectivity, have clearly defined label definitions, and low cognitive load.

We studied six candidate labeling schemes in annotation of a taxonomy of vaccine concerns. Our motivation was to study whether the hierarchy itself could aid workers as they perform the annotation task, and how to design the annotation task for high-quality labels under a fixed budget. We found that integrating the hierarchy into the labeling scheme helps with improving annotation quality, whether explicitly in the interface or through logical passes made behind the scenes. Our analysis showed that a hierarchical multi-pass multi-label scheme performs best when considering individual worker performance. We believe individual worker performance to carry more importance when the tasks are inherently subjective, since a growing body of work is interested in predicting label distributions. Much like in [7] and [45], we find that workers assign more labels per passage on average when they are in multi-pass schemes versus single-pass ones.

However, if the priority is to collect high confidence labels rather than distributions of human opinion, we found that employing the single-pass hierarchical multi-label along with a majority vote achieves highest performance. Unlike [7] and [21], we don’t see a drop-off in performance when using single-pass labeling schemes. Although we don’t conduct a qualitative study, we did not receive notably different amounts of complaints from workers in any of the labeling schemes. Largely, complaints were not about the interface designs at all, but rather about being allowed to annotate more data after workers finished the batch or were banned for failing attention checks. When using majority vote, the choice of labeling scheme matters less than it does for individual worker performance. The ease of setup with single-pass options should not be undervalued either. Such labeling schemes are already supported natively in AMT’s requester user interface, making it a strong option for smaller projects with a necessary quick turnaround.

Overall, we find that introducing the hierarchy helps almost universally across our experiments. In comparisons between partitioning the labels into groups randomly versus using the hierarchies structure, we find that using the hierarchy dominates across all performance metrics. Exposing the hierarchy explicitly helped performance on single-pass schemes by increasing worker performance on particularly difficult passages. We used a taxonomy specifically designed to achieve high agreement among crowdsource workers. While some previous work has indicated that integrating hierarchies into the data labeling process may harm performance ([51]), we find that it boosts it. The contexts here are different: our task is higher difficulty and therefore the hierarchy may aid in completing the task, but using a taxonomy specifically designed for high agreement among crowdsource annotations may also indicate that for these methods to work you may need a well-designed hierarchy.

5.1 Limitations

Use of Amazon Mechanical Turk. We conduct our experiments on AMT, one of the biggest and most popular crowdsourcing platforms. While AMT is similar to many other crowdsourcing platforms, and while we do use a custom annotation platform which limits annotators interaction with AMT-specific UI, there are a few unique traits to AMT. First, the population of workers is hard to replicate to other platforms. We use several of AMT’s built in qualifications to filter out workers, and there is no clear translation for which qualifications to use on other crowdsourcing platforms. Further, AMT has different payment expectations than other crowdsourcing communities. Some crowdsourcing platforms are purely volunteer based, while others attract short-time workers who complete only a few tasks. Ultimately, our choice to work on AMT is motivated by the size and popularity of this platform, thereby having results be relevant for a large set of researchers.

Omission of binary-label formats. Due to cost constraints, we could not experiment with labeling schemes which involved presenting binary choices to the workers. While this is representative of real-world scenarios for tasks similar to ours, it also leaves questions regarding whether or not the quality of annotation is potentially higher with these methods. However, we believe that given the vast cost differences of these methods, this choice is a reasonable assumption and will closely represent decisions made by the researchers for whom this analysis is intended.

Generalizing to new hierarchies. While we have no clear reason to expect our results are specific to the vaccine concerns hierarchy we used, we do not show or indicate that these results generalize well beyond it. For instance, as the size of the hierarchy grows, one might expect that the single-pass options become cognitively overbearing, and therefore multi-pass methods might begin increasing in relative performance. However, in offering useful analysis this is a choice that must be made.

Budget implications. We set a single budget and examine how to best optimize annotation performance against ground-truth labels on AMT. However, it may be the case that the best labeling scheme for our budget shows a less significant improvement when the budget is much higher and the reward given to workers is increased. Or there may be a different labeling scheme which performs better under a higher budget. One could imagine repeating our experiments at several budgets, and examining the relationship between a particular labeling scheme’s data quality and the relative expense of data collection. Some schemes may be more cost efficient, showing small differences in worker performance across budgets, while others may only become viable at higher budgets. Unfortunately, running such experiments would make this research prohibitively expensive. The budget can be set based on previous experimentation regarding the minimum budget needed to achieve relatively high quality data, as well as confidently exceed the United States minimum wage. Teams that wish to collect data would likely avoid opting to pay more for the labels they are getting. For machine learning applications, it is well known that you may get more utility from collecting additional data, rather than increasing the quality of the labeled data [20].

6 Conclusion

We investigate various labeling schemes for crowdsourcing text annotation of difficult, high-subjectivity tasks and measure impact on worker performance against ground-truth labels. We find that integrating hierarchies into the labeling scheme helps with boosting performance.

Through analysis, we explore three potential indirect causes for improvement against ground-truth labels: (1) They group similar concepts together, improving F1 scores to 0.50 from 0.34 as compared to random groupings. (2) They allow relative increases in performance on difficult passages, leading to an increase in as much as + 0.40 on F1 score on high difficulty examples. (3) They boost the true positive frequency, thereby increasing precision of workers without detriment to recall. We recommend considering incorporating hierarchies into the labeling process if optimizing for individual worker performance, while using a majority vote setting if solely optimizing for F1 score (achieving 0.70 with single-pass hierarchical multi-label).

Acknowledgments

We thank Pardis Emami-Naeini and anonymous reviewers for feedback. This work was supported by NSF award IIS-2211526 and an award from Google.

A The Vaccine Concerns (VaxConcerns) Taxonomy

Figure 8:

B Alternative Text for Figures

Figure 1: “Three diagrams are shown describing single-pass, multi-pass, and hierarchical multi-pass routing logic. For single-pass, all set of labels are given to one worker. For multi-pass, the labels are partitioned into groups (1,2,...) and given to separate workers (A,B,...). For hierarchical multi-label, the top level labels are given to one worker, who’s annotations determine whether or not the child labels will be given as a new group to annotators downstream. This example shows the case where the first worker labels TFFTF, and downstream there are two tasks set up for new workers to label the children of label 1 and label 4, respectively.” Figure 2: “The figure shows an example passage that reads: ’The minister of fear (the CDC) was working overtime peddling doom and gloom, knowing that frightened people do not make rational decisions — nothing sells vaccines like panic.’” Figure 3: “Four diagrams are shown side by side. In each diagram there are a set of checkboxes or radio buttons indicating how the labels will be presented to the user. Binary label (the leftmost diagram) contains a simple question ’1.1?’ and below it a radio button reading ’yes’ or ’no’. Multi-label contains a simple list of checkboxes labeled ’1.1, 1.2,...’. Hierarchical multi-label v1 contains staggered checkboxes where the leftmost checkboxes read ’1, 2’ and the boxes immediately under these are tucked under them, reading ’1.1,...’ for the parent label ’1’, and ’2.1,...’ for the parent label ’2’. Hierarchical multi-label v2 contains both the radio button setup from the leftmost diagram, as well as the checkboxes from multi-label underneath them.” Figure 4: “A table is shown with column headers reading ’interface design’, ’greater than or equal to 1 tutorial Q’, ’greater than or equal to full tutorial’, ’greater than or equal to took exam’, and ’greater than or equal to 1 datapoint’. The table shows values for all 6 labeling schemes.” Figure 5: “A table is shown with values for precision, recall, and F1 score. These metrics are given for each of the 6 labeling schemes, and a random baseline is shown at the bottom. Hrchl-pass multi has bold font at the f1 score indicating it is the highest value in that column: 0.56.” Figure 6: “A two-part table is shown with column headers ’model factor’, ’estimate’, ’95$\%$ CI’, ’SE’, and ’p-value’. The first part of the table (top) has a subheading that reads ’labeling_scheme (baseline=multi-pass random multi-label)’, and the second part of the table (bottom) has a subheading that reads ’additional numerical factors’. The first part includes 5 of the 6 labeling schemes, while the bottom includes new factors such as ’time_started’, ’percentage_easy’, and ’true_positive_freq’. Some values in the table are denoted ’*’ which represents a p-value below 0.001.” Figure 7: “The figure shows an example passage that reads: ’Pregnant women given vaccine have babies with more health problems’” Figure 8: “A bar chart is shown giving the value for worker F1 score on the X axis ranging from 0 to 1, and the difficulty on the Y axis. The labels, from top to bottom, on the Y axis read ’immediate author agreement’, ’author agreement after providing rationale’, and ’author agreement after discussion’.” Figure 9: “A scatter plot shows orange and blue dots generally following a linearly positive relationship. The orange dots are labeled ’hrchl-pass’ and the blue dots ’multi-pass’. On the X axis: ’Frequency of True Positives shown to workers’. On the Y axis: Worker F1. Each blue dot is paired with an orange dot through an arrow which is drawn between them pointing towards the orange dot.” Figure 10: “A line plot is shown with dashed lines between three dots. The dots are lined up at tickmarks labeled ’sensitive’, ’majority’, and ’unanimous’. This is repeated for all 6 labeling schemes, leaving 6 connected dotted lines all in different colors. Every one of the lines follows an inverted V shape, with their highest point being over the ’majority’ tick mark.”

C Degradation in Worker Task Performance Over Time

Figure 9:

D Distributions of Time Spent Labeling Each Passage

Figure 10:

E Screenshot of Definitions Task in Annotation Platform

Figure 11:

F Methods for Pre-processing Data

We ignore all data collected for tutorial paragraphs, entrance exam, and quality checks when assessing workers. In the future we may assess how to leverage any information about worker performance on the tutorial towards improving data collection quality.

We remove all data collected from a worker if they failed the entrance exam or a quality check. Such workers were also banned in live time during data collection to avoid spending any further of our research budget on their data collection. To be clear, when setting the budgets for each labeling scheme, we do not factor in additional cost due to such workers. Rather, we re-annotate all the passages which they had been paid for. For certain labeling schemes, consistency between level 1 and level 2 is forced by the annotation platform. For example, in single-pass hrchl-label, workers cannot submit a positive value for a level 2 label without also submitting a positive value for its parent node in the taxonomy. This is achieved through some javascript in the annotation platform and is visually confirmed to the workers while labeling. However, for other labeling schemes such as multi-pass multi-label, the separation of L1 and L2 labels into separate screens leads to collecting data which is not necessarily consistent. Since any real-world use of this latter type of labeling scheme would include corrections for consistency, we correct for this during post-processing. This ensures fair comparison between the labeling schemes such that we don’t disadvantage multi-pass schemes. To further ensure comparison is consistent, we make sure to look at performance on level 2 labels on its own during our analysis.

G Metrics

We measure annotator performance using the F1 score, a harmonic mean of precision and recall. This score is commonly used in machine learning tasks and is a well understood and cared about metric in research communities such as natural language processing (NLP). Examining this metric gives more utility for NLP researchers, since the positively labeled examples will be the most informative during model training. One can achieve very high accuracy simply by marking all passages as negative. Since each worker’s performance can be evaluated across each label in the taxonomy individually, we must use some aggregation technique when presenting our scores. We employ a macro-level average of F1, which is computed by first finding the F1 score on every taxonomy label, and then averaging across all these labels. In the case where we give an F1 score for each worker, the macro-averaging process happens in parallel for each worker. In the case where we give a single F1 score for all workers, we take all the annotations and treat them as if a single worker had filled out all annotations and we then follow the macro-averaging process. This method of averaging is preferable in our setting as opposed to a samples-based average which would compute F1 over a single passage, and then aggregate across passages. Semantically, we use this method since we also care about worker performance on each individual label and can thereby inspect these values. Inspecting worker performance on each passage is less important to us, since the set of passages used are meant to be a sample of the types of passages annotated in any application of our work.

H Task Uptake

We train each worker before they complete any real annotations. Workers get paid for this training in order to ensure a fair and non-predatory ([4]) employment. This leads to additional costs to those interested in paying for the data labeling, since some workers can go through some or even all of the training, yet submit no actual annotations. Such workers never become “productive.” Below we report the task uptake, that is the percentage of workers who became productive, out of all the workers which completed at least one tutorial example. We also report an inefficiency number, which is the percentage of extraneous annotations collected from training and attention checks (discussed in detail in §3.2). The inefficiency number compares the extraneous annotations to the total number of “useful” annotations collected. For multi-pass schemes, the workers were allowed to complete one partition (group) of the labels and then begin a new partition. This occurred seamlessly, whereas there was a multi-day delay for the hrchl-pass multi scheme, meaning that there were less return workers for this task and thus decreasing the total task uptake.⁶

Task inefficiency (which is directly proportional to additional incurred cost) is lowest for hrchl-pass multi. The task uptake is the highest for multi-pass random multi at $80\%$, which contrasts the performance trends shown in §4.1. One potential explanation for this could be that annotators underestimate the difficulty of the task when they’re presented randomly partitioned labels.

Table 4:

I True Positive Frequency Vs F1 Scores

Figure 12:

J Worker Agreement On Labels by Labeling Scheme

Table 5:

K Bootstrap Confidence Intervals

To find the confidence intervals included in any results of the paper, we use bootstrap confidence intervals. This is done directly on the raw data we collected, where a datapoint is a single submission by a worker. That is, for single-pass schemes the datapoint will include all labels from the taxonomy, but for multi-pass schemes the datapoint will only include a subset of the labels. We draw N=10,000 samples with replacement from the original data, then find the performance metric using the process outlined in the paper (including all preprocessing). We then use these 10,000 measurements of each performance metric to compute 99% confidence intervals.

We do this sampling on the subset of annotations which ultimately contribute to the performance metric, rather than all the annotations we receive. This ensures we are not sampling (for example) tutorial annotations which we will ignore in the final calculations anyway.

Figure 13:

L Toy Example of Annotation Costs

Suppose the team wants to collect approximately 10, 000 fully labeled passages in order to provide high-quality training data. How much will it cost them to use each labeling scheme?

If the team assumes the longest part of labeling is due to reading, and want to guarantee a strong hourly wage for workers, then they can fix the reward at $0.10 per passage (this is a toy example, but loosely this is in line with paying above minimum wage in the United States). Then, the cost of data collection has to do with how many times they ask workers to read a single passage, just to collect a full set of labels. If chunking the question into smaller subsets (multi-pass), the cost will be greater. The extreme is when you ask a new worker to read the passage once for every label in the taxonomy. See Figure 14 for a breakdown of the cost to carry out this annotation.

Figure 14:

We see that binary labeling schemes are prohibitively expensive. For this reason we do not include binary labeling schemes in further comparisons.

Overall, single-pass options are the cheapest (both below $2, 000), including the cost of training the workers (paying them for completing the tutorial examples). Multi-pass options have a higher range, $5, 000 − 8, 000. Hrchl-pass multi balances cost (under $3, 000) but still uses multiple workers to complete one passage’s annotations.

M Details for Producing A Proxy for Difficulty

Instant agreement (“easy”) are the passage-label pairs for which the entire lab (3 authors plus 3 more students) gave the same value while individually annotating. During this step, the team looked up any terms we were unfamiliar or unclear about, just like the AMT workers are instructed to do.

Agreement after writing rationales (“medium”): highlight specific parts of the passage and justify why a label should or shouldn’t apply. We only went through this process for passages where we had disagreed in the first step. During this process, we include any level 1 label if there is disagreement within any of its children, even if all team members agreed on the level 1 label.

For the remaining disagreement, we had a brief (1-2 minute) discussion for every passage-label pair, reading everyone’s given rationale to see whether one of us could convince the others. This process included both reading the rationales written in 1 as well as generating new rationales (in discussion). The passage-label pairs agreed upon in this stage are referred to as “hard”.

See Figure 15 for a few plots examining the worker performance metrics as each difficulty category varies.

Figure 15:

The remaining passages are marked using each of our individual votes, and placed aside (“no agreement”). We consider these passages to be too subjective to give a gold label for, and don’t evaluate workers on them since any label could be valid so long as they have a strong justification. In future work we may consider looking at the justification/rationale of an AMT worker to assess their performance on highly subjective passage-label pairs.

We make note of which category each passage-label pair was resolved in, such that we can perform an analysis into how the “difficulty’’ of passages affect labeling performance. We recognize that this is not necessarily a direct measurement of how difficult it is to label a passage, but we make the assumption that any passage that requires increasingly more thought or discussion to reach consensus will imply this passage is more difficult.

N Performance by Taxonomy Label

See Tables in Figure 16 for each labeling scheme detailing the performance broken down by each individual label of the taxonomy. Final scores are computed for each metric by taking averages across these taxonomy labels.

Figure 16:

Footnotes

This multi-day delay was due to the annotation platform we used not supporting this immediate switch when the first labeling scheme was deployed.

For this paper we do not alter the training process in order to control for this step across all labeling schemes

This approach has shown useful for high-quality data annotation for images [32], but has been less successful in video and audio [7, 51] due to its high cost when there is a temporal element in the annotation.

Binary label schemes are not included in experiments due to their high cost.

We should note that one possibility is to spend the same amount of money but vary the amount of data collected. This adds complexity to this question beyond the scope of this paper, since it would be necessary to build and evaluate ML model performances in order to measure the tradeoff of data quantity.

The TurkOpticon page for our requester account shows 5/5 rating in Fairness and 5/5 rating Fast payments. This requester account was created solely for these experiments.

Supplementary Material

MP4 File (3544548.3581431-talk-video.mp4)

Pre-recorded Video Presentation

Download
21.65 MB

References

[1]

Alberto Alemanno. 2018. How to counter fake news? A taxonomy of anti-fake news approaches. European journal of risk regulation 9, 1 (2018), 1–5.

Abstract

1 Introduction

2 Related Works

3 Approach

3.1 Data collection

3.2 Annotator training

3.2.1 Definitions.

3.2.2 Tutorial.

3.2.3 Entrance exam.

3.2.4 Quality checks.

3.3 Ground-truth labels

3.4 Labeling schemes

3.4.1 Formats.

3.4.2 Pass logic.

3.4.3 Combinations.

3.5 Controls

3.5.1 Cost.

3.5.2 Payment broken down by each labeling scheme.

4 Analysis

4.1 Performance Comparison

4.2 Multiple Regression Analysis

4.3 Contributing factors toward performance differences

4.3.1 Grouping labels.

4.3.2 Examining difficulty.

4.3.3 True positive frequency.

4.4 Voting schemes

5 Discussion

5.1 Limitations

6 Conclusion

Acknowledgments

A The Vaccine Concerns (VaxConcerns) Taxonomy

B Alternative Text for Figures

C Degradation in Worker Task Performance Over Time

D Distributions of Time Spent Labeling Each Passage

E Screenshot of Definitions Task in Annotation Platform

F Methods for Pre-processing Data

G Metrics

H Task Uptake

I True Positive Frequency Vs F1 Scores

J Worker Agreement On Labels by Labeling Scheme

K Bootstrap Confidence Intervals

L Toy Example of Annotation Costs

M Details for Producing A Proxy for Difficulty

N Performance by Taxonomy Label

Footnotes

Supplementary Material

References

Cited By

Index Terms

Recommendations

Toward crowdsourcing micro-level behavior annotations: the challenges of interface, training, and generalization

Named entity recognition and disambiguation using linked data and graph-based centrality scoring

Various approaches to text representation for named entity disambiguation

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Upcoming Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

HTML Format

Login options

Full Access