Abstract
This paper introduces a novel crowdsourcing worker selection algorithm, enhancing annotation quality and reducing costs. Unlike previous studies targeting simpler tasks, this study contends with the complexities of label interdependencies in sequence labeling. The proposed algorithm utilizes a Combinatorial Multi-Armed Bandit (CMAB) approach for worker selection, and a cost-effective human feedback mechanism. The challenge of dealing with imbalanced and small-scale datasets, which hinders offline simulation of worker selection, is tackled using an innovative data augmentation method termed shifting, expanding, and shrinking (SES). Rigorous testing on CoNLL 2003 NER and Chinese OEI datasets showcased the algorithm’s efficiency, with an increase in \(\text {F}_1\) score up to 100.04% of the expert-only baseline, alongsidecost savings up to 65.97%. The paper also encompasses a dataset-independent test emulating annotation evaluation through a Bernoulli distribution, which still led to an impressive 97.56% \(\text {F}_1\) score of the expert baseline and 59.88% cost savings. Furthermore, our approach can be seamlessly integrated into Reinforcement Learning from Human Feedback (RLHF) systems, offering a cost-effective solution for obtaining human feedback. All resources, including source code and datasets, are available to the broader research community at https://github.com/blcuicall/nlp-crowdsourcing.
Y. Wang and C. Huang—Equal contribution.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
The augmentation procedure takes about 2 h on a computer with a 2.9 GHz Quad-Core Intel Core i7 CPU.
References
Biswas, A., Jain, S., Mandal, D., Narahari, Y.: A truthful budget feasible multi-armed bandit mechanism for crowdsourcing time critical tasks. In: Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems, pp. 1101–1109 (2015). http://dblp.uni-trier.de/db/conf/atal/aamas2015.html
Casper, S., et al.: Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv preprint arXiv:2307.15217 (2023). https://arxiv.org/abs/2307.15217
Chen, W., Wang, Y., Yuan, Y.: Combinatorial multi-armed bandit: general framework and applications. In: Proceedings of the 30th International Conference on Machine Learning, pp. 151–159 (2013). https://proceedings.mlr.press/v28/chen13a.html
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12(76), 2493–2537 (2011). http://jmlr.org/papers/v12/collobert11a.html
Derczynski, L.: Complementarity, F-score, and NLP evaluation. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pp. 261–266. European Language Resources Association (ELRA), Portorož, Slovenia (2016). https://aclanthology.org/L16-1040
Erdogan, H.: Sequence Labeling: generative and discriminative approaches. In: Proceedings of the 9th International Conference on Machine Learning and Applications. pp. 1–132 (2010). https://www.icmla-conference.org/icmla10/
Garcelon, E., Avadhanula, V., Lazaric, A., Pirotta, M.: Top k ranking for multi-armed bandit with noisy evaluations. In: Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, pp. 6242–6269 (2022). https://proceedings.mlr.press/v151/garcelon22b.html
Howe, J.: The rise of crowdsourcing. Wired Mag. 14(6), 1–4 (2006). http://www.wired.com/wired/archive/14.06/crowds.html
Lai, T.L., Robbins, H.: Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6(1), 4–22 (1985). https://doi.org/10.1016/0196-8858(85)90002-8
Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: RCV1: a new benchmark collection for text categorization research. J. Mach. Learn. Res. 5, 361–397 (2004). http://jmlr.org/papers/volume5/lewis04a/lewis04a.pdf
Li, F., Zhao, J., Yu, D., Cheng, X., Lv, W.: Harnessing context for budget-limited crowdsensing with massive uncertain workers. IEEE/ACM Trans. Netw. 30(5), 2231–2245 (2022). https://doi.org/10.1109/TNET.2022.3169180
Motwani, R., Raghavan, P.: Randomized Algorithms (1995). https://doi.org/10.1017/CBO9780511814075
Nangia, N., Sugawara, S., Trivedi, H., Warstadt, A., Vania, C., Bowman, S.R.: What ingredients make for an effective crowdsourcing protocol for difficult NLU data collection tasks? In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 1221–1235. Association for Computational Linguistics, Online (2021). https://doi.org/10.18653/v1/2021.acl-long.98, https://aclanthology.org/2021.acl-long.98
Nguyen, A.T., Wallace, B., Li, J.J., Nenkova, A., Lease, M.: Aggregating and predicting sequence labels from crowd annotations. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 299–309. Association for Computational Linguistics, Vancouver, Canada (2017). https://doi.org/10.18653/v1/P17-1028, https://aclanthology.org/P17-1028
Nowak, S., Rüger, S.: How reliable are annotations via crowdsourcing: a study about inter-annotator agreement for multi-label image annotation. In: Proceedings of the International Conference on Multimedia Information Retrieval, pp. 557–566 (2010). https://doi.org/10.1145/1743384.1743478
Ramshaw, L., Marcus, M.: Text chunking using transformation-based learning. In: Third Workshop on Very Large Corpora (1995). https://aclanthology.org/W95-0107
Rangi, A., Franceschetti, M.: Multi-armed bandit algorithms for crowdsourcing systems with online estimation of workers’ ability. In: Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pp. 1345–1352 (2018). http://dl.acm.org/citation.cfm?id=3237900
Rodrigues, F., Pereira, F., Ribeiro, B.: Sequence labeling with multiple annotators. Mach. Learn. 95(2), 165–181 (2014). https://doi.org/10.1007/s10994-013-5411-2
Rodrigues, F., Pereira, F.C.: Deep learning from crowds. In: Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence, pp. 1611–1618 (2018). https://doi.org/10.1609/aaai.v32i1.11506
Simpson, E., Gurevych, I.: A Bayesian approach for sequence tagging with crowds. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 1093–1104. Association for Computational Linguistics, Hong Kong, China (2019). https://doi.org/10.18653/v1/D19-1101, https://aclanthology.org/D19-1101
Song, Y., Jin, H.: Minimizing entropy for crowdsourcing with combinatorial multi-armed bandit. IEEE Conf. Comput. Commun., 1–10 (2021). https://doi.org/10.1109/INFOCOM42981.2021.9488800
Tjong Kim Sang, E.F., De Meulder, F.: Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pp. 142–147 (2003). https://aclanthology.org/W03-0419
Tran-Thanh, L., Stein, S., Rogers, A., Jennings, N.R.: Efficient crowdsourcing of unknown experts using bounded multi-armed bandits. Artif. Intell. 214, 89–111 (2014). https://doi.org/10.1016/j.artint.2014.04.005
Venanzi, M., Guiver, J., Kazai, G., Kohli, P., Shokouhi, M.: Community-based Bayesian aggregation models for crowdsourcing. In: Proceedings of the 23rd International Conference on World Wide Web, pp. 155–164 (2014). https://doi.org/10.1145/2566486.2567989
Zhang, X., Xu, G., Sun, Y., Zhang, M., Wang, X., Zhang, M.: Identifying Chinese opinion expressions with extremely-noisy crowdsourcing annotations. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2801–2813. Association for Computational Linguistics, Dublin, Ireland (2022). https://doi.org/10.18653/v1/2022.acl-long.200, https://aclanthology.org/2022.acl-long.200
İren, D., Bilgen, S.: Cost of quality in crowdsourcing. Hum. Comput. 1 (2014). https://doi.org/10.15346/hc.v1i2.14
Acknowledgements
This work was supported by the MOE (Ministry of Education in China) Project of Humanities and Social Sciences (No. 23YJCZH264) and the funds of Research Project of the National Language Commission (No. ZDA145-17).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Ethics declarations
Disclosure of Interests
The authors have no competing interests to declare that are relevant to the content of this article.
Appendices
A Feedback Simulator
The performance of crowd workers can vary across different types of annotation tasks. To evaluate the Exp.+MV worker selection method in more stable conditions without task-specific influence, we do not actually annotate the sentences, but directly use a worker’s average \(\text {F}_1\) score to simulate his score on each sentence he annotates. The simulated scores are used as the numerical feedback for worker selection. Specifically, for each worker w, we calculate in advance two average \(\text {F}_1\) scores for all of their annotations on the original dataset. The two \(\text {F}_1\) scores for each worker are calculated using expert and majority vote (MV) evaluation respectively, denoted as \(\bar{\varphi }_w^{Exp.}\) and \(\bar{\varphi }_w^{MV}\). At each time step t, for every sentence \(s_i\) in the sentence set to be annotated \(S_t\), we ask K different workers from the current selected workers \(W_t\) to annotate it. Then, we use a random value between 0 and 1 as the agreement level \(\kappa \). If \(\kappa \) exceeds the threshold value \(\tau \) (set to 0.4 in Exp.+MV), we independently generate feedback for the K workers from a Bernoulli distribution with a probability parameter set to \(\bar{\varphi }_w^{MV}\). If not, the feedback is generated from a Bernoulli distribution with a probability parameter set to \(\bar{\varphi }_w^{Exp.}\). The span-level average \(\text {F}_1\) scores of the annotated dataset using different worker selection algorithm are shown in Table 5. Our feedback mechanism Exp.+MV for worker selection achieved comparable performance to the expert-only mechanism Exp. (68.29 versus 69.78), while in the same time reduced expert involvement in evaluation by 59.88% under the dataset-independent conditions.
B Regret Analysis
We provide a brief regret analysis of the worker selection framework assuming that we use the \(\epsilon \)-greedy algorithm and that each worker’s reward follows a Bernoulli distribution.
The main proof follows the proof of Theorem 1 in [7]. The key contribution here is that we need to specify that the evaluation signal (generated by majority voting) is a generalized linear model of workers’ true reward signal (generated by expert/oracle). To this end, we utilize the following form of the Chernoff bound which applies for any random variables with bounded support.
Lemma 1
(Chernoff Bound [12]) Let \(X_1, X_2, \cdots , X_N\) be independent random variables such that \(x_l \le X_i \le x_h\) for all \(i \in \{1,2, \cdots , N\}\). Let \(X=\sum _{i=1}^{N}X_i\) and \(\mu =\mathbb {E}(X)\). Given any \(\delta >0\), we have the following result:
For the purpose of our discussion, let \(X_i \in \{0,1\}\) be a binary random variable, where \(X_i=0\) denotes that worker i provides an incorrect solution, and \(X_i=1\) denotes that worker i generates a correct solution. Define \(X=\sum _{i\in \mathcal {N}}X_i\).
We aim to approximate \(P_\textrm{MV}\), which is the probability that the majority of the N workers provide the correct estimate. We apply the Chernoff Bound in Lemma 1 to \(P_\textrm{MV}\). We can compute
Based on (9), we let \(\mu =\mathbb {E}(X), \;\delta =\frac{N(\bar{p}-\frac{1}{2})}{\frac{N}{2}+N(\bar{p}-\frac{1}{2})}\), \(x_l=0\), \(x_h=1\), and get the following result:
Through approximating \(P_\textrm{MV}\) by its lower bound in (14), we can see that the evaluation signal (represented by \(P_\textrm{MV}\)) is an increasing function in each worker’s capability \(p_i\) and twice-differentiable. That is, \(P_\textrm{MV}\) is a generalized linear function, which satisfies Assumption 3 in [7]. Therefore, one can follow the proof of Theorem 1 in [7] that the \(\epsilon \)-greedy algorithm yields a sub-linear regret with order \(\tilde{O}(T^{2/3})\).
C Case Study of Annotation Errors
Based on our statistical analysis of the Chinese OEI dataset, we find that 74.80% of annotations have different types of errors. And these annotation errors could be decomposed to three basic error types, namely Shifting, Expanding, and Shrinking (SES). In our data augmentation algorithm, we reversely used SES modifications and their combinations on the ground truth annotations to generate annotations with varying errors made by crowd workers. In this section, we provide a detailed characterization of human-made errors observed on annotated data with real cases to better motivate these modifications.
Shifting. Some crowd annotation spans are as long as expert ones, but their positions are wrong. Shifting simulates this type of error. As depicted in Fig. 5, both the expert span and the crowd span are three words long and of negative polarity. The difference is that the crowd span is shifted to the left by 2 words compared with the expert span. This type of error can be generated with Shifting on the expert annotations.
Expanding. Expanding is used to generate longer (than expert span) error spans. It might be intuitive that annotators barely make errors such as expanding to a very long span. However, in the case illustrated in Fig. 6, the expert annotates five short spans separated by commas, while the crowd worker uses a very long span that covers the whole sentence, which is obviously not accurate. To simulate such human-made errors, we can expand an expert span to cover unnecessary words. Statistically, 4.03% of annotation errors are very long spans with more than 15 Chinese characters. So we do not set an upper bound of span length in Expanding.
Shrinking. Shrinking is useful since crowd workers often ignore some words when annotating. As shown in Fig. 7, the crowd worker failed to find all words expressing positive opinions.
Sometimes crowd workers ignore a whole span in expert annotations. This is why we set the lower bound of span length to zero in Shrinking, which means we can shrink a span into no span.
These three types of errors may occur separately or combined in real crowd annotations. Such that an error could be both shifted and shrunk. This is why we use the combination of these three types of modifications to simulate human-made errors in our data augmentation algorithm.
D A Running Example of Data Augmentation
We here provide a running example to illustrate how an annotation for a certain worker on a certain sentence is generated with our proposed augmentation method. Suppose we have an English sentence:
Although he looked very depressed yesterday, he has already become much more cheerful now.
And an expert annotation:
Although he looked [NEGATIVE: very depressed] yesterday, he has already become [POSITIVE: much more cheerful] now.
If the crowd worker Sam has an annotation on this sentence in the original dataset, we use it directly in the augmented dataset. Otherwise, we generate an annotation for Sam with our data augmentation method.
When generating annotation for Sam, we follow the steps below:
-
1.
For each span in the expert annotation, we apply the Shifting, Expanding, and Shrinking (SES) modifications on it. After this step, we have several lists of annotation, each list contain annotations with only one modified span:
-
List 1, modifications of the first span, containing \(N_1\) annotations:
-
Although he looked [NEGATIVE: very depressed] yesterday, he has already become much more cheerful now. # Unmodified, span-level proportional F1 = 1.0
-
Although he looked very [NEGATIVE: depressed yesterday], he has already become much more cheerful now. # Shifting, F1 = 0.5
-
Although he looked very depressed [NEGATIVE: yesterday ,] he has already become much more cheerful now. # Shifting, F1 = 0
- \(\ldots \):
-
# Other Shifting modifications
-
Although he [NEGATIVE: looked very depressed] yesterday, he has already become much more cheerful now. # Expanding, F1 = 1.0
- \(\ldots \):
-
# Other Expanding modifications
-
Although he looked very [NEGATIVE: depressed] yesterday, he has already become much more cheerful now. # Shrinking, F1 = 0.5
- \(\ldots \):
-
# Other Shrinking modifications
-
-
List 2, modifications of the second span, containing \(N_2\) annotations:
-
Although he looked very depressed yesterday, he has already become [POSITIVE: much more cheerful] now. # Unmodified, span-level proportional F1 = 1.0
-
Although he looked very depressed yesterday, he has already become much [POSITIVE: more cheerful now]. # Shifting, F1 = 0.6667
-
Although he looked very depressed yesterday, he has already become much more [POSITIVE: cheerful now .] # Shifting, F1 = 0.3334
- \(\ldots \):
-
# Other Shifting modifications
-
Although he looked very depressed yesterday, he has already [POSITIVE: become much more cheerful] now. # Expanding, F1 = 1.0
- \(\ldots \):
-
# Other Expanding modifications
-
Although he looked very depressed yesterday, he has already become much [POSITIVE: more cheerful] now. # Shrinking, F1 = 0.6667
- \(\ldots \):
-
# Other Shrinking modifications
-
-
-
2.
We choose one annotation from each list, and combine them to generate an annotation with 2 spans. This is done for all combinations of the annotations in the two lists. Note that if the two spans overlay with each other, we merge them into one span. After step 2, we have one list of annotations:
Combined List, containing less than or equal to \(N_1 \times N_2\) annotations:
-
Although he looked [NEGATIVE: very depressed] yesterday, he has already become [POSITIVE: much more cheerful] now. # span-level proportional F1 = 1.0
-
Although he looked very [NEGATIVE: depressed yesterday], he has already become [POSITIVE: much more cheerful] now. # span-level proportional F1 = 0.75
-
Although he looked very depressed [NEGATIVE: yesterday ,] he has already become [POSITIVE: much more cheerful] now. # span-level proportional F1 = 0.5
- \(\ldots \):
-
# Other combinations with F1 ranging from 0 to 1.0
-
-
3.
We choose one annotation from the combined list as Sam’s annotation on this sentence, according to the following procedure:
-
(a)
Sam has an average F1 score \(F_{\text {ori}} = 0.57\) on the original (real) dataset.
-
(b)
We have already got 10 annotations for Sam in the augmented dataset, which has an average F1 score \(F_{\text {aug}\_\text {10}} = 0.54\).
-
(c)
We are choosing annotation on the 11th sentence for Sam.
-
(d)
We firstly select two annotations with the closest F1 scores to \(F_{\text {ori}}\) from the combined list, one higher than \(F_{\text {ori}}\), and one lower than \(F_{\text {ori}}\), as candidate annotations. In this case, the two annotations could have F1 scores of 0.58 and 0.52 respectively.
-
(e)
If \(F_{\text {aug}\_\text {10}} > F_{\text {ori}}\), we choose the annotation with the lower F1 score (0.52) as Sam’s annotation on this sentence. Otherwise, we choose the annotation with the higher F1 score (0.58). This is to ensure that the average F1 score of Sam’s annotations in the whole augmented dataset, \(F_{\text {aug}}\), is as close to \(F_{\text {ori}}\) as possible, which reflects Sam’s reliability (i.e., performance). In this case, we choose the annotation with F1 score of 0.58.
-
(a)
By generating the missing annotations in the original dataset with the method above, we could have an augmented dataset.
E Explanation of Worker Selection on Building New Datasets
When creating new datasets, we expect to have a few (e.g. five) experts and a relatively large group of (e.g. a hundred) crowd workers available for annotation.
At each time step, we select a group of (e.g. 20) crowd workers, and request them to annotate a few (e.g. 5) sentences, resulting in 4 crowd annotations on each sentence. Now we calculate the agreement of the annotations on each sentence, if the agreement is high (e.g. greater than 0.4), we use the MV aggregation of the crowd annotations as the ground truth, and calculate the F1 scores of each worker’s annotation. Otherwise, we ask an expert to give an annotation on the sentence, and calculate the F1 score of each worker on the expert annotation. Note that the expert annotates only when the agreement is low. After this time step, we have crowd annotations on the sentences and their F-scores, which can be used to update the average score of each worker. This procedure is repeated until we have enough annotations on every sentence.
In other words, the Expert+MV approach works with both crowd workers and experts available (e.g. on an online system) when building datasets. The Expert+MV is an iterative approach in which the expert annotates when needed. And it saves the cost of expert annotations by using the MV aggregation of crowd annotations as the ground truth when possible. Our experiment results show that the Expert+MV approach can save 47.19% of the cost of expert annotations on the Chinese OEI dataset, and 65.97% on the CoNLL’03 dataset respectively.
However, even in the case that no expert is available, which means that Expert+MV falls back to MV, we can still observe that the MV approach outperforms the Random baseline (which is an equivalent of normal crowdsourcing procedure which assigns an equal amount of sentences to each worker randomly) by a large gap. In this case, the MV approach saves 100% of expert annotation cost, but still produced crowd annotation with good quality. Please refer to Table 3 and Table 4 for more detailed results.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Wang, Y. et al. (2025). Cost-Efficient Crowdsourcing for Span-Based Sequence Labeling: Worker Selection and Data Augmentation. In: Sun, M., et al. Chinese Computational Linguistics. CCL 2024. Lecture Notes in Computer Science(), vol 14761. Springer, Singapore. https://doi.org/10.1007/978-981-97-8367-0_22
Download citation
DOI: https://doi.org/10.1007/978-981-97-8367-0_22
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-8366-3
Online ISBN: 978-981-97-8367-0
eBook Packages: Computer ScienceComputer Science (R0)