Nothing Special   »   [go: up one dir, main page]

Skip to main content

Cost-Efficient Crowdsourcing for Span-Based Sequence Labeling: Worker Selection and Data Augmentation

  • Conference paper
  • First Online:
Chinese Computational Linguistics (CCL 2024)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14761))

Included in the following conference series:

  • 78 Accesses

Abstract

This paper introduces a novel crowdsourcing worker selection algorithm, enhancing annotation quality and reducing costs. Unlike previous studies targeting simpler tasks, this study contends with the complexities of label interdependencies in sequence labeling. The proposed algorithm utilizes a Combinatorial Multi-Armed Bandit (CMAB) approach for worker selection, and a cost-effective human feedback mechanism. The challenge of dealing with imbalanced and small-scale datasets, which hinders offline simulation of worker selection, is tackled using an innovative data augmentation method termed shifting, expanding, and shrinking (SES). Rigorous testing on CoNLL 2003 NER and Chinese OEI datasets showcased the algorithm’s efficiency, with an increase in \(\text {F}_1\) score up to 100.04% of the expert-only baseline, alongsidecost savings up to 65.97%. The paper also encompasses a dataset-independent test emulating annotation evaluation through a Bernoulli distribution, which still led to an impressive 97.56% \(\text {F}_1\) score of the expert baseline and 59.88% cost savings. Furthermore, our approach can be seamlessly integrated into Reinforcement Learning from Human Feedback (RLHF) systems, offering a cost-effective solution for obtaining human feedback. All resources, including source code and datasets, are available to the broader research community at https://github.com/blcuicall/nlp-crowdsourcing.

Y. Wang and C. Huang—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 74.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://english.sina.com/weibo/.

  2. 2.

    The augmentation procedure takes about 2 h on a computer with a 2.9 GHz Quad-Core Intel Core i7 CPU.

References

  1. Biswas, A., Jain, S., Mandal, D., Narahari, Y.: A truthful budget feasible multi-armed bandit mechanism for crowdsourcing time critical tasks. In: Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems, pp. 1101–1109 (2015). http://dblp.uni-trier.de/db/conf/atal/aamas2015.html

  2. Casper, S., et al.: Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv preprint arXiv:2307.15217 (2023). https://arxiv.org/abs/2307.15217

  3. Chen, W., Wang, Y., Yuan, Y.: Combinatorial multi-armed bandit: general framework and applications. In: Proceedings of the 30th International Conference on Machine Learning, pp. 151–159 (2013). https://proceedings.mlr.press/v28/chen13a.html

  4. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12(76), 2493–2537 (2011). http://jmlr.org/papers/v12/collobert11a.html

  5. Derczynski, L.: Complementarity, F-score, and NLP evaluation. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pp. 261–266. European Language Resources Association (ELRA), Portorož, Slovenia (2016). https://aclanthology.org/L16-1040

  6. Erdogan, H.: Sequence Labeling: generative and discriminative approaches. In: Proceedings of the 9th International Conference on Machine Learning and Applications. pp. 1–132 (2010). https://www.icmla-conference.org/icmla10/

  7. Garcelon, E., Avadhanula, V., Lazaric, A., Pirotta, M.: Top k ranking for multi-armed bandit with noisy evaluations. In: Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, pp. 6242–6269 (2022). https://proceedings.mlr.press/v151/garcelon22b.html

  8. Howe, J.: The rise of crowdsourcing. Wired Mag. 14(6), 1–4 (2006). http://www.wired.com/wired/archive/14.06/crowds.html

  9. Lai, T.L., Robbins, H.: Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6(1), 4–22 (1985). https://doi.org/10.1016/0196-8858(85)90002-8

    Article  MathSciNet  Google Scholar 

  10. Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: RCV1: a new benchmark collection for text categorization research. J. Mach. Learn. Res. 5, 361–397 (2004). http://jmlr.org/papers/volume5/lewis04a/lewis04a.pdf

  11. Li, F., Zhao, J., Yu, D., Cheng, X., Lv, W.: Harnessing context for budget-limited crowdsensing with massive uncertain workers. IEEE/ACM Trans. Netw. 30(5), 2231–2245 (2022). https://doi.org/10.1109/TNET.2022.3169180

    Article  Google Scholar 

  12. Motwani, R., Raghavan, P.: Randomized Algorithms (1995). https://doi.org/10.1017/CBO9780511814075

  13. Nangia, N., Sugawara, S., Trivedi, H., Warstadt, A., Vania, C., Bowman, S.R.: What ingredients make for an effective crowdsourcing protocol for difficult NLU data collection tasks? In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 1221–1235. Association for Computational Linguistics, Online (2021). https://doi.org/10.18653/v1/2021.acl-long.98, https://aclanthology.org/2021.acl-long.98

  14. Nguyen, A.T., Wallace, B., Li, J.J., Nenkova, A., Lease, M.: Aggregating and predicting sequence labels from crowd annotations. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 299–309. Association for Computational Linguistics, Vancouver, Canada (2017). https://doi.org/10.18653/v1/P17-1028, https://aclanthology.org/P17-1028

  15. Nowak, S., Rüger, S.: How reliable are annotations via crowdsourcing: a study about inter-annotator agreement for multi-label image annotation. In: Proceedings of the International Conference on Multimedia Information Retrieval, pp. 557–566 (2010). https://doi.org/10.1145/1743384.1743478

  16. Ramshaw, L., Marcus, M.: Text chunking using transformation-based learning. In: Third Workshop on Very Large Corpora (1995). https://aclanthology.org/W95-0107

  17. Rangi, A., Franceschetti, M.: Multi-armed bandit algorithms for crowdsourcing systems with online estimation of workers’ ability. In: Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pp. 1345–1352 (2018). http://dl.acm.org/citation.cfm?id=3237900

  18. Rodrigues, F., Pereira, F., Ribeiro, B.: Sequence labeling with multiple annotators. Mach. Learn. 95(2), 165–181 (2014). https://doi.org/10.1007/s10994-013-5411-2

    Article  MathSciNet  Google Scholar 

  19. Rodrigues, F., Pereira, F.C.: Deep learning from crowds. In: Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence, pp. 1611–1618 (2018). https://doi.org/10.1609/aaai.v32i1.11506

  20. Simpson, E., Gurevych, I.: A Bayesian approach for sequence tagging with crowds. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 1093–1104. Association for Computational Linguistics, Hong Kong, China (2019). https://doi.org/10.18653/v1/D19-1101, https://aclanthology.org/D19-1101

  21. Song, Y., Jin, H.: Minimizing entropy for crowdsourcing with combinatorial multi-armed bandit. IEEE Conf. Comput. Commun., 1–10 (2021). https://doi.org/10.1109/INFOCOM42981.2021.9488800

  22. Tjong Kim Sang, E.F., De Meulder, F.: Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pp. 142–147 (2003). https://aclanthology.org/W03-0419

  23. Tran-Thanh, L., Stein, S., Rogers, A., Jennings, N.R.: Efficient crowdsourcing of unknown experts using bounded multi-armed bandits. Artif. Intell. 214, 89–111 (2014). https://doi.org/10.1016/j.artint.2014.04.005

    Article  MathSciNet  Google Scholar 

  24. Venanzi, M., Guiver, J., Kazai, G., Kohli, P., Shokouhi, M.: Community-based Bayesian aggregation models for crowdsourcing. In: Proceedings of the 23rd International Conference on World Wide Web, pp. 155–164 (2014). https://doi.org/10.1145/2566486.2567989

  25. Zhang, X., Xu, G., Sun, Y., Zhang, M., Wang, X., Zhang, M.: Identifying Chinese opinion expressions with extremely-noisy crowdsourcing annotations. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2801–2813. Association for Computational Linguistics, Dublin, Ireland (2022). https://doi.org/10.18653/v1/2022.acl-long.200, https://aclanthology.org/2022.acl-long.200

  26. İren, D., Bilgen, S.: Cost of quality in crowdsourcing. Hum. Comput. 1 (2014). https://doi.org/10.15346/hc.v1i2.14

Download references

Acknowledgements

This work was supported by the MOE (Ministry of Education in China) Project of Humanities and Social Sciences (No. 23YJCZH264) and the funds of Research Project of the National Language Commission (No. ZDA145-17).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Liner Yang .

Editor information

Editors and Affiliations

Ethics declarations

Disclosure of Interests

The authors have no competing interests to declare that are relevant to the content of this article.

Appendices

A Feedback Simulator

The performance of crowd workers can vary across different types of annotation tasks. To evaluate the Exp.+MV worker selection method in more stable conditions without task-specific influence, we do not actually annotate the sentences, but directly use a worker’s average \(\text {F}_1\) score to simulate his score on each sentence he annotates. The simulated scores are used as the numerical feedback for worker selection. Specifically, for each worker w, we calculate in advance two average \(\text {F}_1\) scores for all of their annotations on the original dataset. The two \(\text {F}_1\) scores for each worker are calculated using expert and majority vote (MV) evaluation respectively, denoted as \(\bar{\varphi }_w^{Exp.}\) and \(\bar{\varphi }_w^{MV}\). At each time step t, for every sentence \(s_i\) in the sentence set to be annotated \(S_t\), we ask K different workers from the current selected workers \(W_t\) to annotate it. Then, we use a random value between 0 and 1 as the agreement level \(\kappa \). If \(\kappa \) exceeds the threshold value \(\tau \) (set to 0.4 in Exp.+MV), we independently generate feedback for the K workers from a Bernoulli distribution with a probability parameter set to \(\bar{\varphi }_w^{MV}\). If not, the feedback is generated from a Bernoulli distribution with a probability parameter set to \(\bar{\varphi }_w^{Exp.}\). The span-level average \(\text {F}_1\) scores of the annotated dataset using different worker selection algorithm are shown in Table 5. Our feedback mechanism Exp.+MV for worker selection achieved comparable performance to the expert-only mechanism Exp. (68.29 versus 69.78), while in the same time reduced expert involvement in evaluation by 59.88% under the dataset-independent conditions.

Table 5. The overall span-level proportional \(\text {F}_1\) scores of all methods with the feedback simulator.
Table 6. Span-level exact \(\text {F}_1\) scores of widely-accepted deep learning models on the Chinese OEI dataset. LSTM-Crowd-cat is from Nguyen et al. [14]. Bert-BiLSTM-CRF and Annotator-Adaptor are from Zhang et al. [25]. We provide results with and without our worker selection algorithm.
Algorithm 1
figure a

The worker selection algorithm with the Expert+MV metric.

B Regret Analysis

We provide a brief regret analysis of the worker selection framework assuming that we use the \(\epsilon \)-greedy algorithm and that each worker’s reward follows a Bernoulli distribution.

The main proof follows the proof of Theorem 1 in [7]. The key contribution here is that we need to specify that the evaluation signal (generated by majority voting) is a generalized linear model of workers’ true reward signal (generated by expert/oracle). To this end, we utilize the following form of the Chernoff bound which applies for any random variables with bounded support.

Lemma 1

(Chernoff Bound [12]) Let \(X_1, X_2, \cdots , X_N\) be independent random variables such that \(x_l \le X_i \le x_h\) for all \(i \in \{1,2, \cdots , N\}\). Let \(X=\sum _{i=1}^{N}X_i\) and \(\mu =\mathbb {E}(X)\). Given any \(\delta >0\), we have the following result:

$$\begin{aligned} P(X\le (1-\delta )\mu )\le e^{-\frac{\delta ^2 \mu ^2}{N(x_h-x_l)^2}}. \end{aligned}$$
(9)

For the purpose of our discussion, let \(X_i \in \{0,1\}\) be a binary random variable, where \(X_i=0\) denotes that worker i provides an incorrect solution, and \(X_i=1\) denotes that worker i generates a correct solution. Define \(X=\sum _{i\in \mathcal {N}}X_i\).

We aim to approximate \(P_\textrm{MV}\), which is the probability that the majority of the N workers provide the correct estimate. We apply the Chernoff Bound in Lemma 1 to \(P_\textrm{MV}\). We can compute

$$\begin{aligned} \mathbb {E}(X)=\bar{p}=\frac{\sum _{i=1}^{N}p_i}{N} . \end{aligned}$$
(10)

Based on (9), we let \(\mu =\mathbb {E}(X), \;\delta =\frac{N(\bar{p}-\frac{1}{2})}{\frac{N}{2}+N(\bar{p}-\frac{1}{2})}\), \(x_l=0\), \(x_h=1\), and get the following result:

$$\begin{aligned} &P_\textrm{MV}=P\left( X \ge \frac{N}{2}\right) =1- P\left( X \le \frac{N}{2}\right) \nonumber \\ &\ge 1-e^{-\frac{\delta ^2 \mu ^2}{N}}\end{aligned}$$
(11)
$$\begin{aligned} &=1-e^{-\frac{\frac{N^2(\bar{p}-\frac{1}{2})^2}{[\frac{N}{2}+N(\bar{p}-\frac{1}{2})]^2}[\frac{N}{2}+N(\bar{p}-\frac{1}{2})]^2}{N}}\end{aligned}$$
(12)
$$\begin{aligned} &=1-e^{-\frac{N^2(\bar{p}-\frac{1}{2})^2}{N}}\end{aligned}$$
(13)
$$\begin{aligned} &=1-e^{-N\left( \frac{\sum _{i=1}^{N}p_i}{N}-\frac{1}{2}\right) ^2}. \end{aligned}$$
(14)

Through approximating \(P_\textrm{MV}\) by its lower bound in (14), we can see that the evaluation signal (represented by \(P_\textrm{MV}\)) is an increasing function in each worker’s capability \(p_i\) and twice-differentiable. That is, \(P_\textrm{MV}\) is a generalized linear function, which satisfies Assumption 3 in [7]. Therefore, one can follow the proof of Theorem 1 in [7] that the \(\epsilon \)-greedy algorithm yields a sub-linear regret with order \(\tilde{O}(T^{2/3})\).

Algorithm 2
figure b

The annotation selection algorithm.

C Case Study of Annotation Errors

Based on our statistical analysis of the Chinese OEI dataset, we find that 74.80% of annotations have different types of errors. And these annotation errors could be decomposed to three basic error types, namely Shifting, Expanding, and Shrinking (SES). In our data augmentation algorithm, we reversely used SES modifications and their combinations on the ground truth annotations to generate annotations with varying errors made by crowd workers. In this section, we provide a detailed characterization of human-made errors observed on annotated data with real cases to better motivate these modifications.

Shifting. Some crowd annotation spans are as long as expert ones, but their positions are wrong. Shifting simulates this type of error. As depicted in Fig. 5, both the expert span and the crowd span are three words long and of negative polarity. The difference is that the crowd span is shifted to the left by 2 words compared with the expert span. This type of error can be generated with Shifting on the expert annotations.

Fig. 5.
figure 5

A case in which the crowd worker annotates a span with correct length and polarity but incorrect position.

Expanding. Expanding is used to generate longer (than expert span) error spans. It might be intuitive that annotators barely make errors such as expanding to a very long span. However, in the case illustrated in Fig. 6, the expert annotates five short spans separated by commas, while the crowd worker uses a very long span that covers the whole sentence, which is obviously not accurate. To simulate such human-made errors, we can expand an expert span to cover unnecessary words. Statistically, 4.03% of annotation errors are very long spans with more than 15 Chinese characters. So we do not set an upper bound of span length in Expanding.

Fig. 6.
figure 6

A case in which the crowd worker uses a very long span that covers the whole sentence.

Shrinking. Shrinking is useful since crowd workers often ignore some words when annotating. As shown in Fig. 7, the crowd worker failed to find all words expressing positive opinions.

Fig. 7.
figure 7

A case in which the crowd worker does not annotate all words with polarity.

Sometimes crowd workers ignore a whole span in expert annotations. This is why we set the lower bound of span length to zero in Shrinking, which means we can shrink a span into no span.

These three types of errors may occur separately or combined in real crowd annotations. Such that an error could be both shifted and shrunk. This is why we use the combination of these three types of modifications to simulate human-made errors in our data augmentation algorithm.

D A Running Example of Data Augmentation

We here provide a running example to illustrate how an annotation for a certain worker on a certain sentence is generated with our proposed augmentation method. Suppose we have an English sentence:

Although he looked very depressed yesterday, he has already become much more cheerful now.

And an expert annotation:

Although he looked [NEGATIVE: very depressed] yesterday, he has already become [POSITIVE: much more cheerful] now.

If the crowd worker Sam has an annotation on this sentence in the original dataset, we use it directly in the augmented dataset. Otherwise, we generate an annotation for Sam with our data augmentation method.

When generating annotation for Sam, we follow the steps below:

  1. 1.

    For each span in the expert annotation, we apply the Shifting, Expanding, and Shrinking (SES) modifications on it. After this step, we have several lists of annotation, each list contain annotations with only one modified span:

    • List 1, modifications of the first span, containing \(N_1\) annotations:

      • Although he looked [NEGATIVE: very depressed] yesterday, he has already become much more cheerful now. # Unmodified, span-level proportional F1 = 1.0

      • Although he looked very [NEGATIVE: depressed yesterday], he has already become much more cheerful now. # Shifting, F1 = 0.5

      • Although he looked very depressed [NEGATIVE: yesterday ,] he has already become much more cheerful now. # Shifting, F1 = 0

        \(\ldots \):

        # Other Shifting modifications

      • Although he [NEGATIVE: looked very depressed] yesterday, he has already become much more cheerful now. # Expanding, F1 = 1.0

        \(\ldots \):

        # Other Expanding modifications

      • Although he looked very [NEGATIVE: depressed] yesterday, he has already become much more cheerful now. # Shrinking, F1 = 0.5

        \(\ldots \):

        # Other Shrinking modifications

    • List 2, modifications of the second span, containing \(N_2\) annotations:

      • Although he looked very depressed yesterday, he has already become [POSITIVE: much more cheerful] now. # Unmodified, span-level proportional F1 = 1.0

      • Although he looked very depressed yesterday, he has already become much [POSITIVE: more cheerful now]. # Shifting, F1 = 0.6667

      • Although he looked very depressed yesterday, he has already become much more [POSITIVE: cheerful now .] # Shifting, F1 = 0.3334

        \(\ldots \):

        # Other Shifting modifications

      • Although he looked very depressed yesterday, he has already [POSITIVE: become much more cheerful] now. # Expanding, F1 = 1.0

        \(\ldots \):

        # Other Expanding modifications

      • Although he looked very depressed yesterday, he has already become much [POSITIVE: more cheerful] now. # Shrinking, F1 = 0.6667

        \(\ldots \):

        # Other Shrinking modifications

  2. 2.

    We choose one annotation from each list, and combine them to generate an annotation with 2 spans. This is done for all combinations of the annotations in the two lists. Note that if the two spans overlay with each other, we merge them into one span. After step 2, we have one list of annotations:

    Combined List, containing less than or equal to \(N_1 \times N_2\) annotations:

    • Although he looked [NEGATIVE: very depressed] yesterday, he has already become [POSITIVE: much more cheerful] now. # span-level proportional F1 = 1.0

    • Although he looked very [NEGATIVE: depressed yesterday], he has already become [POSITIVE: much more cheerful] now. # span-level proportional F1 = 0.75

    • Although he looked very depressed [NEGATIVE: yesterday ,] he has already become [POSITIVE: much more cheerful] now. # span-level proportional F1 = 0.5

      \(\ldots \):

      # Other combinations with F1 ranging from 0 to 1.0

  3. 3.

    We choose one annotation from the combined list as Sam’s annotation on this sentence, according to the following procedure:

    1. (a)

      Sam has an average F1 score \(F_{\text {ori}} = 0.57\) on the original (real) dataset.

    2. (b)

      We have already got 10 annotations for Sam in the augmented dataset, which has an average F1 score \(F_{\text {aug}\_\text {10}} = 0.54\).

    3. (c)

      We are choosing annotation on the 11th sentence for Sam.

    4. (d)

      We firstly select two annotations with the closest F1 scores to \(F_{\text {ori}}\) from the combined list, one higher than \(F_{\text {ori}}\), and one lower than \(F_{\text {ori}}\), as candidate annotations. In this case, the two annotations could have F1 scores of 0.58 and 0.52 respectively.

    5. (e)

      If \(F_{\text {aug}\_\text {10}} > F_{\text {ori}}\), we choose the annotation with the lower F1 score (0.52) as Sam’s annotation on this sentence. Otherwise, we choose the annotation with the higher F1 score (0.58). This is to ensure that the average F1 score of Sam’s annotations in the whole augmented dataset, \(F_{\text {aug}}\), is as close to \(F_{\text {ori}}\) as possible, which reflects Sam’s reliability (i.e., performance). In this case, we choose the annotation with F1 score of 0.58.

By generating the missing annotations in the original dataset with the method above, we could have an augmented dataset.

E Explanation of Worker Selection on Building New Datasets

When creating new datasets, we expect to have a few (e.g. five) experts and a relatively large group of (e.g. a hundred) crowd workers available for annotation.

At each time step, we select a group of (e.g. 20) crowd workers, and request them to annotate a few (e.g. 5) sentences, resulting in 4 crowd annotations on each sentence. Now we calculate the agreement of the annotations on each sentence, if the agreement is high (e.g. greater than 0.4), we use the MV aggregation of the crowd annotations as the ground truth, and calculate the F1 scores of each worker’s annotation. Otherwise, we ask an expert to give an annotation on the sentence, and calculate the F1 score of each worker on the expert annotation. Note that the expert annotates only when the agreement is low. After this time step, we have crowd annotations on the sentences and their F-scores, which can be used to update the average score of each worker. This procedure is repeated until we have enough annotations on every sentence.

In other words, the Expert+MV approach works with both crowd workers and experts available (e.g. on an online system) when building datasets. The Expert+MV is an iterative approach in which the expert annotates when needed. And it saves the cost of expert annotations by using the MV aggregation of crowd annotations as the ground truth when possible. Our experiment results show that the Expert+MV approach can save 47.19% of the cost of expert annotations on the Chinese OEI dataset, and 65.97% on the CoNLL’03 dataset respectively.

Table 7. Comparisons between different data augmentation methods on the span-level exact \(\text {F} _1\) score of every crowd worker. Ori. stands for the original score in real datasets before any augmentation. Rnd. Gen. is a naive augmentation method with random generated annotations. SES Only indicates the shifting, shrinking, and expanding method we proposed. SES + Alg.2 means SES with Algorithm 2 which is our final method.

However, even in the case that no expert is available, which means that Expert+MV falls back to MV, we can still observe that the MV approach outperforms the Random baseline (which is an equivalent of normal crowdsourcing procedure which assigns an equal amount of sentences to each worker randomly) by a large gap. In this case, the MV approach saves 100% of expert annotation cost, but still produced crowd annotation with good quality. Please refer to Table 3 and Table 4 for more detailed results.

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wang, Y. et al. (2025). Cost-Efficient Crowdsourcing for Span-Based Sequence Labeling: Worker Selection and Data Augmentation. In: Sun, M., et al. Chinese Computational Linguistics. CCL 2024. Lecture Notes in Computer Science(), vol 14761. Springer, Singapore. https://doi.org/10.1007/978-981-97-8367-0_22

Download citation

  • DOI: https://doi.org/10.1007/978-981-97-8367-0_22

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-97-8366-3

  • Online ISBN: 978-981-97-8367-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics