Abstract
With the development of Human-AI Collaboration in Classification (HAI-CC), integrating users and AI predictions becomes challenging due to the complex decision-making process. This process has three options: 1) AI autonomously classifies, 2) learning to complement, where AI collaborates with users, and 3) learning to defer, where AI defers to users. Despite their interconnected nature, these options have been studied in isolation rather than as components of a unified system. In this paper, we address this weakness with the novel HAI-CC methodology, called Learning to Complement and to Defer to Multiple Users (LECODU). LECODU not only combines learning to complement and learning to defer strategies, but it also incorporates an estimation of the optimal number of users to engage in the decision process. The training of LECODU maximises classification accuracy and minimises collaboration costs associated with user involvement. Comprehensive evaluations across real-world and synthesized datasets demonstrate LECODU’s superior performance compared to state-of-the-art HAI-CC methods. Remarkably, even when relying on unreliable users with high rates of label noise, LECODU exhibits significant improvement over both human decision-makers alone and AI alone (Supported by the Engineering and Physical Sciences Research Council (EPSRC) through grant EP/Y018036/1). Code is available at https://github.com/zhengzhang37/LECODU.git.
Supported by the Engineering and Physical Sciences Research Council (EPSRC) through grant EP/Y018036/1.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Agarwal, N., Moehring, A., Rajpurkar, P., Salz, T.: Combining human expertise with artificial intelligence: experimental evidence from radiology. Technical report, National Bureau of Economic Research (2023)
Arazo, E., Ortego, D., Albert, P., O’Connor, N., Mcguinness, K.: Unsupervised label noise modeling and loss correction. In: ICML, pp. 312–321 (2019)
Babbar, V., Bhatt, U., Weller, A.: On the utility of prediction sets in human-AI teams. arXiv preprint arXiv:2205.01411 (2022)
Bansal, G., Nushi, B., Kamar, E., Horvitz, E., Weld, D.S.: Is the most accurate AI the best teammate? Optimizing AI for teamwork. In: AAAI, vol. 35, pp. 11405–11414 (2021)
Berthelot, D., Carlini, N., Goodfellow, I., Papernot, N., Oliver, A., Raffel, C.A.: MixMatch: a holistic approach to semi-supervised learning. In: NeurIPS, vol. 32 (2019)
Bubeck, S., et al.: Sparks of artificial general intelligence: early experiments with GPT-4. arXiv preprint arXiv:2303.12712 (2023)
Cao, Y., Mozannar, H., Feng, L., Wei, H., An, B.: In defense of softmax parametrization for calibrated and consistent learning to defer. In: NeurIPS, vol. 36 (2024)
Charusaie, M.A., Mozannar, H., Sontag, D., Samadi, S.: Sample efficient learning of predictors that complement humans. In: ICML, pp. 2972–3005. PMLR (2022)
Chen, P., Ye, J., Chen, G., Zhao, J., Heng, P.A.: Beyond class-conditional assumption: a primary attempt to combat instance-dependent label noise. In: AAAI (2021)
Chen, Z., et al.: Structured probabilistic end-to-end learning from crowds. In: IJCAI, pp. 1512–1518 (2021)
Chiou, E.K., Lee, J.D.: Trusting automation: designing for responsivity and resilience. Hum. Factors 65(1), 137–165 (2023)
Cordeiro, F.R., Sachdeva, R., Belagiannis, V., Reid, I., Carneiro, G.: LongReMix: robust learning with high confidence samples in a noisy label environment. Pattern Recogn. 133, 109013 (2023)
Cortes, C., DeSalvo, G., Mohri, M.: Learning with rejection. In: Ortner, R., Simon, H.U., Zilles, S. (eds.) ALT 2016. LNCS (LNAI), vol. 9925, pp. 67–82. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46379-7_5
Dafoe, A., Bachrach, Y., Hadfield, G., Horvitz, E., Larson, K., Graepel, T.: Cooperative AI: machines must learn to find common ground. Nature 593(7857), 33–36 (2021)
Dawid, A.P., Skene, A.M.: Maximum likelihood estimation of observer error-rates using the EM algorithm. J. Roy. Stat. Soc.: Ser. C (Appl. Stat.) 28(1), 20–28 (1979)
Garg, A., Nguyen, C., Felix, R., Do, T.T., Carneiro, G.: Instance-dependent noisy label learning via graphical modelling. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2288–2298 (2023)
Ghosh, A., Kumar, H., Sastry, P.S.: Robust loss functions under label noise for deep neural networks. In: AAAI, vol. 31 (2017)
Goh, H.W., Tkachenko, U., Mueller, J.: CROWDLAB: supervised learning to infer consensus labels and quality scores for data with multiple annotators (2023)
Green, B., Chen, Y.: Disparate interactions: an algorithm-in-the-loop analysis of fairness in risk assessments. In: Proceedings of the Conference on Fairness, Accountability, and Transparency, pp. 90–99 (2019)
Guan, M., Gulshan, V., Dai, A., Hinton, G.: Who said what: modeling individual labelers improves classification. In: AAAI, vol. 32 (2018)
Halling-Brown, M.D., et al.: Optimam mammography image database: a large-scale resource of mammography images and clinical data. Radiology: Artif. Intell. 3(1), e200103 (2020)
Han, B., et al.: Co-teaching: robust training of deep neural networks with extremely noisy labels. In: NeurIPS, vol. 31 (2018)
Hemmer, P., Schellhammer, S., Vössing, M., Jakubik, J., Satzger, G.: Forming effective human-AI teams: building machine learning models that complement the capabilities of multiple experts. In: Raedt, L.D. (ed.) IJCAI, pp. 2478–2484. International Joint Conferences on Artificial Intelligence Organization (2022). https://doi.org/10.24963/ijcai.2022/344
Jaehwan, L., Donggeun, Y., Hyo-Eun, K.: Photometric transformer networks and label adjustment for breast density prediction. In: International Conference on Computer Vision Workshops (2019)
Jang, E., Gu, S., Poole, B.: Categorical reparameterization with Gumbel-softmax. arXiv preprint arXiv:1611.01144 (2016)
Jiang, L., Zhou, Z., Leung, T., Li, L.J., Fei-Fei, L.: MentorNet: learning data-driven curriculum for very deep neural networks on corrupted labels. In: ICML (2018)
Kamar, E., Hacker, S., Horvitz, E.: Combining human and machine intelligence in large-scale crowdsourcing. In: AAMAS, vol. 12, pp. 467–474 (2012)
Kerrigan, G., Smyth, P., Steyvers, M.: Combining human predictions with model probabilities via confusion matrices and calibration. In: NeurIPS, vol. 34, pp. 4421–4434 (2021)
Keswani, V., Lease, M., Kenthapadi, K.: Towards unbiased and accurate deferral to multiple experts. In: Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, pp. 154–165 (2021)
Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009)
Li, J., Socher, R., Hoi, S.C.: DivideMix: learning with noisy labels as semi-supervised learning. arXiv preprint arXiv:2002.07394 (2020)
Liu, M., Wei, J., Liu, Y., Davis, J.: Do humans and machines have the same eyes? Human-machine perceptual differences on image classification. arXiv preprint arXiv:2304.08733 (2023)
Liu, Y., Cheng, H., Zhang, K.: Identifiability of label noise transition matrix. In: ICML, pp. 21475–21496. PMLR (2023)
Lu, Z., Yin, M.: Human reliance on machine learning models when performance feedback is limited: Heuristics and risks. In: Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, pp. 1–16 (2021)
Madras, D., Pitassi, T., Zemel, R.: Predict responsibly: improving fairness and accuracy by learning to defer. In: NeurIPS, vol. 31 (2018)
Mao, A., Mohri, C., Mohri, M., Zhong, Y.: Two-stage learning to defer with multiple experts. In: NeurIPS (2023)
Mozannar, H., Lang, H., Wei, D., Sattigeri, P., Das, S., Sontag, D.: Who should predict? Exact algorithms for learning to defer to humans. In: Ruiz, F., Dy, J., van de Meent, J.W. (eds.) Proceedings of The 26th International Conference on Artificial Intelligence and Statistics. Proceedings of Machine Learning Research, vol. 206, pp. 10520–10545. PMLR (2023). https://proceedings.mlr.press/v206/mozannar23a.html
Mozannar, H., Sontag, D.: Consistent estimators for learning to defer to an expert. In: ICML, pp. 7076–7087. PMLR (2020)
Narasimhan, H., Jitkrittum, W., Menon, A.K., Rawat, A., Kumar, S.: Post-hoc estimators for learning to defer to an expert. In: NeurIPS, vol. 35, pp. 29292–29304 (2022)
Okati, N., De, A., Rodriguez, M.: Differentiable learning under triage. In: NeurIPS, vol. 34, pp. 9140–9151 (2021)
Ortego, D., Arazo, E., Albert, P., O’Connor, N.E., McGuinness, K.: Multi-objective interpolation training for robustness to label noise. In: CVPR, pp. 6606–6615 (2021)
Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learning library. In: NeurIPS, vol. 32 (2019)
Peterson, J.C., Battleday, R.M., Griffiths, T.L., Russakovsky, O.: Human uncertainty makes classification more robust. In: ICCV, pp. 9617–9626 (2019)
Pradier, M.F., Zazo, J., Parbhoo, S., Perlis, R.H., Zazzi, M., Doshi-Velez, F.: Preferential mixture-of-experts: interpretable models that rely on human expertise as much as possible. In: AMIA Summits on Translational Science Proceedings 2021, p. 525 (2021)
Raghu, M., Blumer, K., Corrado, G., Kleinberg, J., Obermeyer, Z., Mullainathan, S.: The algorithmic automation problem: prediction, triage, and human effort. arXiv preprint arXiv:1903.12220 (2019)
Raykar, V.C., et al.: Learning from crowds. J. Mach. Learn. Res. 11(4) (2010)
Ren, M., Zeng, W., Yang, B., Urtasun, R.: Learning to reweight examples for robust deep learning. In: ICML (2018)
Rodrigues, F., Lourenco, M., Ribeiro, B., Pereira, F.C.: Learning supervised topic models for classification and regression from crowds. IEEE TPAMI 39(12), 2409–2422 (2017)
Rodrigues, F., Pereira, F.: Deep learning from crowds. In: AAAI, vol. 32 (2018)
Rodrigues, F., Pereira, F., Ribeiro, B.: Gaussian process classification and active learning with multiple annotators. In: ICML, pp. 433–441. PMLR (2014)
Rosenfeld, A., Solbach, M.D., Tsotsos, J.K.: Totally looks like-how humans compare, compared to machines. In: CVPRW, pp. 1961–1964 (2018)
Sachdeva, R., Cordeiro, F.R., Belagiannis, V., Reid, I., Carneiro, G.: ScanMix: learning from severe label noise via semantic clustering and semi-supervised learning. Pattern Recogn. 134, 109121 (2023)
Serre, T.: Deep learning: the good, the bad, and the ugly. Annu. Rev. Vision Sci. 5, 399–426 (2019)
Shin, D.: The effects of explainability and causability on perception, trust, and acceptance: implications for explainable AI. Int. J. Hum Comput Stud. 146, 102551 (2021)
Song, H., Kim, M., Park, D., Shin, Y., Lee, J.G.: Learning from noisy labels with deep neural networks: a survey. IEEE Trans. Neural Netw. Learn. Syst. (2022)
Steyvers, M., Tejeda, H., Kerrigan, G., Smyth, P.: Bayesian modeling of human-AI complementarity. Proc. Nat. Acad. Sci. 119(11), e2111547119 (2022)
Verma, R., Barrejón, D., Nalisnick, E.: On the calibration of learning to defer to multiple experts. In: Workshop on Human-Machine Collaboration and Teaming in International Conference of Machine Learning (2022)
Verma, R., Barrejon, D., Nalisnick, E.: Learning to defer to multiple experts: consistent surrogate losses, confidence calibration, and conformal ensembles. In: International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 11415–11434. Proceedings of Machine Learning Research, PMLR (2023). https://proceedings.mlr.press/v206/verma23a.html
Verma, R., Nalisnick, E.: Calibrated learning to defer with one-vs-all classifiers. In: ICML, pp. 22184–22202. PMLR (2022)
Vodrahalli, K., Daneshjou, R., Gerstenberg, T., Zou, J.: Do humans trust advice more if it comes from AI? an analysis of human-AI interactions. In: Proceedings of the 2022 AAAI/ACM Conference on AI, Ethics, and Society, pp. 763–777 (2022)
Wang, H., Xiao, R., Dong, Y., Feng, L., Zhao, J.: ProMix: combating label noise via maximizing clean sample utility. arXiv preprint arXiv:2207.10276 (2022)
Wei, H., Xie, R., Feng, L., Han, B., An, B.: Deep learning from multiple noisy annotators as a union. IEEE Trans. Neural Netw. Learn. Syst. (2022)
Wei, J., et al.: Emergent abilities of large language models. arXiv preprint arXiv:2206.07682 (2022)
Wei, J., Zhu, Z., Cheng, H., Liu, T., Niu, G., Liu, Y.: Learning with noisy labels revisited: a study using real-world human annotations. arXiv preprint arXiv:2110.12088 (2021)
Weitz, K., Schiller, D., Schlagowski, R., Huber, T., André, E.: “Do you trust me?” increasing user-trust by integrating virtual agents in explainable AI interaction design. In: Proceedings of the 19th ACM International Conference on Intelligent Virtual Agents, pp. 7–9 (2019)
Whitehill, J., Wu, T.f., Bergsma, J., Movellan, J., Ruvolo, P.: Whose vote should count more: optimal integration of labels from labelers of unknown expertise. In: NeurIPS, vol. 22 (2009)
Wilder, B., Horvitz, E., Kamar, E.: Learning to complement humans. In: IJCAI (2021)
Wu, X., Xiao, L., Sun, Y., Zhang, J., Ma, T., He, L.: A survey of human-in-the-loop for machine learning. Future Gener. Comput. Syst. 135(C), 364–381 (2022). https://doi.org/10.1016/j.future.2022.05.014
Xia, X., et al.: Sample selection with uncertainty of losses for learning with noisy labels. arXiv preprint arXiv:2106.00445 (2021)
Xu, Y., Zhu, L., Jiang, L., Yang, Y.: Faster meta update strategy for noise-robust deep learning. In: CVPR (2021)
Yin, M., Wortman Vaughan, J., Wallach, H.: Understanding the effect of accuracy on trust in machine learning models. In: Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, pp. 1–12 (2019)
Yuan, B., Chen, J., Zhang, W., Tai, H.S., McMains, S.: Iterative cross learning on noisy labels. In: Winter Conference on Applications of Computer Vision, pp. 757–765 (2018)
Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: Mixup: beyond empirical risk minimization. arXiv preprint arXiv:1710.09412 (2017)
Zhang, Z., Wells, K., Carneiro, G.: Learning to complement with multiple humans (LECOMH): integrating multi-rater and noisy-label learning into human-AI collaboration. arXiv preprint arXiv:2311.13172 (2023)
Zhang, Z., Sabuncu, M.: Generalized cross entropy loss for training deep neural networks with noisy labels. In: NeurIPS, pp. 8778–8788 (2018)
Zhang, Z., Pfister, T.: Learning fast sample re-weighting without reward data. In: ICCV (2021)
Zhang, Z., Zhang, H., Arik, S.Ö., Lee, H., Pfister, T.: Distilling effective supervision from severe label noise. In: CVPR, pp. 9291–9300 (2020)
Zhou, Z.H.: Ensemble Methods: Foundations and Algorithms. CRC Press (2012)
Zhu, C., Chen, W., Peng, T., Wang, Y., Jin, M.: Hard sample aware noise robust learning for histopathology image classification. IEEE Trans. Med. Imaging 41(4), 881–894 (2021)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Zhang, Z., Ai, W., Wells, K., Rosewarne, D., Do, TT., Carneiro, G. (2025). Learning to Complement and to Defer to Multiple Users. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15114. Springer, Cham. https://doi.org/10.1007/978-3-031-72992-8_9
Download citation
DOI: https://doi.org/10.1007/978-3-031-72992-8_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72991-1
Online ISBN: 978-3-031-72992-8
eBook Packages: Computer ScienceComputer Science (R0)