Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3603165.3607368acmotherconferencesArticle/Chapter ViewAbstractPublication Pagesacm-turcConference Proceedingsconference-collections
poster

Empowering MultiModal Models’ In-Context Learning Ability through Large Language Models

Published: 25 September 2023 Publication History

Abstract

Pretrained visual-language models (VLMs) have made progress in developing multimodal models to improve various tasks. However, they lack reasoning and in-context learning ability. Building on the success of large language models (LLMs) in general-purple NLP tasks, researchers anticipate that the VLM should also have the same strong reasoning and ICL ability through specific techniques, for example benefiting from LLMs. To boost VLMs to solve vision-language problems via few-shot exemplars, we suggest a vision-language model, called MIC1.

References

[1]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
[2]
Fei-Long Chen, Du-Zhen Zhang, Ming-Lun Han, Xiu-Yi Chen, Jing Shi, Shuang Xu, and Bo Xu. 2023. Vlp: A survey on vision-language pre-training. Machine Intelligence Research 20, 1 (2023), 38–56.
[3]
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. 2023. InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning. arxiv:2305.06500 [cs.CV]
[4]
Shivam Garg, Dimitris Tsipras, Percy S Liang, and Gregory Valiant. 2022. What can transformers learn in-context? a case study of simple function classes. Advances in Neural Information Processing Systems 35 (2022), 30583–30598.
[5]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023).
[6]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning. PMLR, 12888–12900.
[7]
Mukai Li, Shansan Gong, Jiangtao Feng, Yiheng Xu, Jun Zhang, Zhiyong Wu, and Lingpeng Kong. 2023. In-context learning with many demonstration examples. arXiv preprint arXiv:2302.04931 (2023).
[8]
Chao Lou, Wenjuan Han, Yuhuan Lin, and Zilong Zheng. 2022. Unsupervised Vision-Language Parsing: Seamlessly Bridging Visual Scene Graphs With Language Structures via Dependency Relationships. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 15607–15616.
[9]
Jerry Wei, Jason Wei, Yi Tay, Dustin Tran, Albert Webson, Yifeng Lu, Xinyun Chen, Hanxiao Liu, Da Huang, Denny Zhou, 2023. Larger language models do in-context learning differently. arXiv preprint arXiv:2303.03846 (2023).
[10]
Xiang Wei, Xingyu Cui, Ning Cheng, Xiaobin Wang, Xin Zhang, Shen Huang, Pengjun Xie, Jinan Xu, Yufeng Chen, Meishan Zhang, 2023. Zero-shot information extraction via chatting with chatgpt. arXiv preprint arXiv:2302.10205 (2023).

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
ACM TURC '23: Proceedings of the ACM Turing Award Celebration Conference - China 2023
July 2023
173 pages
ISBN:9798400702334
DOI:10.1145/3603165
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 September 2023

Check for updates

Author Tags

  1. In-Context Learning
  2. Large Language Model
  3. Vision-Language Model

Qualifiers

  • Poster
  • Research
  • Refereed limited

Funding Sources

  • the Talent Fund of Beijing Jiaotong University

Conference

ACM TURC '23

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 135
    Total Downloads
  • Downloads (Last 12 months)107
  • Downloads (Last 6 weeks)5
Reflects downloads up to 18 Nov 2024

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media