Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3616855.3635772acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
research-article

K2: A Foundation Language Model for Geoscience Knowledge Understanding and Utilization

Published: 04 March 2024 Publication History

Abstract

Large language models (LLMs) have achieved great success in general domains of natural language processing. In this paper, we bring LLMs to the realm of geoscience with the objective of advancing research and applications in this field. To this end, we present the first-ever LLM in geoscience, K2, alongside a suite of resources developed to further promote LLM research within geoscience. For instance, we have curated the first geoscience instruction tuning dataset, GeoSignal, which aims to align LLM responses to geoscience-related user queries. Additionally, we have established the first geoscience benchmark, GeoBench, to evaluate LLMs in the context of geoscience. In this work, we experiment with a complete recipe to adapt a pre-trained general-domain LLM to the geoscience domain. Specifically, we further train the LLaMA-7B model on 5.5B tokens of geoscience text corpus, including over 1 million pieces of geoscience literature, and utilize GeoSignal's supervised data to fine-tune the model. Moreover, we share a protocol that can efficiently gather domain-specific data and construct domain-supervised data, even in situations where manpower is scarce. Meanwhile, we equip K2 with the abilities of using tools to be a naive geoscience aide. Experiments conducted on the GeoBench demonstrate the effectiveness of our approach and datasets on geoscience knowledge understanding and utilization.We open-source all the training data and K2 model checkpoints at https://github.com/davendw49/k2

References

[1]
Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciBERT: A Pretrained Language Model for Scientific Text. In Conference on Empirical Methods in Natural Language Processing.
[2]
, Marion E. Bickford. 2013. The Impact of the Geological Sciences on Society. Geological Society of America. https://doi.org/10.1130/SPE501
[3]
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. 2023. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality. https://vicuna.lmsys.org
[4]
Hyung Won Chung, Le Hou, S. Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Dasha Valter, Sharan Narang, Gaurav Mishra, Adams Wei Yu, Vincent Zhao, Yanping Huang, Andrew M. Dai, Hongkun Yu, Slav Petrov, Ed Huai hsin Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei. 2022. Scaling Instruction-Finetuned Language Models. ArXiv, Vol. abs/2210.11416 (2022).
[5]
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge. ArXiv, Vol. abs/1803.05457 (2018).
[6]
Databricks. 2023. Hello Dolly: Democratizing the magic of ChatGPT with open models. https://www.databricks.com/blog/2023/03/24/hello-dolly-democratizing-magic-chatgpt-open-models.html
[7]
Cheng Deng, Yuting Jia, Hui Xu, Chong Zhang, Jingyao Tang, Luoyi Fu, Weinan Zhang, Haisong Zhang, Xinbing Wang, and Cheng Zhou. 2021. GAKG: A Multimodal Geoscience Academic Knowledge Graph. Proceedings of the 30th ACM International Conference on Information & Knowledge Management (2021).
[8]
Cheng Deng, Bo Tong, Luoyi Fu, Jiaxin Ding, Dexing Cao, Xinbing Wang, and Chenghu Zhou. 2023. PK-Chat: Pointer Network Guided Knowledge Driven Generative Dialogue Model. arXiv preprint arXiv:2304.00592 (2023).
[9]
Huseyin Denli, HassanJaved Chughtai, Brian Hughes, Robert Gistri, and Peng Xu. 2021. Geoscience Language Processing for Exploration. Day 3 Wed, November 17, 2021 (2021).
[10]
Ruixue Ding, Boli Chen, Pengjun Xie, Fei Huang, Xin Li, Qiang-Wei Zhang, and Yao Xu. 2023. A Multi-Modal Geographic Pre-Training Method. ArXiv, Vol. abs/2301.04283 (2023).
[11]
Majigsuren Enkhsaikhan, Wei Liu, Eun-Jung Holden, and Paul Duuring. 2021. Auto-labelling entities in low-resource text: a geological case study. Knowledge and Information Systems, Vol. 63 (2021), 695 -- 715.
[12]
Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, and Pengfei Liu. 2023. GPTScore: Evaluate as You Desire. ArXiv, Vol. abs/2302.04166 (2023).
[13]
Leo Gao, Stella Rose Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. 2020. The Pile: An 800GB Dataset of Diverse Text for Language Modeling. ArXiv, Vol. abs/2101.00027 (2020).
[14]
Xinyang Geng, Arnav Gudibande, Hao Liu, Eric Wallace, Pieter Abbeel, Sergey Levine, and Dawn Song. 2023. Koala: A Dialogue Model for Academic Research. Blog post. https://bair.berkeley.edu/blog/2023/04/03/koala/
[15]
Fabrizio Gilardi, Meysam Alizadeh, and Maël Kubli. 2023. ChatGPT outperforms crowd workers for text-annotation tasks. Proceedings of the National Academy of Sciences of the United States of America, Vol. 120 (2023). https://api.semanticscholar.org/CorpusID:257766307
[16]
Tanishq Gupta, Mohd Zaki, N. Krishnan, and Mausam. 2021. MatSciBERT: A materials domain language model for text mining and information extraction. npj Computational Materials, Vol. 8 (2021), 1--11.
[17]
Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin de Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019. Parameter-Efficient Transfer Learning for NLP. In International Conference on Machine Learning.
[18]
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, and Weizhu Chen. 2021. LoRA: Low-Rank Adaptation of Large Language Models. ArXiv, Vol. abs/2106.09685 (2021).
[19]
Jizhou Huang, Haifeng Wang, Yibo Sun, Yunsheng Shi, Zhengjie Huang, An Zhuo, and Shikun Feng. 2022. ERNIE-GeoL: A Geography-and-Language Pre-trained Model and its Applications in Baidu Maps. Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (2022).
[20]
Jared Kaplan, Sam McCandlish, T. J. Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeff Wu, and Dario Amodei. 2020. Scaling Laws for Neural Language Models. ArXiv, Vol. abs/2001.08361 (2020).
[21]
Zeljko Kraljevic, Anthony Shek, Daniel M Bean, Rebecca Bendayan, James T. H. Teo, and Richard J. B. Dobson. 2021. MedGPT: Medical Concept Prediction from Clinical Narratives. ArXiv, Vol. abs/2107.03134 (2021).
[22]
Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. The Power of Scale for Parameter-Efficient Prompt Tuning. ArXiv, Vol. abs/2104.08691 (2021).
[23]
Xiang Lisa Li and Percy Liang. 2021. Prefix-Tuning: Optimizing Continuous Prompts for Generation. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Vol. abs/2101.00190 (2021).
[24]
Xiao Liu, Juan Hu, Qi Shen, and Huan Chen. 2021. Geo-BERT Pre-training Model for Query Rewriting in POI Search. In Conference on Empirical Methods in Natural Language Processing.
[25]
S. Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V. Le, Barret Zoph, Jason Wei, and Adam Roberts. 2023. The Flan Collection: Designing Data and Methods for Effective Instruction Tuning. ArXiv, Vol. abs/2301.13688 (2023).
[26]
Bin Lu, Lyuwen Wu, Lina Yang, Chenxing Sun, Wei Liu, Xiaoying Gan, Shiyu Liang, Luoyi Fu, Xinbing Wang, and Cheng Zhou. 2023. DataExpo: A One-Stop Dataset Service for Open Science Research. Companion Proceedings of the ACM Web Conference 2023 (2023).
[27]
Renqian Luo, Liai Sun, Yingce Xia, Tao Qin, Sheng Zhang, Hoifung Poon, and Tie-Yan Liu. 2022. BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining. Briefings in bioinformatics (2022).
[28]
Kai Ma, Miao Tian, Yongjian Tan, Xuejing Xie, and Qinjun Qiu. 2021. What is this article about? Generative summarization with the BERT model in the geosciences domain. Earth Science Informatics, Vol. 15 (2021), 21 -- 36.
[29]
Xiaogang Ma, Chao Ma, and Chengbin Wang. 2020. A new structure for representing and tracking version information in a deep time knowledge graph. Comput. Geosci., Vol. 145 (2020), 104620.
[30]
Gengchen Mai, Weiming Huang, Jin Sun, Suhang Song, Deepak Mishra, Ninghao Liu, Song Gao, Tianming Liu, G. Cong, Yingjie Hu, Chris Cundy, Ziyuan Li, Rui Zhu, and Ni Lao. 2023. On the Opportunities and Challenges of Foundation Models for Geospatial Artificial Intelligence. ArXiv, Vol. abs/2304.06798 (2023).
[31]
Gengchen Mai, Krzysztof Janowicz, Yingjie Hu, Song Gao, Bo Yan, Rui Zhu, Ling Cai, and Ni Lao. 2021. A review of location encoding for GeoAI: methods and applications. International Journal of Geographical Information Science, Vol. 36 (2021), 639 -- 673. https://api.semanticscholar.org/CorpusID:243847917
[32]
Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi. 2021. Natural Instructions: Benchmarking Generalization to New Tasks from Natural Language Instructions. arXiv preprint arXiv:2104.08773 (2021).
[33]
Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Haiquan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. 2022. CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis.
[34]
OpenAI. 2022. Introducing ChatGPT. (2022). https://openai.com/blog/chatgpt
[35]
OpenAI. 2023. GPT-4 Technical Report. ArXiv, Vol. abs/2303.08774 (2023).
[36]
José Padarian and Ignacio Fuentes. 2019. Word embeddings for application in geosciences: development, evaluation, and examples of soil-related concepts. SOIL (2019).
[37]
Yujia Qin, Shi Liang, Yining Ye, Kunlun Zhu, Lan Yan, Ya-Ting Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Runchu Tian, Ruobing Xie, Jie Zhou, Marc H. Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. 2023. ToolLLM: Facilitating Large Language Models to Master 16000 Real-world APIs. ArXiv, Vol. abs/2307.16789 (2023). https://api.semanticscholar.org/CorpusID:260334759
[38]
Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language Models are Unsupervised Multitask Learners. (2019).
[39]
Colin Raffel, Noam M. Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2019. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. ArXiv, Vol. abs/1910.10683 (2019).
[40]
Victor Sanh, Albert Webson, Colin Raffel, Stephen H. Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, Manan Dey, M Saiful Bari, Canwen Xu, Urmish Thakker, Shanya Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal V. Nayak, Debajyoti Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han Wang, Matteo Manica, Sheng Shen, Zheng Xin Yong, Harshit Pandey, Rachel Bawden, Thomas Wang, Trishala Neeraj, Jos Rozen, Abheesht Sharma, Andrea Santilli, Thibault Févry, Jason Alan Fries, Ryan Teehan, Stella Rose Biderman, Leo Gao, Tali Bers, Thomas Wolf, and Alexander M. Rush. 2021. Multitask Prompted Training Enables Zero-Shot Task Generalization. ArXiv, Vol. abs/2110.08207 (2021).
[41]
Hoo-Chang Shin, Yang Zhang, Evelina Bakhturina, Raul Puri, Mostofa Patwary, Mohammad Shoeybi, and Raghav Mani. 2020. Bio-Megatron: Larger Biomedical Domain Language Model. ArXiv, Vol. abs/2010.06060 (2020).
[42]
K. Singhal, Shekoofeh Azizi, Tao Tu, Said Mahdavi, Jason Lee Kai Wei, Hyung Won Chung, Nathan Scales, Ajay Kumar Tanwani, Heather J. Cole-Lewis, Stephen J. Pfohl, P A Payne, Martin G. Seneviratne, Paul Gamble, Chris Kelly, Nathaneal Scharli, Aakanksha Chowdhery, P. A. Mansfield, Blaise Agüera y Arcas, Dale R. Webster, Greg S. Corrado, Y. Matias, Katherine Hui-Ling Chou, Juraj Gottweis, Nenad Tomavsev, Yun Liu, Alvin Rajkomar, Joëlle K. Barral, Christopher Semturs, Alan Karthikesalingam, and Vivek Natarajan. 2022. Large Language Models Encode Clinical Knowledge. ArXiv, Vol. abs/2212.13138 (2022).
[43]
Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. Stanford Alpaca: An Instruction-following LLaMA model. https://github.com/tatsu-lab/stanford_alpaca.
[44]
Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas Scialom, Anthony S. Hartshorn, Elvis Saravia, Andrew Poulton, Viktor Kerkez, and Robert Stojnic. 2022. Galactica: A Large Language Model for Science. ArXiv, Vol. abs/2211.09085 (2022).
[45]
MosaicML NLP Team. 2023. Introducing MPT-7B: A New Standard for Open-Source, ly Usable LLMs. (2023). www.mosaicml.com/blog/mpt-7b
[46]
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aur'elien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023. LLaMA: Open and Efficient Foundation Language Models. ArXiv, Vol. abs/2302.13971 (2023).
[47]
Benyou Wang, Qianqian Xie, Jiahuan Pei, Prayag Tiwari, Zhao Li, and Jie Fu. 2021b. Pre-trained Language Models in Biomedical Domain: A Systematic Survey. ArXiv, Vol. abs/2110.05006 (2021).
[48]
Chengshan Wang, Robert M. Hazen, Qiuming Cheng, Michael H. Stephenson, Chenghu Zhou, Peter A. Fox, Shu'zhong Shen, Roland Oberh"ansli, Zeng'qian Hou, Xiaogang Ma, Zhiqiang Feng, Junxuan Fan, Chao Ma, Xiumian Hu, Bin Luo, Juanle Wang, and Craig M. Schiffries. 2021a. The Deep-Time Digital Earth program: data-driven discovery in geosciences. National Science Review, Vol. 8 (2021).
[49]
Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. 2022. Self-Instruct: Aligning Language Model with Self Generated Instructions. ArXiv, Vol. abs/2212.10560 (2022).
[50]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Huai hsin Chi, F. Xia, Quoc Le, and Denny Zhou. 2022. Chain of Thought Prompting Elicits Reasoning in Large Language Models. ArXiv, Vol. abs/2201.11903 (2022). https://api.semanticscholar.org/CorpusID:246411621
[51]
Canwen Xu, Daya Guo, Nan Duan, and Julian McAuley. 2023 a. Baize: An Open-Source Chat Model with Parameter-Efficient Tuning on Self-Chat Data. ArXiv, Vol. abs/2304.01196 (2023).
[52]
Yi Xu, Shuqian Sheng, Bo Xue, Luoyi Fu, Xinbing Wang, and Chenghu Zhou. 2023 b. Exploring and Verbalizing Academic Ideas by Concept Co-occurrence. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Toronto, Canada, 13001--13027. https://doi.org/10.18653/v1/2023.acl-long.727
[53]
Weizhe Yuan and Pengfei Liu. 2022. reStructured Pre-training. ArXiv, Vol. abs/2206.11147 (2022).
[54]
Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, Weng Lam Tam, Zixuan Ma, Yufei Xue, Jidong Zhai, Wenguang Chen, P. Zhang, Yuxiao Dong, and Jie Tang. 2022. GLM-130B: An Open Bilingual Pre-trained Model. ArXiv, Vol. abs/2210.02414 (2022).
[55]
Shao Zhang, Yuting Jia, Hui Xu, Ying Wen, Dakuo Wang, and Xinbing Wang. 2022. DeepShovel: An Online Collaborative Platform for Data Extraction in Geoscience Literature with AI Assistance. ArXiv, Vol. abs/2202.10163 (2022). https://api.semanticscholar.org/CorpusID:247011979 io

Cited By

View all
  • (2024)Development of a Large-scale Korean Language Model in the Field of GeosciencesEconomic and Environmental Geology10.9719/EEG.2024.57.5.53957:5(539-550)Online publication date: 29-Oct-2024
  • (2024)The Combined Use of GIS and Generative Artificial Intelligence in Detecting Potential Geodiversity Sites and Promoting GeoheritageResources10.3390/resources1309011913:9(119)Online publication date: 27-Aug-2024
  • (2024)Bibliometric Analysis on the Research of Geoscience Knowledge Graph (GeoKG) from 2012 to 2023ISPRS International Journal of Geo-Information10.3390/ijgi1307025513:7(255)Online publication date: 16-Jul-2024
  • Show More Cited By

Index Terms

  1. K2: A Foundation Language Model for Geoscience Knowledge Understanding and Utilization

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      WSDM '24: Proceedings of the 17th ACM International Conference on Web Search and Data Mining
      March 2024
      1246 pages
      ISBN:9798400703713
      DOI:10.1145/3616855
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 04 March 2024

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. foundation model
      2. geoscience knowledge mining
      3. geoscience large language model

      Qualifiers

      • Research-article

      Funding Sources

      Conference

      WSDM '24

      Acceptance Rates

      Overall Acceptance Rate 498 of 2,863 submissions, 17%

      Upcoming Conference

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)511
      • Downloads (Last 6 weeks)62
      Reflects downloads up to 12 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Development of a Large-scale Korean Language Model in the Field of GeosciencesEconomic and Environmental Geology10.9719/EEG.2024.57.5.53957:5(539-550)Online publication date: 29-Oct-2024
      • (2024)The Combined Use of GIS and Generative Artificial Intelligence in Detecting Potential Geodiversity Sites and Promoting GeoheritageResources10.3390/resources1309011913:9(119)Online publication date: 27-Aug-2024
      • (2024)Bibliometric Analysis on the Research of Geoscience Knowledge Graph (GeoKG) from 2012 to 2023ISPRS International Journal of Geo-Information10.3390/ijgi1307025513:7(255)Online publication date: 16-Jul-2024
      • (2024)GeoLocator: A Location-Integrated Large Multimodal Model (LMM) for Inferring Geo-PrivacyApplied Sciences10.3390/app1416709114:16(7091)Online publication date: 13-Aug-2024
      • (2024)Survey on Knowledge Distillation for Large Language Models: Methods, Evaluation, and ApplicationACM Transactions on Intelligent Systems and Technology10.1145/3699518Online publication date: 8-Oct-2024
      • (2024) When geoscience meets generative AI and large language models: Foundations, trends, and future challenges Expert Systems10.1111/exsy.13654Online publication date: 11-Jun-2024
      • (2024)Assessing named entity recognition efficacy using diverse geoscience datasets2024 International Conference on Machine Intelligence for GeoAnalytics and Remote Sensing (MIGARS)10.1109/MIGARS61408.2024.10544642(1-3)Online publication date: 8-Apr-2024
      • (2024)Where to Move Next: Zero-shot Generalization of LLMs for Next POI Recommendation2024 IEEE Conference on Artificial Intelligence (CAI)10.1109/CAI59869.2024.00277(1530-1535)Online publication date: 25-Jun-2024
      • (2024) PreparedLLM: effective pre -pretraining framework for domain-specific large language models Big Earth Data10.1080/20964471.2024.2396159(1-24)Online publication date: 8-Sep-2024
      • (2024)Future-proofing geotechnics workflows: accelerating problem-solving with large language modelsGeorisk: Assessment and Management of Risk for Engineered Systems and Geohazards10.1080/17499518.2024.2381026(1-18)Online publication date: 25-Jul-2024
      • Show More Cited By

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media