Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3584684.3597263acmconferencesArticle/Chapter ViewAbstractPublication PagespodcConference Proceedingsconference-collections
research-article
Open access

Invited Paper: Common Public Knowledge for Enhancing Machine Learning Data Sets

Published: 20 June 2023 Publication History

Abstract

In this study, we show the advantages of incorporating multi-source knowledge from publicly available sources, such as ChatGPT and Wikipedia, into existing datasets to enhance the performance of machine learning models for routine tasks, such as classification. specifically, we propose the utilization of supplementary data from external sources and demonstrate the utility of widely accessible knowledge in the context of the Forest Cover Type Prediction task launched by the Roosevelt National Forest of Northern Colorado. Additionally, we exhibit an improvement in classification accuracy for the Isolated Letter Speech Recognition dataset when incorporating information on regional accents in the prediction of spoken English letter names.

Supplementary Material

MP4 File (ApPLIED23_Dolev_Ilani.mp4)
Presentation video - short version

References

[1]
1990. Pin Pen merger. (1990). https://en.wikipedia.org/wiki/Phonological_history_of_English_close_front_vowels#Pin%E2%80%93pen_merger
[2]
1994. The Isolated Letter (ISOLET) Speech Recognition dataset. (1994). https://archive-beta.ics.uci.edu/dataset/54/isolet
[3]
2012. The T Factor. (2012). https://casoilresource.lawr.ucdavis.edu/gmap/help/defn-t-factor.html
[4]
2012. The Socio-Economic Significance of Four Phonetic Characteristics in North American English. (2012). http://article.sapub.org/10.5923.j.linguistics.20120103.06.html
[5]
2017. FAO Maps Carbon Stocks in Soil. (2017). https://unfccc.int/news/fao-maps-carbon-stocks-in-soil
[6]
2020. IPA chart 2020. (2020). https://en.wikipedia.org/wiki/File:IPA_chart_2020.svg
[7]
2021. The USDA-NCSS soil survey. (2021). https://casoilresource.lawr.ucdavis.edu/gmap/
[8]
2023. ChatGPTby OpenAI, (Jan 9, 2023 version). (2023). https://chat.openai.com
[9]
2023. GitHub repo. (2023). https://github.com/arnoni/ml-with-public-knowledge
[10]
2023. Multi-task Language Understanding on MMLU. (2023). https://paperswithcode.com/sota/multi-task-language-understanding-on-mmlu
[11]
2023. Transformers Agent. (2023). https://huggingface.co/docs/transformers/transformers_agents
[12]
I. Abdelaziz, Achille Fokoue, Oktie Hassanzadeh, Ping Zhang, and Mohammad Sadoghi. 2017. Large-scale structural and textual similarity-based mining of knowledge graph to predict drug-drug interactions. J. Web Semant. 44 (2017), 104--117.
[13]
Shamsulhaq Basir and Inanc Senocak. 2021. Physics and equality constrained artificial neural networks: Application to forward and inverse problems with multi-fidelity data fusion. J. Comput. Phys. 463 (2021), 111301.
[14]
Pravin Bhasme, Jenil Vagadiya, and Udit Bhatia. 2021. Enhancing predictive skills in physically-consistent way: Physics Informed Machine Learning for Hydrological Processes. ArXiv abs/2104.11009 (2021).
[15]
Jock Blackard. 1998. Covertype. UCI Machine Learning Repository. https://archive-beta.ics.uci.edu/dataset/31/covertype
[16]
Yixin Cao, Xiang Wang, Xiangnan He, Zikun Hu, and Tat-Seng Chua. 2019. Unifying Knowledge Graph Learning and Recommendation: Towards a Better Understanding of User Preferences. The World Wide Web Conference (2019).
[17]
Pi-Yueh Chuang and Lorena A. Barba. 2022. Experience report of physics-informed neural networks in fluid simulations: pitfalls and frustration. ArXiv abs/2205.14249 (2022).
[18]
Ronald A. Cole. 1990. The ISOLET spoken letter database.
[19]
Tirtharaj Dash, Sharad Chitlangia, Aditya Ahuja, and Ashwin Srinivasan. 2021. A review of some techniques for inclusion of domain-knowledge into deep neural networks. Scientific Reports 12 (2021).
[20]
Emmanuel de Bézenac, Arthur Pajot, and Patrick Gallinari. 2017. Deep learning for physical processes: incorporating prior scientific knowledge. Journal of Statistical Mechanics: Theory and Experiment 2019 (2017).
[21]
Zakaria Elabid, Tanujit Chakraborty, and Abdenour Hadid. 2022. Knowledge-based Deep Learning for Modeling Chaotic Systems. ArXiv abs/2209.04259 (2022).
[22]
Salah A Faroughi, Nikhil Dilip Pawar, Celio Fernandes, Subasish Das, Nima Khademi Kalantari, and Seyed Kourosh Mahjour. 2022. Physics-Guided, Physics-Informed, and Physics-Encoded Neural Networks in Scientific Computing. ArXiv abs/2211.07377 (2022).
[23]
Douglas Fraser. 2017. Improving Management of Forest Cover Modeling Study 100189521. United States Forest Service (2017).
[24]
A. Gómez-Pérez, M. Fernandez-Lopez, and O. Corcho. 2006. Ontological Engineering: with examples from the areas of Knowledge Management, e-Commerce and the Semantic Web. First Edition. Springer London. https://books.google.co.il/books?id=qR_3BwAAQBAJ
[25]
Julia Gurieva, Evgenii Vasiliev, and Lev Mikhailovich Smirnov. 2022. Application of conservation laws to the learning of physics-informed neural networks. Procedia Computer Science (2022).
[26]
Zhiting Hu, Xuezhe Ma, Zhengzhong Liu, Eduard H. Hovy, and Eric P. Xing. 2016. Harnessing Deep Neural Networks with Logic Rules. ArXiv abs/1603.06318 (2016).
[27]
Zhengbao Jiang, Frank F. Xu, J. Araki, and Graham Neubig. 2019. How Can We Know What Language Models Know? Transactions of the Association for Computational Linguistics 8 (2019), 423--438.
[28]
George Em Karniadakis, Ioannis G. Kevrekidis, Lu Lu, Paris Perdikaris, Sifan Wang, and Liu Yang. 2021. Physics-informed machine learning.
[29]
Eyal Krupka and Naftali Tishby. 2007. Incorporating Prior Knowledge on Features into Learning. In AISTATS.
[30]
T. H. Kung, Maureen Cheatham, A. Medinilla, ChatGPT, C. Sillos, Luz De Leon, C. Elepano, M A Madriaga, Rischelle G. Aggabao, G. Diaz-Candido, Jose Martin Z. Maningo, and V. Tseng. 2022. Performance of ChatGPT on USMLE: Potential for AI-Assisted Medical Education Using Large Language Models. In medRxiv.
[31]
William Labov, Sharon Ash, and Charles Boberg. 2006. The atlas of North American English : phonetics, phonology and sound change : a multimedia reference tool.
[32]
Fabio Petroni, Tim Rocktäschel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, Alexander H. Miller, and Sebastian Riedel. 2019. Language Models as Knowledge Bases? ArXiv abs/1909.01066 (2019).
[33]
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving Language Understanding by Generative Pre-Training. (2018).
[34]
Samuel Ritter, David G. T. Barrett, Adam Santoro, and Matthew M. Botvinick. 2017. Cognitive Psychology for Deep Neural Networks: A Shape Bias Case Study. In International Conference on Machine Learning.
[35]
Karan Singhal, Shekoofeh Azizi, Tao Tu, Said Mahdavi, Jason Lee Kai Wei, Hyung Won Chung, Nathan Scales, Ajay Kumar Tanwani, Heather J. Cole-Lewis, Stephen J. Pfohl, P A Payne, Martin G. Seneviratne, Paul Gamble, Chris Kelly, Nathaneal Scharli, Aakanksha Chowdhery, P. D. Mansfield, Blaise Agüera y Arcas, Dale R. Webster, Greg S. Corrado, Y. Matias, Katherine Hui-Ling Chou, Juraj Gottweis, Nenad Tomasev, Yun Liu, Alvin Rajkomar, Joëlle K. Barral, Christopher Semturs, Alan Karthikesalingam, and Vivek Natarajan. 2022. Large Language Models Encode Clinical Knowledge.
[36]
James Tanner, Morgan Sonderegger, Jane Stuart-Smith, and Josef Fruehwald. 2020. Toward "English" Phonetics: Variability in the Pre-consonantal Voicing Effect Across English Dialects and Speakers. Frontiers in Artificial Intelligence 3 (2020).
[37]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems (2017).
[38]
Laura von Rueden, Sebastian Mayer, Katharina Beckh, Bogdan Georgiev, Sven Giesselbach, Raoul Heese, Birgit Kirsch, Julius Pfrommer, Annika Pick, Rajkumar Ramamurthy, Michal Walczak, Jochen Garcke, Christian Bauckhage, and Jannis Schuecker. 2019. Informed Machine Learning - A Taxonomy and Survey of Integrating Prior Knowledge into Learning Systems. IEEE Transactions on Knowledge and Data Engineering 35 (2019), 614--633.
[39]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. 2022. Chain of Thought Prompting Elicits Reasoning in Large Language Models. ArXiv abs/2201.11903 (2022).
[40]
Xiaozhen Xie, Jianwei Niu, Xuefeng Liu, Zhengsu Chen, and Shaojie Tang. 2020. A Survey on Domain Knowledge Powered Deep Learning for Medical Image Analysis. ArXiv abs/2004.12150 (2020).

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ApPLIED 2023: Proceedings of the 5th workshop on Advanced tools, programming languages, and PLatforms for Implementing and Evaluating algorithms for Distributed systems
June 2023
103 pages
ISBN:9798400701283
DOI:10.1145/3584684
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 June 2023

Check for updates

Author Tags

  1. ontology
  2. machine learning
  3. random forests
  4. feature engineering
  5. world knowledge
  6. speech recognition
  7. isolated letter
  8. forest management
  9. tree cover type
  10. ChatGPT

Qualifiers

  • Research-article

Funding Sources

  • Israel Science Foundation (ISF)

Conference

ApPLIED 2023
Sponsor:

Acceptance Rates

Overall Acceptance Rate 3 of 4 submissions, 75%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 358
    Total Downloads
  • Downloads (Last 12 months)184
  • Downloads (Last 6 weeks)18
Reflects downloads up to 20 Nov 2024

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media