2024
pdf
bib
abs
INDUS: Effective and Efficient Language Models for Scientific Applications
Bishwaranjan Bhattacharjee
|
Aashka Trivedi
|
Masayasu Muraoka
|
Muthukumaran Ramasubramanian
|
Takuma Udagawa
|
Iksha Gurung
|
Nishan Pantha
|
Rong Zhang
|
Bharath Dandala
|
Rahul Ramachandran
|
Manil Maskey
|
Kaylin Bugbee
|
Michael M. Little
|
Elizabeth Fancher
|
Irina Gerasimov
|
Armin Mehrabian
|
Lauren Sanders
|
Sylvain V. Costes
|
Sergi Blanco-Cuaresma
|
Kelly Lockhart
|
Thomas Allen
|
Felix Grezes
|
Megan Ansdell
|
Alberto Accomazzi
|
Yousef El-Kurdi
|
Davis Wertheimer
|
Birgit Pfitzmann
|
Cesar Berrospi Ramis
|
Michele Dolfi
|
Rafael Teixeira De Lima
|
Panagiotis Vagenas
|
S. Karthik Mukkavilli
|
Peter W. J. Staar
|
Sanaz Vahidinia
|
Ryan McGranaghan
|
Tsengdar J. Lee
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track
Large language models (LLMs) trained on general domain corpora showed remarkable results on natural language processing (NLP) tasks. However, previous research demonstrated LLMs trained using domain-focused corpora perform better on specialized tasks. Inspired by this insight, we developed INDUS, a comprehensive suite of LLMs tailored for the closely-related domains of Earth science, biology, physics, heliophysics, planetary sciences and astrophysics, and trained using curated scientific corpora drawn from diverse data sources. The suite of models include: (1) an encoder model trained using domain-specific vocabulary and corpora to address NLP tasks, (2) a contrastive-learning based text embedding model trained using a diverse set of datasets to address information retrieval tasks and (3) smaller versions of these models created using knowledge distillation for applications which have latency or resource constraints. We also created three new scientific benchmark datasets, Climate-Change NER (entity-recognition), NASA-QA (extractive QA) and NASA-IR (IR) to accelerate research in these multi-disciplinary fields. We show that our models outperform both general-purpose (RoBERTa) and domain- specific (SciBERT) encoders on these new tasks as well as existing tasks in the domains of interest. Furthermore, we demonstrate the use of these models in two industrial settings- as a retrieval model for large-scale vector search applications and in automatic content tagging systems.
2023
pdf
bib
abs
Ensemble-Instruct: Instruction Tuning Data Generation with a Heterogeneous Mixture of LMs
Young-Suk Lee
|
Md Sultan
|
Yousef El-Kurdi
|
Tahira Naseem
|
Asim Munawar
|
Radu Florian
|
Salim Roukos
|
Ramón Astudillo
Findings of the Association for Computational Linguistics: EMNLP 2023
Using in-context learning (ICL) for data generation, techniques such as Self-Instruct (Wang et al., 2023) or the follow-up Alpaca (Taori et al., 2023) can train strong conversational agents with only a small amount of human supervision. One limitation of these approaches is that they resort to very large language models (around 175B parameters) that are also proprietary and non-public. Here we explore the application of such techniques to language models that are much smaller (around 10B–40B parameters) and have permissive licenses. We find the Self-Instruct approach to be less effective at these sizes and propose new ICL methods that draw on two main ideas: (a) categorization and simplification of the ICL templates to make prompt learning easier for the LM, and (b) ensembling over multiple LM outputs to help select high-quality synthetic examples. Our algorithm leverages the 175 Self-Instruct seed tasks and employs separate pipelines for instructions that require an input and instructions that do not. Empirical investigations with different LMs show that: (1) Our proposed method yields higher-quality instruction tuning data than Self-Instruct, (2) It improves performances of both vanilla and instruction-tuned LMs by significant margins, and (3) Smaller instruction-tuned LMs generate more useful examples than their larger un-tuned counterparts.
2022
pdf
bib
abs
Zero-Shot Dynamic Quantization for Transformer Inference
Yousef El-kurdi
|
Jerry Quinn
|
Avi Sil
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track
We introduce a novel run-time method for significantly reducing the accuracy loss associated with quantizing BERT-like models to 8-bit integers. Existing methods for quantizing models either modify the training procedure, or they require an additional calibration step to adjust parameters that also requires a selected held-out dataset. Our method permits taking advantage of quantization without the need for these adjustments. We present results on several NLP tasks demonstrating the usefulness of this technique.
2020
pdf
bib
abs
Scalable Cross-lingual Treebank Synthesis for Improved Production Dependency Parsers
Yousef El-Kurdi
|
Hiroshi Kanayama
|
Efsun Sarioglu Kayi
|
Vittorio Castelli
|
Todd Ward
|
Radu Florian
Proceedings of the 28th International Conference on Computational Linguistics: Industry Track
We present scalable Universal Dependency (UD) treebank synthesis techniques that exploit advances in language representation modeling which leverage vast amounts of unlabeled general-purpose multilingual text. We introduce a data augmentation technique that uses synthetic treebanks to improve production-grade parsers. The synthetic treebanks are generated using a state-of-the-art biaffine parser adapted with pretrained Transformer models, such as Multilingual BERT (M-BERT). The new parser improves LAS by up to two points on seven languages. The production models’ LAS performance improves as the augmented treebanks scale in size, surpassing performance of production models trained on originally annotated UD treebanks.
2019
pdf
bib
abs
Multi-Granular Text Encoding for Self-Explaining Categorization
Zhiguo Wang
|
Yue Zhang
|
Mo Yu
|
Wei Zhang
|
Lin Pan
|
Linfeng Song
|
Kun Xu
|
Yousef El-Kurdi
Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP
Self-explaining text categorization requires a classifier to make a prediction along with supporting evidence. A popular type of evidence is sub-sequences extracted from the input text which are sufficient for the classifier to make the prediction. In this work, we define multi-granular ngrams as basic units for explanation, and organize all ngrams into a hierarchical structure, so that shorter ngrams can be reused while computing longer ngrams. We leverage the tree-structured LSTM to learn a context-independent representation for each unit via parameter sharing. Experiments on medical disease classification show that our model is more accurate, efficient and compact than the BiLSTM and CNN baselines. More importantly, our model can extract intuitive multi-granular evidence to support its predictions.