Day 1

Large Language Models:
Past, Present, and Future

Dr. Arpan Mandal
Research Associate, University of Southampton.
1
Brief Contents
A brief Conclusions
What are
Brief What are history of Current Live demo and future Questions
some of its
overview of large language state-of- of a 13 directions Answering
major
Natural language models. the-art in billion for LLMs session.
shortcomings?
Language models (or LLMs. parameter How can AI be a
Processing LLMs)? Traditional model. potential danger
Language models. What future
(NLP) for the society?
solutions can solve
Architectures of
those problems? What steps can we
some popular
take?
LLMs
Conclusions and
What is pre-
future directions.
training, fine-
tuning and
prompting?
2
A field of artificial intelligence
(AI) that focuses on the
interaction between computers
and natural language.
Natural development of
Language • algorithms,
Processing • models,
• techniques
(NLP)
enable computers to
• understand,
• classify,
• extract,
• generate natural language.
3
Tasks in NLP
Entailment
Text Understanding Extracting dependency graphs based on grammar
Sentiment Analysis
Text Classification Detecting Spam messages
Language Translation Real time translation of natural language.
Keywords extraction
Information Extraction Entity extraction (Dates, Locations, Persons, Organizations, Objects)
Question Answering
Abstractive Summarization
Text Generation Dialogue systems
Creative writing
4
Language Models (LMs)
Definition: A language model is an algorithmic way of providing a

probability distribution over words or word sequences, in a given context.
TFIDF on n-grams.
Traditionally used statistical models. HMM (Hidden Markov Models).
CRF (Conditional Random Fields).
Recently we use neural network LSTM, GRU

models. BERT, GPT, ChatGPT, BARD
5
Brief history of language models -1
• Count frequency based probabilistic methods.

• Given many texts, we can count the frequency of each word in a context.
• count how often the word “chocolate” appears after the word “like”,
• P(“chocolate” | “like”) = count(“like chocolate”) / count(“chocolate”)
• do this for every word pair.
• obvious drawback: The context used is only a single word.
• To tackle this problem, we can extend the context.
• P(“chocolate” | “kids like”) = count(“kids like chocolate”) / count(“chocolate”)
• We can further increase the context and call it n-gram. (where n denotes the number of words)
• Problem: increasing n increases space complexity
6
I grew up in France.
• Recurrent Neural Networks (RNNs) When I was younger,
• LSTM (Long Short-term memory units) in 1997. I used to regularly run
to the shops to buy
• GRU (Gated recurrent units) in 2014. fresh baguettes for breakfast.
• Working principle: I speak fluent __?__.
• Instead of n-grams we use a hidden representation of the context.

• The size of hidden representation is fixed.
• Sequential processing of hidden representation at each step.
• Problems:
• The range of context is still short.
• Sequential processing is time consuming.
1. https://dl.acm.org/doi/10.1162/neco.1997.9.8.1735 7
• Attention to the rescue in 2014.

I grew up in France.
• Working principle: When I was younger,
I used to regularly run
• The size of hidden representation is fixed. to the shops to buy
fresh baguettes for breakfast.
• Can be used with existing models.
I speak fluent __?__.
• This allows for a much larger context.
• Problems:
• Serial processing takes time to train and infer.
8
Brief history of language
models -4
• Transformers is the final answer. In 2017.
• Initially built for translation.
• BERT (Bidirectional Encoder Representations from
Transformers)
• GPT (Generative Pretrained Transformers)
• Working principle:
• The size of hidden representation is fixed.
• Key component is the attention mechanism.
• This allows for a much larger context.
• Problems
• Limited input sequence length.
9
Vaswani et al. 2017 https://papers.neurips.cc/paper/7181-attention-is-all-you-need
Transformer Architecture
• The encoder processes

inputs in parallel.
• But the decoder produces

words sequentially.
https://jalammar.github.io/illustrated-transformer/ 10
1. BERT – Bidirectional Encoder Representations from Transformers.
In 2019.
BERT 2. It uses 12 stacks of encoder layers.
Architecture 3. We trained our own BERT using 25 GB of Indian Legal

documents. Available on HuggingFace.
4. Just like transformer encoders it processes the input in parallel.
Devlin 2019 https://aclanthology.org/N19-1423 11

InLegalBERT, https://huggingface.co/law-ai/InLegalBERT
Pre-training BERT
1. Masked Language modeling (MLM).
2. Next Sentence Prediction (NSP).
https://towardsdatascience.com/understanding-masked-language-models-mlm-and-causal-language-models-clm-in-nlp-194c15f56a5
https://www.scaler.com/topics/nlp/bert-next-sentence-prediction/ 12
Architecture of GPT
1. GPT - Generative Pretrained Transformers. In 2018.
2. Uses 12 stacks of the transformer decoder.
3. GPT is also pre-trained first and then fine-tuned for a
task.
4. GPT does not have bidirectional context.
5. The pre-training task is Causal Language Model
(CLM).
6. GPT-2 in addition to pre-training employs meta-
learning where it is fine-tuned on several data rich
tasks.
OpenAI Blog, https://openai.com/research/language-unsupervised 13

Model sizes follow Moore’s Law
HuggingFace, https://huggingface.co/blog/large-language-models 14
Prompting and Prompt-Engineering
• Since GPT-3 the model sizes were infeasible for fine-tuning.

• Hence something called prompting (aka priming).
• Prompt is an instruction to perform the job, and the model does it for
us.
• Types of prompting:
• Zero-shot prompt
• One/Few-shot prompt
• Chain of thought prompt
• Artificial prompt or prompt tuning
15
Types of prompts (zero-shot)
16
Example of zero-shot prompts
17
Liu et al. https://dl.acm.org/doi/full/10.1145/3560815
Types of prompts (one-shot)
18
Types of prompts (Chain-of-thought)
19
Types of prompts (Auto-prompting)
• Here we do not provide any actual words from the

vocabulary as the instruction.
• We provide pseudo words, i.e., embeddings that are tuned
to provide the best representation of the instruction.
• The rest is out of the scope of this talk.
20
History of LLMs
21
Public leak of Meta's LLaMA model
• The model was released in February 2023.

• Within a week it was leaked via BitTorrent.
• It released four models of different sizes:
• 7B
• 13B
• 33B
• 65B
ChatGPT is of 175B parameters.
22
Chart Title
Current State-of-the-art 120%
LLMs 100%
92% 93%
100%
Percentage of ChatGPT
80% 76%
• LLM performance - April 2023
68%
• Relative performance over 8 tasks compared to GPT-4
(leading commercial LLM) 60%
• GPT-3 has 175B parameters, GPT-4 data not released

but expected to be a lot larger 40%
• LLaMA 65B trained on 2,048 GPUs for 14 days on 1.4T

tokens 20%
• Smaller LLM's perform similar to large LLM's as of 2023

• Download on huggingface 0%
• No compute barrier to research LLM's
Hu 2021 LoRA https://arxiv.org/abs/2106.09685

Alpaca LoRA https://github.com/tloen/alpaca-lora
Vicuna Team 2023 https://lmsys.org/blog/2023-03-30-vicuna/
Leaked internal Google memo 2023 https://www.semianalysis.com/p/google-we-have-no-moat-and-neither
23
Future directions of improvement
• requires huge memory to store parameters.

• requires huge amount of energy to train and run.
• requires costly infrastructure to train and run.
• Answer:
• Analog computing (Startups such as Mythic AI has already created 1inch
chips, that can perform the same number of matrix multiplications using
just 3 watts of power, instead of 100 watts.)
• Quantization of models.
MythicAI, https://mythic.ai/ 24
Quantization of Large LMs
• Each parameter is a floating-point number of 32 bit

• We can reduce the precision of these numbers hugely to as low as 4bit.
• GPTQ: https://arxiv.org/abs/2210.17323
• And they still perform quite well.
• Quantization can be used to deploy large language models to even small
devices like a smartphone.
• LLaMa on a Google Pixel 6 phone
https://twitter.com/thiteanish/status/1635678053853536256
• LLaMa on a Raspberry Pi
https://github.com/ggerganov/llama.cpp/issues/58
25
LIVE demo of VICUNA 4bit quantized
• The model was taken from huggingface.

TheBloke/vicuna-13B-1.1-GPTQ-4bit-128g
• The original 13B model requires a 64GB GPU memory.
• After 4bit quantization, the memory requirement is reduced
to 8GB.
WebUI: https://github.com/oobabooga/text-generation-webui
Vicuna Model: https://huggingface.co/TheBloke/vicuna-13B-1.1-GPTQ-4bit-128g 26
LIVE demo of VICUNA 4bit quantized
(Already asked some questions)
• Define large language models.
• Large Language Models (LLMs) are neural network architectures designed for natural language processing tasks such as text generation, question
answering, and machine translation. These models are typically trained on vast amounts of data, often in the billions of words or more, which
allows them to capture complex patterns and relationships within language. LLMs use techniques like deep learning and recurrent neural
networks to analyze and generate text, enabling them to understand context and produce coherent, human-like responses. Some popular
examples of LLMs include GPT-3, BERT, and XLNet.
• Does himalayas have the highest peak in the world? And in which country is the peak located in?
• Yes, Mount Everest, which is located in Nepal, is considered the tallest mountain in the world and is part of the Himalayan range. It stands at a
height of 8,848 meters (29,029 feet).
• What season it is in Australia during January?

• In January, Australia experiences summer.
• What does Indian youth need to follow to make the country the new global economic leader?
• To help India become a new global economic leader, its youth should focus on developing skills in areas such as technology, innovation,
entrepreneurship, and creativity. They should also stay up-to-date with emerging trends and technologies, build strong networking connections
both domestically and internationally, and be open to collaborations and partnerships. Additionally, they should work towards building a strong
understanding of economics, finance, and business principles, and strive to create sustainable solutions for the future. Finally, they must maintain a
positive attitude, be resilient in the face of challenges, and always look for opportunities to grow and contribute positively to society.
27
Conclusions (1)
• LLM are likely to become ubiquitous in the next year.

• LLM-based applications are being rolled out commercially
now.
• Big IT >> Move fast and break things, fix the mess to society later.
• LLM government regulations will likely follow two or three
years later.
28
Conclusions (2)
• LLMs will challenge the meaning of authorship.

• High quality generated text.
• Even multimodal models can create audio-visual content
indistinguishable from humans.
• On the positive side LLMs can be an enabler for human-AI
collaboration
• Microsoft Co-pilot is an example >> AI code + Human review
• LLM-based search. (already being used by google and bing)
29
Conclusions (3)
• LLMs will change education.

• We use good writing as a proxy for thinking, and depth of knowledge.
• Any LLM has much more knowledge than any single human on the planet.
• State of the art 2023 - MIT experiment using GPT-4 + manual prompt on 30
subjects (maths, electronics, computer science, etc.) for exam and coursework
assignments
• Solve rate from 10% to 73% depending on assessment type
• Paper https://arxiv.org/abs/2306.08997 (withdrawn by MIT citing reputation damage)
• Withdrawal statement https://people.csail.mit.edu/asolar/CoursesPaperStatement.pdf
30
Conclusions (4)
• LLMs ability to generate human quality content at scale will be a

danger.
• Order of magnitude higher number of scams, fake news, hard-
to-detect spam messages and, bots running social media pages.
• Many are not aware of the implications of this fast-moving
technology.
• So, it is our duty to inform our near ones, especially the elders,
about the implications of this new technology.
31
Thank you for your attention!
Dr. Arpan Mandal
University of Southampton, School of Electronics and Computer Science
Email: A.Mandal@soton.ac.uk | ArpanM096@gmail.com
Website: https://sites.google.com/view/arpan-mandal/
32

Day 1

Uploaded by

Copyright:

Available Formats

Day 1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Day 1

Uploaded by

Copyright:

Available Formats

Large Language Models:

Past, Present, and Future

Language Translation Real time translation of natural language.

Definition: A language model is an algorithmic way of providing a

Recently we use neural network LSTM, GRU

• Count frequency based probabilistic methods.

• Problem: increasing n increases space complexity

• Instead of n-grams we use a hidden representation of the context.

• Attention to the rescue in 2014.

• The encoder processes

• But the decoder produces

Architecture 3. We trained our own BERT using 25 GB of Indian Legal

Devlin 2019 https://aclanthology.org/N19-1423 11

1. Masked Language modeling (MLM).

2. Next Sentence Prediction (NSP).

OpenAI Blog, https://openai.com/research/language-unsupervised 13

• Since GPT-3 the model sizes were infeasible for fine-tuning.

• Here we do not provide any actual words from the

• The model was released in February 2023.

Current State-of-the-art 120%

• GPT-3 has 175B parameters, GPT-4 data not released

• LLaMA 65B trained on 2,048 GPUs for 14 days on 1.4T

• Smaller LLM's perform similar to large LLM's as of 2023

• No compute barrier to research LLM's

Hu 2021 LoRA https://arxiv.org/abs/2106.09685

• requires huge memory to store parameters.

• Each parameter is a floating-point number of 32 bit

• The model was taken from huggingface.

• What season it is in Australia during January?

• LLM are likely to become ubiquitous in the next year.

• LLMs will challenge the meaning of authorship.

• LLMs will change education.

• LLMs ability to generate human quality content at scale will be a

University of Southampton, School of Electronics and Computer Science

Email: A.Mandal@soton.ac.uk | ArpanM096@gmail.com

You might also like