2.1. GPT-3 Model: Leveraging Transformers
The release of the GPT-3 model family by OpenAI has set the bar very high for its direct competitors, namely Google and Facebook. It has been a major milestone in developing natural language processing (NLP) models. The largest GPT-3 configuration comprises 175 billion parameters, including 96 attention layers and a batch size of 3.2 million training samples. GPT-3 was trained using 300 billion tokens (usually sub-words) [
9].
The training process of GPT-3 builds on the successful strategies used in its predecessor, GPT-2. These strategies include modified initialization, pre-normalization, and reverse tokenization. However, GPT-3 also introduces a new refinement based on alternating dense and sparse attention patterns [
9].
GPT-3 is designed as an autoregressive framework that can achieve task-agnostic goals using a few-shot learning paradigm [
9]. The model can adapt to various tasks with minimal training data, making it a versatile and powerful tool for NLP applications.
To cater to different scenarios and computational resources, OpenAI has produced GPT-3 in various configurations. Table II-A5 summarizes these configurations, which range from a relatively small 125 million parameter model to the largest 175 billion parameter model. This allows users to choose a model that best fits their needs and resources.
All GPT (Generative Pre-trained Transformer) models, including the most recent GPT-3 model, are built based on the core technology of Transformers. The Transformer architecture was first introduced in the seminal paper "Attention is All You Need" by Vaswani et al. in 2017 [
10], which has significantly impacted the deep learning research community, starting with sequential models and extending to computer vision. In the next sub-section, we provide a detailed overview of the Transformer technology and how it was leveraged in GPT-3 for text generation.
2.1.1. Transformers as Core Technology
Transformers refer to the revolutionary core technology behind ChatGPT. Transformers have transformed how sequence-to-sequence models are processed, significantly outperforming traditional models based on recurrent neural networks. Although Transformers are based on classical encoder-decoder architecture, it dramatically differs in integrating the concept of self-attention modules, which excels in capturing long-term dependencies between the elements (i.e., tokens) of the input sequence. It leverages this information to determine each element’s importance in the input sequence efficiently. The importance of each element is determined through the self-attention mechanism, which computes a weight for each element based on its relevance to other tokens in the sequence. This enables Transformers to handle variable-length sequences better and capture complex relationships between the sequence elements, improving performance on various natural language processing tasks. Another critical feature is positional embedding that helps transformers learn the positional information of tokens within the sequence. It allows differentiating between tokens with the same contents but at different positions, which provides better context representation that improves the models’ accuracy. These features represent a significant strength in ChatGPT for providing accurate natural language generation, as compared to its peer, particularly with being trained on large datasets of 570 GB of Internet data.
In general, a transformer comprises three featured modules: (i.) Encoder-Decoder module, (ii.) Self-Attention module, (iii.) Positional Embedding module.
In the following sub-section, we will present the core functionalities of these modules.
2.1.2. Encoder-Decoder Architecture
When Transformers architecture was first designed in [
10] as shown in
Figure 4, they were applied in machine translation, where an input sequence of an initial language is transformed to the output sequence of the target language. The Transformer architecture followed an
encoder-decoder model, where the encoders map a discrete representation of the input sequence (i.e., words, characters, sub-words) to a continuous representation denoted as an embedding vector (i.e., a vector of continuous values). The decoder takes embeddings as input and generates an output sequence of elements one at a time. As the transformer is an autoregressive generative model, it predicts the probability distribution of the next element in the output sequence, given the previous sequence, which can be seen as a special case of Hidden Markov Chains (HMMs). However, HMMs do not have the ability to capture long-term dependencies bidirectionally as transformers do.
Unlike other models that use both an encoder and a decoder or an encoder only, like the Bert model family from Google [
11], ChatGPT relies only on pure decoder architecture, illustrated in
Figure 5, as defined in the first GPT paper [
12]. Specifically, the underlying GPT model applies unidirectional attention using a language masking strategy to process the input sequences token-wise. The decoder is trained to take the first token of the input sequence as a start token and then generate subsequent output tokens based on the input sequence and previously generated tokens. This architecture represents the standard model for language modeling with the objective of generating the sequence of tokens with the highest likelihood given a context (i.e., input sequence and previously generated tokens). The reason why ChatGPT architecture does not rely on an encoder is that the GPT models are trained on a large corpus of textual datasets, using unsupervised learning, to predict the next sequence of words, given a context. Therefore, these models are trained to generate text rather than mapping an input to an output, as in a typical encoder-decoder architecture. As such, the text embedding is directly fed into the self-attention modules to learn the complex relationship between these models in a cascade of self-attention layers. The self-attention module is what makes transformers a powerful tool, which will be explained further in the next section.
2.1.3. Self-Attention Module
Self-attention stands as the core module that empowers transformers to achieve their remarkable performance. They have the ability to capture complex dependencies between the tokens in an input sequence and efficiently determine the weight of each token in an input sequence, in addition to their relative importance. While the self-attention concept may look complex, it relies on the notion of semantic similarity between vectors (in our case, token embeddings) using the dot product. The dot product of two vectors determines the cosine distance between the two vectors, considering their amplitude and the relative angle. The higher the dot product between two embeddings, the more semantically similar they are, indicating their importance in the overall context of the input sequence.
Self-Attention in transformers also relies on the Query (
Q), Key (
K), and Value (
V) concepts. This concept is not new and is borrowed from the Information Retrieval literature and, more specifically, from query processing and retrieval, such as in search engines. In Information Retrieval, a query is a set of tokens used to search for relevant documents from a collection of stored documents. A document’s relevance score is calculated based on the similarity between the query and the document. The similarity score is determined by comparing the query tokens to the documents tokens (i.e., keys) and their corresponding weights (i.e., values). The dot product measures the cosine similarity between these vectors and calculates the relevance scores. This is what exactly happens in the self-attention modules of transformers illustrated in
Figure 6.
In transformers, an input sequence is converted into a set of three vectors, namely, the query, the key, and the value vectors. Consider a sentence with a sequence of tokens (i.e., words, sub-words, or characters).
The Query (Q): This vector represents a single token (e.g., word embedding) in the input sequence. This token is used as a query to measure its similarity to all other tokens in the input sequences, equivalent to a document in an information retrieval context.
The Key (K):This vector represents all other tokens in the input sequence apart from the query token. The key vectors are used to measure the similarity to the query vector.
The Value (V): This vector represents all tokens in the input sequence. The value vectors are used to compute the weighted sum of the elements in the sequence, where the weights are determined by the attention weights computed from the query and key vectors through a dot-product.
In summary, The dot product between the query vectors and the key vectors results in the attention weights, also known as the similarity score. This similarity score is then used to compute attention weights determining how much each value vector contributes to the final output.
Formally, the self-attention module is expressed as follows:
Where , , and are the packed matrix representations of queries, keys, and values, respectively. N and M denote the lengths of queries and keys (or values), while and denote the dimensions of keys (or queries) and values, respectively. The dot-products of queries and keys are divided by to alleviate the softmax function’s gradient vanishing problem, control the magnitude of the dot products, and improve generalization. This is known as a scaled dot product. The result of the attention mechanism is given by the matrix multiplication of A and V. A is often called the attention matrix, and softmax is applied row-wise.
In some cases, a mask could be applied relevance score matrix (query to key dot product) to consider specific dependency patterns between the tokens. For example, in text generation, like ChatGPT, the self-attention module uses a mask that returns the lower triangular part of a tensor input, with the elements below the diagonal filled with zeros. This helps capture the dependency of a token with all previous tokens and not with future ones, which is the partner needed in text generation.
The algorithm of a self-attention module (with mask) is presented in Algorithm 1.
Algorithm 1: Self-Attention Module with Mask |
Require: Q, K, and V matrices of dimensions , , and , respectively Ensure: Z matrix of dimension
Step 1: Compute the scaled dot product of Q and K matrices:
Step 2: Apply the mask to the computed attention scores (if applicable):
if mask is not None then {Element-wise multiplication}
end if
Step 3: Compute the weighted sum of V matrix using A matrix as weights:
Step 4: Return the final output matrix Z. |
Note that the algorithm takes in matrices Q, K, and V, which are the query, key, and value matrices, respectively, and returns the output matrix Z. The softmax function is applied row-wise to the scaled dot product of Q and K matrices, which are divided by the square root of the key dimension . The mask is applied in Step 2 before computing the weighted sum in Step 3. The element-wise multiplication of the mask and attention scores sets the attention scores corresponding to masked tokens to zero, ensuring that the model does not attend to those positions. The resulting matrix A is used as weights to compute the weighted sum of V matrix, resulting in the output matrix Z.
2.1.4. Multi-Head Attention
Traditional NLP tasks, particularly ChatGPT, deal with vast and complex data. Therefore, using only one attention head may not be sufficient for capturing all relevant information in a sequence. Multi-head Attention allows for parallel processing since the self-attention operation is applied across multiple heads. This can result in faster training and inference times than a single-head self-attention mechanism.
Multi-Head Attention also captures multiple between the query and the key-value pairs in the input sequence, which enables the model to learn complex patterns and dependencies in the data. It also helps increase the model’s capacity to learn advanced relationships over large data.
Formally, the multi-head attention function can be represented as follows:
where
, and
,
,
, and
.
2.1.5. Positional Embedding
If only the data without its order is considered, then the self-attention mechanism used in transformers becomes permutation invariant, which processes all tokens equally without considering their positional information. This may result in the loss of important semantic information as the importance of each token with respect to other tokens in the sequence is not captured. Therefore, it is necessary to leverage position information to capture the order and importance of tokens in the sequence.
To address the issue of losing important position information, the transformer model creates an encoding for each position in the sequence and adds it to the token before passing it through the self-attention and feedforward layers. This allows the model to capture the importance of each token concerning others, considering its position in the sequence.
In the transformer architecture of ChatGPT, the positional embedding is added to the input embeddings at the entrance of the decoder. The positional embedding is expressed as follows.
Given a sequence of inputs
, the position embedding matrix
is calculated as:
This equation calculates the value of the position embedding at position i, given the maximum sequence length d and the embedding size k. The value of k determines whether the sin or cos function is used, and is typically set to an even number in transformer models.
Model Name |
No. of Params |
No. of Layers |
Embedding Size |
No. of Heads |
Head Size |
Batch Size |
Learning Rate |
GPT3-Small |
125M |
12 |
768 |
12 |
64 |
0.5M |
6.0 × 10−4
|
GPT3-Medium |
350M |
24 |
1024 |
16 |
64 |
0.5M |
3.0 × 10−4
|
GPT3-Large |
760M |
24 |
1536 |
16 |
96 |
0.5M |
2.5 × 10−4
|
GPT3-XL |
1.3M |
24 |
2048 |
24 |
128 |
1M |
2.0 × 10−4
|
GPT3-2.7B |
2.7B |
32 |
2560 |
32 |
80 |
1M |
1.6 × 10−4
|
GPT3-6.7B |
6.7B |
32 |
4096 |
32 |
128 |
2M |
1.2 × 10−4
|
GPT3-13B |
13B |
40 |
5140 |
40 |
128 |
2M |
1.0 × 10−4
|
GPT3-175B |
175B |
96 |
12288 |
96 |
128 |
3M |
0.6 × 10−4
|