How Does ChatGPT Work: From Pretraining to RLHF

Vikram M — Published On May 17, 2023 and Last Modified On May 18th, 2023

Welcome to the future of AI: Generative AI! Have you ever wondered how machines learn to understand human language and respond accordingly? Let’s take a look at ChatGPT – the revolutionary language model developed by OpenAI. With its groundbreaking GPT-3.5 architecture, ChatGPT has taken the world by storm, transforming how we communicate with machines and opening up endless possibilities for human-machine interaction. The race has officially begun with the recent launch of ChatGPT’s rival, Google BARD, powered by PaLM 2. In this article, we will dive into the inner workings of ChatGPT, how it works, what are different steps involved like Pretraining and RLHF, and explore how it can comprehend and generate human-like text with remarkable accuracy.

“Generative AI opens up new creative possibilities that we never thought were possible before.”
Douglas Eck, Research Scientist at Google Brain

Explore inner workings of ChatGPT and explore how it can comprehend and generate human-like text with remarkable accuracy. Get ready to be amazed by the cutting-edge technology behind ChatGPT and discover the limitless potential of this powerful language model.

The key objectives of the article are-

Discuss the steps involved in the model training of ChatGPT.
Find out the advantages of using Reinforcement Learning from Human Feedback (RLHF).
Understand how humans are involved in making models like ChatGPT better.

Overview of ChatGPT Training

ChatGPT is a Large Language Model (LLM) optimized for dialogue. It is built on top of GPT 3.5 using Reinforcement Learning from Human Feedback (RLHF). It is trained on massive volumes of internet data.

There are mainly 3 steps involved in building ChatGPT-

Pretraining LLM
Supervised Finetuning of LLM (SFT)
Reinforcement Learning from Human Feedback (RLHF)

The first step is to pretrain the LLM (GPT 3.5) on the unsupervised data to predict the next word in the sentence. This makes LLM learn the representation and various nuances of the text.

In the next step, we finetune the LLM on the demonstration data: a dataset with the questions and answers. This optimizes the LLM for dialogue.

In the final step, we use RLHF to control the responses generated by the LLM. We are prioritizing the better responses generated by the model using RLHF.

Now, we will discuss each step in detail.

Pretraining LLM

Language models are statistical models that predict the next word in a sequence. Large language models are deep learning models trained on billions of words. The training data is scraped from multiple websites like Reddit, StackOverflow, Wikipedia, Books, ArXiv, Github, etc.

Dataset and parameters in different LLMs. ChatGPT uses GPT-3

We can see the above image and get an idea of the side of the dataset and the number of parameters. The pretraining of LLM is computationally expensive as it requires massive hardware and a vast dataset. At the end of pretraining, we will obtain an LLM that can predict the next word in the sentence when prompted. For example, if we prompt a sentence, “Roses are red and”, it might respond with “Violets are blue.” The below image depicts what GPT-3 can do at the end of pretraining:

Pretraining GPT-3 model.What GPT-3 can do at the end of pretraining.

We can see that the model is trying to complete the sentence rather than answering it. But we need to know the answer rather than the next sentence. What could be the next step to achieve it? Let us see this in the next section.

Also Read: Prompt Engineering: Rising Lucrative Career Path AI Chatbots Age

Supervised Finetuning of LLM

So, how do we make the LLM answer the question rather than predict the next word? Supervised Finetuning of the model would help us solve this problem. We can tell the model the desired response for a given prompt and fine-tune it. For this, we can create a dataset of multiple types of questions to ask a conversational model. Human labelers can provide the appropriate responses to make the model understand the expected output. This dataset consisting of pairs of prompts and responses is called Demonstration Data. Now, let us see a sample dataset of prompts and their responses in the demonstration data.

Reinforcement Learning from Human Feedback (RLHF)

Now, we are going to learn about RLHF. Before understanding RLHF, let us first see the benefits of using RLHF.

Why RLHF?

After supervised finetuning, our model should give us the appropriate responses for the given prompts, right? Unfortunately, No! Our model might still not properly answer every question that we ask it. It might still be unable to evaluate which response is good and which is not. It could have to overfit the demonstration data. Let us see what could happen if it overfits the data. While writing this article, I asked Bard this:

what RLHF is important in making model like GPT

I did not give it any link, article, or sentence to summarize. But it just summarized something and gave it to me, which was unexpected.

One more problem which might arise is its toxicity. Though the answer might be right, it might not be right ethically and morally. For example, look at the image below, which you might have seen before. When asked for the websites to download movies, it first responds that it is not ethical, But in the next prompt, we can easily manipulate it as shown.

Ok, now go ahead to ChatGPT and try the same example. Did it give you the same result?

Why are we not getting the same answer? Did they retrain the entire network? Probably not! There might have been a small fine-tuning with RLHF. You can refer to this beautiful gist for more reasons.

Reward Model

The first step in RLHF is to train a reward model. The model should be able to take the response of a prompt as input and output a scalar value that depicts how good the response is. For the machine to learn what a good response is, can we ask the annotators to annotate the responses with rewards? Once we do this, there might be biases in rewarding the responses by different annotators. So the model might not be able to learn how to reward the responses. Instead, the annotators can rank the responses from the model, which would reduce the bias in the annotations to a great extent. The below image shows a chosen response and rejected response for a given prompt from Anthropic’s hh-rlhf dataset.

From this data, the model tries to distinguish between a good and bad response.

Finetuning LLM with Reward Model Using RL

Now, we finetune the LLM with Proximal Policy Approximation(PPO). In this approach, we get the reward for the response generated by the initial language model and the current iteration of the fine-tuned iteration. We compare the current language model with the initial language model so that the language model does not deviate too much from the right answer while generating a neat, clean, and readable output. KL-divergence is used to compare both models and then finetune the LLM.

Model Evaluation

The models have been constantly evaluated at the end of each step with a different number of parameters. You can see the methods and their respective scores in the images below:

We can compare the performance of the LLMs at different stages w.r.t different model sizes in the above figure. As you can see, there is a significant increase in the results after each training phase.

We can replace the Human in RLHF in this segment with Artificial Intelligence RLAIF. This significantly reduces the cost of labeling and has the potential to perform better than RLHF. Let’s discuss that in the next article.

Conclusion

In this article, we saw how conversational LLMs like ChatGPT are trained. We saw the three phases of training ChatGPT and how reinforcement learning from human feedback has helped the model improve its performance. We also understood the importance of each step, without which the LLM would be inaccurate.

Hope you enjoyed reading it. Feel free to leave comments below in case of any query/feedback. Happy Learning

Frequently Asked Questions

Q1. How does ChatGPT get its data?

A. ChatGPT gets its data from a wide range of sources, including books, articles, websites, and other publicly available text on the internet. It uses this data to learn patterns, grammar, and facts.

Q2. How to earn money using ChatGPT?

A. ChatGPT itself does not provide a direct way to earn money. However, individuals or organizations can utilize the capabilities of ChatGPT to develop applications or services that can generate revenue, such as virtual assistants, customer support bots, or content generation tools.

Q3. How does ChatGPT neural network work?

A. ChatGPT’s neural network uses a transformer architecture, specifically the Transformer model. It processes input text by dividing it into smaller segments and applies attention mechanisms to capture contextual relationships between words, allowing it to generate coherent and contextually relevant responses.

Q4. What algorithm does ChatGPT use?

A. ChatGPT uses a variant of the transformer architecture called the “GPT” (Generative Pre-trained Transformer). It employs unsupervised learning, where the model predicts the next word in a sentence based on the previous words. This self-supervised learning enables the model to generate human-like text when given a prompt.

How Does ChatGPT Work: From Pretraining to RLHF

How Does ChatGPT Work: From Pretraining to RLHF

Overview of ChatGPT Training

Pretraining LLM

Supervised Finetuning of LLM

Reinforcement Learning from Human Feedback (RLHF)

Why RLHF?

Reward Model

Finetuning LLM with Reward Model Using RL

Model Evaluation

Conclusion

Frequently Asked Questions

Related

Recommend

Trust, AI and Product Design?

InfoWorld Technology of the Year Awards 2023 Nominations Now Open

Difficult Conversations: We can work it out!

Announcing Cloudflare Secrets Store

20 Best And Worst Google Bard Examples And Use Cases

Attractions and Shopping In Romantic Verona

Litecoin, XRP and other altcoins rise, 'blue chip' bitcoin hits a lull

Android users switching to iPhone

我如何处理每天接收的信息

Microsoft CEO talks A.I. concerns and its impact on jobs, education

About Joyk