Overview of LLMs (from and for software engineers)

LLM Overview

This blog post provides an overview of LLMs (large language models), which are a type of artificial intelligence (AI) that are trained on massive amounts of text data. LLMs are currently being discussed extensively, as they have the potential to revolutionize the way we interact with computers. Most of us have heard or played around with ChatGPT and I’m personally a user of the service.

This blog post is for educational purposes only and assumes you have a basic understanding of machine learning. The information provided is not intended to be a substitute for professional advice. I hope it helps demystifying LLMs.

How LLMs work

The typical LLMs consists of two main components: an encoder and a decoder. The encoder converts a sequence of text or code into a vector representation, while the decoder takes this vector representation and generates a new sequence of text or code. Both the encoder and decoder are neural networks, but they have different architectures. The encoder is typically a transformer-based network, while the decoder is typically a recurrent neural network (RNN).

The transformer is a neural network architecture that is particularly well-suited for natural language processing tasks while the RNN is another neural network architecture that is well-suited for natural language processing tasks. It is able to process text sequentially, which is important for generating text that is grammatically correct.

The encoder and decoder work together to perform language modeling tasks.

How LLMs are trained

There are two common approaches, which we will discuss on a very high level:

Autoregressive Language Modeling

The model is trained to predict the next word in a sequence. During training the model processes one word at the time and predicts the next token based on the context provided by the preceding tokens. As an example the input may be “The cat sits on a” and the model is supposed to predict “mat”. Which results in “The cat sits on a mat”.

Masked Language Modeling

Some words are randomly masked and the model is trained to predict the masked words based on the context. As an example the input may be “The cat [xxxx] on a mat”. The model is supposed to predict “sits”. Which ultimately results in “The cat sits on a mat”.

And then?

For conversational style models, the training usually is done in two phases:

The model is trained on a massive amount of data. It basically learns the knowledge and context to predict the next word (or the masked word).

After the initial training, the model is fine tuned on specific tasks, such as chat conversations, question answering, translation or any other task. The model can learn to respond in conversational style (chat) during this phase.

Tokenization

Tokenization is a critical process that enables neural networks to process large volumes of text efficiently. If you have ever worked with neural networks, you may know they take numbers as input. So, how does this work then with text?

Simply put, every word gets assigned a token ID. The input “The quick brown fox jumps over the lazy dog” is passed to the model as a vector of numbers, like: [101, 1103, 4248, 2829, 4419, 3695, 6500, 2058, 1996, 11616, 3899, 1012, 102]

[101] = [CLS] (Special token indicating the start of a sequence)

[1103] = “The”

[4248] = “quick”

[2829] = “brown”

[4419] = “fox”

[3695] = “jumps”

[6500] = “over”

[2058] = “the”

[1996] = “lazy”

[11616] = “dog”

[3899] = “.”

[102] = [SEP] (Special token indicating the end of a sequence)

Overview of instruction following LLMs

The list is not exhaustive.

GPT 3.5 or greater	ChatGPT (openai.com)	Closed source LLM by OpenAI. Trained on a massive data set consisting of books, web pages, etc. The size of the text corpus is approximatelsy 570 GB. The model has around 175 billion parameters, which is massive. The model is used in various different products like ChatGPT and GitHub Copilot (yes, it can produce code). More information can be found on ChatGPT (openai.com)
Bard	Bard (google.com)	Bard is Google’s experimental conversational AI chat service (closed source). The model has around 137 billion parameters and was trained on a dataset consisting of text and code (including books, articles, etc.) More information can be found on Bard (google.com)
Claude2	Anthropic \ Introducing Claude Limited access from UK and US as per 8.8.2023	Claude2 seems to be very comparable to ChatGPT with the major advantage that it can process a much larger context (100’000 tokens per prompt, compared to ChatGPT 4 with 8’192). It would have been great to test the serivce, but the regional limitations make it tricky at the time of writing. Learn more on Anthropic \ Introducing Claude
LLama 2 (chat)	Chat with Open Large Language Models (lmsys.org) or use Web LLM if you want to run everything inside the browswer: WebLLM \| Home (mlc.ai)	LLama 2 is Meta AI’s LLM, which comes with various model sizes (7, 13 and 70 billion parameters) and was open sourced and is free to use also for commercial purposes. Please note there is quite a discussion going on regarding how open the model actually is, as the license implies some restrictions: Llama and ChatGPT Are Not Open-Source – IEEE Spectrum The dataset includes web pages, source code and books. The data set has 2 trillion tokens. Learn more on Llama 2 – Meta AI
Dolly V2 (based on Pythia-12B foundation model)	No online playground found.	Databricks conversational fine tuned Pythia model. The company curated a dataset of 15k conversations to fine tune the model. The creation of the dataset was setup as a contest at Databricks. Learn more on Free Dolly: Introducing the World’s First Open and Commercially Viable Instruction-Tuned LLM – The Databricks Blog
Falcon LLM	The 7B model can be tried directly on Hugging Face, altough very limited: tiiuae/falcon-7b-instruct · Hugging Face Another demo for the 40B model can be found here: The Falcon has landed in the Hugging Face ecosystem	Falcon is a LLM open sourced by TII (Technology Innovation Institute) and offers models with 7 and 40 billion parameters. It is free to use also for commercial purposes. It was trained on their own dataset named RefinedWeb, which requires around 3 TB of storage and consists of around 1 trillion tokens. Learn more on tiiuae (Technology Innovation Institute) (huggingface.co)
Vicuna	Try on Chat with Open Large Language Models (lmsys.org)	Fine tuned instruction following LLama model. The code is open source, usage of the model for non commercial use. Learn more on Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality \| LMSYS Org
Alpaca	Try on Chat with Open Large Language Models (lmsys.org)	Fine tuned instruction following LLama model. Learn more on Stanford CRFM

Conclusion

Understanding LLMs doesn’t require rocket science, and I hope this blog post has helped demystify them to some extent. The possibilities they offer are fascinating, and it’s encouraging to witness the increasing availability of LLMs in a more open-source approach. With LLMs becoming more accessible, we look forward to seeing the exciting developments and innovations that will shape the future of AI and human-computer interaction.

2 comments

Lionel Guerraz says:

08/06/2023 at 17:34

very cool summary Christian – thanks for taking the time to synthesise this information – did you use AI? ????

perhaps you could add Claude from Anthropic?
https://www.anthropic.com/index/introducing-claude

1. admin says:
  
  08/07/2023 at 6:17
  
  Thanks for the feedback! 🙂 Good point, let me have closer look at Claude. Unfortunately, I can’t access their service (only US/UK at the moment).