LLMs: Why should we pay attention to transformers?

Namrata Tanwani
Jul 31, 2024
4 min read

Updated: Aug 6, 2024

How do large language models work? What makes them so efficient at memorising complex patterns in language?

The very, very mysterious large language models (LLMs) are like a stack of bricks, and each brick is a decoder.

Transformers

Simply, transformers can be understood as encoders and decoders. Let's take an example of translating text from English to Spanish. The encoder's job is to take the English input and convert it into a format that the decoder can work with, called embeddings.

Imagine you're translating a book from English to Spanish. Here's how transformer does it:

The encoder first turns the English words into a special code the computer understands, the embedding. It's like a secret language that captures the meaning and position of each word, an intermediate step, a bit like human translator's notes.
Creating the Spanish version: Another part of the transformer, the decoder, reads these notes. It then writes the Spanish version, choosing each word based on two things:
1. What the original English sentence meant (from the embeddings).
2. What Spanish words it has already written.
Word by word: The decoder keeps choosing the most likely next Spanish word until it finishes the translation.

Some computer programs only focus on understanding language, not translating. These are like the first part of our translation process and only contain encoders and only output embeddings. Examples of these are models like BERT and RoBERTa, which can then be trained to understand and process input sequences that can be used for sentiment classification, entity recognition, question-answering with context, and many other applications.

When we only use layers of decoders, we call them auto-regressive or generative Large Language Models (LLMs).

Generative LLMs are focused on understanding the input and generating new output sequences based on the input. Auto-regressive models such as: GPT, Llama, Mistral can be used for chatting, question-answering, creating synthetic data, and so on.

LLMs

The reason LLMs are called "large" is because they are made of a large number of parameters. Some of the things that influence the number of parameters are the size of the embeddings, context length (number of words a model can process at once), and the number of transformer bricks used (layers). You'll often hear the model sizes are described in billions of parameters. The parameters are numerical values that the model learns during training and then uses to make predictions. They include the values that connect neurons between layers (weights), learnable values added to each neuron's input to help the model fit the training data better (biases), embedding vectors representing each of the tokens, as well as parameters that focus the model on particular embeddings and ones that stabilise the training.

The "language" in LLMs refers to how these systems learn to understand and use human language:

Learning from the internet: These models are trained on vast amounts of text from websites, books, and articles across the internet.
Pretraining: This is the model's "education phase." It learns patterns in language by reading and analysing all the text.
Understanding context: The model learns how words are used in different situations. For example, it learns that "bank" can be a riverside or a money-handling institution, depending on the context.
Connecting related ideas: The model starts to group words and concepts that are often used together or in similar ways. For instance, it might learn that "dog," "puppy," and "canine" are closely related.
Generating responses: After all this learning, the model can create responses that sound natural and relevant to a given question or prompt.
Mathematical representation: In the model's "mind," words with similar meanings are represented by similar patterns of numbers. This allows the model to understand relationships between words and ideas.

By going through this process, the model gains a broad understanding of language, enabling it to generate human-like text on a wide range of topics.

LLMs might struggle with topics that aren't widely discussed online. If information on a subject is rare, private, or not readily available on the internet, these models may not have enough data to give accurate responses. They might require additional training to specialise in these particular subjects. This process is called fine-tuning. The great thing about fine-tuning is that you don’t require large datasets!

Attention!

So, what is it exactly that makes these models “understand” and remember long inputs?

It a mechanism called attention. Just as humans don’t pay attention to each word equally while listening and reading, machines don’t need to either. This way they can identify which words to focus on more to understand the meaning of the sentence when the input is long. Attention also helps resolve ambiguity amongst words with different meanings, also known as polysemy, by understanding the context. For example, the word ‘mouse’ can be used in 2 contexts: “There’s a huge mouse in the house!” and “My computer uses a wireless mouse”. Both usages have different meaning that can be understood via context.

💡Even though LLMs seem to understand and generate almost flawless results, they should be used with some caution as they can hallucinate, be biased, and lack grounding.

A game-changer

The transformer architecture was a breakthrough for AI working with language and multimedia data.

It gave birth to LLMs that can process, while giving the impression of understanding, much longer texts by capturing relationships between things far apart.

The surge of LLM acceptance also broadened horizons for reusing and adapting AI models for new tasks, just by fine-tuning them on a little bit of targeted data. No more needing massive datasets from scratch. Plus, LLMs can even learn from different data types together, like text, images, and audio. This ability has opened up tons of new opportunities across industries.

So, in short, with their long-range mastery, efficient adaptability, and multi-modal skills, the transformer architecture turbocharged what modern AI could do with human communication and multimedia. Game-changing stuff!

Further Resources

What are LLMs? by Kate Soule
A more mathematical explanation can be found in this blog by huggingface
Let's build a GPT from scratch - step by step coding guide of a simple character-based LLM by Andrej Karpathy