Intro to LLMs
Generative AI has certainly made a big impact over the last couple of years. These are AI systems that take input, usually in the form of text, called a “prompt”, and use it to generate a piece of media - be it more text, an image, audio, or video - in response.
In this post, I’m going to be primarily focusing on LLMs (large language models). These systems are trained on vast quantities of text and, as a result, come to derive relationships between certain combinations of words (technically, ‘tokens’, as I’ll explain).
There are 4 basic steps that go into making most modern generative LLMs: tokenisation, encoding, pre-training, and fine-tuning.
Tokenisation: First, each string of text that makes up the LLM’s training data has to be converted into a series of tokens. A token is a string of characters that can be a word, a part of a word, or some other combination of characters. Each unique token string gets a unique token ID number and GPT-4 uses about 100,000 of them in total. Tokens tend to be case, space, and punctuation specific. In common English text, an average token is about 4 characters or about 3/4 of a word.
[Click here to play around and see how OpenAI’s GPT-3 and Codex LLMs break up text into tokens]
Embedding: Each token ID is then converted into a token vector in a high-dimensional space known as ‘embedding space’ (or ‘latent space’).
Pre-training: Despite the name, this is actually the first stage of training. Here, strings of text from the training data are fed into the model input, converted into tokens, and then converted into token vectors.
There are two main methods used during the pre-training stage - next token prediction (used by GPTs) and masked token prediction (where hidden tokens have to be predicted, given visible tokens before and after). Based on this training, the token vectors’ elements are updated. If successful, then after pre-training is complete, the model’s token vectors should come to occupy the embedding space in such a way that tokens that are commonly found in similar contexts tend to cluster “closer” together (though note: by “closer” we do not mean in the standard, euclidean sense - cosine similarity is often used). As a result, the model should now be able to competently continue strings of texts given as input.
Importantly, such systems are stochastic - this means that entering the same prompt multiple times (even when making sure to reset the AI’s context window) is not guaranteed to get you the same output. This is due to the model generating a probability distribution over the likelihood of each potential next token (dependent on their aforementioned “closeness”) and then randomly choosing a token from this distribution.
Fine-tuning: Often, we want our LLM to do more than simply extend strings of text in a coherent way. Fine-tuning typically involves training our LLM to perform better at specific tasks by updating its parameters in accordance with its performance on such tasks. Supervised learning, where labelled data are used, and Reinforcement Learning from Human Feedback (RLHF) - where a reward model is first trained by human judges and this reward model is then used as a reward function to train the main model - are two commonly used techniques. More recently, some kinds of prompting (more on this below) are sometimes also referred to as fine-tuning - though this does not involve updating the model’s parameters. Prompting, in essence, is just changing the context that the AI uses to determine which tokens to output.
As you can see, while there is often an attempt at fine-tuning for generally desired qualities such as honesty, at their core, LLMs are autocomplete predictors (albeit particularly advanced ones). It’s not surprising then that hallucination, the property of LLMs to confidently assert false information as fact (or otherwise outputting something incongruous with its context), is still relatively common - to say nothing of the potential for the fine-tuning process to increase the chance of hallucination.
Above: A Twitter user offers an interesting analogy. Here, ‘browsing’ the Library of Babel is akin to navigating through the LLM’s latent space
Text-based generative AIs such as OpenAI’s ChatGPT, Google’s Bard, Microsoft’s Bing Chat, and Anthropic’s Claude use pre-trained LLMs and then fine-tune them for specific tasks such as conversation and instruction-following, or to increase [decrease] generally desired [undesired] behaviour.
Prompt engineering
So if a ‘prompt’ is an input given to a generative AI, which it then uses to generate an output, ‘prompt engineering’ is the practice of crafting a prompt in such a way as to increase the chance of eliciting your desired response.
The term itself is not without its controversy, no doubt due in part to the subject coming to be associated with a certain kind of grifty content creator (see this post’s thumbnail). Despite this, I believe there is some genuine value to certain best practices - both in terms of increased productivity when working with such tools, and in ‘probing’ them to get a better understanding of how they work. Prompt injection seems particularly interesting as an area of study, due to its potential implications for future, more advanced models (more on this in a future post).
What follows is a short summary of some of the best advice I was able to find for prompt engineering. If you’re interested in reading further, I’ve included a series of links at the bottom of this post.
General tips:
Try being more specific about the kind of response you want. This may involve giving the AI some additional context.
Be patient: you may have to regenerate a response to your prompt multiple times to get the result you were looking for
Try iterating upon previous prompts/responses. You can request specific refinements yourself, or just ask the model to reflect on (or offer a critique of) its answer.
Try resetting the session if the model gets stuck: this is because, provided the LLM doesn’t exceed its token limit, everything that has previously been written in the conversation (both the user input and the model’s output) is used as input by the model.
If you’re dissatisfied with your results, perhaps try a different model
Tweak the model’s settings, if you have access to them. The higher the temperature of the LLM is set to, the greater the variability in the responses it will give you. Set it close to zero for more consistency or give it a high value if you want a wide range of responses. Other settings include top-K (where the model will only sample from the top ‘K’ most likely tokens) and top-p (where the model will only sample from the top ‘p’(x100)% of the cumulative probability mass of tokens). Generally, it’s advised to only alter one of these settings at a time.
Specific techniques:
Role prompting: this involves telling the AI to take on a given persona - be it a tour guide, a science tutor, or even a Linux terminal (click here for further examples). Role prompting can dramatically vary the kinds of responses you get and allow the model to perform significantly better at certain tasks than it would otherwise be able to.
Meta-prompting: this is an application of role prompting where the role we tell our AI to take on is as a prompt generator. This can be for the LLM you are currently using or for any other text-accepting generative AI.
Chain-of-thought (CoT) prompting: this involves prompting the AI to perform step-by-step reasoning. First, some background: zero-shot prompting refers to a prompt that provides zero examples of similar input-output pairs to what we desire, one-shot prompting provides one example, and few-shot prompting provides multiple examples. Few-shot CoT prompting then provides multiple examples, where each example demonstrates the process used to arrive at the correct answer. Zero-shot CoT, on the other hand, is as simple as ending your prompt with the sentence “Let’s think step by step.” Much like with role prompting, a simple change like this can produce dramatically better results.
Prompt injections: so-called due to being analogous to SQL injections, these involve ‘tricking’ the model into performing an action it was not intended to perform (and usually, its developers would prefer it not to perform). While many of the examples you may see are fairly trivial, as LLMs become more commonplace and integrated into our world, such vulnerabilities could quickly become a serious concern. For explorations of how this could prove to be a significant security risk in the future, see HERE and HERE.
prompt leaking: this is a common type of prompt injection that gets an LLM to output its prompt. This is particularly interesting in the case of apps built ‘on top of’ LLMs, by allowing you to see the instructions they are given by the developers.
jailbreaking: these types of prompt injections aim to bypass safeguards placed on models (or apps built on top of them) and gain access to functionality that would otherwise be unavailable. Commonly, role prompting (see above), is employed in some way to achieve this. New jailbreaking prompts are constantly being discovered and patched - for a living compilation of such prompts see HERE.
What was I saying again?:
One problem current LLMs have, specifically where ‘prompt-crafting as fine-tuning’ is concerned, is the short “memory” that these models have (though really, of course, this is not ‘memory’ in the conventional sense, but rather the context window that the model can use in order to generate new tokens). GPT-3.5 has a 4,000 token limit, with GPT-4 having either an 8,000 or a 32,000 token limit depending on the version used. While this issue will no doubt continue to become less of a concern with time, in the meantime, some workarounds can be used, such as summarising key details at the beginning of later prompts. [UPDATE: Anthropic just released a new version of its Claude model with a 100,000 token limit]
Conclusion: I’m certainly not the first to make this prediction, but I mostly expect “prompt engineering” to be a temporary phenomenon. In as little as perhaps 2 or 3 years’ time, these models will likely be skilled enough at interpreting/predicting the desires of their users so as to reduce the value of prompt craft as a skill to a small fraction of what it is currently. That said, I could well see some “prompt engineering” surviving beyond that if only in the form of ‘hacking’/security.
[Then again, as
notes, we have been overly optimistic about the near-term future pace of these kinds of AI developments before.]From a longer-term perspective however, the reverse - that is, machines that engineer their behaviour in such a way as to elicit a certain desired response from humans - presents a far more interesting possibility. However, I will resist further speculation for now.
Additional Resources:
About ChatGPT/LLMs:
https://www.lesswrong.com/posts/pHPmMGEMYefk9jLeh/llm-basics-embedding-spaces-transformer-token-vectors-are
https://medium.com/@atmabodha/pre-training-fine-tuning-and-in-context-learning-in-large-language-models-llms-dd483707b122
https://openai.com/blog/chatgpt
https://www.codecademy.com/learn/intro-to-chatgpt
https://arxiv.org/abs/2304.13712 - “Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond”
https://writings.stephenwolfram.com/2023/02/what-is-chatgpt-doing-and-why-does-it-work/
https://lena-voita.github.io/nlp_course/language_modeling.html
https://statmodeling.stat.columbia.edu/2023/04/26/llm-alignment-bias-cultural-consensus-theory/
https://www.assemblyai.com/blog/how-chatgpt-actually-works/
“How Chat GPT is Trained” by Ari Seff [below]
“Let's build GPT: from scratch, in code, spelled out.” by Andrej Karpathy [below]
Prompt engineering:
https://learnprompting.org/
https://www.promptingguide.ai/
https://huyenchip.com/2023/04/11/llm-engineering.html#prompt_engineering_challenges
https://lilianweng.github.io/posts/2023-03-15-prompt-engineering/
https://simonwillison.net/tags/promptengineering/
https://github.com/f/awesome-chatgpt-prompts
https://hn.algolia.com/?dateRange=all&page=0&prefix=true&query=prompt%20engineering&sort=byPopularity&type=story