[0:00] ChatGPT is an amazing technology. You can use it for all sorts of things from generating
[0:04] code to writing poetry, but it has come in for some criticisms when it comes to factual
[0:09] information. It’s often accused of getting things wrong, making facts up, or even sometimes
[0:14] outright lying and misleading the user.
[0:17] First off, what are we actually talking about? Well, ChatGPT is built on top of a large language
[0:22] model. The particular model that is used in ChatGPT is a generative pre-trained transformer.
[0:27] The model is trained on a huge amount of text. In the case of GPT-3, 45 terabyte of text
[0:33] data was used.
[0:34] During this pre-training, the language model develops a broad set of skills and abilities.
[0:38] Once training is complete, it can use its abilities on new tasks. To use the trained
[0:42] model, you feed in a prompt consisting of a series of words. The model then predicts
[0:46] the next word. This process is repeated until the model runs out of words.
[0:51] We can actually pretend to be a language model. Imagine the passage, the cat sat on
[0:55] there. There are a couple of potential words that are likely to follow this. As humans,
[0:59] we know that it would probably be mat or lap. Maybe with a feeling whimsical, it could
[1:03] have hat.
[1:04] You could imagine how you could build a simple language model. First, you would need to know
[1:08] how to start a sentence. What is the most likely word that would be at the start of
[1:12] a sentence?
[1:13] You could take all the words in your vocabulary and count how many times each one is used
[1:17] at the start of a sentence. And that would let you pick the most likely word.
[1:21] For the second word, you could count how many times two words occur together. And for the
[1:25] third word, you could count how many times three words occur together. You could repeat
[1:29] this until you’ve built a large enough table of probabilities that you could generate a
[1:32] fairly long piece of text.
[1:34] The problem is that for any reasonably long piece of text, this table would be huge.
[1:38] Apparently most native speakers have a vocabulary that ranges from 20,000 to 35,000 words.
[1:44] Even taking the lower range for this, we’d end up with a table that explodes in size.
[1:48] Every time we add a new word, we need to multiply the size of our table by 20,000. It increases
[1:53] exponentially.
[1:54] After just 20 words, we would need a table containing more than 100 septum-virgin-tilian
[1:59] entries. That’s 10, with 86 zeros after it. To put that into perspective, that’s 10,000
[2:05] times more than all the atoms in the known universe. And that’s just using our very
[2:09] conservative 20,000 word vocabulary.
[2:12] This exponential behaviour is why we can use three or four random words as a strong password.
[2:17] With three words, there are over 8 million million possible combinations that a hacker
[2:20] would need to try.
[2:21] There are some issues with using words in a large language model. There are simply too
[2:25] many of them.
[2:26] Although the average person may only use 20,000 to 35,000 words. There are actually
[2:30] many more. Around 500,000 to 1 million.
[2:33] The GPT models get around this by using tokens instead of words. In total, there are 50,257
[2:40] tokens in the GPT vocabulary.
[2:42] There’s a really nice online tool that you can use to see how it breaks text up into tokens.
[2:47] For example, the phrase elephant carpaccio is not something you should eat, gets broken
[2:51] up into 11 tokens, even though it only contains eight words. The longer words elephant and
[2:56] carpaccio are turned into multiple tokens.
[2:59] Using tokens, let’s us encode a lot more words compared to just using a fixed vocabulary.
[3:04] Under the hood, the model doesn’t actually work on tokens. Each token is actually turned
[3:08] into something called an embedding. An embedding is pretty simple, it’s just a bunch of numbers.
[3:12] That represent the tokens.
[3:14] For the most capable GPT-3 model, each token is represented by 12,288 numbers. These embeddings
[3:20] are learnt by the model during training and help to represent the meaning of each token.
[3:25] Tokens with similar meanings should end up with similar embeddings. These embeddings
[3:28] are actually really powerful just by themselves and there’s loads of interesting applications
[3:32] for them.
[3:33] However, even if we grouped similar tokens together, they are still different. We still
[3:37] have a very large number of inputs coming into our model, so we still have the problem
[3:41] of our probability table exploding exponentially.
[3:44] The GPT-3 model only has 175 billion parameters, and it can have an input of up to 4,096 tokens.
[3:51] There’s no way we could store every possible combination of tokens in it.
[3:55] The model has to learn an approximation of the probabilities. There’s a lossy compression
[4:00] of the real world happening.
[4:01] Now, we’re all familiar with lossy compression from looking at JPEG images. There’s a reason
[4:06] why professional photographers like to shoot their pictures in raw format. They want to
[4:10] avoid losing any information from their pictures.
[4:12] So we know that the model is actually learning an approximation of the real world. Obviously,
[4:17] the more parameters the model has, the better this approximation will be, but it will never
[4:21] be perfect. It’s also simply learning the approximate probabilities of combinations
[4:26] of words. It’s not storing facts or algorithms.
[4:30] This means that when you ask it a factual question, the information simply may not exist
[4:35] in the model.
[4:36] However, what does exist in the model is an approximation for what is the most likely
[4:41] answer.
[4:42] Strictly speaking, it’s the most likely combination of tokens that would follow the
[4:45] tokens in your question. You may be lucky, and it may be that the facts you are looking
[4:50] for are the most likely tokens. But you may also be unlucky, and the most likely tokens
[4:55] may just simply look correct.
[4:57] One of the amazing things about these models is that they can actually do anything useful
[5:01] at all. And this is why these large language models are such a breakthrough.
[5:05] Previously, to get something useful, you’d need to train a model for a particular use
[5:09] case.
[5:10] Now, with these very large models, you can just train the model on a whole bunch of
[5:13] text, and it can be used to solve multiple problems.
[5:16] What can we actually do about the hallucination and lying problem? The actual GPT-3 paper
[5:21] is surprisingly useful. It does go into great detail about how the model performs.
[5:26] All the people complaining about how bad chat GPT is at arithmetic really should go and
[5:31] read the paper and see what the author said it was capable of. It can just about do simple
[5:36] addition and subtraction on small numbers, but that’s pretty much it.
[5:39] There are several ways to help the model behave in more useful ways. The default way of using
[5:43] the model is called zero-shot learning. We just give the model a prompt, e.g. you are
[5:48] an AI assistant, and hope for the best. This does work surprisingly well.
[5:53] You can also make the prompt very detailed. One interesting approach is to look at the
[5:57] question the user is asking, and then find matching text from a database of facts, e.g.
[6:02] user manuals, technical documentation, or websites. You then feed these facts in as
[6:07] part of the prompt. If you do this right, the model will use your information to answer
[6:11] the question.
[6:12] Then we have one-shot learning. This is the same as zero-shot learning, but you provide
[6:16] an example to the model so it knows more about what you are trying to do.
[6:20] Following on from this, we have few-shot learning. Exactly the same as one-shot learning, except
[6:24] you give the model multiple examples. Then we have fine-tuning. This is more complicated
[6:29] and actually involves taking a trained model and then tweaking its parameters by training
[6:33] it on your own text.
[6:34] So what’s coming next? Well, prompt engineering is a new field. We’re still learning how
[6:39] to get the most out of these large language models. What we’ve seen so far are just baby
[6:43] steps. There are also new models coming soon that have even more parameters. These have
[6:48] huge potential. It’s going to be a wild ride.
read
Want to keep up to date with the latest posts and videos? Subscribe to the newsletter
HELP SUPPORT MY WORK: If you're feeling flush then please stop by Patreon Or you can make a one off donation via ko-fi
Related Content
Transcript
Want to keep up to date with the latest posts and videos? Subscribe to the newsletter