AI Mole
Published on

How ChatGPT works - explaining in simple words the evolution of language models from T9 to miracle

Authors
  • avatar

Recently, almost every day we are told in the news about the next heights that language neural networks have conquered, and why they will definitely leave you unemployed in a month. But few people understand how neural networks like ChatGPT work inside? So, make yourself comfortable: in this article, we will finally explain everything in a way that even a six-year-old humanist can understand! Our goal was to give readers a general understanding of the principles of language neural networks at the level of concepts and analogies, rather than to dissect all the deep technical nuances of the process down to the last cog.

T9: a language magic session with a disclosure Let's start with a simple one. To understand what ChatGPT is from a technical point of view, we must first understand what it is definitely not. It is not a "God from a machine", not an intelligent being, not an analog of a schoolchild (in terms of intelligence and problem-solving skills), not a genie, and not even a Tamagotchi that has gained the gift of speech. Prepare to hear the scary truth: ChatGPT is actually the T9 from your phone, but on bovine steroids! Yes, that's right: scientists call both of these technologies "Language Models"; and all they essentially do is guess what the next word should follow an existing text.

Well, to be more precise, in very old-school phones from the late '90s (like the iconic, unkillable Nokia 3210), the original T9 technology only sped up typing on button phones by guessing the current word being typed, not the next word. But the technology evolved, and by the smartphone era of the early 2010s, it could already take context (previous word) into account, put in punctuation, and suggest words that could go next. That's exactly the analogy of such an "advanced" version of T9/autosubstitution we're talking about.

So, both T9 on a smartphone keyboard and ChatGPT are trained to solve an insanely simple task: predicting the single next word. This is language modeling - when some existing text is used to infer what should be written next. To be able to make such predictions, the language models under the hood have to operate on the probabilities of occurrence of certain words to continue. After all, you'd probably be unhappy if your phone's autocomplete just threw you completely random words with the same probability.

To illustrate, let's imagine that you receive a message from a friend: "Shall we go out tonight?". You start typing back: "Nah, I've got stuff to do(( I'm going to...", and that's where T9 comes in. If it offers you to finish the sentence with a completely random word, like "I'm going to the duck-billed platypus" - then for such gibberish, frankly speaking, you don't need any tricky language model. Real autocomplete models in smartphones suggest much more appropriate words (you can check it yourself right now).

So, how exactly does T9 understand which words are more likely to follow an already typed text, and which words should not be suggested? To answer this question, we will have to dive into the basic principles of the simplest neural nets.

Where do neural networks get the probabilities of words from? Let's start with an even simpler question: how do we predict dependencies of some things on others? Suppose we want to teach a computer to predict a person's weight based on their height - how do we approach this task?

Common sense suggests that we should first collect data on which we will look for the dependencies we are interested in (for simplicity, let's limit ourselves to one gender - let's take statistics on height/weight for several thousand men), and then try to "train" a certain mathematical model to look for regularities within this data.

For clarity, let's first draw our entire data set on a graph: the horizontal X-axis is height in centimeters, and the vertical Y-axis is weight. Weight vs Height

Even the naked eye can see a certain dependence: tall men, as a rule, weigh more (thanks, Cap!). And this dependence is quite simple to express in the form of the usual linear equation y=k*x+b, familiar to all of us since the fifth grade of school. In the picture, the line we need is already drawn using a linear regression model - in fact, it allows us to choose the coefficients of the equation k and b in such a way that the resulting line optimally describes the key relationship in our data set (you can substitute your height in centimeters instead of x into the equation in the picture and check how accurately our model will guess your weight).

You probably want to exclaim: "Okay, everything was intuitive with height/weight, but what does this have to do with language neural networks?". And the reason is that neural networks are a set of approximately the same equations, only much more complex and using matrices (but let's not talk about that now).

We can simplistically say that the same T9 or ChatGPT are just cleverly chosen equations that try to predict the next word (Y) depending on the set of previous words (X) fed to the model's input. The main task in training a language model on a set of data is to find such coefficients at these ics that they actually reflect some kind of dependency (as in our height/weight example). And by large models we will further understand those that have a very large number of parameters. In the field of AI they are called LLM, Large Language Models. As we will see a little further on, a "fat" model with a lot of parameters is the key to success for generating cool texts!

By the way, if at this point you are wondering why we are always talking about "predicting one next word", while ChatGPT responds cheerfully with whole portions of text, then don't worry. Language models generate long texts without any difficulty, but they do it on a word-by-word basis. In fact, after generating each new word, the model simply re-runs the entire previous text through itself, along with the addition it has just written - and spits out the next word already taking it into account. The result is a coherent text.

Barak's paradox, or why language models need to be intelligent in creativity What language models are actually trying to predict as "Y" in our equations is not so much the specific next word, but rather the probabilities of different words that can be used to continue a given text. Why is this necessary, why can't we always look for the single, "most correct" word to continue? Let's look at the example of a small game.

The rules are as follows: you pretend to be a language model, and I suggest you to continue the text "The 44th President of the USA (and the first African-American in this position) is Barack...". Substitute the word that should be in place of the dotted line and estimate the probability that it will actually be there.

If you just said that the next word should be "Obama" with 100% probability, then congratulations - you are wrong! And it's not that there is some other mythical Barack: it's just that in official documents, the president's name is often spelled in full form, with his middle name, Hussein. So a properly trained language model should predict that in our sentence "Obama" will be the next word only with a probability of 90%, and the remaining 10% should be allocated for the case of the continuation of the text by "Hussein" (which will be followed by Obama with a probability close to 100%).

If you just said that the next word should be "Obama" with 100% probability, then congratulations - you are wrong! And it's not that there is some other mythical Barack: it's just that in official documents, the president's name is often spelled in full form, with his middle name, Hussein. So a properly trained language model should predict that in our sentence "Obama" will be the next word only with a probability of 90%, and the remaining 10% should be allocated for the case of the continuation of the text by "Hussein" (which will be followed by Obama with a probability close to 100%).

And here we come to a very interesting aspect of language models: it turns out that they are not strangers to creativity! In fact, when generating each next word, such models choose it "randomly", as if throwing a die. But not randomly, but in such a way that the probabilities of "falling out" of different words roughly correspond to the probabilities suggested to the model by the equations (derived by training the model on a huge array of different texts).

It turns out that one and the same model even for absolutely identical queries can give completely different answers - just like a living person. In general, scientists once tried to make neurons always choose the "most probable" next word as a continuation - which sounds logical at first glance, but in practice such models somehow work worse; but a healthy element of randomness is strictly to their benefit (increases variability and, as a result, the quality of answers).

In general, our language is a special structure with (sometimes) clear sets of rules and exceptions. Words in sentences do not appear out of nowhere, they are connected to each other. These connections are learned quite well by humans "automatically" - while growing up and going to school, through conversations, reading, and so on. At the same time, to describe the same event or fact, people come up with many ways in different styles, tones, and halftones. The approach to linguistic communication of gopniks in the back alley and, for example, of elementary school students is likely to be quite different.

A good model should accommodate all this variability of language descriptiveness. The more accurately the model estimates the probabilities of words depending on the nuances of the context (the preceding part of the text describing the situation), the better it is able to generate the answers we want to hear from it.

2018: GPT-1 transforms language models Let's move on from all sorts of slumbering T9s to more modern models: ChatGPT, which has made so much noise, is the most recent member of the GPT family of models. But to understand how it managed to acquire such an unusual ability to please people with its answers, we have to go back to the beginning.

GPT stands for Generative Pre-trained Transformer, or "transformer trained to generate text". Transformer is the name of a neural network architecture invented by Google researchers back in 2017 (we didn't misspell "back": by industry standards, six years is an eternity).

It was Transformer's invention that proved so significant that, in general, all areas of artificial intelligence (AI) - from text translation to image, sound or video processing - began to actively adapt and apply it. The AI industry literally received a powerful shake-up: it went from the so-called "AI winter" to rapid development, and was able to overcome stagnation.

Conceptually, Transformer is a universal computing mechanism, which is very simple to describe: it takes one set of sequences (data) as input and outputs the same set of sequences, but a different one - transformed according to some algorithm. Since text, pictures and sound (and almost everything in this world) can be represented as sequences of numbers, you can solve almost any task with the help of Transformer.

But the main trick of Transformer is its convenience and flexibility: it consists of simple modules-blocks, which are very easy to scale. If the old, pre-Transformer language models started lurching and coughing (demanding too many resources) when you tried to make them "swallow" a lot of words at a time, Transformer neural networks are much better at this task.

Earlier approaches had to process input data on a one-by-one basis, i.e. sequentially. Therefore, when the model worked with a one-page text, by the middle of the third paragraph it forgot what was at the very beginning (just like people in the morning, before they "had a cup of coffee"). But Transformer's mighty paws allow him to look at EVERYTHING at once - and this leads to much more impressive results.

This is what allowed us to make a breakthrough in neural network processing of texts (including their generation). Now the model does not forget: it reuses what has already been written earlier, holds the context better, and most importantly - it can build connections of the "every word with every" type on very impressive amounts of data.

2019: GPT-2, or how to cram seven thousand Shakespeares into a language model If you want to teach an image recognition neural network to distinguish cute little Chihuabellas from blueberry muffins, you can't just tell it "here's a link to a giant archive with 100,500 photos of dogs and baked goods - figure it out!". No, in order to train a model, you have to be sure to first mark up the training dataset - that is, sign under each photo whether it is fluffy or sweet.

Do you know what is great about training language models? The fact that they can be fed any textual data, and this data does not need to be labeled in any special way beforehand. It's as if you could just throw a suitcase with all kinds of books at a schoolboy, without any instructions as to what to learn and in what order - and he would figure out some clever conclusions for himself in the process of reading!

If you think about it, it makes sense: we want to teach the language model to predict the next word based on information about the words that come before it, right? Well, any text ever written by a human being is a ready-made piece of training data. After all, it already consists of a huge number of sequences like "a bunch of words and sentences => the next word after them".

And now let's remember that the Transformers technology, which was tested on GPT-1, has been very successful in terms of scalability: it can handle large amounts of data and "massive" models (consisting of a huge number of parameters) much more efficiently than its predecessors. Are you thinking what I'm thinking? Well, the OpenAI scientists in 2019 have come to the same conclusion: "It's time to saw big language models!"

In general, it was decided to radically pump up GPT-2 in two key areas: training data set (dataset) and model size (number of parameters).

At that time, there were no special, large and high-quality public text datasets for training language models - so each team of AI specialists had to twist according to their own degree of depravity. So the guys from OpenAI decided to do something witty: they went to the most popular English-language online forum Reddit and stupidly downloaded all hyperlinks from all posts that had more than three likes (I'm not kidding now - scientific approach, well!). In total, such links came out about 8 million, and the texts downloaded from them weighed a total of 40 gigabytes.

Is it much or little? Let's estimate: the collection of William Shakespeare's works (all his plays, sonnets and poems) consists of 850'000 words. The average page of a book contains about 300 English words - so 2800 pages of wonderful, sometimes outdated English text by the greatest English-speaking writer will take up about 5.5 megabytes of computer memory. So: that's 7300 times less than the GPT-2 training sample size.... Given that people on average read a page a minute, even if you absorb the text 24 hours a day without a break for food and sleep - it will take you almost 40 years to catch up with GPT-2 on erudition!

But the amount of training data alone is not enough to produce a cool language model: after all, even if you put a five-year-old child to read the entire collection of Shakespeare's works together with Feynman's lectures on quantum physics, it is unlikely to make him much smarter. And so it is here: the model itself should be complex and voluminous enough to "swallow" and "digest" such amount of information. And how to measure this complexity of the model, what is it expressed in?

Why are "Plus Size" models more valued in the world of language models?

Remember we said a little earlier that inside language models (in a super-simplified approximation) live equations of the form y=k*x+b, where the sought after yreq is the next word whose probability we're trying to predict, and the x's are the input words based on which we make that prediction?

So, what do you think: how many parameters were in the equation describing the largest GPT-2 model in 2019? Maybe a hundred thousand, or a couple million? Ha, take it higher: there were as many as a billion and a half such parameters in the formula (that's that many: 1'500'000'000). Even if you just write that many numbers into a file and save it on your computer, it would take up 6 gigabytes! On the one hand, this is much smaller than the total size of the text data set on which we trained the model (remember, the 40 GB one we collected from Reddit links); on the other hand, the model doesn't need to memorize the whole text, it just needs to find some dependencies (patterns, rules) that can be extracted from texts written by people.

These parameters (they are also called "weights" or "coefficients") are obtained during model training, then saved, and do not change anymore. That is, when using the model, different ixes (words in the input text) are substituted into this giant equation each time, but the parameters of the equation (numerical coefficients k at ixes) remain unchanged.

The more complex the equation inside the model (the more parameters it contains), the better the model predicts probabilities, and the more plausible the text it generates will be. And this largest model at that time, GPT-2, suddenly began to generate texts so good that researchers from OpenAI were even afraid to publish the model in the open for security reasons. What if people rushed to generate realistic-looking text fakes, spam for social networks, etc. on an industrial scale?

No, seriously - this was a major breakthrough in quality! You remember: the previous T9/GPT-1 models could tell you whether you were going to the bank or the drugstore, and could guess that Sasha the highwayman was sucking dries and not something else. GPT-2, on the other hand, easily wrote an essay on behalf of a teenager answering the question, "What fundamental economic and political changes are needed to effectively respond to climate change?" (even adults would have gotten a kick out of the seriousness of the topic). The text of the answer was sent under a pseudonym to the jury of the relevant contest - and they did not notice any trickery. Well, okay, the scores of this work were not very high and it did not make it to the finals - but nobody exclaimed "what nonsense you sent us, you should be ashamed!

Transition of quantity into quality (almost according to Marx) In general, this idea that as a model grows in size, it suddenly develops qualitatively new properties (e.g., writing coherent essays with meaning instead of just telling you the next word on your phone) is a pretty amazing thing. Let's break down GPT-2's newfound skills in a little more detail.

There are special sets of tasks for resolving ambiguity in text, which help to evaluate text comprehension (at least by a human, at least by a neural network). For example, compare two statements:

  1. The fish swallowed the bait. It was tasty.

  2. The fish swallowed the bait. It was hungry.

Which object does the pronoun "she" refer to in the first example-the fish or the bait? What about in the second example? Most people easily realize from the context that in one case "she" is the bait and in the other the fish. But in order to realize this, it is necessary not just to read the sentence - but to build a whole picture of the world in your head! After all, for example, a fish can be both hungry and tasty (on a plate in a restaurant) in different situations. The conclusion about its "hunger" in this particular example follows from the context and its, excuse me, bloodthirsty actions.

Humans solve such problems correctly about 95% of the time, but early language models were able to do it only half of the time (i.e., they tried to guess almost randomly "50/50" - like in that joke about "what is the probability of meeting a dinosaur on the street?").

You probably thought: "Well, we just need to accumulate a large database of such tasks (a couple thousand examples) with answers, run it through a neural network - and train it to find the right answer". And with old models (with a smaller number of parameters) they tried to do that - but they only managed to reach about 60% success rate. But GPT-2 nobody specially taught such tricks; but it took it and unexpectedly and confidently surpassed its "specialized" predecessors - it learned to identify hungry fish correctly in 70% of cases.

This is the very transition of quantity into quality, about which old Karl Marx once told us. And it happens in a completely non-linear way: for example, when the number of parameters grows three times from 115 to 350 million, there are no special changes in the accuracy of the model's solution of "fishy" problems, but when the size of the model increases two more times to 700 million parameters, a qualitative leap occurs, the neural network suddenly "sees through" and begins to amaze everyone with its success in solving completely unfamiliar to it problems that it has never met before and has not studied them specially.

2020: GPT-3, or how to turn a model into the Incredible Hulk After playing around with the fatter (and thus smarter) GPT-2, the OpenAI guys thought, "Why not take the same model and make it 100 times bigger?". Anyway, the next numbered version, GPT-3, released in 2020, already boasted 116 times more parameters - as many as 175 billion! At the same time, the neural network itself weighed an incredible 700 gigabytes.

The data set for training GPT-3 also pumped, although not so radically: it increased about 10 times to 420 gigabytes - there stuffed a bunch of books, Wikipedia, and many more texts from a variety of Internet sites. It is definitely unrealistic for a living person to absorb such a volume of information - well, except if you put a dozen Anatoly Wasserman to read literally non-stop for 50 years in a row each.

An interesting nuance immediately strikes the eye: unlike GPT-2, the model itself is now larger (700 GB) than the entire array of text for its training (420 GB). It turns out as if a paradox: our "neurobrain" in this case, in the process of studying raw data, generates information about various interdependencies within them, which exceeds the volume of the original information.

Such generalization ("comprehension"?) by the model allows us to extrapolate even better than before, i.e. to show good results in text generation tasks, which were rare or non-existent during training. Now you don't need to teach the model to solve a specific problem - instead, just describe the problem in words, give some examples, and GPT-3 will grab what you want it to do on the fly!

And then once again it turned out that the "universal hulk" in the form of GPT-3 (which no one trained for any "narrow" tasks) easily puts on the blades of many specialized models that existed before it: for example, the translation of texts from French or German to English immediately began to give GPT-3 easier and better than any other specially sharpened for this neural networks. How?! I remind you that we are talking about a linguistic model, whose purpose was actually exactly one - to try to guess the next word to a given text.... Where does the ability to translate come from?

But that's just the flowers - even more amazing is the fact that GPT-3 was able to teach itself... math! math! The graph below (source: original article) shows the accuracy of neural networks with different numbers of parameters on tasks related to addition/subtraction, as well as multiplication of numbers up to five digits. As you can see, when moving from models with 10 billion parameters to 100 billion, neural networks suddenly and sharply begin to "know" math. illustration GPT

Once again, think about it: the language model was trained to continue texts with words, and it somehow managed to figure out by itself that if you type "378 + 789 =" to it, you should answer it with "1167" and not with some other number. Magic, by golly, magic! (Although, some people say "it's just a neural network that has seen all the variants and memorized them in the training data" - so the debate about whether it's magic or just parroting is still going on).

In the graph above, the most interesting thing is that as you increase the size of the model (from left to right) at first it's as if nothing changes, and then there is a qualitative leap and GPT-3 starts to "understand" how to solve this or that problem. Nobody knows exactly how, what, why it works. But it works somehow; and not only in mathematics, but also in a wide variety of other tasks!

The animation below illustrates how, as the number of model parameters increases, new abilities sprout in the model, which no one put there on purpose:

By the way, the problem of "hungry fish", which we tortured GPT-2 with in the last section, GPT-3 already solves with an accuracy of over 90% - just like a human. It makes you wonder, though: what new skills will the neural network acquire if we increase its volume by another hundred times? Well, up to tens of trillions of parameters, for example....

Prompts, or how to properly persuade a model

Let's make a small digression here and discuss what it means in general - "the model can solve problems"? In essence, the process looks like this: we input a certain text with a query to the model, and the model adds its own continuation to it. If this continuation (generation) coincides with our expectations, then the model has solved the task set for it.

The text that we give as input is called a prompt (prompt, or "request/inquiry" in Russian). The more precisely it describes what we want, the better the model will understand what it needs to do. And if we also give it a dozen examples, it will be great! prompt illustrated

Without a description of the goal and without examples in the prompt, the model also usually understands the problem, but offers less good solutions (in terms of their quality). You could say that a detailed prompt allows the GPT to better estimate the probabilities of words to generate as an answer, steering it in the "required direction".

But how complex does the prompt need to be? And how close is the model to us in terms of comprehension? You won't believe this, but just recently researchers found that to significantly improve generation results, you need to add a simple Soviet... to the prompt.

Okay, no kidding, but adding just one phrase before answering a question significantly improves the quality of the model. And that magic phrase is "let's think step by step". Suddenly, it turns out that it encourages the model to reason sequentially, draw conclusions based on its own judgments, and arrive at the correct answer much more often than without this phrase.

How does this work? Let's use the example of a child's problem:

Q: On average, a boxer John throws 25 punches per minute. The fight lasts 5 rounds of 3 minutes each. How many punches did he land?

Answer: 255

The text in bold is the answer generated by the language model. It's easy to check that it's - well, a bit wrong.

However, the same model can answer like this:

Q: On average, a boxer John throws 25 punches per minute. The fight lasts 5 rounds of 3 minutes each. How many punches did he land?

A: Let's think step by step. In one minute, John lands 25 punches. In three minutes John lands 3 * 25 = 75 punches. In five rounds, John throws 5 * 75 = 375 punches.

Again, the text in bold is the model's answer. It can be seen that it has become longer, the solution of the problem is just like a schoolboy's - in three steps. Clearly, sequentially - exactly as we asked. And the final number 375 is a correct answer to the original question. I would like to note separately: we did not retrain the model after it answered incorrectly - it is absolutely the same model. We simply added five additional words to the end of our prompt, and a miracle happened!

Moreover, this "reasoning mode" is one of the qualitatively new features that appeared in the "big" model GPT-3 after overcoming the barrier of a hundred billion parameters. Old models with fewer parameters could not show such tricks, no matter how much you beg them with special prompts "think about it, brother!".

In general, compiling competent prompts for a model is a separate science. For this task, companies have already started to hire separate people with the position of "prompt-engineer" (i.e. a person who composes queries for language models) - I predict that before the appearance of online courses "๐Ÿบ๐Ÿบ๐Ÿบ๐Ÿบ๐ŸบLearn prompt-engineering in 6 weeks and get into a promising industry with a salary of 300k per month! ๐Ÿบ๐Ÿบ๐Ÿบ๐Ÿบ๐Ÿบ" there is nothing left.

January 2022: InstructGPT, or how to teach a robot not to zigzag In fact, increasing the size of language models in itself does not mean that they will answer queries exactly the way the user wants them to. After all, often when we formulate a query, we imply a lot of hidden conditions - which in human communication are taken for granted, or something like that. For example, when Masha asks her husband: "David, go throw out the garbage" - it is unlikely that she will think of adding "(not from the window, please!)" to this prompt. After all, David understands this without specifying - and all because their intentions and attitudes are quite well aligned with each other.

But language models, to be honest, are not very human-like, so they often have to hint and chew up things that seem obvious to humans. The words "let's think step by step" from the last section are just one of the examples of such a hint (although average adults who studied at school would have guessed it themselves: if we are talking about a problem, it means we should solve it by steps). But it would be great if the models, first, understood/generated more detailed and relevant instructions from the query for themselves (without making people tense), and second, followed them more precisely - as if predicting how a person would act in a similar situation.

Part of the reason for the lack of such abilities "by default" is that GPT-3 is trained to simply predict the next word in a giant set of texts from the Internet - and the Internet, like the fence, has a lot of different things written on it (and not always useful). At the same time, people would like the artificial intelligence born in this way to pull up accurate and useful answers on demand; but at the same time, these answers must also be harmless and non-toxic. Otherwise, the model itself would be quickly shut down (it's strict now), and its creators would be sued for many millions of dollars for insulting the dignity of leather bags.

When researchers were thinking about this problem, it became clear rather quickly that the properties of the model "accuracy/utility" and "harmlessness/non-toxicity" are quite often kind of contradictory. After all, an accurate model should honestly give instructions to the query "okay, Google, how to make a Molotov cocktail, without registration and sms", while a model sharpened for maximum harmlessness will in the limit respond to absolutely any prompt "sorry, I'm afraid that my answer may offend someone on the Internet".

There are many complex ethical issues around this problem of "AI alignment" There are a lot of complex ethical issues around this problem of AI alignment (OpenAI has been writing about it lately), and we are not going to discuss them all now (perhaps in the next article). The main problem here is that there are a lot of such controversial situations, and it is simply impossible to formalize them in any clear way. What's more, people can't even agree among themselves what is good and what is bad for the last several thousand years. Let alone formulate rules understandable for a robot....

In the end, the researchers came up with nothing better than just giving the model a lot of feedback. In a sense, this is how human cubs learn morality: they do a lot of different things from childhood, and at the same time they carefully watch the reaction of adults - what can be done, and what is "poo, poo!".

In short, InstructGPT (also known as GPT-3.5) is exactly GPT-3, which has been trained with feedback to maximize live human evaluation. Literally - a bunch of people sat around and evaluated a bunch of neural network responses to see how well they matched their expectations given the query it was given. Well, actually, it wasn't quite that simple (the instructions for the members of such a "meat jury" took up 26 pages of legible handwriting) - but that's the gist of it. And the language model, it turns out, was learning to solve one more additional task - "how can I change my generated answer in such a way that it will get the highest score from a human?" (more details of the feedback learning process are described in this material).

And from the point of view of the overall model training process, this final stage of "pre-training on live people" takes no more than 1%. But it was this final touch that became the secret sauce that made the latest models from the GPT series so amazing! It turns out that GPT-3 already had all the necessary knowledge: it understood different languages, remembered historical events, knew the differences between the styles of different authors, and so on. But only with the help of feedback from other people did the model learn to use this knowledge in the way that we (humans) consider "correct". In a sense, GPT-3.5 is a model "educated by society".

November 2022: ChatGPT, or the little secrets of the big hype ChatGPT was released in November 2022 - about 10 months after its predecessor, InstructGPT/GPT-3.5 - and instantly made a splash around the world. It seems that for the last few months even grandmothers on the bench at the doorstep have been discussing only one thing - what this "ChatGPT" of yours has said and who it is about to put out of work according to the latest forecasts.

And from the technical point of view, it seems that it doesn't have any particularly powerful differences from InstructGPT (unfortunately, the OpenAI team hasn't published a scientific article with a detailed description of ChatGPT yet, so we can only guess). Well, we do know about some less significant differences: for example, about the fact that the model was fine-tuned on an additional dialog dataset. After all, there are things that are specific for the "AI assistant" to work in the dialog format: for example, if a user's query is unclear, it is possible (and necessary!) to ask a clarifying question, and so on.

But these are already details - what is important for us here is that the main technical characteristics (architecture, number of parameters...) of the neural network have not changed dramatically compared to the last release. Hence the question arises - how so? Why didn't we hear any hype about GPT-3.5 back in early 2022? Although Sam Altman (OpenAI's executive director) honestly admitted that the researchers themselves were surprised by such a rapid success of ChatGPT - after all, a comparable model had been quietly lying on their site for more than ten months and nobody cared about it.

It's amazing, but it seems that the main secret of success of the new ChatGPT is just a user-friendly interface! The same InstructGPT could be accessed only through a special API-interface - that is, it could be done only by nerds, not by ordinary people. And ChatGPT was placed in the usual "dialog box" interface, just like in the messengers everyone knows. And they also opened public access for everyone - and people rushed to have dialogs with the neural network, screen them and share them in social networks. Choo-choo, all aboard the hype train!

As with any technology startup, it's not just the technology itself that's important here - it's the wrapper it's wrapped in. You can have the best model or the smartest chatbot - but no one will be interested in them if they don't come with a simple and clear interface. And ChatGPT has made a breakthrough in this sense, bringing the technology to the masses through a simple dialog box in which a friendly robot "types" the answer right before your eyes, word by word.

Not surprisingly, ChatGPT has set absolute records for the speed of attracting new users: it reached 1 million users in the first five days after its release, and 100 million in just two months.

And where there is a record-breaking influx of hundreds of millions of users, of course, big money will appear immediately. Microsoft promptly made a deal with OpenAI to invest tens of billions of dollars in them, Google engineers sounded the alarm and sat down to think how they could save their search service from competition with the neural network, and the Chinese urgently announced the imminent release of their own chatbot. But all this, to be honest, is another story - you can follow it yourself "live" now....

Let's summarize The article was not very short - but we hope that you were interested and after reading it you have a better understanding of what exactly is going on under the hood of these neural networks.

Traduced form here