Preparing training datasets for large language models

Machine learning large language models on custom datasets makes the models much smarter. There is a success story of the Alpaca dataset. It works wonders with models that at first, if they could do something, they did it very badly. We set out to understand how it's done, and more importantly, what the challenges are along the way and whether newbies like us can figure it out.

The backstory: making education better with language models The action took place in a university course on intellectual property rights. During the semester the students had several control papers. The first of them on the topic "Copyright" consisted of detailed answers to questions. For example, "I scanned a textbook and shared a copy in the local network with fellow students, are my actions legal?" and so on.

At the same time, several large language models (LLMs) did the control work. The students collaboratively fed them questions and evaluated their answers as if they were answered by a normal person taking an introductory course on copyright law.

The models did not answer the questions in the same way, and some questions were answered in the exact opposite way. Several ideas emerged from the parsing of their answers.

The first idea is a kind of challenge. Is it possible to pump up a weak LLM so that it answers no worse, and maybe even better than the strong ones? It was decided to reproduce the success of alpaca and other datasets on which LLMs are currently pumped. To try it out, we decided to take a relatively small domain area that has been dealt with - intellectual rights.

The second idea is whether it is possible to refine the data obtained on LLM responses and assessments of users into a full-fledged industry benchmark and use it to measure the quality of models.

Step 1: Study examples of datasets for fine-tuning, extract and process data

In order to understand where to start in creating a dataset for model fine-tuning, we started with the most successful examples - datasets with instructions. On these, the LLM learns how to perform the actions a user needs and then transfers that experience to tasks that are not in the instructional dataset. So we started to study examples with Hugging face.

Instructional datasets are structured in a similar way. There is an instruction, that is, a requirement to perform a certain action, and there is an example of how to fulfill the instruction.

We found examples of non-English datasets, for example, in Vietnamese and Spanish. The most common ones we came across were translations of alpaca from English into other languages. We found datasets for domain-specific knowledge domains, such as medicine and math. In short, the overlap we needed was present in the datasets, which cheered us up. Recall that in our case we wanted to get a dataset for one of the sections of civil law.

For data mining, we decided to take the closest examples to practice, where something similar to instructions and examples of their implementation are found. First, we considered the question-and-answer structure found on legal forums, where people come with their problem, which is sorted out by lawyers. The problem with these is that this data has to be literally manually pored over, as the lawyers' answers are often unrelated to the subject matter of the question. For example, in addition to answering, lawyers can communicate with each other, which of course is not related to the topic of the question.

The second option we ended up settling on was court decisions. In our case there was an ideal set of data, because the vast majority of questions in the field of intellectual property rights are considered by the same court, and the legislation in this area has not changed much. That is, the data is more or less consistent and well-structured. In addition, the texts of court decisions are not subject to copyright, i.e., with a sufficient level of anonymization, they can be freely used.

In total, we found the texts of 4163 court decisions of the court we needed.

In order not to violate the legislation on personal data, we anonymized the text of the decisions. For reliability, we performed a strict anonymization, removing all references to people, organizations and their addresses, although for the most necessary level of compliance with the law on personal data we could limit ourselves to the names of people, since the names of organizations and their addresses are strictly speaking not personal data.

We decided to divide the decisions into claim and resolution parts. According to the idea, the claim part is our instruction, as in it the plaintiff asks to understand the situation. And the operative part is an example of how to fulfill the instruction. Since the LLM needs to understand which part of the dataset contains instructions, we chose the ~~[INST] instruction [/INST] instruction execution example~~ syntax used in the alpaca dataset to mark up our case.

The problem that arose at this stage was how to cram the decision text into the contextual model window, which is limited to a certain number of tokens.

We tried two options. The first - obvious - option is to summarize the text. The second, less obvious one is cutting the text into chunks.

We tried summarization in several ways. The first attempt - to ask the LLM to pull the plaintiffs' claims and court decisions out of the text - failed at first. The thing is that we needed the specifics of each case, for example, "you can put a textbook online if it is under a free license". At first, due to the incorrect wording of the prompt, instead of the specifics of the case, the model gave us a brief retelling of the procedural part of the decision - what documents were submitted by the plaintiff, in what time the court accepted them, and so on. Here is an example of a bad request and response of the gpt-3.5-turbo-16k model based on the text of this court decision:

request

Please formulate a response in which you find the plaintiff's claim in the text and paraphrase it in detail in plain and simple language, keeping the key details and overall meaning. Mark the resulting conclusion using the following syntax:

answer

response

the text of the plaintiff's demand states that the persons involved in the case are requested to submit written explanations in tabular form justifying their interest in early termination of the disputed trademarks. The plaintiff also requests that the documented information on the use of the disputed trademarks be set forth in tabular form, reflecting the details of evidence of the use of the trademarks in respect of each heading of goods specified in the claim. Statements, motions and other documents may be submitted through the My Arbitrator system or to the specified correspondence address.

That is, we have not extracted knowledge of the subject matter of the litigation.

By empirical selection of prompts, the mixed prompt variant worked more or less: creating the "personality" of the LLM, extracting the part we need and summarizing it. The best result was given by the English-language query. Other conditions may include the fact that this time we input text with a soft variant of summarization, i.e. we removed only people's names.

Another solution we used is the option of summarizing the entire text. Of course, the output is a text without instructions, but in each case we got a more or less good description of the whole document.

The third option is to cut the claim and resolution parts into chunks and make a combination of all claim parts with all resolution parts. We made several variants of such datasets with lengths up to 512, 1024 and 2048 characters, you can find them here.

Step 2: Make a verification dataset and automate the evaluation of responses In order to create a validation dataset, we started studying and comparing different examples of industry datasets with each other. An article about creating an instructional dataset based on patient medical records was very helpful. Volunteer physicians participated in its creation. They wrote the most relevant instructions for their industry and gave examples of how best to follow them. The instructions were then fed to various LLMs and their responses were compared to a benchmark example.

This is very similar to what we did in the beginning, but you had to formulate the benchmark examples.

The first idea is to take question-answers from the internet. But we need a guarantee that these examples were not in the training sample. So we decided to make up our own problems and give ourselves reasonable answers to them with references to laws.

At first we started to create a check dataset according to the scheme "question - model answer - answer evaluation - checker's comments", but after thinking about it, we changed the comments to correct answers. Here is an example of what we got.

Another important point that came in handy for us was the use of automated methods for evaluating model responses, which was mentioned in the article on the medical dataset. The idea was to have the response of a language model checked by another language model. In our case, it is an OpenChat model deployed on the server, which evaluates whether the model answered the question correctly or incorrectly (True / False), and then gives a score of 0 to 4 for the quality of the answer and the disclosure of the topic. The result of what we got is on github.

Step 3: Conduct the experiment and analyze the results Finally, we got to the experiment. Since the time of the course project was running out, we had time to conduct only one experiment with the Llama2 model according to this scheme.

Using the A100 GPU, we pre-trained the model on a dataset with judgment summarization on 1024 tokens. The result with the pre-trained model is posted here. We then ran the model before and after the pre-training on two hand-built benchmarks. For evaluation, we used the OpenChat model, which compared the model's response to the benchmark. Since our verification dataset was not ready by that time, the "gold standard" benchmark had to be taken from Russian-language questions and answers on the websites of the World Intellectual Property Organization and the Russian Copyright Society. We chose these sources because both organizations have expertise in the field of intellectual rights.

The first benchmark dataset had questions and a closed list of answers, one or more of which were correct. The ChatGPT model took the bar exam on roughly the same version of the dataset. Here we obtained an increase in correct answers for the pre-trained model over the default model: from 0.4383561644 to 0.5479452055 for correct/incorrect answers based on 73 question-answer pairs and from 2.821917808 to 2.849315068 for topic disclosure based on 31 question-answer pairs. Example:

Question

"Subjects of related rights are:

directors and screenwriters;
performers, phonogram producers, broadcasting or cable broadcasting organizations;
only performing artists;
heirs of copyright holders."

correct answer

performers, producers of phonograms, broadcasting or cable broadcasting organizations

pre-trained model answer

Answer: 2) performers, producers of phonograms, broadcasting or cable broadcasting organizations. Subjects of related rights are: 1. Directors and scriptwriters who are entitled to related rights on the basis of their authorship in the field of literature and art. 2. Performers, phonogram producers, broadcasting or cable broadcasting organizations, which are entitled to related rights on the basis of their authorship in the field of literature and art. 3. Performing artists who have the right to

answer model out of the box

Correct! The subjetcs of neighboring rights are: 1. Directors and screenwriters (Article 10(1)(a) of the Berne Convention) 2. Performers and producers of phonograms (Article 10(1)(b) of the Berne Convention) 3. Only performers (Article 10(1)(c) of the Berne Convention) 4. Successors in title of the owners of neighboring rights (Article 10(1)(d) of the Berne Convention) Neighboring rights are a set of rights that are related to the protection of intellectual property, but are distinct from the exclusive rights granted to creators under copyright law.

It is noticeable that the model has learned to speak Russian and give generally correct reasoning.

The second test dataset contained open-ended questions and answers. Here the model showed a decrease in the quality of answers compared to the default one: from 0.2580645161 to 0.1935483871 for correct/incorrect answers and from 2.580645161 to 2.516129032 for topic disclosure.

We tried to understand why the unfolded response of the pre-trained model degraded compared to the response of the out-of-the-box model.

One of the hypotheses is that the Llama2 model at the pretrain stage most likely was not specifically trained on the Russian language corpus and learned it spontaneously. That is, we tried to load the model with new knowledge, which most likely was not contained in the pretrain. Therefore, the pre-training did not give either a gain of new knowledge or a warming up of old knowledge. This is partially confirmed by the article in which the authors tried to create a similar dataset and for this purpose they pretrained the Llama 2 model on the Chinese corpus of laws and court decisions. The way out: to retrain the Russian-language model, or better yet, to follow the example of the Chinese researchers and conduct pretrain on our own on the set of texts we need.

Another hypothesis: the pre-trained model coped well with answers to formalized questions and poorly with non-formalized question-answers. The same ChatGPT took the bar exam on formalized Q&A, Perhaps we demanded more from the model than it was capable of and got a worse score on the informalized examples. In our case, the out-of-the-box model also showed degradation on unformalized questions compared to formalized questions.

What's next Due to circumstances - lack of time and resources - we had time to try a little bit of what we wanted to do. So we plan to develop the project further. For this, of course, we need to dive into the topic of model pre-training as much as possible. For example, for the experiment we used the code from datacamp without any modifications. So let's continue by understanding the settings of pre-training and experimenting with their different variants.

And we still have a bunch of other datacamp options left to experiment with as well. So the main task is to get motivated, stock up on 8x A100 :) and move on.

Traduced form here