Technology

'Polish hat' for any language model was created as part of PLLuM

Author: Adam Klimowski, source: Wikipedia, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=124145165
Author: Adam Klimowski, source: Wikipedia, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=124145165

PLLuM language models were trained on Polish texts, tuned on Polish instructions and raised on the preferences of Polish users. Thanks to this, in addition to ready-made models, a 'Polish hat' has been created, which can be added to any language model, says AI expert Jan Kocoń, PhD.

PLLuM (Polish Large Language Universal Model) is a family of large Polish language models. This type of artificial intelligence (AI) programs enable text generation and natural language processing. Thanks to them, communication with the machine does not require the user to use rigid codes and mechanical commands, it can resemble a dialogue with a human.

Unlike commercial LLMs (such as ChatGPT, Claude or Gemini), PLLuM models are specialized in the Polish language, although texts in other Slavic and Baltic languages and, of course, English, were also used to train them. These training data can be considered an 'overlay' on the model, adapted to our culture.

As part of the project, the Ministry of Digital Affairs made 18 PLLuM models available - in various versions. 'However, I think that these ready-made models are a less valuable resource than the sets of Polish texts, instructions, dialogues that were collected or created as part of the project. These collections are a kind of 'Polish hat' that we can 'put on' on any other open language model', said Jan Kocoń, PhD, from the Wrocław University of Science and Technology, the science director of the PLLuM project.

'We wanted to create a good language model from scratch, based only on Polish texts, but it turned out that the collection of Polish texts, the language corpus, was too small to succeed. The quality of such a model would be too low. The project budget was not sufficient to train the model from scratch on multilingual data', Kocoń explains.

There already are open language models on the market that speak Polish, for example Llama - published by the American company Meta AI, and Mistral built by the French. Researchers working in the PLLuM project decided to adapt these models to the Polish language; so that they could speak Polish better, become better versed in Polish realities, and better respond to the needs of Polish users.

Language models must learn to predict the next word in a given sequence based on huge collections of documents. So they work out the probability distributions of the occurrence of different words in a given context.

In the pre-training phase, the training of these ready-made open models was continued using the Polish text database created in the project. In the version of the model intended for commercial use by private companies, the corpus contained 22 billion words (approx. 28 billion tokens); they had to be texts available under open licenses or provided by publishers who consented to their use. In turn, the model for non-commercial use was trained on a much larger corpus of texts, consisting of about 100 billion words (150 billion tokens).

In this way, the researchers obtained a model that understood the Polish language and its nuances quite well. However, this model was still not 'tuned' to dialogues - it did not know what was expected of it during a conversation; for example, it did not know the convention in which various types of questions are answered. It also did not understand Polish culture, and did not know which answers are polite, and which questions should be dismissed with a dry response.

The next phase was tuning the model on instructions. These instructions resemble - in a nutshell - question and answer pairs or dialogues. So the researchers fed the model with a database of sample questions and model answers they expected. There were 40 thousand such pairs, including 3.5 thousand longer dialogues. The model not only understood the language, but also learned to use it actively. 'In this way, we infused the Polish soul into the model', Kocoń jokes.

The next stage of work on the model was to educate it on the preferences of Polish users. At this stage, the model generated several different answers to one question, and Polish experts selected that one that suited them best and pointed out any errors. The experts manipulated the so-called response temperature (in short, the creativity with which the model uses the language). At this stage, the model was taught specific aspects of the Polish language, such as idioms, slang, and cultural contexts. The model also learned which topics it should not talk about, so as not to help users break the law, for example.

'The model can sometimes still be fooled, but much less often than at the beginning. At the beginning, it answered 60 percent of the +forbidden questions+, and now it only falls for 7 percent of such jailbreaks', the researcher reports.

Since PLLuM was made available, users have started using it, and scientists have gained access to several hundred thousand conversations between Poles and the chatbot. Thanks to this, they will better understand the specifics of how Polish users use AI, and will be able to improve our Polish models in the future.

Ludwika Tomala (PAP)

lt/ bar/ mhr/

The PAP Foundation allows free reprinting of articles from the Nauka w Polsce portal provided that we are notified once a month by e-mail about the fact of using the portal and that the source of the article is indicated. On the websites and Internet portals, please provide the following address: Source: www.scienceinpoland.pl, while in journals – the annotation: Source: Nauka w Polsce - www.scienceinpoland.pl. In case of social networking websites, please provide only the title and the lead of our agency dispatch with the link directing to the article text on our web page, as it is on our Facebook profile.

More on this topic

  • Adobe Stock

    Even intelligent machines can be manipulated like puppets

  • A several-millimetre cell with rubidium atoms that can serve as a Rydberg sensor. Credit: Michał Parniak (UW)

    Polish researchers build quantum antennas for ESA

Before adding a comment, please read the Terms and Conditions of the Science in Poland forum.