
We have made every effort to ensure that the Polish PLLuM language models are safe, adapted to the Polish language and 'trained' on legally obtained data, says Szymon Łukasik, PhD, from NASK, a centre involved in the work on the AI models published in February.
PLLuM (Polish Large Language Universal Model) is a family of large Polish language models. This type of artificial intelligence (AI) programs enable text generation and natural language processing. Thanks to them, communication with the machine does not require the user to use rigid codes and mechanical commands, it can resemble a dialogue with a human.
Unlike commercial LLMs (such as ChatGPT, Claude or Gemini), PLLuM models are specialized in the Polish language, although texts in other Slavic and Baltic languages and, of course, English, were also used to train them.
'The model is adapted to the Polish language and culture. The work within this project supports Polish experts and the development of competences in the field of artificial intelligence', project participant Szymon Łukasik, PhD, a professor at AGH USK and director of the NASK AI Safety Research Center explains in an interview with PAP. This centre will now to coordinate further work and implementation of models in the work of public administration - as part of the HIVE consortium.
The researcher explains how important the issues of safety and ethical approach to building AI were. 'The philosophy guiding the construction of this model was to make the issues of data collection as transparent as possible, so that we would be certain in what area the models built based on them could be used', Łukasik says. He points out that the project representatives, for example, concluded agreements with editors and obtained official consent to use archives of Polish texts.
The NASK expert explains that Polish is a language with low resources. This means that there is not much data that can be used to build models. PLLuM models for commercial use are trained on about 30 billion tokens (a token is a piece of processed text, e.g. a word or its fragment); while models for non-commercial use - there are much more resources for this purpose - were trained on about 150 billion tokens.
Ultimately, PLLuM models are to be used in state administration, which means further challenges related to safety. The creators had to ensure that the model set boundaries in its responses and that no illegal, false or controversial content was included.
The largest collection of queries in Poland, with 40 thousand interactions, was used to train the models, including about 3.5 thousand longer dialogues between local trainers and the machine. Thanks to this painstaking work of the so-called annotators, AI is supposed better handle with the specifics of the Polish language and Polish culture.
Accordinjg to its creators, PLLuM is being created in accordance with national and EU guidelines on artificial intelligence. It also takes into account current data protection standards.
PLLuM models are available free of charge in the form of a chat to all interested parties. The Ministry of Digital Affairs has also published 18 open versions of the PLLuM model for programmers. All interested parties have access to both light but less accurate versions of the models that can be downloaded to a laptop, as well as more powerful models that require multiple graphics cards for more advanced applications - e.g. research. In the case of both types, it is possible to run the models on the user's own infrastructure, without having to send queries to external entities.
The project also created generators, i.e. specialized RAG (Retrieval Augmented Generation) models. Thanks to such models, you can, for example, search and analyse your local databases and create virtual assistants that analyse sets of your own documents. The PLLuM team has built the smallest (8 billion parameters), but leading in the rankings for the Polish language, generator of this type.
Łukasik also addresses the change of the project name from PLLuM to HIVE. 'Our models are called PLLuM, and their family will be further developed within the HIVE consortium. In this way, we wanted to refer to the idea of cooperation between many researchers, engineers and institutions, operating like bees in one ecosystem, exchanging knowledge and resources (e.g. data, code, models). However, perhaps one day we will want to release a new family of models - with a new name. We are talking about this with the Ministry of Digital Affairs', Łukasik says.
Ludwika Tomala (PAP)
PAP - Science in Poland
lt/ zan/ ktl/