A Polish large language model PLLuM, trained on mainly Polish-language content, and an intelligent assistant based on it will be developed by six Polish research units. 'We cannot afford to be left behind,’ say the project’s representatives.
The PLLUM (Polish Large Language Universal Model) consortium was established in November (on the eve of chatGPT's first birthday). Its members are six of Poland's leading research units in the field of artificial intelligence and linguistics: the Wrocław University of Science and Technology (consortium leader), NASK National Research Institute, National Information Processing Institute - National Research Institute, the Institute of Computer Science of the Polish Academy of Sciences, the University of Łódź and the Institute of Slavic Studies of the Polish Academy of Sciences. Representatives of NASK announced the establishment of the consortium in a press release sent to PAP.
For about a year, generative AI large language models (LLMs) have been amazing us with previously unimaginable possibilities offered by artificial intelligence. However, the most famous models, such as ChatGPT or Google Bard, have their limitations: they are paid, they are closed (you cannot learn or modify their algorithms) and they have been trained with too little Polish-language content (therefore, more errors appear in Polish language answers than in English language ones).
Hence the idea to create an open, free model, trained mostly on Polish-language content, and to develop an intelligent assistant that will use this model. 'The entire project is to be carried out in accordance with good practices of ethical and responsible artificial intelligence, including the principles of representativeness, transparency and fairness of data', the project representatives say. The National Centre of Data Excellence based at NASK will play an important role in the project.
'Developed by leading research units in cooperation with public administration, in accordance with the principles of responsible development of AI systems, a transparent and fully accessible open model will be an innovation on a global scale in the sense of a project combining access to data, competences, technical resources and know-how of scientific units and government for the common purpose of supporting science and economy, including the competitiveness of Polish enterprises,’ says Wojciech Pawlak, director of the NASK National Research Institute.
In addition to paid language models, there are already open license large language models, but still no models trained on representative Polish language sets. A small share of Polish texts in the training process or only tuning to the Polish language, make these models unsuitable for many commercial applications in Polish. Therefore, as we read in the release, PLLuM aims to support Polish entrepreneurs in the technological race by creating access to a model with an extended Polish language under a free, open-source license, that will meet market requirements.
'Large language models have become universal, basic engines for natural language processing, but building or training them is beyond the capabilities of Polish businesses. That is why the creation of an open Polish large language model combined with the computing infrastructure for AI already available in Poland (e.g. at Wrocław University of Science and Technology) is so important, because it can support the development of science as well as small and medium-sized enterprises, which are the driving force of Polish economy in the field of IT and AI,’ says Professor Maciej Piasecki, project manager at the Wrocław University of Science and Technology (the consortium leader).
The National Information Processing Institute director Dr. Jarosław Protasiewicz adds: 'The dynamic development of the IT industry and the scientific community in Poland is in the interest of us all. It is important to develop new IT tools and make them available to everyone for free. At the institute, we have developed, for example, the Polish RoBERTa large model, which, according to KLEJ Benchmark, is the best representative model for the Polish language. I am glad that our knowledge and experience will now be used to develop the PLLuM model. We need models trained on Polish-language texts, also for the analysis of the Polish Internet.
Having an open model also means access to a research facility, the possibility of developing and testing methods for explaining this model, and looking inside the black box.
'PLLuM will stimulate the development of science in Poland not only in the area of AI development, but also in the field of explainability of XAI (Explainable AI). And that is particularly worth betting on, because the topic of critical analysis is as important as the development of AI capabilities itself, and besides, Poland has a chance to take leading positions in the world,’ says Dr. Inez Okulska, head of Department of Linguistic Engineering and Text Analysis at NASK
According to the consortium representatives, a significantly greater share of texts originally written in Polish and containing information about Poland (Polish science, art, history, law, economy and other topics) will increase the visibility of the Polish language and culture, which are noticeably marginalized in currently available models.
Its creators hope that PLLuM will serve not only scientists and businesses, but above all, Polish society, the recipient of innovative solutions based on this model. One of them is a Polish-language intelligent assistant, which will aim to increase the availability of public services, both digitally and during an in-person visit to an office or service point. By offering the possibility of formulating queries in natural language (as in the case of a conversation with an official), it also meets the needs of digitally excluded people. And this is only the beginning of the possibilities offered by this huge joint venture of Polish researchers, business and public administration, the creators announce. (PAP)
PAP - Science in Poland
lt/ bar/ kap/
tr. RL