A new version of Bielik has been launched. It is a huge language model with open source code that learns to generate text responses based on a huge database of Polish texts. Bielik's training was carried out using the computing resources of the two fastest supercomputers in Poland - Helios and Athena at AGH UST in Kraków.
In principle, the language model is designed to use the Polish language more efficiently than foreign language models, and better navigate Polish reality. In addition, it is open source, which means that it can also be used to process non-public databases.
The latest version of the language model, Bielik-11B-v2, was launched at the end of August. Bielik has been developed as a result of joint efforts of the SpeakLeash Foundation and the AGH University Academic Computer Centre Cyfronet. It is a Polish model falling under the LLM category (Large Language Models), i.e. a language model with 11 billion parameters.
'The most challenging task was to obtain data in Polish. We must operate only on source data and we must know where it comes from', says Sebastian Kondracki, Bielik’s originator, quoted in the AGH UST press release..
The members of SpeakLeash decided to create the largest Polish text database. And they achieved their goal: the resources of SpeakLeash are currently the largest, best described and documented collection of data in Polish.
Supercomputers from the AGH University Academic Computer Centre Cyfronet were used to train Bielik. According to the university, the support of the Cyfronet team concerned the optimisation and scaling of training processes, the work on data processing pipelines and the development and operation of synthetic data generation methods, as well as model testing methods.
The result is the Polish ranking of models (Polish OpenLLM Leaderboard).
'Our role is to provide support with our expertise, experience, and above all with computational power in data cataloguing, collecting, and processing, as well as in teaching language models', says Marek Magryś, Deputy Director of AGH University Cyfronet for High Performance Computers. 'Thanks to the joint efforts of SpeakLeash and the AGH University, we have managed to create Bielik, an LLM model which handles our language and cultural context perfectly well and which may be a key element of text data processing pipelines for our language in scientific and business uses. High positions on ranking lists for Polish are only a confirmation of Bielik's quality'.
Magryś admits that even the largest Polish supercomputers cannot match the capabilities of the world's LLM leaders. 'However, our systems enable a few hours or days of computations, which could take years or even hundreds of years on regular computers', he says.
The computational power of Helios and Athena in traditional computer simulations amounts to over 44 PFLOPS (petaflops is a million billion flops), and for lower precision AI calculations it is even 2 EFLOPS (exaflops is a billion billion flops – ed. PAP).
BIELIK AND CHAT GPT
'The collection of data powering Bielik is continuously growing, yet it will be difficult for us to compete with resources used by other models which operate in English. Besides, the amount of Polish content online is substantially smaller than of the English content', the creators say.
The most popular product taking advantage of a large language model is ChatGPT, based on the resources of OpenAI. However, the need to develop language models in other languages is justified.
Magryś says: 'While ChatGPT is able to speak in Polish, it is saturated with content in English. So its understanding of the Polish culture and the nuances of Polish literature is small. It also does not truly cope with understanding the logic of more complex texts, legal or medical ones. If we want to use it in specialist fields and have a language model that thinks well in Polish and responds using correct Polish, we cannot rely only on foreign language models'.
The version that the users may test in the public domain is free of charge and is still being improved. In addition to the full versions of models, the authors have made available a range of versions in the most popular formats to run the model on your own computer.
'It is worth knowing that Bielik will perform well in terms of providing summaries or short descriptions. At this moment, our language model is useful in science and business terms, it may be used for improving communication with users when handling requests in Helpdesk', says Szymon Mazurek from Cyfronet.
WHO NEEDS A POLISH LLM?
The creators of Bielik explain that the AI services available online, such as the most popular ones offered by ChatGPT, are kept on external servers. If a given company or industry develops a solution operating on specialist data, for example medical data, or on texts which for various reasons may not leave the walls of a company, are confidential, the only possibility is to launch such a model on site. The model may be not as perfect as ChatGPT but it does not have to be as broad.
An additional benefit from the launch of language models such as Bielik is the reinforcement of the position of Poland in terms of innovation in the AI sector. In addition, as Bielik's creators emphasise, building their own tools allows them to become independent of external companies, which, in the event of market turbulence, regulations, or legal restrictions, may hinder access to their resources. By developing and improving tools in Poland, we are building a stable base and are able to secure many of our sectors - banking, administrative, medical, or legal.
'Intensified development of AI, language models, and other AI-based tools are in the best interest of all well-functioning economies. We observe increased efforts in the development of similar solutions in many countries', adds Jan Maria Kowalski from SpeakLeash.
Information on the current amount of data is available on THIS page.
PAP - Science in Poland, Ludwika Tomala
lt/ zan/ kap/
tr. RL