Preparing Polish-English translations, categorising content, searching documents and extracting important information from them - Polish companies can gain a lot if they skilfully use Bielik, its creators believe. Bielik is a free, open language model trained on the Polish language.
At the end of August, the SpeakLeash Foundation released the second version of their Polish large language model (LLM) Bielik (the name is the Polish word for white-tailed eagle). This language model is a result of one of the largest projects of this type in Poland. The model was developed by volunteers of the foundation with the support of the computing power of the largest Polish supercomputers at AGH UST (Cyfronet).
Bielik is a model specialising in the Polish language, trained on huge sets of Polish data (40 million text documents). It has been fine-tuned on instructions in Polish, and its style was polished using reinforced learning methods, which was a unique approach in our country.
A demo version of the Polish chat bot is available on the foundation's website, and developers can download a free version of the program that works offline.
'Bielik is a relatively small language model with 11 billion parameters. Such models are highly accessible', says Remigiusz Kinas from SpeakLeash. To use the full potential of Bielik, all you need is a computer with a 24 GB graphics card. SpeakLeash has also provided a large number of quantised versions (with reduced computational precision), which makes it even easier to use such models on equipment with low computing power.
BIELIK IS NOT ANOTHER CHATGPT
Bielik is not intended to directly compete with ChatGPT, because it is a different class of AI models. Remigiusz Kinas explains that using commercial large language models, such as ChatGPT or Gemini, is impossible in many business applications. Confidential data, e.g. concerning customers or confidential company documents may not be sent to the servers of large companies for reasons of personal data protection.
Large models usually do not have open source code, and work on them is not transparent, so you never really know whether their authors use data obtained from users to further train their AI.
Many open AI systems that can operate offline are being created worldwide. They process data using only the computing power of a given computer, which means that you can use the capabilities of AI without giving it access to the Internet.
Bielik is one of those models. Perhaps the text it creates are not as perfect as those created by models from commercial companies, where billions of dollars are spent on training models, and its knowledge base is not updated on an ongoing basis, but it is compact, fast, agile, and it efficiently uses the Polish language and the Polish cultural context. Thanks to this - its creators hope - it will help Polish users work with documents in Polish.
WHAT CAN THE POLISH BIELIK DO?
According to Kinas, Bielik can be used to help search and organise the content of emails, documents, analyse content, categorise files, construct semi-automatic correspondence with clients, proofread - and all this while maintaining data confidentiality.
'Bielik is also surprisingly good at translating texts from Polish to English and from English to Polish', Kinas says. He believes that the Polish AI works better than most online translators for our language.
Another issue is 'censorship'. For example, ChatGPT will avoid performing some tasks that it considers inappropriate or controversial. To some questions, the OpenAI chat bot responds that it cannot help with illegal activities. Meanwhile, Bielik will answer the same question, even a controversial one, after a brief note that it is illegal - in a fairly detailed way (it is a separate issue whether its answer is correct).
Kinas explains that Bielik has been deliberately designed not to protect users from accessing controversial content. 'We write in the model use regulations that it is an uncensored model (after all, this is an accepted standard in the world of open models). In Bielik, there is no content censorship, because it would weaken the model', he says.
He adds that thanks to the fact that the content is not filtered, the AI is 'smarter', allowing it to be used in a broader business context.
'Our model has not been trained on secret documents. It responds based on knowledge that is available online, so anyone who wants to find such information will find it anyway. We assume that business will use language models responsibly', he says.
In future versions of Bielik, the foundation plans to add a safe-guard, a plug-in that will remove inappropriate content. However, it will be up to the user whether to use it or not.
Bielik is not the only open Polish language model. The PLLuM consortium, which consists of six leading Polish scientific units in the field of artificial intelligence and linguistics, is also working on its own model. However, Kinas does not want other Polish models to be treated as Bielik's competition. 'My dream is for SpeakLeash and PLLuM to join forces to create a modern, best Polish language model', he says.
FLY BIELIK, FLY
'We made Bielik for free, in the evenings and on weekends. We had the computing power of 450 Cyfronet graphics cards at our disposal. Meanwhile, OpenAI built its Chat GPT with a gigantic budget, dozens of world-class geniuses employed full-time, with the computing power of hundreds of thousands of graphics cards. So we are flattered when someone compares Bielik to ChatGPT. Because, yes, Bielik requires improvements, but we can already see how much we can do ourselves. As Poles, we should be proud of this project', Kinas says. (PAP)
PAP - Science in Poland, Ludwika Tomala
lt/ bar/
tr. RL