National Information Processing Institute to create neural language models in Polish

Technology

National Information Processing Institute to create neural language models in Polish

02.12.2021 update: 02.12.2021

3 minutes read

Credit: Adobe Stock

The National Information Processing Institute develops neural language models used to detect spam and antiplagiarism systems. Two new models have been launched this year: Polish RoBERTa v2 and GPT-2 intended for tasks related to the text generation.

The popularity of neural language models has grown significantly over the past few years. Their size (number of parameters) also increases rapidly. They are widely used, yet few people are aware of this. Thanks to them though, Internet users have access to text translation into different languages, it is possible to detect spam, conduct online research on social moods, use automatic text correction and talk to chatbots.

Work on the development of neural language models continue in many IT centres and companies around the world. The IT industry has long seen their potential. They are increasingly useful in every Internet user's life. Developing new neural models, however, requires high computational power and specialist infrastructure. Individuals or small organizations are not capable of training them. In addition, large amounts of data are necessary. Just like with other tools based on artificial intelligence (AI), the greater the data set used to train the model, the more precise the model will be.

Most of these models, however, are developed for the English language. That is why researchers from the National Information Processing Institute develop and share Polish language models. This year, they added two more: Polish RoBERTa v2 and GPT-2.

According to the National Information Processing Institute: “The models can be used, for example, for research on the detection and classification of hate in social media or fake news. Models in Polish are essential for analysing the Polish Internet, it is not possible to analyse the data of Polish phenomena using foreign language tools.”

The base part of the models data pool are high-quality texts (Wikipedia, documents of the Polish Parliament, social media content, books, articles, longer written forms). On the other hand, the online part of dataset consists of filtered and properly cleaned extracts from websites (the CommonCrawl project).

Sławomir Dadas, deputy head of the Laboratory of Intelligent Information Systems at the National Information Processing Institute said: “The models made available by the National Information Processing Institute are based on transformer networks. This architecture is relatively new, in use since 2017. Transformer networks do not rely on sequential data processing, instead they process data in a simultaneous manner.”

Training one model takes approx. 3-4 months. All neural language models developed at the National Information Processing Institute have been tested with the comprehensive list of language evaluations (KLEJ benchmark) developed by Allegro. It makes it possible to evaluate the model based on nine tasks, such as, for example, sentiment analysis or the analysis of semantic similarity of texts. (PAP)

uka/ zan/ kap/

tr. RL

The PAP Foundation allows free reprinting of articles from the Nauka w Polsce portal provided that we are notified once a month by e-mail about the fact of using the portal and that the source of the article is indicated. On the websites and Internet portals, please provide the following address: Source: www.scienceinpoland.pl, while in journals – the annotation: Source: Nauka w Polsce - www.scienceinpoland.pl. In case of social networking websites, please provide only the title and the lead of our agency dispatch with the link directing to the article text on our web page, as it is on our Facebook profile.

Prizes & Awards

Polish students in James Dyson Award final with life-saving wristband and AI water monitor inventions
Innovation

Krakow researchers develop alcohol-free, eco-friendly water-based perfumes

Before adding a comment, please read the Terms and Conditions of the Science in Poland forum.

Delusions and psychosis in schizophrenia linked to cognitive biases, says Polish psychologist
Polish scientist examines if lichens can survive on Mars
Polish researchers co-author forensic tool that reads age from DNA
Early exposure to nitrogen dioxide negatively affects children's brain structure
Polish students in James Dyson Award final with life-saving wristband and AI water monitor inventions

Technology

Algorithm as a poet? Recipients' reaction to poetry written by AI and humans examined
Technology

Mathematics will be the first field of knowledge where AI will achieve superhuman capabilities, says expert
Human

Polish Economic Institute finds 3.68 million Poles work in professions most exposed to AI impact

National Information Processing Institute to create neural language models in Polish

More on this topic

Polish students in James Dyson Award final with life-saving wristband and AI water monitor inventions

Krakow researchers develop alcohol-free, eco-friendly water-based perfumes

Delusions and psychosis in schizophrenia linked to cognitive biases, says Polish psychologist

Polish scientist examines if lichens can survive on Mars

Polish researchers co-author forensic tool that reads age from DNA

Early exposure to nitrogen dioxide negatively affects children's brain structure

Polish students in James Dyson Award final with life-saving wristband and AI water monitor inventions

Similar

Algorithm as a poet? Recipients' reaction to poetry written by AI and humans examined

Mathematics will be the first field of knowledge where AI will achieve superhuman capabilities, says expert

Polish Economic Institute finds 3.68 million Poles work in professions most exposed to AI impact

Recommended

Polish researchers co-author forensic tool that reads age from DNA

Partners

Science

Scholarship

About us

Publisher

Follow us