AI-based software prepared by Polish scientists will help moderators to clean the Internet of illegal content. The software is designed to detect child pornography videos, images and texts.
The program developed as part of the APAKT project will limit the moderators' contact with such materials.
The task of the Dyżurnet.pl team at the NASK - National Research Institute is to block illegal online content, in particular related to the sexual abuse of children. Any such content found online should be reported there.
'Dyżurnet.pl moderators spend many hours every day reviewing illegal content - either reported by users or flagged by scrapers, algorithms that search the web for materials with specific parameters,’ says Dr. Inez Okulska from NASK's Artificial Intelligence Division.
She explains that a lot of such content gets reported, and someone always has to review it to assess whether the material really contains illegal content and whether it should be blocked. And whether the person who shared it should be prosecuted.
Martyna Różycka, head of Dyżurnet.pl, explains that the most urgent task is to identify materials that have not been previously reported, and that depict sexual abuse of children. If such materials were created recently, this could mean that a child is still being harmed. It is then necessary to find the perpetrator as soon as possible and protect the potential victim. 'Most of the content that needs to be blocked, however, are materials that were created years ago, but are still copied and shared in other places,’ says Różycka.
To improve the work of moderators and protect them from contact with mentally burdensome content, NASK, in cooperation with the Warsaw University of Technology, decided to use artificial intelligence. The algorithm developed by researchers as part of the APAKT project is designed to automatically analyse illegal content sent to moderators and propose the order, in which the reports should be reviewed - starting with those requiring immediate intervention.
For example, the program will be able to indicate with 90% certainty that a given file resembles a previously known material. Thanks to this, moderator will be able to quickly assess and confirm whether they agree with the program's evaluation. This will not only save time, but also protect the mental condition of the moderators, who will not have to compare themselves whether something similar has actually already appeared in the database.
Unlike many AI systems, this model is not just a 'black box' that spits out answers without any way to control where its decision came from. APAKT will be able to explain why it considered the given material to be paedophilic. The system is capable of classifying not only video materials and photos, but also narrative texts describing the sexual abuse of children.
Oksulka says: 'Moderators have no doubts that such texts are socially harmful. Such content promotes paedophilia, behaviour that is absolutely unacceptable and should never be normalized.
‘Not only are such texts badly written and graphomaniac, but the story is usually built gradually. Before the moderator can figure out whether a given text is innocent or promotes paedophilia, he or she has to read a lot, know the characters. The read content stays in the memory. And these are unpleasant, heavy topics. Moderators are under psychological care, but it was natural to want to introduce models that would make the moderators' work easier.’
The researcher explains that the APAKT software will be able to flag to the moderator specific fragments of the text that prove that the material actually describes sexual scenes involving minors. The program is supposed to identify harmful elements by itself.
According to Dr. Okulska, the work on the APAKT program was complicated because the algorithm had to be trained on actual paedophile materials, the storage of which is illegal or controversial. And the only team in Poland that can directly analyse child pornography content is Dyżurnet.pl (in accordance with the Act on the National Cybersecurity System).
'Scientists who create models could not and did not want to have access to Dyżurnet.pl data, on which the algorithms were learning. As you can guess, creating algorithms that classify certain objects without being able to see these objects is very difficult. It is like working blindfolded,’ Okulska says.
She adds that thanks to this limitation, in the part concerning written materials, an innovative way of representing text for the purposes of AI was developed. According Dr Okulska, the expertly configured StyloMetrix vectors allow for high-quality classification, but at the same time do not focus solely on meaning. They are explainable mainly at the grammatical and statistical level - so in the context of such topics, 'safe' for the researcher.
She explains that artificial intelligence does not work at the data collection stage yet, but at the stage of selecting materials initially flagged for evaluation. The research is carried out as part of a grant from the National Centre for Research and Development and the program is expected to be ready for use in a few months. Its effectiveness is currently estimated at about 80 percent.
Researchers also hope that the APAKT software will be of interest to Internet providers and owners of large websites, who - under the new bill - will be responsible for blocking minors' access to pornography. But the program can also be useful to the police and forensic experts.
Foreign institutions dealing with the removal of paedophile content from the Internet may also be interested in the program. APAKT can detect paedophilia in video and photos regardless of language. When it comes to detecting content in texts, the program works only in Polish for now, but the RoBERTa model and StyloMetrix vectors it uses are currently available in both English and Ukrainian. (PAP)
PAP - Science in Poland, Ludwika Tomala
lt/ bar/ kap/
tr. RL