Researchers from Wrocław University of Science and Technology have checked whether ChatGPT understands jokes, detects language errors, sarcasm and aggression, and can recognise spam. They made more than 38,000 queries for the programme.
The study’s co-author Dr. Jan Kocoń said: “For a programme that wasn't trained specifically in this area, ChatGPT did quite well.
NATURAL LANGUAGE IN HUMAN-COMPUTER COMMUNICATION
“People have already gotten used to using forms, commands and keywords instead of natural language when communicating with computers. The main goal of developing ChatGPT was to take a step towards natural human-computer interaction in the form of a conversation. We believe that in this area, ChatGPT is revolutionary.”
However, he added that ChatGPT quickly began to be used for purposes unforeseen by its creators: solving various problems that often require a great understanding of language and contexts accompanying the use of language. So the question is how well ChatGPT handles these areas.
HOW TO MAKE 38K QUERIES
Scientists from the CLARIN-PL team - involved in research on artificial intelligence and natural language processing - decided to check it systematically. They subjected the new chatbot to strict tests. They made more than 38,000 queries for the AI.
Dr. Kocoń said: 'It was very laborious, because at the time there was no API (interface) for asking so many questions. There was also only a free version with a limit of about 50 queries per hour per user. Twenty people from the team lent their ChatGPT accounts, which made it possible to automatically make about 2,000 queries per day.”
This is one of the largest studies on ChatGPT to date. The study has not yet been peer-reviewed, the researchers have made a preprint available.
ChatGPT AND ITS COMPETITION
The researchers wanted to compare ChatGPT to the best available automatic language processing models. This includes systems for analysing sentiment. These programmes allow marketing companies to analyse the emotions evoked by a given piece of information, service or brand.
Kocoń said: “We were getting inquiries from companies whether it would be viable to discontinue using these specialized programs and rely only on ChatGPT.”
The conclusion? For now, ChatGPT performs worse than those programs. The worse other models coped with a given task, the worse ChatGPT compared to them. It made mistakes that most people would notice.
Dr. Kocoń added: “The jack of all trades turned out to be a master of none.”
DATABASES OF HUMAN ASSESSMENTS
The researchers analysed 25 thematic categories linked to large databases of various texts, where each text had already been manually assessed by people. For example, a database of almost 40,000 tweets from Twitter was used, where each tweet had already been rated by a number of people as sarcastic or not. The researchers also used the Wikipedia Detox project database where Wikipedians voted on whether a given comment was aggressive or not. Another database contained tens of thousands of Reddit entries tagged by experts as containing specific emotions.
The researchers asked ChatGPT the same questions that people had already answered. For example, they asked if the quoted text was a spam message or if it contained sarcasm, was humorous, aggressive, or if it was grammatically correct. There were also requests to identify emotions in the text, draw conclusions on the basis of information embedded in the text or solve simple mathematical problems 'with content'.
Dr. Kocoń said: “In all of the 25 analysed categories, the OpenAI chatbot was significantly inferior to its competition. The best currently available SOTA (state-of-the-art) natural language processing models were much better at assessing grammar correctness, user emotions, word meanings, answered questions and solved mathematical problems more accurately. Nevertheless, ChatGPT's results are impressive, taking into account that the model was not previously directly trained to solve most of the tested tasks.”
He added that ChatGPT was on average 25 percentage points behind other models. ChatGPT was the worst at evaluating emotions and pragmatic tasks that required knowledge about the world. It handled with semantic tasks better, where the answer to the question could be drawn from the analysed text.
The researchers point out that the specialised natural language processing models available on the market are designed for one purpose, such as automatically detecting aggressive comments. They are smaller and faster, trained on specific datasets that respond to the user's needs.
Although its answers were less frequently correct than other models, ChatGPT had other strengths in which it beat the competition. Its advantage was that it was able to explain why it gave the given answer. It was creative in its responses - when asked the same question several times, the answers varied (which unfortunately also means that sometimes the bot answered correctly, and sometimes - not).
Kocoń said: “For now, ChatGPT will not replace those specialised models, but it opens up new opportunities for us, it shows how the world will develop.”
According to the researcher, there are a lot of professions that ChatGPT can replace. Dr. Kocoń warns that the possibilities offered by the bot will probably reduce the demand for call-centre employees. ‘He said: “However, other professions will appear that have not existed before, such as a prompt-engineer - a person specializing in writing good commands for a chatbot.”
He added that for other professions, the chatbot could be a significant support - it will be useful in programming, education, proofreading or translation of texts.
CLARIN-PL is the largest publicly funded artificial intelligence development project in Poland. Six institutes and over 20 companies are involved in its implementation. Most of the team members work at the Wrocław University of Science and Technology. The main goal of the project is to develop tools for automatic processing of huge text data sets, mainly in Polish (natural language processing - NLP).
PAP - Science in Poland, Ludwika Tomala
lt/ agt/ kap/
tr. RL