A group of scientists from the HSE Faculty of Computer Science has conducted a study on the ability of neural networks to detect humour. It turns out that for more reliable recognition, it’s necessary to change the approach to creating datasets on which neural networks are trained. The scientists presented these results at one of the world's most important conferences on natural language processing — EMNLP 2023. The paper is available here.
The research was carried out as part of a project by the Laboratory for Models and Methods of Computational Pragmatics. Alexander Baranov, doctoral student of the HSE Faculty of Computer Science, made a presentation at the conference in Singapore; his participation was funded by the HSE University.
Currently, voice assistants can only tell a ready-made joke, they are not able to produce their own jokes or reliably detect a joking tone. At the same time, users of AI-based voice assistants want them to behave more human-like — to be able to recognize a joke and make a joke in response.
Since the mid-2000s, scientists have been engaged in classifying texts as ‘funny — not funny’ and collecting and annotating datasets in the same frame. A group of scientists from HSE University proposed changing the approaches to the creation of such datasets — to make them more diverse but not overly large.
The task of detecting humour is also difficult because there are no formal criteria for determining what is funny and what is not. Currently, most datasets for training and evaluating humour detection models contain puns. Sarcasm and irony are much more complex, as is situational humour, which requires knowledge of a broader context or cultural code.
We wanted to evaluate the transferability and robustness of models trained on different datasets. Transferability determines how well a model trained on a dataset with one type of humour recognises another type of humour. It was not obvious how the training would work, because humour can be quite different.
The scientists evaluated the robustness with so called ‘adversarial attacks’—attempts to force the neural network to see humour where it does not exist. For example, the neural network was presented with unfunny phrases, where one word was substituted with a similarly sounding, but semantically different word, which could look as an attempt to pun. The less the network falls into such traps, the more robust it is.
The researchers trained the models on standard datasets with different humour types. In addition, the models were tested on dialogues from Lewis Carroll's Alice in Wonderland, Charles Dickens's The Old Curiosity Shop, Jerome K. Jerome's Three Men in a Boat, to Say Nothing of the Dog, the TV series The Walking Dead, Friends and a collection of ironic tweets. It turned out that some models overfit and consider everything to be funny.
We showed different models Dickens's The Old Curiosity Shop, which is a very sad story, and asked them to evaluate what was happening. It turned out that some models believe that all dialogue from the literature of the 19th century are funny. And even more — everything that differs from the news of the 21st century is perceived as humorous.
Models trained in puns are more likely to make mistakes if one word is substituted with a consonant in an unfunny text. It also turned out that neural networks trained on small parts of different datasets recognize humour better than those trained on a large amount of the same type of data. The authors conclude that the existing datasets are too narrow, the humour diversity in each of them is quite limited, which reduces the quality of joke recognition.
The researchers proposed a change in the approach to models for learning and evaluating humour detection. We need new datasets that are more diverse and closer to ordinary conversations and natural communication. Large language models, such as ChatGPT, trained on massive amounts of different types of data, do a relatively good job of recognizing humour, and scientists suggest that this happens as a result of the variety of data they learn from.
Now we are talking only about binary humour detection: funny or not funny. This is a long way from defining shades of humour, distinguishing sarcasm and irony, and detecting contextual humour. Our voice assistants still produce one-line filtered jokes determined by the user's words. Such programmed responses feel unnatural. The demand for more naturally behaving AI assistants is understandable, but it will not be easy to satisfy this request.