Alex Neznanov, Senior Research Fellow, ISSA Lab
Dmitry Ignatov, Research Fellow, ISSA Lab
More than 80% of data available to criminal investigators is in textual form, such as patrol officer reports. Such texts are entered into databases and form massive amounts of unstructured data which is hard to analyse. Jointly with the Amsterdam-Amstelland police and researchers of the Leuven Catholic University, a group of HSE mathematicians have developed a system for analysing large amounts of textual data and automatically extracting information which can help investigators.
It is one of the four projects implemented by the Catholic University of Leuven, HSE mathematicians and Amsterdam-Amstelland criminal police.
This cooperation between the Russian researchers and Dutch police started back in 2010, when Jonas Poelmans, a young researcher at the Leuven Catholic University, invited Sergey Kuznetsov and his colleagues to help analyse the crime situation in Amsterdam. According to Kuznetsov, "data processing software and methods of using formal concept lattices were our contribution to the project. While Poelmans had just started exploring this approach, we already had considerable experience with it." This approach enables visual representation of object-attribute correspondences and is widely used, for example, in computational linguistics for analysing the content of blogs and finding similarities between texts.
Over the four years of this collaboration, HSE researchers have taken part in projects dealing with domestic violence, human trafficking, terrorism and spotting pedophiles in internet chat rooms. In the latter case, the objective was to detect adults seeking underage sexual partners via the internet.
Typically, a pedophile would visit an online chat room popular with children and adolescents, establish contact with a potential victim, get them to talk about sex, and finally persuade them to meet offline. Such chat rooms generate enormous amounts of information, and processing all of it manually would take far too much time. Therefore, the police needed a software solution to automate the detection of potential pedophiles.
The Dutch police categorise pedophiles into three groups based on the potential danger they may pose:
The team of experts from the Netherlands, Russia and Belgium were asked to come up with a software solution capable of scanning the web for suspicious communications in chat rooms and ranking their potential danger for the child, and more specifically:
"The main objective was to help the police find evidence that a particular chat is taking place between a pedophile and their potential victim,” Neznanov explains, “and distinguish such interactions from those involving innocent people." According to the researcher, a system based on formal concept lattices can quickly identify chats likely to pose danger to a child.
A large collection of texts was used to generate source data for the analysis. The criminal police experts were responsible for providing the texts and dealing with other practical issues, while the Russian researchers, according to Neznanov, "worked on identifying the types of attributes which indicate, with a high degree of probability, a chat involving a pedophile."
The Russian researchers put together a list of suspicious words and phrases, such as those related to describing a person's looks, arranging a date, discussing sexual preferences, etc., including their shorthand, distorted and misspelled versions, and those using numbers instead of words (e.g. 2 as ‘to’, 4 as ‘for’).
In turn, their colleagues from the Leuven University came up with CORDIET, a Concept Relation Discovery and Innovation Enabling Technology, using iterative methods for sorting information, continuously processing the results and making adjustments; the data analysis cycle repeats at each phase, eventually producing a semantic lattice.
Neznanov describes this process as working at the intersection of classical computational linguistics and machine learning, combining the classical problem of constructing ontologies with the practical task of generating a specific type of knowledge.
Based on the results, a software solution was created to automate the process of spotting pedophiles on the web.
First, the researchers input a collection of texts with relevant metadata, such as tags, and then the program automatically generated an object-attribute description of the data and constructed formal concept lattices and other analytical artifacts enabling the analyst to interactively visualize them and draw conclusions. The resulting visual representations, according to Neznanov, are so simple that "learning to work with them should not take more than one day." The only outstanding problem is scaling up the interactive mode, since above a certain lattice size, its fragments require intelligent visualisation techniques, which is the current focus of the ISSA laboratory.
The software has been empirically tested and the results described in a series of academic papers.
However, since the actual chat data collected by the Dutch police cannot be legally made public, the system's operation has been demonstrated using data available from Perverted-Justice, a U.S.-based anti-predator NGO working to detect pedophiles on the internet by using adult volunteers posing as children on chat sites and waiting for potential pedophiles to approach them. While the 'victims' in this case are adults and therefore their behaviour is not representative, the adult predators who believe they are communicating with minors behave realistically.
The Amsterdam police are currently using software based on the HSE researchers’ theoretical framework.
Prepared by Vladislav Grinkevich