This Is Robot Speaking

Whether we like it or not, talking with a robot on the phone is now part of our everyday lives. What is the right way to respond to a mechanical voice, who should we blame when communication fails, and could AI make an ideal conversationalist? HSE sociologist Alisa Maksimova answers these questions and more, based on findings from her study of interactions between humans and robots presented in the book Adventures in Technology: Barriers to Digitalisation in Russia.

Alisa Maximova
Junior Research Fellow, HSE Poletayev Institute for Theoretical and Historical Studies in the Humanities, Associate Professor, HSE Vysokovsky Graduate School of Urbanism

Alisa, what is the book – and your contribution to it – all about?

The book is a collective effort financed by a grant from the Russian Science Foundation's Presidential Programme to support fundamental and exploratory research. A team of social scientists from Moscow and St. Petersburg worked on our part of the study for three years.

The resulting monograph discusses barriers to the design, implementation and uptake of digital technology from the social science perspective. Using a variety of empirical cases – from the work of software testers, aka QA engineers, to problems with self-driving cars or online dating applications – the authors examine what and why things can go wrong with AI.

A Tindergarten of Love. Barriers to ‘digital romance’ and strategies of online dating app users

My part of the research focused on AI voice bots handling customer calls at a public service call centre by providing information on available services and answering questions on housing and utilities, paperwork, and other matters.

I was lucky to make an agreement with a company (they requested that we keep their name confidential), making more than 200 call records available to us. Using this data, our team analysed ways in which humans interact with robots, with the callers knowing in advance that they would talk to a machine but having no choice about it.

What is this telephone bot like?

It is a conversational agent, or a chatbot which can carry on a conversation, recognise speech, including certain pre-defined cues, and respond appropriately.

The history of voice assistants dates back to 1961, when the IBM introduced ‘Shoebox’, a computer capable of recognising 16 spoken words and numbers from 0 to 9. In 2011, Apple's iPhone 4S was launched with a new voice-assistant feature. Since then, new voice recognition products for the mass market have been cropping up at a fast pace. Among the most popular virtual assistants worldwide are Siri, Cortana and Alexa, and in Russia, these include Alice, launched by Yandex in 2017, and followed more recently by Joy, Sber and Athena from Sber (formerly Sberbank).

When we began our study three years ago, AI-driven voice recognition services were rare, but today they are part of everyday life. People no longer perceive robots answering customer calls to be something out of the ordinary.

It is particularly interesting to observe how AI adapts to users who, in turn, try to accommodate the robot by being patient, slowing down and making an effort to speak clearly and concisely.

Do most people immediately realise that they are speaking to a computer? If not, how do they change the way they communicate once they know there is a robot on the other end?

The call records we used for the study were anonymised and not linked to specific callers, so we could not go back and ask a caller what was on their mind at the time of the interaction, e. g. whether they were frustrated with the stupid robot or inept human operator.

Curiously, people seem to switch instinctively to a 'robotic mode' once they realise there is a computer on the other end and tend to speak loudly in fragmented sentences. But at some point, they might say, 'young lady, you just don't seem to understand me' – meaning that they still associate the AI with certain human characteristics and address it accordingly.

How are robots trained to maintain conversation with a human?

Training AI to recognise human speech is the most important thing. An AI-driven voice bot must have an integrated speech recognition system and a set of rules to correlate what they hear with a database of keywords. People can use different words for the same thing, such as 'payment document' or 'bill'. This must be anticipated when training the voice bot.

Certain settings of a voice assistant, including the quality and clarity of speech, can be modified, e. g. its speech recognition sensitivity can be adjusted to fit certain categories of users, such as the elderly.

Where the developers expect good sound quality and clear speech, they lower the bot's speech recognition sensitivity to increase its efficiency and accuracy and reduce errors such as mistaking noise for a meaningful statement.

Robots also use a set of standard phrases and dialogue scripts, which can be tested for their potential to lead to successful interaction.

Have such tests been conducted on your virtual operator?

One of the challenges was to give the bot the right phrases encouraging users to ask their questions. The phrase tried at first was 'please, formulate your question clearly', but it caused callers to hesitate before responding. The developers suspected that the word 'clearly' was confusing and replaced the instruction with the more casual 'what question do you have for me?' assuming that the user already had a question which the bot was capable of recognising.

They also tried out phrases such as 'formulate your query, e.g...', but it appeared confusing as well, because callers often said, 'perhaps I’ve dialled the wrong number, because my question is about something else'.

So even a seemingly basic instruction turns out to be quite difficult to get right, but this makes the process of designing user interface technology fascinating to observe.

Looking from the user’s perspective, how do people enter into this type of conversation? Are our actions and words spontaneous or predetermined?

Scholars' opinions differ on the matter. Some argue that people tend to have assumptions as to what technology is capable of and whether it can be trusted. As soon as we hear a voice robot, we expect it to behave in a certain way and try to act accordingly. A user's digital literacy is another factor influencing their behaviour. Technology savvy people who know how things work will not bother explaining things to a bot but will try to use keywords which it might be trained to recognise.

Another approach focuses on the interaction pe se rather than user characteristics. In other words, whatever a user may think about the technology, there are certain aspects of the interaction which they are guided by.

I have chosen this approach, and my analysis is based on ethnomethodology, the study of how social order works, and on conversation analysis that looks at conversation as a structured interaction whose elements (i.e. words or phrases) must follow certain sequences.

One can observe from this perspective how a person interacts with a robot, they tend to ask a question, determine whether or not they are understood by the robot, note the length of a pause in conversation, draw conclusions about the virtual operator's responsiveness and modify their own behaviour, e. g. by trying to use simpler phrases.

User: My name is Ivan Ivanovich, I live outside Moscow. So, I went to that hospital where I'd been on treatment for an eye injury and my doctor had been seeing me for a while without any problems. But this time, I showed up and they said I had to pay six hundred and fifty rubles to see a doctor. They had not charged me for follow-up visits before. That's all I wanted to say.

Robot: Sorry, I don't really understand you. Please formulate your question clearly once again and speak after the beep.

User: A HOSPITAL IN ... Serpukhov ... district hospital in Serpukhov, they charge you a fee for SEEING YOUR DOCTOR. IS IT ALLOWED?

How common are the methods you use in studying human interaction with AI systems? How else are they studied?

There are plenty of methods in use, such as staged experiments, simulated situations where participants interact with robots in games, museum tours and conversations with chatbots.

But in reality, people are likely to act differently. Plus, our reality is changing rapidly. Technology is no longer something we deal with from time to time, but literally a part of our everyday life, such as smart speakers in our homes or virtual assistants in our smartphones – always present, always available.

Today, researchers need to use new study methods. In recent years, the focus of research has increasingly been on data collected from naturally occurring interactions. Ironically, the key challenges remain the same as in earlier studies.

Little has changed since the 1980s, when sociologist Lucy Suchman studied people's encounters with photocopiers. While technology is getting more sophisticated, human operators face the same issues interacting with it, such as feedback, transparency, comprehension and how tasks can be achieved.

In practical terms, how should one talk to a smart machine to be understood? What is important in terms of intonation and tempo? Are there any special techniques?

The main thing is to be patient and not to give up too soon. If the robot does not respond for a while, it is not necessarily frozen – maybe it is searching for the right answer. Also, be prepared to repeat or modify your query.

You wrote about an adapted way of speaking which people develop for communicating with robots. Does it mean that people adapt to how robots comprehend speech?

When human-to-machine communication fails or does not work as expected, users try to adapt to the machine by adjusting the volume of their speech, changing their intonation, making pauses or repeating certain phrases.

However, this approach is not specific to interacting with robots: we do the same to be understood by someone who, e. g. has a hearing problem or poor command of our language. Actually, whether we are talking to a robot, a foreigner, a hearing-impaired person or to anyone in a noisy room or via a poor telephone connection, we make an effort to speak in a way which is easier to comprehend.

Do you think that AI developers take this into account?

They probably do, but I cannot give you any supporting evidence. Speaking simply and clearly – rather than too fast or in jargon – makes it easier for robots to understand us, but they are not likely to ask us to make it easier for them, because they are designed to appear smart.

Do people respond differently depending on how they perceive the voice assistant's gender? Also, does the gender of potential users play a role in technology design?

I am not sure it makes a difference at the design stage. I believe that robots are designed to communicate in the same way regardless of user gender.

As for robots' voices, the answer is yes, people respond differently depending on whether they perceive the robot to be a male or a female. In a study by Ekaterina Khonineva, Siri is used as an example to demonstrate when and how users act differently towards a 'female' assistant, which includes insulting it or flirting with it, choosing topics for conversation or explaining the machine's errors.

By perceiving virtual assistants to be of a certain gender, people tend to attribute specific characteristics to the machine. It is believed that virtual cockpit assistants speak in female voices because they are perceived by pilots as calm, confident and helpful without being bossy.

Today, most AI-powered assistants speak in women's voices by default, and their female gendering is often criticised for reinforcing stereotypes by assigning female AI a submissive servile role.

Are digital assistants trained to respond to obscene or inappropriate language?

I don't think they really recognise such language; instead, they may redirect the call to a human operator. Some people even say that using profanity early on is a sure way to be redirected to a human, but I believe that this is an urban myth.

What are the most common failures in conversation? Who is usually the source of errors – humans or computers?

I would not put the blame on either side, because we share the responsibility, which is often related to the fundamental difference between humans and machines. People do not normally think of their actions as commands for a computer, while computers, in turn, ignore as noise any human utterances which do not contain any input the machine can recognise.

Another thing a robot often does is to be silent for a few seconds after the user finishes speaking. Users expecting a faster response may start worrying and adding phrases in an effort to explain themselves. The robot may begin responding to the original statement while processing the more recent input at the same time. This often results in a funny sequence of silence and both parties speaking at once, because robots, like humans, tend to stop when the other party is talking.

How does it usually end?

Someone gives up! I'm joking, of course. Eventually, they resume communication: the person will at some point stop talking and wait, assuming that a response is coming.

As a social scientist, do you think there is anything that AI developers have overlooked? Are there any recommendations as to what they might consider adding to their creations?

I would recommend focusing more on feedback mechanisms. We tend to perceive voice robots as communication partners rather than inanimate objects, so it is important to make them more predictable and easier to understand.

Users need to hear or see signals indicating what is going on at a given moment: whether the technology is listening to them or searching for an answer and how soon it might respond. Now the entire communication process is extremely non-transparent.

Sooner or later, humans will learn to communicate effectively with robots, especially if forced to deal with them in important life situations. But developers can make the learning easier for users by placing ‘beacons’ along the way.

I like the idea of revealing the limitations of technology rather than making robots appear perfect, omnipotent and quick-witted. Users need to be aware of what robots can and cannot do. Otherwise, we may hear what sounds like human speech and talk to the robot as if it were a human receptionist who can understand our fragmented and unclear utterances. But a naturally-sounding robot’s inability to live up to such expectations can cause users even more frustration than a machine speaking in a mechanical voice with flat intonation would.

Is it possible to design technology which is an ideal partner in conversation? And what would 'ideal' mean in this situation?

My study examined a specific type of interaction between a person and a machine. where the ‘ideal’ AI partner would understand the question and respond promptly and accurately.

But the ideal partner in spontaneous free conversation is not a machine but another person. Only someone with a vivid imagination or a tech optimist could imagine an AI companion being as capable as a human of responding promptly and appropriately, bringing joy and expressing sympathy.

While a number of social robots, such as AI-driven companions, friends and helpers, have been created, they are only pale imitations of what people can do. And more importantly, why even try to design machines which appear to be human?

There is an element of deceit in such close resemblance. In 2018, there was a debate around Google Duplex, new technology capable of imitating natural human speech. In particular, this system stumbles and makes errors in speech, changes intonation, stops in mid-sentence and corrects itself. This is clearly an attempt to pass it off as something which it is not.

Is there any practical application today for research of human-robot interaction? Are there any orders coming from companies or government?

Some prototypes have been tested with real users, sometimes with a variety of users to make the final product more inclusive and accessible. However, these have mostly been one-off projects aimed to improve technology used by a certain company, and their results rarely make it into the public domain.

While there is an interest in working on it together with scientists, the actual collaboration is often limited, because academic researchers, businesses and public services all follow different schedules and performance criteria; research may take time while commercial companies require fast and clear results.

You started your research started three years ago. Was this the point when you first became interested in the topic?

No, I had been curious about this topic long before our research team got hold of the data on the telephone robot.

So where will your curiosity take you next?

I don’t know… I hope to some beautiful place. The sphere of social interaction is so rich in discoveries hiding behind often overlooked things. There is plenty of data available, not necessarily technology-related. How we respond to each other and the world around us is an inexhaustible source of study material. One only needs to observe, set goals and understand where to apply knowledge and expertise both academically and practically.

Text author: Svetlana Saltanova

Author: Svetlana Saltanova, January 12, 2021

All texts by

This Is Robot Speaking

Alisa Maximova Junior Research Fellow, HSE Poletayev Institute for Theoretical and Historical Studies in the Humanities, Associate Professor, HSE Vysokovsky Graduate School of Urbanism

Alisa Maximova
Junior Research Fellow, HSE Poletayev Institute for Theoretical and Historical Studies in the Humanities, Associate Professor, HSE Vysokovsky Graduate School of Urbanism