Why Alexa, Siri, and other AI cannot engage in human conversation

Alexa, Siri, .... and human conversation.

9/23/2018

Alexa, Siri, and Google Assistant are household names today. Advances in Artificial Intelligence (AI), more specifically machine learning and natural language processing, have empowered us to use voice to interact with machines. This adds a new dimension to existing literal and gestural interfaces and potentially overcomes the complexity of user interface design when navigating complex / deep menus. The simplest example is that of using Alexa to play a song - a user interface on iTunes or similar would require that you type the name of the song in the search field, apply any filters to narrow the search, and finally selectively push play based on the results. Instead, these actions are replaced by your voice, eliminating the need for a sequence of user inputs thereby speeding up the interaction. But perhaps the most important aspect of voice-based interactions is that it has eliminated the need for proximity with the machine and / or it’s physical user interface. From a psychological perspective, it plays into human nature - we primarily communicate with other humans through voice, body language, and gestures. Why can’t we do the same with machines ? Note that the word “communicate” is very different from the word “interact” - and it is here that we begin to encounter barriers with all aspects of AI related to Natural Language Processing (NLP).

To help better understand the difference between interaction and communication, let’s use the following example. Try asking your favorite voice assistant to order a coffee:

"Siri, can you place an order for a cappuccino using the Starbucks app? "

The results will likely be locations of the Starbucks stores that are in your vicinity. More specifically, the result would have been appropriate if your question was "Find the nearest Starbucks " In order for you to achieve your intended goal (ordering a cappuccino), you must continue issuing voice commands or interacting with your phone until the above task is complete, but the whole process feels transactional in nature. They are short, simple constructs, with each subsequent instruction having a definitive outcome.

But what if you had this same exchange with a friend or co-worker? You will likely begin with a discussion of what the outcome should be, exchange complex constructs during your interaction and never worry about the order in which the exchange of information occurred. In addition, you will have no problem dealing with variations in accent, language, grammar, or even if the task was completely inter-twined with another.

It is obvious from this very simple example that you can interact with Siri or other voice assistants but can never really communicate with them (at least not today). So what causes this barrier and can it be overcome?

Communication is a very complex construct. When we communicate as humans, we perform several conscious and subconscious tasks such as reading the other person’s facials, gestures, body languages and tone of voice in the context of the interaction so we can best reciprocate to achieve a desired outcome. Machines have no such knowledge and in particular, are devoid of contextual information. Several researchers are investigating the possibility of giving machines the ability to perform some of these tasks, but the challenge lies in the nature of the machine to acquire this ability. Over the years, this has been termed as “Artificial Intelligence” and encompasses several sciences of which machine learning has gained popularity since it provides machines with the capability to generalize taught constructs. There are however two things to bare in mind:

• The word “Artificial” refers to the fact that machines can be made to “appear” intelligent and are not truly intelligent.
• Machines needs a lot of clean data in order to “learn”.

Illustration: It is easy for us humans to deduce all the hierarchical relationships above even if only two (Manager, Peer) are given.

To illustrate, if a relationship A <> B and a relationship B <> C are taught, humans are capable of deducing the relationship A <> C. Not only that, humans are also capable of deducing the reverse relationships B <> A, C <> B and C <> A (see example to the left). In fact, it is this ability of ours to learn via extrapolation / deduction that makes us the dominant species on the planet. For a machine to learn all these relationships, humans must provide data that includes both positive and negative examples of all of these relationships. In other words, humans must provide a machine with sufficient data for it to learn to “compute” all these relationships. And this is the AI effect - if a machine can compute a relationship, it is not truly intelligent. The advantage of machine learning is best seen when a problem is complex and non-linear and hidden relationships between the input and output need to be discovered. And this is where the biggest barrier in teaching a machine how to communicate exists. The subsets of inputs and outputs are infinite.

Take for example, the common situation where employees engage in a 5-minute conversation with a manager about a workplace dispute. My previous blog post addresses why soft skills are so critical to increasing productivity and efficiency in the workplace and how training can help better handle such situations. The workplace dispute could be around any topic involving budgets, processes, methodology or even personal in nature. One can only imagine the number of unique ways in which this conversation may play out in real life depending on a number of factors including but not limited to personality types, context surrounding the conversation, history and relationship between the people, etc., re-iterating that infinite space of inputs and outputs during communication.

Let us for the sake of argument, assume that AI can be programmed to hold an extremely engaging conversation within constrained boundaries for 5 minutes. This readily lends itself to a training application where learners can practice their soft skills. Even in this case, the machine will need sufficient data from real conversations in these areas to communicate effectively with a learner. More importantly, the machine will require contextual information and read the learner’s body language, tone of voice, and other non-verbals to effectively engage them in the training. How is it possible to provide the machine with this learning data unless we have a database of such conversations readily available to us ? Every time there is a new pathway of conversation (data that the machine has not seen and hence cannot deduce a relationship), a human will need to provide the machine with both positive and negative examples to incorporate this pathway into its repertoire. In other words, there is always a human in the loop. The question we ask ourselves is this - where do we want to place the human in the loop ?

At Mursion, we combine the computational power of AI with that of human reasoning to create seamless, engaging and customizable simulations that allow learners to practice soft skills. By placing the human at the end of the chain, we avoid having to teach the machine the complex art of communication - instead, we focus on harnessing the power of artificial intelligence to drive digital representations of the human (avatars) thereby reducing the cognitive load of the humans that inhabit the avatars. The end result is a simulation platform where subject matter can be rapidly customized without the need for re-programming. When combined in this manner, virtual environments and avatars reinstate the constructs of situational plausibility and place illusion (Slater, 2009), leading to the suspension of disbelief in learners. This creates a very realistic, effective, and safe training environment. To learn more about how this training can be applied to organizations and how results translate to real-world performance, come visit us at Mursion - and as an added benefit, you can walk to the Golden Gate bridge after your visit!

0 Comments

All things AI, VR, Entrepreneurial, Academic, and Fun!

Alexa, Siri, .... and human conversation.

Leave a Reply.

About

Archives

Categories