Conversational AI — Key technologies and Challenges — Part 1 - JOYK Joy of Geek, Geek News, Link all geek

1. Conversational System

What is the conversational system or a virtual agent? One of the best-known fictional agents is Jarvis from Iron Man. It can think independently and help Tony do almost anything, including running chores, processing massive data sets, making intelligent suggestions, and providing emotional support. The most impressive feature of Jarvis is the chat capability, you can talk to him like an old friend, and he can understand you without ambiguity. The technology behind the scene is conversational AI.

The core of Conversational AI is a smartly designed voice user interface(VUI). Compared with the traditional GUI (Graphic User Interface), VUI free user’s hands by allowing them to perform nested queries via simple voice control (not ten clicks on the screen).

However, I have to admit that there’s still a big gap between the perfect virtual agent Jarvis and the existing conversational AI platforms’ capabilities.

Human and machine conversations have received tons of tractions from academia and industry over the past decade. In the research lab, we saw the following movement:

Natural language understanding has moved from manual annotation and linguistic analysis to deep learning and sequenced language modeling.
The dialog management system has moved from rule-based policies to supervised learning and reinforcement learning .
The language generation engine has moved from the pre-defined template and syntax parsing to end-to-end language transformer and attention mechanisms.

In addition, we also saw conversational products spring up in the cross-market domain. All the big players have their signature virtual agent, for instance, Siri for Apple, Alexa for Amazon, Cortana for Microsoft and Dialogflow for Google. (Diagram below are out of date, please use it a reference only)

v6fieeN.jpg!web

Report from Recast.AI

2. Key Components of a Conversational System

There are few main components in the conversational platform, 1) ASR: Automatic Speech Recognition, 2) NLU: Natural Language Understanding, 3) Dialog Management, 4)NLG: Natural Language Generation, 5) TTS: Text to Speech. (Additional components could include public API, integration gateway, action fulfillment logic, Language model training stack, versioning, and chat simulation, etc.)

For simplicity, let’s explore the basics now.

6ZRVNnu.jpg!web

Simple Dialog System by Catherine Wang

2.1. ASR:Automatic speech recognition is a model trained on speaker voice record and transcript, then fine-tuned to recognize the unseen voice queries. Most of the conversational platforms offer this feature as an embedded element. Thus developers can leverage the state of the art ASR on their product(e.g., voice input, voice search, real-time translation, and smart home devices).

2.2. NLU: Indisputably, the most important part of a conversational system. ASR will only transcribe what you have said, but NLU will understand exactly what do you mean? Natural Language Understanding can be seen as a subset of Natural Language Processing. The relationship can be loosely described as below.

By Sciforce

Both NLP and NLU are board topics, so instead of going too deep into the topic, I will explain the high-level concept by using practical examples from the virtual agent use case.

Generally speaking, NLU and NLP were structured around the following problems:

Tokenisation and Tagging. They are text preprocessing techniques. Tokenisation is the first step that needs to apply to both the traditional linguistic analysis and deep learning models. It split a sentence into words (or n_grams), and those words will be later used to build the vocabulary or train the word embedding algorithm. Tagging is sometimes optional, and it will label each token (words) into lexical categories. (e.g., ADJ, ADV, NOUN, NN)

Dependency and Syntactic Parsing. A popular technique in linguistic analysis, it parses a sentence into its grammatical structure. Before the age of deep learning, those syntax trees are used to constitute a new sentence or a sequence of words.

6Z7v2em.png!web

From Stanford NLP

Name Entity Recognition. It was used to extract or identify a set of predefined word entities. The output of NER can sometimes look quite similar to POS tagging. The results are also stored in a Python tuple,e.g. (US, ‘GPE’). The main differences are 1) the NER model can be trained by new annotation to pick up domain-specific word entities. 2) NER focuses more on semantic meaning, whereas POC tagging is more on grammar structure.
Phrase and Pattern matching. The simplest implementation of phrase matching is using a rules-based regular expression. Don’t get me wrong, the regular expression is still beneficial in the unlabeled dataset. An adequately defined fuzzy pattern can match up to hundreds of similar sentences. However, this rule-based method is hard to maintain and scale-up. A more advanced approach involves using POC tags or dependency labels as the sequence for matching, or using vector distances.

BNvYNjF.jpg!web

Word Vectorization and Embedding. Word embedding marks the dawn of NLP, and it introduces the concept of distributed representation of a word. Before deep learning, linguistics uses the dense representation to capture the structure of the text and use the statistical model to understand the relationship. The drawback of this method is the lack of the capability of representing the contextual meaning and word inference. Word embedding offers a solution to learn the parameters that best represent a word in a particular context from a higher-dimensional space. For practical use, you can find pre-trained word embedding models like Word2Vec , GloVe , or if you need, you can always fine-tune those models on your new set of vocab and training corpus.

uqeYJjf.jpg!web

Word Embedding By Catherine Wang

Sequence Vectorization and Embedding. Similar concept, but instead of vectorizing every single word, sequence embedding focuses on finding the best representation for longer text as a whole. This technique improves specific NLP tasks that need to understand a longer chunk of texts, for instance, text translation, text generation, reading comprehension, natural questions & longer answers, etc.

zQvIrqZ.jpg!web

Sequence Modeling by Catherine Wang

Sentiment Analysis. The task of analyzing if an expression is positive or negative (can be understood as binary classification, 1-positive, 0-negative)? One of the most common tasks in NLP, in the use of conversational AI, sentiment analysis could provide a benchmark for the virtual customer agent to identify customers’ emotions and intention then provides a different emotional response suggestion.
Topic Modeling. It leverages unsupervised ML techniques to find the groups of the topics in a broad set of unlabeled documents. It helps us to understand the theme of a collection of unseen corpus quickly. In the use case of conversational AI, topic modeling acts as the first filter that triage the user queries into higher-level topics then mapped to more granular intents and actions.
Text Classification and Intent Matching. Both of those tasks use supervised learning, and the quality of the model would largely depend on how you prep the training data. Compared with Topic Modeling, text classification and intent matching are more granular and deterministic. You can understand the relationship with the image shown below. When facing unseen customer queries, your conversational AI system will use topic modeling to filter your query to a broad topic and then use text classification and intent matching to map it to a specific action.

VNNf2yN.jpg!web

Intent Matching by Catherine Wang

Language Modeling. A trendy topic in deep learning and NLP. All the state-of-the-art models you have heard of are based on this concept (the BERT family: ALBERT, RoBERTa; the multitask learners and few-shot learning: GPT-2). To let the machine understand the human language better, scientists trained it to build vocabulary and statistical models to predict the likelihood of each word in the context.

UFFVJvN.png!web

Language Model by Catherine Wang

Multi-Turn Dialog System. This is an advanced topic in NLU and conversational AI. It refers to the techniques that track and identity change of a topic/intent in a conversational system. How we can better pick up the information in each dialog and draw a comprehensive logic behind the user’s compound intent.

JZfARvR.png!web

Modeling Multi-turn Conversation with Deep Utterance Aggregation

In the use case of conversational AI, NLU is aiming to resolute the language confusion, ambiguity, generalize verbal understanding, identify domains and intentions from humans to machine dialog, then extract critical semantic information.

Apart from using the key technologies I mentioned above, the AI system needs to find a useful semantic representation of user queries. The most successful one is “Frame Semantics,” which uses Domain, Intent, Entity, and Slot to formulate semantic results.

Domain : Can be linked to topic modeling, it groups the queries and knowledge resources into different business categories, goals, and corresponding services. For example, “Pre-sale”, “Post-sale”, or “Order and Transaction”.
Intent : Can be linked to intent matching and classification. It refers to particular tasks or business processes within a domain. It usually is written in a verb-object phrase. e.g. “search for songs”, “play the music”, or “favorite the playlist” in the music player domain.
Entity and Slot : Can be used as parameters to extract critical information from domain and intent. e.g. “song name”, “ singer”.

A sentence “ What is the weather for Melbourne tomorrow? ” can be transposed into the blow structure,

- Domain: “ Weather”

- Intent: “ Check the Weather”

- Entity and Slot: (“City”: “Melbourne”, “Date”: “Tomorrow”)

Then the follow-up actions will be fulfilled by parsing the above-structured data.

2.3. Dialog Management:another critical part of the Conversational AI system. It controls the flow of the dialog between user and agent. In the simplest version, a DM engine will remember the history dialog context, tracks the state of the current dialog, then applies dialog policy.

Dialog Context: During the session of a user-agent conversation, all the back and forth dialogs will be remembered in the context. Critical information like domain, intent, entity, and slot will be saved in a message queue for in-memory search and retrieve. After the conversation, the dialog context can be preserved in the database for further analysis.
Dialog State Tracking : Dialog state tracker will remember the logic flow in the conversation. It will make the agent more intelligent and flexible by tacking the logic tuning point in different dialogs, then suggesting a response based on the long term memory.
Dialog Policy: Based on the context and logic flow of the conversation, the agent needs to prioritize services, trigger certain events, and request fulfillment. The fulfillment actions could include retrieving user information from the database, searching for content in the knowledge base system, or triggering third-party API.

For example:

Q : I want to order pizza delivery (intent=order_pizza, entity_time=null, entity_address = null, entity_type=null). A: what type of pizza do you want to order? (slot=type, slot=date, slot=address)

Q : Margherita. (intent=order_pizza, entity_time=null, entity_address = null, entity_type=Margherita) A : What time you want your pizza to be diliver? (slot=date, slot=address)

Q : ASAP. (intent=order_pizza, entity_time=ASAP , entity_address = null, entity_type=Margherita) A : Is there anything else you would want to order with your {Margherita} Pizza? (follow_up_intent: additional_prodcut)

Q: A bottole of Coke . (intent=order_pizza, entity_time=ASAP , entity_address = null, type=Margherita, additional =coke) A: What is the address for us to delier your pizza? ( slot=address)

Q: xx.xxx . (intent=order_pizza, entity_time=ASAP , entity_address = xx.xxx , type=Margherita, additional =coke) A: Thanks, so you ordered { type } Pizza with {* additional } and deliver to { entity_address } { ASAP }. (fulfillment: update_order, call_delivery_services)

As you can see, slot and entity need to be filled during the conversation, and parent intent can trigger follwo_up intent, then action fulfillment will be activated base on the state of the conversation.

2.4. NLG: The natural language generation engine has different implementation and technology stack based on the type of chat system. For a task-oriented close domain conversation system, NLG is implemented via the response template with inter-replaceable parameters from “ slot ” and “ entities ” extracted from the conversation session. For an open domain chat system, text generation would be based on information retrieval, machine comprehension, knowledge graph, etc.

2.5. TTS: Text to speech engine is performing the task exactly opposite to ASR. It transforms the plain text to voice record and plays it back with the synthetic voice to the end-user.

Based on the above discussion, the below image offers a more comprehensive and realistic view of the Conversational AI system.

auqyaqf.jpg!web

Conversational AI System by Catherine Wang

3. Voice User Interface and User Experience Design

GUI (Graphic User Interface) is dominating the human-machine interaction. It’s the game-changer in the PC world, and catalyst the massive adoption of digital devices in everyday life. Now we are facing the screen and interact with them all the time.

But in the next decade, with the advances of AI, human and machine interaction will be shifting to voice. Voice User Interface will be the new entry point of smart and IoT devices. For example, when you say “Hey Google,” Googe home will be awake and start a conversation with you. In this case, the voice will become the new mouse and figure.

fmYbuym.jpg!web

Kelly Sikkema from unsplash

In the GUI design, all the user interactions are pre-defined and guided by a series of clicks or swaps on the screen. But the VUI system, firstly, user’s behaviors are unpredictable and can diverge from the main storyline. Secondly, in the open conversation, a user might change the topic anytime and the user’s request might have compound intents that need to be fulfilled. Lastly, the voice interaction requires constant attention from users and agents because both parties need to remember what they said in the previous turns.

The most successful Conversational AI system would consider voice and graphics complementary in their UI and UX design. A mature system should combine both traits to offer end-user a richer and immersive experience.

Reference

Modeling Multi-turn Conversation with Deep Utterance Aggregation. arXiv:1806.09102 [cs.CL]

Conversational AI — Key technologies and Challenges — Part 1

1. Conversational System

2. Key Components of a Conversational System

3. Voice User Interface and User Experience Design

Recommend

服务之间调用还需要鉴权？

java基础(二)--main方法讲解

我的一位朋友，一年肝了四本 PDF。

聊聊我认识的一个程序员，是如何草根逆袭，实现财富自由的？

美国网络安全：NIST网络安全实践指南系列

彩蛋还是陷阱？我所经历的期权往事

交易所比特币流入数量自1月份以来首次超过流出数量

中国巨头海外大撤退

为什么「北斗」没有出现在手机设置里？

人类卵子对某些精子情有独钟，还会施以额外“恩惠”

About Joyk