Multi-lingual Chatbot Using Rasa and Custom Tokenizer

Oct 2 ·5min read

Tips and tricks to enhance the Rasa NLU pipeline with your own custom tokenizer for multi-lingual chatbot.

2qi6F3E.jpg!web

Photo by Alex Knight on Unsplash

I have covered the basic guide to create your own Rasa NLU server for intent classification and named-entity recognition in the previoustutorial. In this article, we will looking at the necessary steps to add custom tokenizer to the Rasa NLU pipeline. This article consists of 5 sections:

Setup
Tokenizer
Registry File
Train and Test
Conclusion

1. Setup

By default, Rasa framework provides us with four built-in tokenizer:

Whitespace Tokenizer
Jieba Tokenizer (Chinese)
Mitie Tokenizer
Spacy Tokenizer

Built-in Tokenizer

If you are testing it on the Chinese language, you can simply change the tokenizer name in config.yml file to the following and you are good to go.

language: zh
pipeline:
  - name: "JiebaTokenizer"
  - name: "RegexFeaturizer"
  - name: "CRFEntityExtractor"
  - name: "EntitySynonymMapper"
  - name: "CountVectorsFeaturizer"
  - name: "CountVectorsFeaturizer"
    analyzer: "char_wb"
    min_ngram: 1
    max_ngram: 4
  - name: "EmbeddingIntentClassifier"

Do not be alarmed if you noticed that there are two instances of CountVectorsFeaturizer. According to the official website :

The pipeline uses two instances of CountVectorsFeaturizer . The first one featurizes text based on words. The second one featurizes text based on character n-grams, preserving word boundaries. We empirically found the second featurizer to be more powerful, but we decided to keep the first featurizer as well to make featurization more robust.

Custom Tokenizer

For other languages, we need to modify a few things. You can test it out on any tokenizer but I will be using a Japanese tokenizer called SudachiPy. I have covered this python module in the previousarticleas well. Feel free to check it out. Setup a virtual environment with the necessary modules for Rasa NLU server. Once you are done, go to the following link and install SudaichiPy based on the instructions given. Then, modify the config.yml file.

language: ja
pipeline:
  - name: "JapaneseTokenizer"
  - name: "RegexFeaturizer"
  - name: "CRFEntityExtractor"
  - name: "EntitySynonymMapper"
  - name: "CountVectorsFeaturizer"
  - name: "CountVectorsFeaturizer"
    analyzer: "char_wb"
    min_ngram: 1
    max_ngram: 4
  - name: "EmbeddingIntentClassifier"

I have only modified the tokenizer name and the language. You can name it anything as you like but you have to keep it consistent. We will be modifying other files later on. Make sure that you use the same name. You need to test it out by running the following code:

You should be able to see the following result:

['国家', '公務', '員']

2. Tokenizer

We will need to create a tokenizer python file. Go to the directory of your virtual environment and find the following directory:

Lib/site-packages/rasa/nlu/tokenizer/

You should be able to see the following files (minus the japanese_tokenizer.py)

1*82zY3hFUAb-4DVS9LndNbA.png?q=20

Create a new py file and name anything that you prefer (I named it to japanese_tokenizer.py). Open it up and add the following code:

import re
from typing import Any, Dict, List, Textfrom rasa.nlu.components import Component          
from rasa.nlu.config import RasaNLUModelConfig
from rasa.nlu.tokenizers import Token, Tokenizer
from rasa.nlu.training_data import Message, TrainingData

All of the imports are important except for regex. Let’s continue on to create the main class (JapaneseTokenizer).

class JapaneseTokenizer(Tokenizer, Component):

Inside this class, you will need to have the following classes:

__init__
train
tokenize

You can actually duplicate any existing file and modify it according to your needs. The tokenization should be based on the language and module that you have. One important thing to note is that you need to return a list of Token. The token class consists of the tokenized word and the word offset. Kindly refer to JiebaTokenizer.py and WhiteSpaceTokenizer.py for more information on the structure tokenizer class.

My final code looks like this:

3. Registry file

The next step is to modify all the configuration files. Go to the directory of your virtual environment and find the following directory:

Lib/site-packages/rasa/nlu/

You should have the following folders and files

J7BZFzJ.png!web

Open up the registry.py file and start to edit the content. At the import area, add the following code:

from rasa.nlu.tokenizers.japanese_tokenizer import JapaneseTokenizer

japanese_tokenizer : Name of the py file.
JapaneseTokenizer : Name of the class. You have to make sure that this name is exactly the same name as the one used in config.yml and component_classes.

In the same registry.py file, find component_classes list and add the JapaneseTokenizer to it (modify it according to your own class name).

component_classes = [
    # utils
    SpacyNLP,
    MitieNLP,
    # tokenizers
    JapaneseTokenizer,
    MitieTokenizer,
    SpacyTokenizer,
    WhitespaceTokenizer,
    JiebaTokenizer,
...

Save the file and we are now ready to test it.

4. Train and Test

Open up a terminal in a virtual environment and point it to the base directory of your config.yml file (the one that we have modified in the first section). Make sure that you have a data folder with your training data in it. Run the following code:

rasa train nlu

The training might take some time depends on the number of intent that you have. You should be able to see the following output:

yaEnqiM.png!web

Once it is complete, you should have a tar.gz file at the models folder. Run the following command to test it (modify the name of the model accordingly):

rasa shell -m models/nlu-20190924-144445.tar.gz

It will run an interactive shell mode inside the terminal. You can input your sentence to test the result.

5. Conclusion

Let’s recap on the basic steps to setup a Japanese tokenizer for Rasa NLU. First and foremost, we need to modify the config.yml file and install the SudachiPy module.

Then, we created a japanese_tokenizer.py file with the necessary code for initialization, training and tokenization.

We moved on to modify the registry.py file by adding the import code and JapaneseTokenizer reference to the component_class list.

Finally, we train the model and test it via the interactive shell mode provided by rasa.

Hope you enjoyed the article and see you next time!

Multi-lingual Chatbot Using Rasa and Custom Tokenizer