![](/style/images/good.png)
![](/style/images/bad.png)
Multi-lingual Chatbot Using Rasa and Custom Tokenizer
source link: https://www.tuicool.com/articles/IfUJzyb
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Multi-lingual Chatbot Using Rasa and Custom Tokenizer
Oct 2 ·5min read
Tips and tricks to enhance the Rasa NLU pipeline with your own custom tokenizer for multi-lingual chatbot.
I have covered the basic guide to create your own Rasa NLU server for intent classification and named-entity recognition in the previoustutorial. In this article, we will looking at the necessary steps to add custom tokenizer to the Rasa NLU pipeline. This article consists of 5 sections:
- Setup
- Tokenizer
- Registry File
- Train and Test
- Conclusion
1. Setup
By default, Rasa framework provides us with four built-in tokenizer:
- Whitespace Tokenizer
- Jieba Tokenizer (Chinese)
- Mitie Tokenizer
- Spacy Tokenizer
Built-in Tokenizer
If you are testing it on the Chinese language, you can simply change the tokenizer name in config.yml file to the following and you are good to go.
language: zh pipeline: - name: "JiebaTokenizer" - name: "RegexFeaturizer" - name: "CRFEntityExtractor" - name: "EntitySynonymMapper" - name: "CountVectorsFeaturizer" - name: "CountVectorsFeaturizer" analyzer: "char_wb" min_ngram: 1 max_ngram: 4 - name: "EmbeddingIntentClassifier"
Do not be alarmed if you noticed that there are two instances of CountVectorsFeaturizer. According to the official website :
The pipeline uses two instances of CountVectorsFeaturizer
. The first one featurizes text based on words. The second one featurizes text based on character n-grams, preserving word boundaries. We empirically found the second featurizer to be more powerful, but we decided to keep the first featurizer as well to make featurization more robust.
Custom Tokenizer
For other languages, we need to modify a few things. You can test it out on any tokenizer but I will be using a Japanese tokenizer called SudachiPy. I have covered this python module in the previousarticleas well. Feel free to check it out. Setup a virtual environment with the necessary modules for Rasa NLU server. Once you are done, go to the following link and install SudaichiPy based on the instructions given. Then, modify the config.yml file.
language: ja pipeline: - name: "JapaneseTokenizer" - name: "RegexFeaturizer" - name: "CRFEntityExtractor" - name: "EntitySynonymMapper" - name: "CountVectorsFeaturizer" - name: "CountVectorsFeaturizer" analyzer: "char_wb" min_ngram: 1 max_ngram: 4 - name: "EmbeddingIntentClassifier"
I have only modified the tokenizer name and the language. You can name it anything as you like but you have to keep it consistent. We will be modifying other files later on. Make sure that you use the same name. You need to test it out by running the following code:
You should be able to see the following result:
['国家', '公務', '員']
2. Tokenizer
We will need to create a tokenizer python file. Go to the directory of your virtual environment and find the following directory:
Lib/site-packages/rasa/nlu/tokenizer/
You should be able to see the following files (minus the japanese_tokenizer.py)
Create a new py file and name anything that you prefer (I named it to japanese_tokenizer.py). Open it up and add the following code:
import re from typing import Any, Dict, List, Textfrom rasa.nlu.components import Component from rasa.nlu.config import RasaNLUModelConfig from rasa.nlu.tokenizers import Token, Tokenizer from rasa.nlu.training_data import Message, TrainingData
All of the imports are important except for regex. Let’s continue on to create the main class (JapaneseTokenizer).
class JapaneseTokenizer(Tokenizer, Component):
Inside this class, you will need to have the following classes:
- __init__
- train
- tokenize
You can actually duplicate any existing file and modify it according to your needs. The tokenization should be based on the language and module that you have. One important thing to note is that you need to return a list of Token. The token class consists of the tokenized word and the word offset. Kindly refer to JiebaTokenizer.py and WhiteSpaceTokenizer.py for more information on the structure tokenizer class.
My final code looks like this:
3. Registry file
The next step is to modify all the configuration files. Go to the directory of your virtual environment and find the following directory:
Lib/site-packages/rasa/nlu/
You should have the following folders and files
Open up the registry.py file and start to edit the content. At the import area, add the following code:
from rasa.nlu.tokenizers.japanese_tokenizer import JapaneseTokenizer
- japanese_tokenizer : Name of the py file.
- JapaneseTokenizer : Name of the class. You have to make sure that this name is exactly the same name as the one used in config.yml and component_classes.
In the same registry.py file, find component_classes list and add the JapaneseTokenizer to it (modify it according to your own class name).
component_classes = [ # utils SpacyNLP, MitieNLP, # tokenizers JapaneseTokenizer, MitieTokenizer, SpacyTokenizer, WhitespaceTokenizer, JiebaTokenizer, ...
Save the file and we are now ready to test it.
4. Train and Test
Open up a terminal in a virtual environment and point it to the base directory of your config.yml file (the one that we have modified in the first section). Make sure that you have a data folder with your training data in it. Run the following code:
rasa train nlu
The training might take some time depends on the number of intent that you have. You should be able to see the following output:
Once it is complete, you should have a tar.gz file at the models folder. Run the following command to test it (modify the name of the model accordingly):
rasa shell -m models/nlu-20190924-144445.tar.gz
It will run an interactive shell mode inside the terminal. You can input your sentence to test the result.
5. Conclusion
Let’s recap on the basic steps to setup a Japanese tokenizer for Rasa NLU. First and foremost, we need to modify the config.yml file and install the SudachiPy module.
Then, we created a japanese_tokenizer.py file with the necessary code for initialization, training and tokenization.
We moved on to modify the registry.py file by adding the import code and JapaneseTokenizer reference to the component_class list.
Finally, we train the model and test it via the interactive shell mode provided by rasa.
Hope you enjoyed the article and see you next time!
Reference
- https://towardsdatascience.com/a-beginners-guide-to-rasa-nlu-for-intent-classification-and-named-entity-recognition-a4f0f76b2a96
- https://towardsdatascience.com/sudachipy-a-japanese-morphological-analyzer-in-python-5f1f8fc0c807
- https://gist.github.com/wfng92/831b47df29de687c8ea3264ffb9134ee
- https://github.com/WorksApplications/SudachiPy
- https://rasa.com/docs/rasa/nlu/choosing-a-pipeline/
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK