Difference Between NFD, NFC, NFKD, and NFKC Explained with Python Code
source link: https://towardsdatascience.com/difference-between-nfd-nfc-nfkd-and-nfkc-explained-with-python-code-e2631f96ae6c?gi=74c7875af369
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
The difference between Unicode normalization forms
Photo by Joel Filipe on Unsplash
Recently I am working on an NLP task in Japanese, one problem is to convert special characters to a normalized form. So I have done a little research and write this post for anyone who has the same need.
Japanese contains different forms of the character, for example, Latin has two forms, full-width form, and half-width.
In the above example, we can see the full-width form is very ugly and is also hard to utilizing for the following processing. So we need to convert it to a normalized form.
TL;DR
Use NFKC
method.
>>> from unicodedata import normalize >>> s = "株式会社KADOKAWA Future Publishing" >>> normalize('NFKC', s) 株式会社KADOKAWA Future Publishing
Unicode normalization forms
from Wikipedia
There are 4 kinds of Unicode normalization forms. This article give a very detailed explanation. But I will explain the difference with a simple and easy understanding way.
First, we could see the below result for an intuitive understanding.
アイウエオ ==(NFC)==> アイウエオ アイウエオ ==(NFD)==> アイウエオ アイウエオ ==(NFKC)==> アイウエオ アイウエオ ==(NFKD)==> アイウエオ パピプペポ ==(NFC)==> パピプペポ パピプペポ ==(NFD)==> パピプペポ パピプペポ ==(NFKC)==> パピプペポ パピプペポ ==(NFKD)==> パピプペポ パピプペポ ==(NFC)==> パピプペポ パピプペポ ==(NFD)==> パピプペポ パピプペポ ==(NFKC)==> パピプペポ パピプペポ ==(NFKD)==> パピプペポ abcABC ==(NFC)==> abcABC abcABC ==(NFD)==> abcABC abcABC ==(NFKC)==> abcABC abcABC ==(NFKD)==> abcABC 123 ==(NFC)==> 123 123 ==(NFD)==> 123 123 ==(NFKC)==> 123 123 ==(NFKD)==> 123 +-.~)} ==(NFC)==> +-.~)} +-.~)} ==(NFD)==> +-.~)} +-.~)} ==(NFKC)==> +-.~)} +-.~)} ==(NFKD)==> +-.~)}
There are two classification methods for these 4 forms.
# 1 original form changed or not
- A(not changed): NFC & NFD
- B(changed): NFKC & NFKD# 2 the length of original length changed or not
- A(not changed): NFC & NFKC
- B(changed): NFD & NFKD
1 Whether the original form is changed or not
abcABC ==(NFC)==> abcABC
abcABC ==(NFD)==> abcABC
abcABC ==(NFKC)==> abcABC
abcABC ==(NFKD)==> abcABC# 1 original form changed or not
- A(not changed): NFC & NFD
- B(changed): NFKC & NFKD
The first classification method is based on whether the original form is changed or not. More specifically, A group does not contain K
but B group contains K
. What does K
means?
D = Decomposition C = Composition K = Compatibility
K
means compatibility, which is used to distinguish with the original form.
2 Whether the length of original form is changed or not
パピプペポ ==(NFC)==> パピプペポ
パピプペポ ==(NFD)==> パピプペポ
パピプペポ ==(NFKC)==> パピプペポ
パピプペポ ==(NFKD)==> パピプペポ# 2 the length of original length changed or not
- A(not changed): NFC & NFKC
- B(changed): NFD & NFKD
This second classification method is based on whether the length of the original form is changed or not. A group contains C
(Composition), which won’t change the length. B group contains D
(Decomposition), which will change the length.
You might be wondering why the length is change? Please see the test below.
>>> from unicodedata import normalize >>> s = "パピプペポ" >>> len(s) 5 >>> len(normalize('NFC', s)) 5 >>> len(normalize('NFKC', s)) 5 >>> len(normalize('NFD', s)) 10 >>> len(normalize('NFKD', s)) 10
We can find the “decomposition” method doubles the length.
from Unicode正規化とは
This is because the NFD & NFKD
decompose each Unicode character into two Unicode characters. For example, ポ(U+30DD) = ホ(U+30DB) + Dot(U+309A)
. So the length change from 5 to 10. NFC & NFD
compose separated Unicode characters together, so the length is not changed.
Python Implementation
You can use the unicodedata library to get different forms.
>>> from unicodedata import normalize >>> s = "パピプペポ" >>> len(s) 5 >>> len(normalize('NFC', s)) 5 >>> len(normalize('NFKC', s)) 5 >>> len(normalize('NFD', s)) 10 >>> len(normalize('NFKD', s)) 10
Take Away
Usually, we can use either of NFKC or NFKD
to get the normalized form. The length won’t make trouble only if your NLP task is length sensitive. I usually use the NFKC
method.
Check out my other posts here !
GitHub: https://github.com/BrambleXu
LinkedIn: www.linkedin.com/in/xu-liang
Reference
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK