35

Difference Between NFD, NFC, NFKD, and NFKC Explained with Python Code

 4 years ago
source link: https://towardsdatascience.com/difference-between-nfd-nfc-nfkd-and-nfkc-explained-with-python-code-e2631f96ae6c?gi=74c7875af369
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

The difference between Unicode normalization forms

IjuEfyv.png!web

Nov 14 ·4min read

UjuIni3.jpg!web

Photo by Joel Filipe on Unsplash

Recently I am working on an NLP task in Japanese, one problem is to convert special characters to a normalized form. So I have done a little research and write this post for anyone who has the same need.

Japanese contains different forms of the character, for example, Latin has two forms, full-width form, and half-width.

In the above example, we can see the full-width form is very ugly and is also hard to utilizing for the following processing. So we need to convert it to a normalized form.

TL;DR

Use NFKC method.

>>> from unicodedata import normalize
>>> s = "株式会社KADOKAWA Future Publishing"
>>> normalize('NFKC', s)
株式会社KADOKAWA Future Publishing

Unicode normalization forms

q2IVRz3.png!web

from Wikipedia

There are 4 kinds of Unicode normalization forms. This article give a very detailed explanation. But I will explain the difference with a simple and easy understanding way.

First, we could see the below result for an intuitive understanding.

アイウエオ ==(NFC)==> アイウエオ
アイウエオ ==(NFD)==> アイウエオ
アイウエオ ==(NFKC)==> アイウエオ
アイウエオ ==(NFKD)==> アイウエオ
パピプペポ ==(NFC)==> パピプペポ
パピプペポ ==(NFD)==> パピプペポ
パピプペポ ==(NFKC)==> パピプペポ
パピプペポ ==(NFKD)==> パピプペポ
パピプペポ ==(NFC)==> パピプペポ
パピプペポ ==(NFD)==> パピプペポ
パピプペポ ==(NFKC)==> パピプペポ
パピプペポ ==(NFKD)==> パピプペポ
abcABC ==(NFC)==> abcABC
abcABC ==(NFD)==> abcABC
abcABC ==(NFKC)==> abcABC
abcABC ==(NFKD)==> abcABC
123 ==(NFC)==> 123
123 ==(NFD)==> 123
123 ==(NFKC)==> 123
123 ==(NFKD)==> 123
+-.~)} ==(NFC)==> +-.~)}
+-.~)} ==(NFD)==> +-.~)}
+-.~)} ==(NFKC)==> +-.~)}
+-.~)} ==(NFKD)==> +-.~)}

There are two classification methods for these 4 forms.

# 1 original form changed or not
- A(not changed): NFC & NFD
- B(changed): NFKC & NFKD
# 2 the length of original length changed or not
- A(not changed): NFC & NFKC
- B(changed): NFD & NFKD

1 Whether the original form is changed or not

abcABC ==(NFC)==> abcABC
abcABC ==(NFD)==> abcABC
abcABC ==(NFKC)==> abcABC
abcABC ==(NFKD)==> abcABC
# 1 original form changed or not
- A(not changed): NFC & NFD
- B(changed): NFKC & NFKD

The first classification method is based on whether the original form is changed or not. More specifically, A group does not contain K but B group contains K . What does K means?

D = Decomposition 
C = Composition
K = Compatibility

K means compatibility, which is used to distinguish with the original form.

2 Whether the length of original form is changed or not

パピプペポ ==(NFC)==> パピプペポ
パピプペポ ==(NFD)==> パピプペポ
パピプペポ ==(NFKC)==> パピプペポ
パピプペポ ==(NFKD)==> パピプペポ
# 2 the length of original length changed or not
- A(not changed): NFC & NFKC
- B(changed): NFD & NFKD

This second classification method is based on whether the length of the original form is changed or not. A group contains C (Composition), which won’t change the length. B group contains D (Decomposition), which will change the length.

You might be wondering why the length is change? Please see the test below.

>>> from unicodedata import normalize
>>> s = "パピプペポ"
>>> len(s)
5
>>> len(normalize('NFC', s))
5
>>> len(normalize('NFKC', s))
5
>>> len(normalize('NFD', s))
10
>>> len(normalize('NFKD', s))
10

We can find the “decomposition” method doubles the length.

AVZZnuJ.png!web

from Unicode正規化とは

This is because the NFD & NFKD decompose each Unicode character into two Unicode characters. For example, ポ(U+30DD) = ホ(U+30DB) + Dot(U+309A) . So the length change from 5 to 10. NFC & NFD compose separated Unicode characters together, so the length is not changed.

Python Implementation

You can use the unicodedata library to get different forms.

>>> from unicodedata import normalize
>>> s = "パピプペポ"
>>> len(s)
5
>>> len(normalize('NFC', s))
5
>>> len(normalize('NFKC', s))
5
>>> len(normalize('NFD', s))
10
>>> len(normalize('NFKD', s))
10

Take Away

Usually, we can use either of NFKC or NFKD to get the normalized form. The length won’t make trouble only if your NLP task is length sensitive. I usually use the NFKC method.

Check out my other posts here !

GitHub: https://github.com/BrambleXu

LinkedIn: www.linkedin.com/in/xu-liang

Reference


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK