Difference Between NFD, NFC, NFKD, and NFKC Explained with Python Code

The difference between Unicode normalization forms

Nov 14 ·4min read

UjuIni3.jpg!web

Photo by Joel Filipe on Unsplash

Recently I am working on an NLP task in Japanese, one problem is to convert special characters to a normalized form. So I have done a little research and write this post for anyone who has the same need.

Japanese contains different forms of the character, for example, Latin has two forms, full-width form, and half-width.

In the above example, we can see the full-width form is very ugly and is also hard to utilizing for the following processing. So we need to convert it to a normalized form.

TL;DR

Use NFKC method.

>>> from unicodedata import normalize
>>> s = "株式会社ＫＡＤＯＫＡＷＡ Ｆｕｔｕｒｅ Ｐｕｂｌｉｓｈｉｎｇ"
>>> normalize('NFKC', s)
株式会社KADOKAWA Future Publishing

Unicode normalization forms

q2IVRz3.png!web

from Wikipedia

There are 4 kinds of Unicode normalization forms. This article give a very detailed explanation. But I will explain the difference with a simple and easy understanding way.

First, we could see the below result for an intuitive understanding.

ｱｲｳｴｵ ==(NFC)==> ｱｲｳｴｵ
ｱｲｳｴｵ ==(NFD)==> ｱｲｳｴｵ
ｱｲｳｴｵ ==(NFKC)==> アイウエオ
ｱｲｳｴｵ ==(NFKD)==> アイウエオ
パピプペポ ==(NFC)==> パピプペポ
パピプペポ ==(NFD)==> パピプペポ
パピプペポ ==(NFKC)==> パピプペポ
パピプペポ ==(NFKD)==> パピプペポ
ﾊﾟﾋﾟﾌﾟﾍﾟﾎﾟ ==(NFC)==> ﾊﾟﾋﾟﾌﾟﾍﾟﾎﾟ
ﾊﾟﾋﾟﾌﾟﾍﾟﾎﾟ ==(NFD)==> ﾊﾟﾋﾟﾌﾟﾍﾟﾎﾟ
ﾊﾟﾋﾟﾌﾟﾍﾟﾎﾟ ==(NFKC)==> パピプペポ
ﾊﾟﾋﾟﾌﾟﾍﾟﾎﾟ ==(NFKD)==> パピプペポ
ａｂｃＡＢＣ ==(NFC)==> ａｂｃＡＢＣ
ａｂｃＡＢＣ ==(NFD)==> ａｂｃＡＢＣ
ａｂｃＡＢＣ ==(NFKC)==> abcABC
ａｂｃＡＢＣ ==(NFKD)==> abcABC
１２３ ==(NFC)==> １２３
１２３ ==(NFD)==> １２３
１２３ ==(NFKC)==> 123
１２３ ==(NFKD)==> 123
＋－．～）｝ ==(NFC)==> ＋－．～）｝
＋－．～）｝ ==(NFD)==> ＋－．～）｝
＋－．～）｝ ==(NFKC)==> +-.~)}
＋－．～）｝ ==(NFKD)==> +-.~)}

There are two classification methods for these 4 forms.

# 1 original form changed or not
- A(not changed): NFC & NFD
- B(changed): 
 
   NFKC & NFKD# 2 the length of original length changed or not
- A(not changed): 
 
   NFC & NFKC
- B(changed): 
 
   NFD & NFKD

1 Whether the original form is changed or not

ａｂｃＡＢＣ ==(NFC)==> ａｂｃＡＢＣ
ａｂｃＡＢＣ ==(NFD)==> ａｂｃＡＢＣ
ａｂｃＡＢＣ ==(NFKC)==> abcABC
ａｂｃＡＢＣ ==(NFKD)==> abcABC# 1 original form changed or not
- A(not changed): NFC & NFD
- B(changed): 
 
   NFKC & NFKD

The first classification method is based on whether the original form is changed or not. More specifically, A group does not contain K but B group contains K . What does K means?

D = Decomposition 
C = Composition
K = Compatibility

K means compatibility, which is used to distinguish with the original form.

2 Whether the length of original form is changed or not

パピプペポ ==(NFC)==> パピプペポ
パピプペポ ==(NFD)==> パピプペポ
パピプペポ ==(NFKC)==> パピプペポ
パピプペポ ==(NFKD)==> パピプペポ# 2 the length of original length changed or not
- A(not changed): 
 
   NFC & NFKC
- B(changed): 
 
   NFD & NFKD

This second classification method is based on whether the length of the original form is changed or not. A group contains C (Composition), which won’t change the length. B group contains D (Decomposition), which will change the length.

You might be wondering why the length is change? Please see the test below.

>>> from unicodedata import normalize
>>> s = "パピプペポ"
>>> len(s)
5
>>> len(normalize('NFC', s))
5
>>> len(normalize('NFKC', s))
5
>>> len(normalize('NFD', s))
10
>>> len(normalize('NFKD', s))
10

We can find the “decomposition” method doubles the length.

AVZZnuJ.png!web

from Unicode正規化とは

This is because the NFD & NFKD decompose each Unicode character into two Unicode characters. For example, ポ(U+30DD) = ホ(U+30DB) + Dot(U+309A) . So the length change from 5 to 10. NFC & NFD compose separated Unicode characters together, so the length is not changed.

Python Implementation

You can use the unicodedata library to get different forms.

>>> from unicodedata import normalize
>>> s = "パピプペポ"
>>> len(s)
5
>>> len(normalize('NFC', s))
5
>>> len(normalize('NFKC', s))
5
>>> len(normalize('NFD', s))
10
>>> len(normalize('NFKD', s))
10

Take Away

Usually, we can use either of NFKC or NFKD to get the normalized form. The length won’t make trouble only if your NLP task is length sensitive. I usually use the NFKC method.

Check out my other posts here !

GitHub: https://github.com/BrambleXu

LinkedIn: www.linkedin.com/in/xu-liang

The difference between Unicode normalization forms

TL;DR

Unicode normalization forms

1 Whether the original form is changed or not

2 Whether the length of original form is changed or not

Python Implementation

Take Away

Reference

Recommend

用 AI 自动化做出决策是否是超级简单的一件事？

记一次挖矿处理

本月你最值得关注的数据科学开源项目！

敲开通往架构师的门

Canonical Enhances Kubernetes Reliability At The Edge

张勇回应“双11数据造假”：平台没任何力量改变数据

传苹果明年推“超级捆绑包” 整合多项付费服务

图解：一篇彻底带你搞懂 JS 中的 this 指向问题

成都的缺点

这是要把散户赶出市场请大家集体反对八折半年再融资新规

About Joyk