5

Proper String Normalization for Comparison Purposes

 3 years ago
source link: https://dzone.com/articles/proper-strings-normalization-for-comparison-purpos
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Proper String Normalization for Comparison Purposes

This article exposes some of the most important points to take into consideration when comparing strings from different systems.

Join the DZone community and get the full member experience.

Join For Free

TL;DR

In Java, do:

xxxxxxxxxx
String normalizedString = Normalizer.normalize(originalString,Normalizer.Form.NFKD)
.replaceAll("[^\\p{ASCII}]", "").toLowerCase().replaceAll("\\s{2,}", " ").trim();

Nowadays, most strings are Unicode-encoded and we are able to work with many different native characters with diacritical signs/accents (like ö, é, À) or ligatures (like æ or ʥ). Characters can be stored in UTF-8 (for instance) and associated glyphs can be displayed properly if the font supports them.

However, we often observe recurring difficulties when comparing strings issued from different information systems and/or initially typed by humans.

The human brain is a machine to fill gaps. Hence it has absolutely no problem reading or typing 'e' instead of 'ê'.

But what if the word 'tête' ('head' in French) is correctly stored in an UTF-8 encoded database but you have to compare it with text created by an end-user that's the missing accent?

We also have to deal with legacy systems or modern ones filled up with legacy data that doesn't support the Unicode standard.

Another simple illustration of this problem is the use of ligatures. Imagine a product database storing various items with an ID and a description. Some items contain ligatures (a combination of several letters joined together to create a single character like ’Œuf’ - egg in French). Like most French people, I have no idea of how to produce such a character, even using a French keyboard. I would search for the items descriptions using oeuf. Obviously, our code has to take care of ligatures if we want to return a useful result containing ’Œuf’.

How can we fix that mess?

Rule #1: Don't Even Compare Human Text if You Can

When you can, never compare strings to heterogeneous systems. It is surprisingly tricky to do it properly (even if it is possible to handle most cases like we will see below). Instead, compare sequences, UUID or any other ASCII-based strings without spaces or ‘special’ characters. Strings that come from different information systems will probably store data differently (lower/upper case, with/without diacritics, etc.). On the contrary, good ids are made up of plain ASCII strings and are thus free from encoding issues .

Example:

System 1 : {"id":"8b286f72-b366-47a4-9537-59d39411979a","desc":"Œeuf brouillé"}

System 2 : {"id":"8b286f72-b366-47a4-9537-59d39411979a","desc":"OEUF BROUILLE"}

If you compare ids, everything is simple and you can go home early. If you compare descriptions, you'll have to normalize it as a prerequisite or you'll be in big trouble.

Characters normalization is the action of computing a canonical form of a string. To avoid false positives when comparing strings coming from several information systems, normalize both strings and compare the result of their normalization.

In the previous example, we would compare normalize("Œeuf brouillé") with normalize("OEUF BROUILLE"). Using a proper normalization function, we should then compare 'oeuf brouille' with 'oeuf brouille' but if the normalization function is buggy or partial, the strings would mismatch. For example, if the normalize() function doesn't handle ligatures properly, you would get a false positive by comparing 'œuf brouille' with 'oeuf brouille'.

Rule #2: Normalize in Memory

It is better to compare strings at the last possible moment and to do so in memory and not to normalize strings at storage time. There are at least two reasons for this:

  1. If you only store a normalized version of your string, you lose information. You may need proper diacritics later for display purposes or other reasons. As an IT professional, one of your tasks is to never lose information humans provided you.

  2. What if some items have been stored before the normalization routine has been set up? What if the normalization function changed over time?

To avoid these common pitfalls, simply compare in memory normalize(<data system 1>) with normalize(<data system 2>). The CPU overhead should be negligible if you don't compare thousands of items per second.

Rule #3: Always Trim Externally and Internally

Another common trap when dealing with strings typed by humans is the presence of spaces at the beginning or in the middle of a sequence of characters.

As an example, look at these strings: ' Wiliam' (note the space at the beginning), 'Henry ' (note the space at the end), 'Gates  III' (see the double space in the middle of this family name, did you notice it at first?).

Appropriate solution:

  1. Trim the text to remove spaces at the beginning and at the end of the text.
  2. Remove superfluous spaces in the middle of the string.

In Java, one of the way to achieve it is:

s = s.replaceAll("\\s{2,}", " ").trim();

Rule #4: Harmonize Letter Casing

This is the most well-known and straightforward normalization method: simply put every letter into lower or upper case. As far as I know, there is no preference for one or the other choice. Most of developers (myself included) use lower case.

In Java, just use toLowerCase():

xxxxxxxxxx
s = s.toLowerCase();

Rule #5: Transform Characters With Diacritical Signs to ASCII

When typed out, diacritical signs are often omitted in favor of their ASCII version. For example, one can type the German word 'schon' instead of 'schön'.

Unicode proposes four Normalization forms that may be used for that purpose (NFC, NFD, NFKD, and NFKC). Check-out this enlightening illustration.

Detailing all these forms would go beyond the scope of this article but, basically, some Unicode characters can be encoded either as a single combined character or in a decomposed form. For instance, 'é' can be encoded as \u00e9 code point or as the decomposed form '\u0065' ('e' letter) + '\u0301' (the diacritic '◌́'') afterward.

We will perform a NFD ("Canonical Decomposition") normalization method on the initial text to make sure that every character with an accent is converted to its decomposed form. Then, all we have to do is to drop the diacritics and only keep the 'base' simple characters.

In Java, both operations can be done this way:

xxxxxxxxxx
s = Normalizer.normalize(s, Normalizer.Form.NFD)
    .replaceAll("[^\\p{ASCII}]", "");

Note: even if the code covers this issue, I prefer the NFKD transformation to deal with ligatures as well (see below).

Rule #6: Decompose Ligatures to a Set of ASCII Characters

The other thing to understand is that Unicode maintains some compatibility mapping between about 5000 ‘composite’ characters (like ligatures or precomposed Roman numerals) and a list of regular characters.

For instance; the Roman numeral Ⅻ (U+216B) can be decomposed with NFKD normalization as a 'X' and two 'I's.

For these kinds of 'Siamese twins' characters, we have to apply the NFKD ("Compatibility Decomposition") normalization form that both decomposes the characters (see 'Rule #5') but also maps ligatures to several 'base' characters. You can then drop the remaining diacritics.

In Java, use:

xxxxxxxxxx
s = Normalizer.normalize(s, Normalizer.Form.NFKD)
    .replaceAll("[^\\p{ASCII}]", "");

Rule #7: Beware Punctuation

This is more of a minor issue but according to the context you may want to normalize some special punctuation characters as well.

For example, in a literary context, like a text-revision software, it may be a good idea to map the em/long dash ('—') character to the regular ASCII hyphen ('-').

As far as I know, Unicode doesn't provide mapping for that, so you'll have to do it yourself the old good way:

xxxxxxxxxx
s = s.replaceAll("—", "-");

Final Word

String normalization is very helpful when comparing strings issued from different systems or to perform appropriate comparisons. Even fully English localized projects can benefit from it, for instance to take care of case or trailing spaces or when dealing with foreign words with accents.

This article exposes some of the most important points to take into consideration but it is far from exhaustive. For instance, we omitted Asian character manipulation or cultural normalization of semantically equivalents items (like the 'St' abbreviation of 'Saint') but I hope it is a good start for most projects.

x
String normalizedString = Normalizer.normalize(originalString,Normalizer.Form.NFKD)
.replaceAll("[^\\p{ASCII}]", "").toLowerCase().replaceAll("\\s{2,}", " ").trim();

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK