0

isspace()

 11 months ago
source link: https://www.evanjones.ca/isspace_locale.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

[ 2023-June-06 09:41 ]

This is a post for myself, because I wasted a lot of time understanding this bug, and I want to be able to remember it in the future. I expect close to zero others to be interested. The C standard library function isspace() returns a non-zero value (true) for the six "standard" ASCII white-space characters ('\t', '\n', '\v', '\f', '\r', ' '), and any locale-specific characters. By default, a program starts in the "C" locale, which will only return true for the six ASCII white-space characters. However, if the program changes locales, it can return true for other values. As a result, unless you really understand locales, you should use your own version of this function, or ICU4C's u_isspace() function. An implementation of isspace() for ASCII is one line:

/* Returns true for the 6 ASCII white-space characters: \t \n \v \f \r ' '. */
int isspace_ascii(int c)
{
  return c == '\t' || c == '\n' || c == '\v' || c == '\f' || c == '\r' || c == ' ';
}

I ran into this because On Mac OS X, Postgres switches to the system's default locale, which is something that uses UTF-8 (e.g. en_US.UTF-8, fr_CA.UTF-8, etc). In this case, isspace() returns true for Unicode white-space values, which includes 0x85 = NEL = Next Line, and 0xA0 = NBSP = No-Break Space. This caused a bug in parsing Postgres Hstore values that use Unicode. I have attempted to submit a patch to fix this (mailing list post, commitfest entry).

For a program to demonstrate the behaviour on different systems, see isspace_locale on Github.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK