3

C ++ Unicode UTF-16 encoding

 2 years ago
source link: https://www.codesd.com/item/c-unicode-utf-16-encoding.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

C ++ Unicode UTF-16 encoding

advertisements

I have a wide char string is L"hao123--我的上网主页", and it must be encoded to "hao123--\u6211\u7684\u4E0A\u7F51\u4E3B\u9875". I was told that the encoded string is a special “%uNNNN” format for encoding Unicode UTF-16 code points. In this website, it tells me it's JavaScript escapes. But I don't know how to encode it with C++.

It there any library to get this to work? or give me some tips.

Thanks my friends!


Embedding unicode in string literals is generally not a good idea and is not portable; there is no guarantee that wchar_t will be 16 bits and that the encoding will be UTF-16. While this may be the case on Windows with Microsoft Visual C++ (a particular C++ implementation), wchar_t is 32 bits on OS X's GCC (another implementation). If you have some sort of localized string constants, it's best to use a configuration file in some particular encoding and to interpret them as having been encoded in that encoding. The International Components for Unicode (ICU) library provides pretty good support for interpreting and handling unicode. Another good library for converting between (but not interpreting) encoding formats is libiconv.

Edit
It is possible I am misinterpreting your question... if the problem is that you have a string in UTF-16 already, and you want to convert it to "unicode-escape ASCII" (i.e. an ASCII string where unicode characters are represented by "\u" followed by the numeric value of the character), then use the following pseudo-code:

for each codepoint represented by the UTF-16 encoded string:
    if the codepoint is in the range [0,0x7F]:
       emit the codepoint casted to a char
    else:
       emit "\u" followed by the hexadecimal digits representing codepoint

Now, to get the codepoint, there is a very simple rule... each element in the UTF-16 string is a codepoint, unless it is part of a "surrogate pair", in which case it and the element after it comprise a single codepoint. If so, then the unicode standard defines an procedure for combinging the "leading surrogate" and the "trailing surrogate" into a single code point. Note that UTF-8 and UTF-16 are both variable-length encodings... a code point requires 32 bits if not represented with variable length. The Unicode Transformation Format (UTF) FAQ explains the encoding as well as how to identify surrogate pairs and how to combine them into codepoints.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK