UNICODE charsets and widechars
Vadim Zeitlin
vadim at wxwindows.org
Tue Oct 3 10:21:16 PDT 2006
On Tue, 03 Oct 2006 18:25:56 +0200 Manuel Martín <mmartin at ceyd.es> wrote:
MM> So I decided to write some clarifying text, but I also think an
MM> "expert" should confirm it.
It's globally correct but a few remarks are in order:
MM> IMHO docs are some old with vocabulary: when they say "converts between
MM> the UTF-8 encoding and Unicode", I think this is not fully valid,
MM> because UFT-8 is part of UNICODE.
No, UTF-8 is just one possible encoding of Unicode, as you say yourself
below. In general, "Unicode" in relation to wxWidgets means "wchar_t".
While this is not totally correct neither (especially under Windows which
uses 16 bit wchar_t), it's more or less true as there is a one to one
mapping between at least the BMP (or, in case of Unix systems where wchar_t
is 32bit, the entire Unicode code space) and the wide characters.
MM> (E) 'UTF-16' : An UNICODE encoding that uses 2 bytes for each char.
At least 2 bytes. The composites need more.
MM> This allows more than 65000 chars, but may be not
MM> enough (i.e. Chinese needs more graphs).
Not the usual Chinese ideograms though, they're part of the BMP. So "most"
of the commonly used symbols needs only 2 bytes in UTF-16.
MM> (H) 'widechar': A name for how an OS manage UTF-16 or UTF-32 chars. You
MM> use 'wchar' in your code and Windows XP understands it
MM> as a 2-byte char and some Unices as a 4-byte char.
It's wchar_t and, AFAIK, all Unices use 32 bit wchar_t.
MM> (I) 'multibyte': A sequence of bytes that is supposed to be an UNICODE
Not necessarily, there are many multibyte non-Unicode encodings.
MM> 1) If you want to use UNICODE chars, compile wx and your app passing
MM> _UNICODE to compiler.
There should be rarely need to define this directly. If you use VC IDE
project files you just select one of the "Unicode" build configurations. If
you're under Unix, configure the library with --enable-unicode switch.
MM> 2) Use macro wxT() for all strings. Use wxString instead of C style
MM> strings.
Maybe not quite all but it's surely a good advice to do it by default.
MM> 3) If you need an ASCII-E char in one literal string don't write it
MM> directly (your complier may rise an error). You can pass a charset
MM> parameter to compiler, but it is preferred to tell wxString the charset
MM> to use (see wxString constructors with conversion).
Yes. You can also encode wide chars inside the program using \uxxxx escape
sequence although older compilers don't support this.
Regards,
VZ
--
TT-Solutions: wxWidgets consultancy and technical support
http://www.tt-solutions.com/
More information about the wx-users
mailing list