[wx-dev] Improving wxString proposal (Re: #9672: wx 2.9
Ansi/Unicode Combi-wxString)
Jeff Tupper
tupperlists at gmail.com
Tue Jul 8 16:20:29 PDT 2008
On Tue, Jul 8, 2008 at 3:01 PM, Vadim Zeitlin <vadim at wxwidgets.org> wrote:
> On Tue, 8 Jul 2008 14:49:12 -0700 Jeff Tupper <tupperlists at gmail.com> wrote:
>
> JT> Should it be noted that different indexing semantics are used for
> JT> UTF-16 (per-code unit) than for UTF-8 (per-code point)?
>
> Do you mean the absence of surrogate support under Windows or something
> else here?
What do you mean by "absence of surrogate support" ? Per-code unit
semantics for indexing is common, just not what was chosen for wx and
utf-8. With utf-16 per-code unit indexing, each half of a surrogate
pair gets separate, consecutive index positions.
> JT> > and OS X (in contrast to UCS4) with an identical interface in
> JT> > either case.
> JT>
> JT> Why is utf-8 being proposed as the standard on wxmac?
>
> For the same reasons as with wxGTK: it's used internally by the system.
Used "internally"? I guess what you mean is that it's used by /
available via an API.
> JT> > d) For people requiring O(1) access to Unicode strings under Linux
> JT> > and OS X, the library can still be compiled in wchar_t/UCS4 mode.
> JT> > Alternatively, strings can be converted to std::wstring and then
> JT> > processed further (and later hopefully also to char16_t based
> JT> > strings).
> JT>
> JT> Even if O(1) random access isn't required, O(1) character modification
> JT> may be.
>
> It is much more rare to modify the sting character by character than to
> examine it in this way.
Huh? Even assuming it is, so what? How much rarer? (For those new to
this discussion, see my earlier post regarding average case analysis
and the design of libraries.)
> JT> Index caching might help with the former, but not the latter.
>
> Why do you say this? If the old and new character have the same length
> when encoded in UTF-8 I don't see why not.
Of course. But what if the new character and the old character do not
have the same length?
I've given examples before, For those new to this discussion, see my
earlier posts for other examples. (If you have 1,000 non-ASCII
characters in a 200,000 character string [perhaps heading to
wxHtmlWindow], it wouldn't be surprising to see in-place modification
slow done by a factor of 1,000 when moving from wchar_t to utf-8.)
> JT> If one doesn't want to write and maintain two copies of
> JT> string algorithms, should one work only with std::wstrings and then
> JT> convert back and forth all the time (even for short strings)? If we
> JT> have a large wxString to work with on linux, we've gone from one
> JT> conversion (wchar_t -> utf-8) done automatically by wx to two (utf-8
> JT> -> wchar_t -> utf-8) not done automatically by wx?
>
> First of all, it is done automatically by wx: std::wstring is implicitly
> convertible to wxString.
That's good (for the most / better part --- I've had to change some of
my code due to the automatic conversion constructors of wxString, but
I don't mind doing that as long as it's not excessive).
What about the reverse direction, from wxString to std::wstring?
> Second, if you count UTF-8 conversion now why
> didn't you count it before -- just because it happened inside wxWidgets and
> not at the boundary of wxWidgets API?
I was counting it, and also taking the assumption that the big
wxString is going to a utf-8 consumer. Let's consider a big wxString
being passed to a wxHtmlWindow.
Before: conversion from wchar_t to utf-8 is done somewhere within wx.
One conversion. (One conversion required; perhaps it's done on each
redraw - I haven't checked; I wouldn't be surprised if the conversion
is done in chunks either way...)
Now: conversion from utf-8 to wchar_t for work, then conversion back
to utf-8 for passing to wxHtmlWindow. Two conversions. (Assuming wx
doesn't convert away from utf-8 within wxHtmlWindow.)
I guess what you'd argue is that not enough application code has been
re-written. Application developers using in-place string modification
algorithms should attempt to keep big strings that will be used with
in-place modification routines in wchar_t encoding as much as possible
and convert to wxString just before passing to wxWidgets. (i.e.
attempt to divide all strings in the application into two classes
based on size and use; rather than deal with this issue once in
wxWidgets, deal with it on a per-application, per-string basis.)
> Third, obviously you're not forced to
> use std::wstring for short strings and normally you won't do it.
Then where do the string algorithms come from? I don't want to write
and maintain two sets of string algorithms.
> I do understand that you're against using UTF-8 in wx
I'm not against using utf-8 in wx. I'm against it being forced on us
as the only encoding used for wxStrings. I use utf-8 as a method for
passing data between programs and libraries. Having some support for
utf-8 in wx is certainly necessary.
> as you have repeatedly stated it and so I don't have much hope of changing your
> opinion.
Believe it or not, I'm quite open to changing my mind. If utf-8
support is forced on us:
- I'll likely continue to use wx and will rewrite my string algorithms
so that in-place modification is avoided. The end result of the change
will be that string operations will be somewhat slower than they are
now, but as my applications don't do much string processing it's not
that big a deal for me. (Without rewriting, in-place modification on
large strings freezes my application on trivial tasks. A tasks that
used to take a fraction of a second takes more than a minute with
utf-8.)
- Hopefully not many wx users and developers will leave and the
project won't fork.
The existence of --disable-utf8 makes most of the changes good changes.
> But maybe you should look at this issue again more closely because
> the above paragraph contains several factual errors shows that it's
> possibly not as bad as you're willing to believe.
It's not my first utf-8 post with factual errors. Utf-8 wxStrings are
certainly not as bad as I'm willing to believe, although they're worse
than I expected.
More information about the wx-dev
mailing list