[wx-dev] Improving wxString proposal (Re: #9672: wx 2.9
Ansi/Unicode Combi-wxString)
Vadim Zeitlin
vadim at wxwidgets.org
Tue Jul 8 17:06:35 PDT 2008
On Tue, 8 Jul 2008 16:20:29 -0700 Jeff Tupper <tupperlists at gmail.com> wrote:
JT> On Tue, Jul 8, 2008 at 3:01 PM, Vadim Zeitlin <vadim at wxwidgets.org> wrote:
JT> > On Tue, 8 Jul 2008 14:49:12 -0700 Jeff Tupper <tupperlists at gmail.com> wrote:
JT> >
JT> > JT> Should it be noted that different indexing semantics are used for
JT> > JT> UTF-16 (per-code unit) than for UTF-8 (per-code point)?
JT> >
JT> > Do you mean the absence of surrogate support under Windows or something
JT> > else here?
JT>
JT> What do you mean by "absence of surrogate support" ? Per-code unit
JT> semantics for indexing is common, just not what was chosen for wx and
JT> utf-8. With utf-16 per-code unit indexing, each half of a surrogate
JT> pair gets separate, consecutive index positions.
I mean exactly this by lack of surrogate support (do you imply that this
behaviour could be desirable?). And I don't think the current situation is
ideal but it's not worse than before and I don't have sufficient time nor
motivation to fix it.
JT> > JT> > and OS X (in contrast to UCS4) with an identical interface in
JT> > JT> > either case.
JT> > JT>
JT> > JT> Why is utf-8 being proposed as the standard on wxmac?
JT> >
JT> > For the same reasons as with wxGTK: it's used internally by the system.
JT>
JT> Used "internally"? I guess what you mean is that it's used by /
JT> available via an API.
I'm not a Mac expert but people who are seemed to have arrived at
conclusion that OS X does use mostly UTF-8 in the GUI layer.
JT> > It is much more rare to modify the sting character by character than to
JT> > examine it in this way.
JT>
JT> Huh? Even assuming it is, so what?
So most of the performance problems will be fixed by caching. I never
said that all problems will be magically fixed, if I could claim that it is
possible to implement O(1) indexing for UTF-8 strings I'd work in
marketing. But if the problem rarely arise in practice, it's good enough
for me.
JT> > First of all, it is done automatically by wx: std::wstring is implicitly
JT> > convertible to wxString.
JT>
JT> That's good (for the most / better part --- I've had to change some of
JT> my code due to the automatic conversion constructors of wxString, but
JT> I don't mind doing that as long as it's not excessive).
JT>
JT> What about the reverse direction, from wxString to std::wstring?
What about it? It works, if this is what you mean. Try compiling this
#include "wx/init.h"
#include "wx/string.h"
int main(int argc, char **argv)
{
wxInitializer init;
std::string s("world");
std::wstring ws(wxString::Format("hello, %s!", s));
wxPuts(ws);
return 0;
}
If you don't find this seamless interoperation of all kinds of strings
convenient I don't know what do we have to do in order to impress you.
JT> > Second, if you count UTF-8 conversion now why
JT> > didn't you count it before -- just because it happened inside wxWidgets and
JT> > not at the boundary of wxWidgets API?
JT>
JT> I was counting it,
Then you need to explain me where does the extra conversion come from...
JT> and also taking the assumption that the big wxString is going to a
JT> utf-8 consumer. Let's consider a big wxString being passed to a
JT> wxHtmlWindow.
JT>
JT> Before: conversion from wchar_t to utf-8 is done somewhere within wx.
But where does the original string came from? If it's large, it probably
was read from a file. And the file was almost surely in UTF-8 and not in
UTF-32. I.e. you still forgot one conversion.
JT> Now: conversion from utf-8 to wchar_t for work, then conversion back
JT> to utf-8 for passing to wxHtmlWindow. Two conversions.
If you somehow had your string in wchar_t from the start before, surely
you still have it in wchar_t now. I.e. there is no first UTF-8 to wchar_t
conversion.
Of course, realistically there were 2 conversions before and 2 now.
JT> > Third, obviously you're not forced to
JT> > use std::wstring for short strings and normally you won't do it.
JT>
JT> Then where do the string algorithms come from? I don't want to write
JT> and maintain two sets of string algorithms.
The algorithms in application-specific code are usually specific to
something that application does (sorry for tautology but I try to be
complete here). And this application-specific thing which it does either
can be done with long strings or can't. If it can, you do need to (re)write
your code using iterators (if access to string is read-only) or convert the
string to wstring and operate with it if you really, really need to modify
it in place (personally I never do this but YMMV). But I don't see any
circumstances in which you'd have to have 2 versions of the same algorithm.
Do you have any realistic examples?
JT> > as you have repeatedly stated it and so I don't have much hope of changing your
JT> > opinion.
JT>
JT> Believe it or not, I'm quite open to changing my mind. If utf-8
JT> support is forced on us:
It's not forced on you. I wrote a dozen times already that you're
perfectly free to use wchar_t build but you just seem to ignore it. I won't
correct this "forced" misconception again but it doesn't mean that I agree
with it when you write it the next time.
JT> - Hopefully not many wx users and developers will leave and the
JT> project won't fork.
So far I didn't notice mass exodus of developers.
VZ
More information about the wx-dev
mailing list