[wx-dev] RFC: UTF-8 build mode for wx
Armel Asselin
asselin.armel at wanadoo.fr
Tue Feb 13 10:57:04 PST 2007
> Armel Asselin wrote:
> > I am talking of the design of UTF8 itself (the inner bits...), you
> > may have missed the part explaining heavily why it is designed that
> > way: UTF8 is designated to encode Unicode (more precisely UCS4 code
> > points, i.e. ISO/IEC 10646 codes) in a manner which is compatible
> > with ASCII oriented programs. and all the bits in a UTF8 sequence
> > were cleverly set to get this result.
>
> I am well aware of that, but I don't see what's the relevance of it
> for us.
up to now wx code was rather ASCII oriented in term of parsing, so the
design of UTF8 is just made for that: "don't touch anything and it will
continue to work as before" (no more no less).
> > yes. to be honest in typical user code, you mostly never iterate on
> > strings, you just use strings that came from somewhere in the
> > framework (UI, files, db, tokenizer...). And there, you merely only
> > want the ASCII simplicity. Nearly no code in wxWidgets itself
> > really needs to handle UTF8 as Unicode (I mean, to really exploit
> > the code points individually to do something clever with them).
>
> That's because there's not much code processing text in wx to begin
> with, but e.g. wxStringTokenizer needs it.
yes iterators will be probably well used there.
> > > Worst case, yes. But a simple optimization would make it behave
> > > much better in the typical use case: if we just remember the last
> > > index used and its corresponding position in UTF-8 encoded char*,
> > > then the typical for loop over all characters would still be
> > > O(1). At the cost of higher memory usage per-string, so we may
> > > prefer to not do it (or not).
> >
> > keeping a cache in an object is a no-no, unless it is extremely
> > heavy (connection pool, hard disk...) or seldom used from a
> > synchronization point of view. what about multi-thread programs?
> > two threads would then not be allowed to _read_ a string.
>
> Note that wxString is not MT-safe anyway and while reading data from a
> string by two threads at the same time happens to work as long as
> nobody is writing to the string, e.g. copying the string -- another
> common read-only access -- doesn't.
I proposed a patch for that some times ago (wx/atomic.h) which I have here
and wxString are thread safe here.
> Anyway, this too can be solved e.g. using TLS and out-of-wxString
> storage of the cache, the point I was making was that a) O(n^2) is
> not necessarily the typical cost for UTF-8 and b) optimizing it is a
> matter of trade-offs.
a) yes but through the usage of O(n) operator[] and length, it will become
the rule at any place where you do not have time to improve the code.
if you keep constant time behaviour for operator[] and length, you will
improve code which need it, so focusing at a few places.
> > > > - the positions, lengths... must be in bytes (what i call 'base
> > > > coding entity' in local jargon, which is uint8 for UTF8, uint16
> > > > for UTF16/UCS2, uint32 for UTF32/UCS4) so as to keep O(1) time
> > > > when using them.
> > > > - inserting/prepending/concatenating must work with
> > > > code points (so uint32, wxU32Char?), *iterator as well.
> > >
> > > Pick one, you can't realistically have both of these. If we
> > > started using two different kinds of indexes in wxString, now
> > > *that* would cause some Unicode confusion!
> >
> > which two differents kind? everywhere I talk only about byte
> > positions.
>
> Ah, so you're proposing that you could insert a code point (or valid
> string) in the middle of UTF-8 sequence, thus breaking the string, as
> opposed to being able to insert only between code points? That
> doesn't sound like sane API to me.
in fact, it is sane, because you insert at a known position. you don't
insert at random places. do you?
> > > > - as the Unicode standard explains also, a glyph must always be
> > > > represented as a string (char */wxString),
> > >
> > > I think you meant "abstract character", not "glyph", here. But if
> > > your app needs to process string at this level, it's going to do
> > > context specific things and wx can't help with it anyway.
> >
> > not sure what you call "abstract character", a glyph "é" can be
> represented by two code points "e + diacritic
> > acute" which is
>
> The thing Unicode standard calls that, i.e. what you call "glyph"
> here.
ok
Regards
Armel
More information about the wx-dev
mailing list