[wx-dev] UTF-8 development plans

Vadim Zeitlin vadim at wxwindows.org
Mon Mar 5 15:40:25 PST 2007


On Mon, 05 Mar 2007 17:52:34 +0000 Julian Smart <julian at anthemion.co.uk> wrote:

JS> >  I've updated
JS> >
JS> > http://www.wxwidgets.org/wiki/index.php/Development:_UTF-8_Support
JS> >   
JS> Many thanks for the notes, and best of luck with the implementation!

 Thanks!

JS> A few questions about the Plan:
JS> 
JS> (1) Will we be able to compile under Unix using old-style Unicode, as a 
JS> fallback for legacy apps? Or would that be too hard to maintain?

 I'm not sure about this. It should definitely be possible to continue
allowing to use Unicode under Unix but it likely won't be "free", i.e. will
cost us some additional efforts and it's not really obvious to me if it's
worth the gain. I guess it all depends on how (in)compatible the new UTF-8
build turns out to be with the existing code using Unicode build of wx. We
hope that there will be relatively few incompatibilities but clearly if our
hopes turn out to be wrong, we'd need to continue supporting Unicode build
at least for compatibility considerations.

JS> Similarly, could we compile in UTF-8 mode on Windows so we can
JS> replicate and debug UTF-8-related bugs on that platform?

 I didn't think at all about this but maybe it would indeed make sense to
do this for testing. OTOH this won't come for free neither and the interest
of support UTF-8 build under Windows is even less clear as it is hardly
useful with CRT not supporting UTF-8 locales (as is the case under
Windows).

JS> (2) What about code that uses a wxString to store arbitrary binary data? 

 We'll have some way to directly access the internal storage of wxString
anyhow (I'm tempted to call the function doing this wx_str(): even though
it already exists currently, it's marked as being for wx 1.6x compatibility
and so I think we can safely reuse it 10 years later). So it should be
possible to work with binary data in this way. It's less clear how would
this data find its way inside wxString in the first place. Presumably via
wxStringBuffer?

JS> (3) Some string-manipulation code may look ahead or behind a few 
JS> characters, e.g. s[i+1]. I'm not sure if your optimizations will cope
JS> with that,

 If we know the offset of the character #i in the string, we can find the
next one quickly so if we do implement this optimization (personally I'd
wait for profiling results first) it should take care of this. And, of
course, there is always the possibility to rewrite the loop using
iterators.

JS> in which case, how about having a simple wxString-like class
JS> that allows you to quickly adapt existing code, e.g. change:
JS> 
JS> wxString s(somestring);
JS> for (size_t i = 0; i < s.Length(); i++)
JS> {
JS>     wxChar ch = s[i];
JS>     ...
JS> }
JS> 
JS> to:
JS> 
JS> wxIndexedString s(somestring);
JS> for (size_t i = 0; i < s.Length(); i++)
JS> {
JS>     wxChar ch = s[i];
JS>     ...
JS> }

 Currently (the code is not checked in, of course, but it does exist) this
can be done using the same wxCStrData class which is returned by c_str()
and extracting wchar_t* pointer from it (which results in allocation of a
temporary wchar_t buffer internally) so I hope we won't need yet another
string-related class just for this. But you're right, this should be an
easy way to optimize the existing code without changing it too much.

 Thanks,
VZ





More information about the wx-dev mailing list