[wx-dev] Improving wxString proposal (Re: #9672: wx 2.9 Ansi/Unicode Combi-wxString)

Hajo Kirchhoff mailinglists at hajo-kirchhoff.de
Mon Jul 7 00:28:39 PDT 2008


Hi Vadim,

I propose using three different string classes rather than rolling all 
character representations into one complex wxString class: wxCString, 
wxUString and wxUtf8String. The wxWidgets library will use wxUString only.

I think we can achieve the following:

+ wxWidgets uses Unicode internally. wchar_t strings on Windows/Mac and 
UTF8 on Linux. --> Reduces build complexity.

+ Application programmers can still choose their string representation.

+ Performance penalty for appliations using ANSI should be a lot less.

+ c_str() returns char* for ANSI applications, wchar_t* for Unicode 
applications, like it always did. No changes on the application side 
neccessary.

+ wxCString derives from std::string, wxUString can derive from 
std::wstring on Windows/Mac.

+ The wxString class becomes a lot less complex, easier to maintain.

+ Template argument deduction will work again, because there is no need 
for implicit conversion operators.

+ Template specialization for ctype<> and potentially others are no 
longer neccessary, because there is no need for wxUniChar.




IOW, we can have the present enhancements without breaking c_str(), 
template argument deduction and performance.

The basis of the idea is the following assumption:
The wxWidgets library uses one and only one wxString character 
representation in the entire library. IOW: Within the library, all 
characters inside wxString are stored in the same format.
All applications use one and only one wxString character representation. 
IOW: Within the entire application, all characters inside wxString are 
stored in the same format.

Presently we force the wxWidgets character representation on the 
application programmer. Since we don't know if the application still 
expects the char* interface, we must support both. Hence the need for 
implicit conversion to char*, wxCStrData and performance problems.

If we separate the string type, one used by wxWidgets, the other by the 
application, we can avoid these problems.

I propose that wxWidgets use wxUString, rather than wxString. All 
existing applications use wxString, so if wxWidgets starts using 
wxUString, we have effectively separated the two string domains.

For Unicode applications, wxString == wxUString and everything works 
fine. For Ansi applications, wxString == wxCString. wxUString and 
wxCString need conversions for each other.

> 
>  I'm rather sceptical because I don't see:
> 
> 1. Which of them is going to be called wxString
that depends on a macro which the user will (can) set such as UNICODE.

wxString will be used exclusively by the application. It will be a 
typedef to one of the three string classes.

#if defined(UNICODE)
typedef wxUString wxString;
#elif defined(ANSI)
typedef wxCString wxString;
#elif defined(UTF8)
typedef wxUtf8String wxString;
#else
// wxString is not used
#endif

> 2. Which of them will be accepted by wxWidgets functions
wxUString will be accepted by all wxWidgets functions.
Depending on the platform, wxUString will be a wchar_t or utf8 string.

> 3. Which of them will be returned by wxWidgets functions
wxUString will be returned by wxWidgets functions.

>  And, as mentioned previously, I don't see what would the potential
> wxCharString and wxWCharString have that std::string and std::wstring have
> not.
It's not about wxCharString/WCharString but about giving the application 
programmer back the choice of which string implementation to use.

If my numbers are right, then the current string implementation hits 
users with a very heavy performance penalty unless they upgrade their 
application from char* to unicode. There must be many applications out 
there that are using wx2.8 in ANSI build. For all these, performance 
will go down when they upgrade to wx2.9.

Present wxString tries to do all three strings at once. That way you get 
unicode within the wxwidgets library, but force it to application 
programmers as well.

If we could separate the string used by wxWidgets library and by the 
applications, the library could use Unicode while the application 
programmers could continue using ANSI. The performance hit would be a 
lot less since the character conversion would happen only at the seams 
between the library and the application. Currently the character 
conversion will happen inside the application and without user control.

I also expect better performance from a wxUString than from the current 
implementation, because a wxUString will probably be less complex.

A few examples:
wxWidgets interface:
long wxListCtrl::InsertColumn(long col, const wxUString &heading, int 
format, int width);

The application
void()
{
   wxString mydata;
   wxListCtrl ctrl;
   ctrl.InsertColumn(0, mydata, format, width);
}

wxUString is a unicode string. It accepts a wxCString in its constructor.

If the user defines the UNICODE macro, wxString == wxUString and nothing 
special happens.  library and application use the same string type.

If the user is using an ANSI build, then wxString == wxCString. The 
compiler will create a temporary wxUString variable, assign wxCString to 
it converting the characters and pass the reference to the temporary to 
wxWidgets library. Conversion is done between the seams of the wxWidgets 
library and the application.

wxUString wxListCtrl::GetItemText(long item);
void()
{
   wxString mydata=ctrl.GetItemText(10);
}

Same thing only the other way round. GetItemText returns a wxUString. 
wxString accepts wxUString in its constructor. Nothing special for 
UNICODE. For Ansi this will be  wxCString(const wxUString &). The text 
returned by GetItemText will be converted back to ANSI characters and 
stored in mydata.

More difficult cases are

virtual wxUString wxListCtrl::OnGetItemText(long item, long column);

If the application overrides this in UNICODE, then wxString==wxUString 
and nothing else is needed. If the application uses ANSI, then wxString 
== wxCString.

Application code:
wxString MyClass::OnGetItemText(long i, long c);
I see two solutions. The simple one is to document this case and do 
nothing. A programmer using ANSI build will get a compiler error 
"overloading a function where only the return code differs is not 
allowed". We document this case and say: "In all cases where a virtual 
function returns a wxString and is overloaded by the application 
programmer, change wxString to wxUString".

Programmer needs to change to
wxUString MyClass::OnGetItemText()
{
   wxString rc;
   rc.Format("%d", i);
   return rc;
}
Note that it is _not_ neccessary to use wxUString inside the function. 
All that has changed is the return type. Conversion between ANSI and 
Unicode happens upon return.

I think it is acceptable to ask programmers to do that. There is a good 
explanation: wxWidgets using unicode strings now, which is clearly 
visible with wxUString. It is a little like the c_str() situation, 
except that it will happen a lot less.

The only other solution I see would be to have an OnGetItemTextU and 
OnGetItemText, the one returning wxUString, the other wxCString. I won't 
go into detail as I don't really like it. But it could be made to work 
seamlessly for the application programmer using ifdefs.

The remaining cases are wxString* and wxString&.

wxString& is the harder case, but a search over all sources using 
"~(const):b+wxString:b+&" finds only 16 instances of a wxString&, 12 in 
xti, the other 4 in html.

I'd document these and do nothing.

wxString* could be replaced with a wxStringPtr class. This class accepts 
wxCString* and wxUString*, remembers the type and uses conversions where 
neccessary. In most cases wxString* will only be assigned from within 
the library.

Example:bool wxCmdLineParser::Found(const wxString& name, wxString 
*value) const
{
[snip]
     wxCHECK_MSG( value, false, _T("NULL pointer in 
wxCmdLineOption::Found") );
     *value = opt.GetStrVal();
     return true;
}

The wxStringPtr class overloads operator*() so that *value = wxUString 
will do the right thing even if value was created from a wxCString*.

Regarding performance:
Presently conversion is done inside the application everytime the 
application programmer uses a char*. The three_string_types approach 
moves the conversion to the wxWidgets interface. Conversion will only be 
done when a string passes the boundary between the wxWidgets library and 
the application. My assumption is that this happens less often and so I 
expect that performance will generally increase. Another improvement in 
performance should be possible, because wxUString will be less complex 
and require less checks.

Regards

Hajo



More information about the wx-dev mailing list