[wx-dev] Improving wxString proposal (Re: #9672: wx 2.9
Ansi/Unicode Combi-wxString)
Hajo Kirchhoff
mailinglists at hajo-kirchhoff.de
Mon Jul 7 00:28:39 PDT 2008
Hi Vadim,
I propose using three different string classes rather than rolling all
character representations into one complex wxString class: wxCString,
wxUString and wxUtf8String. The wxWidgets library will use wxUString only.
I think we can achieve the following:
+ wxWidgets uses Unicode internally. wchar_t strings on Windows/Mac and
UTF8 on Linux. --> Reduces build complexity.
+ Application programmers can still choose their string representation.
+ Performance penalty for appliations using ANSI should be a lot less.
+ c_str() returns char* for ANSI applications, wchar_t* for Unicode
applications, like it always did. No changes on the application side
neccessary.
+ wxCString derives from std::string, wxUString can derive from
std::wstring on Windows/Mac.
+ The wxString class becomes a lot less complex, easier to maintain.
+ Template argument deduction will work again, because there is no need
for implicit conversion operators.
+ Template specialization for ctype<> and potentially others are no
longer neccessary, because there is no need for wxUniChar.
IOW, we can have the present enhancements without breaking c_str(),
template argument deduction and performance.
The basis of the idea is the following assumption:
The wxWidgets library uses one and only one wxString character
representation in the entire library. IOW: Within the library, all
characters inside wxString are stored in the same format.
All applications use one and only one wxString character representation.
IOW: Within the entire application, all characters inside wxString are
stored in the same format.
Presently we force the wxWidgets character representation on the
application programmer. Since we don't know if the application still
expects the char* interface, we must support both. Hence the need for
implicit conversion to char*, wxCStrData and performance problems.
If we separate the string type, one used by wxWidgets, the other by the
application, we can avoid these problems.
I propose that wxWidgets use wxUString, rather than wxString. All
existing applications use wxString, so if wxWidgets starts using
wxUString, we have effectively separated the two string domains.
For Unicode applications, wxString == wxUString and everything works
fine. For Ansi applications, wxString == wxCString. wxUString and
wxCString need conversions for each other.
>
> I'm rather sceptical because I don't see:
>
> 1. Which of them is going to be called wxString
that depends on a macro which the user will (can) set such as UNICODE.
wxString will be used exclusively by the application. It will be a
typedef to one of the three string classes.
#if defined(UNICODE)
typedef wxUString wxString;
#elif defined(ANSI)
typedef wxCString wxString;
#elif defined(UTF8)
typedef wxUtf8String wxString;
#else
// wxString is not used
#endif
> 2. Which of them will be accepted by wxWidgets functions
wxUString will be accepted by all wxWidgets functions.
Depending on the platform, wxUString will be a wchar_t or utf8 string.
> 3. Which of them will be returned by wxWidgets functions
wxUString will be returned by wxWidgets functions.
> And, as mentioned previously, I don't see what would the potential
> wxCharString and wxWCharString have that std::string and std::wstring have
> not.
It's not about wxCharString/WCharString but about giving the application
programmer back the choice of which string implementation to use.
If my numbers are right, then the current string implementation hits
users with a very heavy performance penalty unless they upgrade their
application from char* to unicode. There must be many applications out
there that are using wx2.8 in ANSI build. For all these, performance
will go down when they upgrade to wx2.9.
Present wxString tries to do all three strings at once. That way you get
unicode within the wxwidgets library, but force it to application
programmers as well.
If we could separate the string used by wxWidgets library and by the
applications, the library could use Unicode while the application
programmers could continue using ANSI. The performance hit would be a
lot less since the character conversion would happen only at the seams
between the library and the application. Currently the character
conversion will happen inside the application and without user control.
I also expect better performance from a wxUString than from the current
implementation, because a wxUString will probably be less complex.
A few examples:
wxWidgets interface:
long wxListCtrl::InsertColumn(long col, const wxUString &heading, int
format, int width);
The application
void()
{
wxString mydata;
wxListCtrl ctrl;
ctrl.InsertColumn(0, mydata, format, width);
}
wxUString is a unicode string. It accepts a wxCString in its constructor.
If the user defines the UNICODE macro, wxString == wxUString and nothing
special happens. library and application use the same string type.
If the user is using an ANSI build, then wxString == wxCString. The
compiler will create a temporary wxUString variable, assign wxCString to
it converting the characters and pass the reference to the temporary to
wxWidgets library. Conversion is done between the seams of the wxWidgets
library and the application.
wxUString wxListCtrl::GetItemText(long item);
void()
{
wxString mydata=ctrl.GetItemText(10);
}
Same thing only the other way round. GetItemText returns a wxUString.
wxString accepts wxUString in its constructor. Nothing special for
UNICODE. For Ansi this will be wxCString(const wxUString &). The text
returned by GetItemText will be converted back to ANSI characters and
stored in mydata.
More difficult cases are
virtual wxUString wxListCtrl::OnGetItemText(long item, long column);
If the application overrides this in UNICODE, then wxString==wxUString
and nothing else is needed. If the application uses ANSI, then wxString
== wxCString.
Application code:
wxString MyClass::OnGetItemText(long i, long c);
I see two solutions. The simple one is to document this case and do
nothing. A programmer using ANSI build will get a compiler error
"overloading a function where only the return code differs is not
allowed". We document this case and say: "In all cases where a virtual
function returns a wxString and is overloaded by the application
programmer, change wxString to wxUString".
Programmer needs to change to
wxUString MyClass::OnGetItemText()
{
wxString rc;
rc.Format("%d", i);
return rc;
}
Note that it is _not_ neccessary to use wxUString inside the function.
All that has changed is the return type. Conversion between ANSI and
Unicode happens upon return.
I think it is acceptable to ask programmers to do that. There is a good
explanation: wxWidgets using unicode strings now, which is clearly
visible with wxUString. It is a little like the c_str() situation,
except that it will happen a lot less.
The only other solution I see would be to have an OnGetItemTextU and
OnGetItemText, the one returning wxUString, the other wxCString. I won't
go into detail as I don't really like it. But it could be made to work
seamlessly for the application programmer using ifdefs.
The remaining cases are wxString* and wxString&.
wxString& is the harder case, but a search over all sources using
"~(const):b+wxString:b+&" finds only 16 instances of a wxString&, 12 in
xti, the other 4 in html.
I'd document these and do nothing.
wxString* could be replaced with a wxStringPtr class. This class accepts
wxCString* and wxUString*, remembers the type and uses conversions where
neccessary. In most cases wxString* will only be assigned from within
the library.
Example:bool wxCmdLineParser::Found(const wxString& name, wxString
*value) const
{
[snip]
wxCHECK_MSG( value, false, _T("NULL pointer in
wxCmdLineOption::Found") );
*value = opt.GetStrVal();
return true;
}
The wxStringPtr class overloads operator*() so that *value = wxUString
will do the right thing even if value was created from a wxCString*.
Regarding performance:
Presently conversion is done inside the application everytime the
application programmer uses a char*. The three_string_types approach
moves the conversion to the wxWidgets interface. Conversion will only be
done when a string passes the boundary between the wxWidgets library and
the application. My assumption is that this happens less often and so I
expect that performance will generally increase. Another improvement in
performance should be possible, because wxUString will be less complex
and require less checks.
Regards
Hajo
More information about the wx-dev
mailing list