Anyone have/know of a "universal" string class for C++Builder? - c++builder

Has anyone built a "universal" string class for C++Builder that manages all of the conversions to/from ASCII and Unicode?
I had a vision of a class that would accept AnsiString, UnicodeString, WideString, char*, wchar_t*, std::string, and variant values, and would provide any of those back out. AND the copy constructor has to do a deep copy, not just provide a pointer to the same buffer space (as AnsiString and UnicodeString do).
I figure someone else besides me must have to pass strings to both old interfaces that use char* and new ones that use (wide) strings. If you have built, or know of, something you're willing to share, please let me know. Most of the time it's not too big a deal, until I have to pass a map<std::string, std::string>, then it starts getting ugly.
We do not, and will not, support any internationalization whatsoever, so I don't need to worry about encoding. I just want a class that will return my little ASCII strings in whatever format makes the compiler happy... sanely.
UPDATE: to address the comments:
So, std::map<std::string, std::string> is ugly, because you can't do:
parammap[AnsiString(widekey).c_str()] = AnsiString(widevalue).c_str();
Oh no no no. You have to do this:
AnsiString akey = widekey;
AnsiString aval = widevalue;
parammap[akey.c_str()] = aval.c_str();
The person who originally wrote this code tried to keep it as port-friendly as possible, so he standardized on char* for all of the function calls he wrote (circa 2000, it wasn't a bad assumption). Sometimes I was trying to convert everything to char *s before I realized that the function was then immediately turning around and converting it back to wide. There are multiple interface layers, and it took me a while to figure out how it all went together.
Add in some creative compiler bugs, where it would get confused, especially when pulling string values out of Variants. In some places, I had to do:
String wstr = passedvariant.AsType(varString);
String astr = wstr;
std::string key = astr.c_str();
Then life happened, we ended up starting the port over (for the 3rd time. Don't ask), and I finally got smart and wrapped the low-level library in a layer that does all of the conversions, and retooled the middle layers to deal in Strings, so the application layer can just use String except for that map. For the map<string, string>, I created a function to do the converting, so it was one line in a bunch of places instead of six (the three line conversion above for both key and value).
Lastly, I wasn't actually asking for anyone to make suggestions on how to make my code better. I was asking if anyone had or knew of a universal string class. Our code is the way it is for reasons, and I'm not rewriting all of it to make it prettier. I just wanted not to have to touch so many lines... again. It would have been so much nicer to have the compiler keep track of which format is needed and convert it.

Related

How to use AnsiString to store binary data?

I have a simple question.
I want to use AnsiString as a container for binary data. I mostly load such data from TMemoryStream or TFileStream and I save it back from AnsiString after some processing. Works fine, haven't found a problem with that.
But from what I've seen using it like that sparcles debates to use Sysutils::TBytes instead. Why? Sysutils::TBytes has much fewer useful methods which I can use to manipulate data stored inside for example AnsiString. It is clearly half-finished container, compared to AnsiString.
Is the only problem I should care about conversion to regular string or is there something else why I should really use the less-than-adequate TBytes instead? I do not make conversions of AnsiString to other string types - that is what is quoted as a possible problem elsewhere.
An example of how I load data:
AnsiString data;
boost::scoped_ptr<TFileStream> fs(new TFileStream(FileName, fmOpenRead | fmShareDenyWrite));
data.SetLength(fs->Size);
fs->Read(data.c_str(), fs->Size);
An example how I save data:
// fs wants void * so I have to use data.data() instead of data.c_str() here
fs->Write(data.data(), data.Length());
So it should be safe to store binary data correct?
I want to use AnsiString as a container for binary data.
One word - DON'T! It will bite you someday. Use a more appropriate container, such as TBytes, TMemoryStream, std::vector<byte>, etc.
Works fine, haven't found a problem with that.
Consider yourself lucky. From C++Builder 2009 onwards, AnsiString is codepage-aware, and it WILL cause data conversions if you are not VERY careful when passing AnsiString around. Sooner or later, you are likely to slip up and it will risk corrupting your binary data.
But from what I've seen using it like that sparcles debates to use Sysutils::TBytes instead. Why?
Because it is an actual raw binary container meant specifically for raw bytes.
Sysutils::TBytes has much fewer useful methods which I can use to manipulate data stored inside for example AnsiString.
You should not be manipulating binary data as text to begin with. And since you are using things like Boost and STL, you should consider using their binary containers instead. They have more functions available.
That being said, XE7 does introduce some new functions for manipulating Delphi-style dynamic arrays (like TBytes) including inserts, deletes, and concatenations:
String-Like Operations Supported on Dynamic Arrays
It does not look like those new functions made it into C++Builder's DynamicArray class (which TBytes is a typedef of), though.
It is clearly half-finished container, compared to AnsiString.
AnsiString is a container of text characters. Period. Always has been, always will be. People ABUSE it by taking advantage of the fact that sizeof(char)==sizeof(byte). That worked up to a point, but it has become dangerous in recent years to continue abusing it.
Is the only problem I should care about conversion to regular string or is there something else why I should really use the less-than-adequate TBytes instead?
That, and the fact that Embarcadero has been phasing out AnsiString since 2009. 8bit strings are disabled in the mobile compilers, it is only a matter of time before the desktop compilers follow suit.
Why are you wanting to manipulate raw bytes as strings to begin with? Can you provide an example of something you can do with AnsiString that you cannot do with TBytes?
So it should be safe to store binary data correct?
In your specific example, yes (and yes, you can use c_str() instead of data() when calling fs->Write()).

Replace characters in C string

Given this C string:
unsigned char *temp = (unsigned char *)[#"Hey, I am some usual CString" UTF8String]
How can I replace "usual" with "other" to get: "Hey, I am some other CString".
I cannot use NSString functions (replaceCharactersInRange/replaceOccurencesOfString, etc.) for performance reasons. I have to keep it all at low level, since the strings I'll be dealing with happen to exceed 5MB, and therefore the replacements (there will be a lot of replacements to do) take about 10 minutes on a iOS device.
Objective-C is a just thin layer over C.
If you need to work with native C strings, just go ahead and do it.
This
What is the function to replace string in C?
seems to address your problem fairly well.
The C string returned by UTF8String is const. You can't safely change it by casting it to a non-const string and mutate the bytes. So the only way to do this is by creating a copy.
If you really have reason to use an NSString as the source it might be much faster to do the transformation on the original string.
If you want to get a better answer that helps you to speed up your special case you should provide some more information. How do you create the original string, what's the number and size of search/replacement strings and so on.

Delphi XE - should I use String or AnsiString?

I finally upgraded to Delphi XE. I have a library of units where I use strings to store plain ANSI characters (chars between A and U). I am 101% sure that I will never ever use UNICODE characters in those places.
I want to convert all other libraries to Unicode, but for this specific library I think it will be better to stick with ANSI. The advantage is the memory requirement as in some cases I load very large TXT files (containing ONLY Ansi characters). The disadvantage might be that I have to do lots and lots of typecasts when I make those libraries to interact with normal (unicode) libraries.
There are some general guidelines to show when is good to convert to Unicode and when to stick with Ansi?
The problem with general guidelines is that something like this can be very specific to a person's situation. Your example here is one of those.
However, for people Googling and arriving here, some general guidelines are:
Yes, convert to Unicode. Don't try to keep an old app fully using AnsiStrings. The reason is that the whole VCL is Unicode, and you shouldn't try to mix the two, because you will convert every time you assign a Unicode string to an ANSI string, and that is a lossy conversion. Trying to keep the old way because it's less work (or some similar reason) will cause you pain; just embrace the new string type, convert, and go with it.
Instead of randomly mixing the two, explicitly perform any conversions you need to, once - for example, if you're loading data from an old version of your program you know it will be ANSI, so read it into a Unicode string there, and that's it. Ever after, it will be Unicode.
You should not need to change the type of your string variables - string pre-D2009 is ANSI, and in D2009 and alter is Unicode. Instead, follow compiler warnings and watch which string methods you use - some still take an AnsiString parameter and I find it all confusing. The compiler will tell you.
If you use strings to hold bytes (in other words, using them as an array of bytes because a character was a byte) switch to TBytes.
You may encounter specific problems for things like encryption (strings are no longer byte/characters, so 'character' for 'character' you may get different output); reading text files (use the stream classes and TEncoding); and, frankly, miscellaneous stuff. Search here on SO, most things have been asked before.
Commenters, please add more suggestions... I mostly use C++Builder, not Delphi, and there are probably quite a few specific things for Delphi I don't know about.
Now for your specific question: should you convert this library?
If:
The values between A and U are truly only ever in this range, and
These values represent characters (A really is A, not byte value 65 - if so, use TBytes), and
You load large text files and memory is a problem
then not converting to Unicode, and instead switching your strings to AnsiStrings, makes sense.
Be aware that:
There is an overhead every time you convert from ANSI to Unicode
You could use UTF8String, which is a specific type of AnsiString that will not be lossy when converted, and will still store most text (Roman characters) in a single byte
Changing all the instances of string to AnsiString could be a bit of work, and you will need to check all the methods called with them to see if too many implicit conversions are being performed (for performance), etc
You may need to change the outer layer of your library to use Unicode so that conversion code or ANSI/Unicode compiler warnings are not visible to users of your library
If you convert to Unicode, sets of characters (can't remember the syntax, maybe if 'S' in MySet?) won't work. From your description of characters A to U, I could guess you would like to use this syntax.
My recommendation? Personally, the only reason I would do this from the information you've given is the memory use, and possibly performance depending on what you're doing with this huge amount of A..Us. If that truly is significant, it's both the driver and the constraint, and you should convert to ANSI.
You should be able to wrap up the conversion at the interface between this unit and its clients. Use AnsiString internally and string everywhere else and you should be fine.
In general only use AnsiString if it is important that the Chars are single bytes, Otherwise the use of string ensures future compatibility with Unicode.
You need to check all libraries anyway because all Windows API functions in Delhpi XE replaced by their unicode-analogues, etc. If you will never use UNICODE you need to use Delphi 7.
Use AnsiString explicitly everywhere in this unit and then you'll get compiler warning errors (which you should never ignore) for String to AnsiString conversion errors if you happen to access the routines incorrectly.
Alternately, perhaps preferably depending on your situation, simply convert everything to UTF8.
Stick with Ansi strings ONLY if you do not have the time to convert the code properly. The use of Ansi strings is really only for backward compatibility - to my knowledge C# does not have an equiavalent to Ansi strings. Otherwise use the standard Unicode strings. If you have a look on my web-site I have a whole strings routines unit (about 5,000 LOC) that works with both Delphi 2007 (non-Uniocde) and XE (Unicode) with only "string" interfaces and contains almost all of the conversion issues you might face.

Why does Delphi warn when assigning ShortString to string?

I'm converting some legacy code to Delphi 2010.
There are a fair number of old ShortStrings, like string[25]
Why does the assignment below:
type
S: String;
ShortS: String[25];
...
S := ShortS;
cause the compiler to generate this warning:
W1057 Implicit string cast from 'ShortString' to 'string'.
There's no data loss that is occurring here. In what circumstances would this warning be helpful information to me?
Thanks!
Tomw
It's because your code is implicitly converting a single-byte character string to a UnicodeString. It's warning you in case you might have overlooked it, since that can cause problems if you do it by mistake.
To make it go away, use an explicit conversion:
S := string(ShortS);
The ShortString type has not changed. It continues to be, in effect, an array of AnsiChar.
By assigning it to a string type, you are taking what is a group of AnsiChars (one byte) and putting it into a group of WideChars (two bytes). The compiler can do that just fine, and is smart enough not to lose data, but the warning is there to let you know that such a conversion has taken place.
The warning is very important because you may lose data. The conversion is done using the current Windows 8-bit character set, and some character sets do not define all values between 0 and 255, or are multi-byte character sets, and thus cannot convert all byte values.
The data loss can occur on a standard computer in a country with specific standard character sets, or on a computer in USA that has been set up for a different locale, because the user communicates a lot with people in other languages.
For instance, if the local code page is 932, the byte values 129 and 130 will both convert to the same value in the Unicode string.
In addition to this, the conversion involves a Windows API call, which is an expensive operation. If you do a lot of these, it can slow down your application.
It's safe ( as long as you're using the ShortString for its intended purpose: to hold a string of characters and not a collection of bytes, some of which may be 0 ), but may have performance implications if you do it a lot. As far as I know, Delphi has to allocate memory for the new unicode string, extract the characters from the ShortString into a null-terminated string (that's why it's important that it's a properly-formed string) and then call something like the Windows API MultiByteToWideChar() function. Not rocket science, but not a trivial operation either.
ShortStrings don't have a code page associated with them, AnsiStrings do (since D2009).
The conversion from ShortString to UnicodeString can only be done on the assumption that ShortStrings are encoded in the default ANSI encoding which is not a safe assumption.
I don't really know Delphi, but if I remember correctly, the Shortstrings are essentially a sequence of characters on the stack, whereas a regular string (AnsiString) is actually a reference to a location on the heap. This may have different implications.
Here's a good article on the different string types:
http://www.codexterity.com/delphistrings.htm
I think there might also be a difference in terms of encoding but I'm not 100% sure.

Delphi - Problem With Set String and PAnsiChar and Other Strings not Displaying

I was getting advice from Rob Kennedy and one of his suggestions that greatly increased the speed of an app I was working on was to use SetString and then load it into the VCL component that displayed it.
I'm using Delphi 2009 so now that PChar is Unicode,
SetString(OutputString, PChar(Output), OutputLength.Value);
edtString.Text := edtString.Text + OutputString;
Works and I changed it to PChar myself but since the data being moved isn't always Unicode in fact its usually ShortString Data.... so onto what he actually gave me to use:
SetString(OutputString, PAnsiChar(Output), OutputLength.Value);
edtString.Text := edtString.Text + OutputString;
Nothing shows up but I check in the debugger and the text that normally appeared the way I did it building 1 char at a time in the past was in the variable.
Oddly enough this is not the first time I ran into this tonight. Because I was trying to come up with another way, I took part of his advice and instead of building into a VCL's TCaption I built it into a string variable and then copied it, but when I send it over nothing's displayed. Once again in the debugger the variable that the data is built in... has the data.
for I := 0 to OutputLength.Value - 1 do
begin
OutputString := OutputString + Char(OutputData^[I]);
end;
edtString.Text := OutputString;
The above does not work but the old slow way of doing it worked just fine....
for I := 0 to OutputLength.Value - 1 do
begin
edtString.Text := edtString.Text + Char(OutputData^[I]);
end;
I tried making the variable a ShortString, String and TCaption and nothing is displayed. What I also find interesting is while I build my hex data from the same array into a richedit it's very fast while doing it inside an edit for the text data is very very slow. Which is why I haven't bothered trying to change the code for the richedit as it works superfast as it is.
Edit to add - I think I sorta found the problem but I have no solution. If I edit the value in the debugger to remove anything that can't be displayed (which by the old method used to just not display... not fail) then what I have left is displayed. So if it's just a matter of getting rid of bytes that were turned to characters that are garbage how can I fix that?
I basically have incoming raw data from a SCSI device that's being displayed hex-editor style. My original slow style of adding one char at a time successfully displayed strings and Unicode strings that did not have Unicode-specific characters in them. The faster methods even if working won't display ShortStrings one way and the other way wont display UnicodeStrings that aren't using non 0-255 characters. I really like and could use the speed boost but if it means sacrificing the ability to read the string... then what's the point in the app?
EDIT3 - Alright now that I've figured out that 0-31 is control char and 32 and up is valid I think I'm gonna make an attempt to filter char and replace those not valid with a . which is something I was planning on doing later to emulate hex editor style.
If there are any other suggestions I'd be glad to hear about them but otherwise I think I can craft a solution that's faster than the original and does what I need it to at the same time.
Some comments:
Your question is very unclear. What exactly do you want to do?
Your question reads terrible, please check your text with a spelling checker.
The question you are referring to is this one: Delphi accessing data from dynamic array that is populated from an untyped pointer
Please give a complete code sample of your function like you did in your previous question, I want to know if you implemented Rob Kennedy's suggestion or the code you gave yourself in a following answer (let's hope not :) )
As far a I understand your question: You're sending a query to your SCSI device and you get an array of bytes back which you store in the variable OutputData. After that you want to show your data to the user. So your real question is: How to show an array of bytes to the user?
Login as the same user and don't create an account for every new question. That way we can track your question history an can find out what you mean by 'getting advice'.
Some assumptions and suggestions if I'm right about the true meaning of your question:
Showing your data as a hex string won't give any problems
Showing your data in a normal Memo field gives you problems, although a Delphi string can contain any character, including 0 bytes, displaying them will give you problems. A TMemo for example will show your data till the first 0 byte. What you have to do (and you gave the answer yourself), is replacing non viewable characters with a dummy. After that you can show your data in a TMemo. Actually all hex viewers do the same, characters that cannot be printed will be shown as a dot.
I used PAnsiChar in my example for a reason. It looked like OutputLength was being measured in bytes, not characters, so I made sure to use a type whose length is always measured in bytes. You'll also notice that I showed the declaration of OutputString as an AnsiString.
Since the edit control stored Unicode, though, there will be a conversion between AnsiString and UnicodeString. That will take the system's current code page into account, but that's probably not what you want. You might want to declare the variable as a RawByteString instead. That won't have any code page associated with it, so there won't be any unexpected conversions.
Don't use strings for storing binary data. If you're building what amounts to a hex editor, then you're working with binary data. It's important to remember that. Even if your binary data happens to consist mostly of bytes that can be interpreted as text, you can't treat the data as text or you'll run into exactly the problems you're seeing — characters that don't appear as expected. If you get a bunch of bytes from your SCSI device, then store them in an array of bytes, not characters.
In hex editors, you'll notice that they always show the hexadecimal values of the bytes. They might show those bytes interpreted as characters, but that's secondary, and they generally only show the bytes that can represent ASCII characters; they don't try to get too fancy with the basic display. The good hex editors will offer to display the data interpreted as wide characters, too. This aids in debugging because the user can look at the same data in multiple ways. But they're just views of the data. They're not actually changing the binary contents of the data.
When you filter out non viewable characters...You'll probably need to decide what to do with a couple of them like #9(Tab), #10(LF), #11(Verticle Tab), #12(FF-or New Page),#13(CR)

Resources