I've done my homework, and specifically:
1) Read the whole FASTREPORT 4 manual. It does not mention UTF8, nor Unicode support
2) Looked for an answer here on SO
3) Googled it around
If I set a Text field and fill it with Thai characters, they are perfectly printed, so FastReport CAN handle Unicode characters, at least it can print them.
If I try to "pass" a value using the callbacks provided by the frxUserDataSet, then what I see is some garbled not-unicode text. In particular, if I pass e.g. a string made with the same 10 Thai characters, I see the same "set" of 3 or 4 garbled characters repeated ten times, so I am sure the data is passed correctly, but then FastReport has probably no way to know that they should be handled as Unicode.
The callback requires the data passed back to be of "variant" type, so I guess it's totally useless to cast them to any type, because variant will accept any of them.
I forgot to mention that I get the strings from a MySql DB and the data is stored as UTF8, and I do not even copy the data in a local variable: what I get from the DB is put into the variant.
Is there a way to force FastReport to print the data received as Unicode?
Thank you
Yes, FR4 with Delphi7 supports UTF8 using frxUserDataSet.
Just for future reference:
1) You MUST set your DB (MySql in my case) to use UTF8
2) You MUST set the character set in the component you use to access the DB to utf8 ("DAC for MySql" in my case, and the property is called ConnectionCharacterSet)
3) In all the frxUserDataSet callbacks, before setting the "value" variable, you MUST CONVERT whatever you have using the Utf8decode Delphi system routine, like this:
value := Utf8decode(fReports.q1.FieldValueByFieldName('yourDBfield'));
where fReports is the form name, and q1 the component used to access the DB.
I keep reading that using D7 and UniCode is almost impossible, but - as long as you use XP and up - it's only harder from what I am seeing. Unfortunately, I must use XP, D7 and cannot upgrade. But, as said, I am quickly becoming used to solve these problems so, in the future, I hope to be able to give back some help in the same way everybody has always helped me here :)
Related
I've published an app, and I find some of the comments to be like this: РекамедÑ
I have googled a lot and I cannot decode it so that the comment will not be shown this way. This is the way it is stored in database; it can be in Cyrillic, but I could not decode it as well. Any clue on how to understand this kind of comments?
These appear to be doubly encoded HTML entities. So for example, & was turned to & and that was then again turned to &
When decoding the data twice using this online tool (there are many others) the result is
РекамедÑ
That could be Unicode data, e.g. UTF-8 in a non-western character set like Cyrillic or Arabic, that
was misinterpreted as single-byte input
was garbled by a misguided "sanitation" method, possibly a call or two to PHP's htmlentities() (which incidentally assumes the single-byte ISO-8859-1 encoding by default in older versions, so a call to this function could be the whole source of the problem).
The fix will likely need to be on server side.
If you are using PHP, see UTF-8 all the way through for a handy guide.
The Spring4D library has cryptography classes, however I cannot get them to work as expected. I'm probably using them incorrectly, however lack of any examples makes it difficult.
For example on the website https://quickhash.com/hash-sha256-online, I can hash the word "test" to generate the following hash:
9f86d081884c7d659a2feaa0c55ad015a3bf4f1b2b0b822cd15d6c15b0f00a08
Using the Spring4D library, the following code produces a different hash:
CreateSHA256.ComputeHash('test').ToString;
results in:
9EFEA1AEAC9EDA04A892885A65FDAE0E6D9BE8C9FC96DA76D31B929262E12B1D
Upper/lower case aside, it is a different hash altogether. I know must be doing something wrong, but again there's no examples of use so I'm stuck on how to do this.
Hashing algorithms operate on binary data, typically represented using byte arrays.
Unfortunately, both of the resources you have used offer the ability to hash text. In order to hash text, you first need to convert from text to binary. To do so requires a choice of encoding. And neither method makes it clear what that choice is.
When I use this Delphi code:
LowerCase(CreateSHA256.ComputeHash(TEncoding.UTF8.GetBytes('test')).ToString)
I get the same hash as appears in your question.
I urge you never to attempt to encrypt/hash text and instead regard these operations as operating on binary. Always use an explicit encoding and then encrypt/hash the array of bytes that the encoding produced.
I've picked the UTF-8 encoding here, because it is a full Unicode encoding, and tends to be efficient in terms of space. However, I don't think your online encoder uses UTF-8. In fact I've no idea what encoding it uses, it is unclear on the matter. This is of course the same old issue of text being different from binary.
In my opinion it is a design flaw of the Delphi library that you use that it allows you to hash text without an explicit choice of encoding. If this library must offer a function that hashes text, then it should require the caller to supply an extra TEncoding parameter.
There is no conversion going on internally so it hashes the UnicodeString which is at least 2 bytes per character.
If you want the same result as on the page you have to use UTF8Encode or directly pass as AnsiString.
However I tried some strings that contained different unicode characters and the page returned a different result. So I am not quite sure how they treat the strings there. I guess it's a codepage thing.
Edit: If you use this page http://www.xorbin.com/tools/sha256-hash-calculator it generates the same hash as TSHA256 with UTF8Encode.
Which type of string are you using? Do you use AnsiString or WideString (Unicode string). Delphi 2009 and Newer are using WideString by default.
Why is string type inportant? All hasging algorithm operates on raw bytes data so it is omportant if each character of your string is stored in one Byte of memory (AnsiString) or multiple Bytes of memory (WideString).
I have some data which has been imported into Postgres, for use in a Rails application. However somehow the foreign accents have become strangely encoded:
ä appears as â§
á appears as â°
é appears as â©
ó appears as ââ¥
I'm pretty sure the problem is with the integrity of the data, rather than any problem with Rails. It doesn't seem to match any encoding I try:
# Replace "cp1252" with any other encoding, to no effect
"Trollâ§ttan".encode("cp1252").force_encoding("UTF-8") #-> junk
If anyone was able to identify what kind of encoding mixup I'm suffering from, that would be great.
As a last resort, I may have to manually replace each corrupted accent character, but if anyone can suggest a programatic solution (or a even a starting point for fixing this - I've found it very hard to debug), I'd be v. grateful.
It's hardly possible with recent versions of PostgreSQL to have invalid UTF8 inside a UTF8 database. There are other plausible possibilities that may lead to that output, though.
In the typical case of é appearing as ©, either:
The contents of the database are valid, but some client-side layer is interpreting the bytes from the database as if they were iso-latin-something whereas they are UTF8.
The contents are valid and the SQL client-side layer is valid, but the terminal/software/webpage with which you're looking at this is configured for iso-latin1 or a similar mono-bytes encoding (win1252, iso-latin9...).
The contents of the database consist of the wrong characters with a valid UTF8 encoding. This is what you end up with if you take iso-latin-something bytes, convert them to UTF8 representation, then take the resulting byte stream as if if was still in iso-latin, and reconvert it once again to UTF8, and insert that into the database.
Note that while the © sequence is typical in UTF8 versus iso-latin confusion, the presence of an additional â in all your sample strings is uncommon. It may be the result of another misinterpretation on top of the primary one. If you're in case #3, that may mean that an automated fix based on search-replace will be harder than the normal case which is already tricky.
I finally upgraded to Delphi XE. I have a library of units where I use strings to store plain ANSI characters (chars between A and U). I am 101% sure that I will never ever use UNICODE characters in those places.
I want to convert all other libraries to Unicode, but for this specific library I think it will be better to stick with ANSI. The advantage is the memory requirement as in some cases I load very large TXT files (containing ONLY Ansi characters). The disadvantage might be that I have to do lots and lots of typecasts when I make those libraries to interact with normal (unicode) libraries.
There are some general guidelines to show when is good to convert to Unicode and when to stick with Ansi?
The problem with general guidelines is that something like this can be very specific to a person's situation. Your example here is one of those.
However, for people Googling and arriving here, some general guidelines are:
Yes, convert to Unicode. Don't try to keep an old app fully using AnsiStrings. The reason is that the whole VCL is Unicode, and you shouldn't try to mix the two, because you will convert every time you assign a Unicode string to an ANSI string, and that is a lossy conversion. Trying to keep the old way because it's less work (or some similar reason) will cause you pain; just embrace the new string type, convert, and go with it.
Instead of randomly mixing the two, explicitly perform any conversions you need to, once - for example, if you're loading data from an old version of your program you know it will be ANSI, so read it into a Unicode string there, and that's it. Ever after, it will be Unicode.
You should not need to change the type of your string variables - string pre-D2009 is ANSI, and in D2009 and alter is Unicode. Instead, follow compiler warnings and watch which string methods you use - some still take an AnsiString parameter and I find it all confusing. The compiler will tell you.
If you use strings to hold bytes (in other words, using them as an array of bytes because a character was a byte) switch to TBytes.
You may encounter specific problems for things like encryption (strings are no longer byte/characters, so 'character' for 'character' you may get different output); reading text files (use the stream classes and TEncoding); and, frankly, miscellaneous stuff. Search here on SO, most things have been asked before.
Commenters, please add more suggestions... I mostly use C++Builder, not Delphi, and there are probably quite a few specific things for Delphi I don't know about.
Now for your specific question: should you convert this library?
If:
The values between A and U are truly only ever in this range, and
These values represent characters (A really is A, not byte value 65 - if so, use TBytes), and
You load large text files and memory is a problem
then not converting to Unicode, and instead switching your strings to AnsiStrings, makes sense.
Be aware that:
There is an overhead every time you convert from ANSI to Unicode
You could use UTF8String, which is a specific type of AnsiString that will not be lossy when converted, and will still store most text (Roman characters) in a single byte
Changing all the instances of string to AnsiString could be a bit of work, and you will need to check all the methods called with them to see if too many implicit conversions are being performed (for performance), etc
You may need to change the outer layer of your library to use Unicode so that conversion code or ANSI/Unicode compiler warnings are not visible to users of your library
If you convert to Unicode, sets of characters (can't remember the syntax, maybe if 'S' in MySet?) won't work. From your description of characters A to U, I could guess you would like to use this syntax.
My recommendation? Personally, the only reason I would do this from the information you've given is the memory use, and possibly performance depending on what you're doing with this huge amount of A..Us. If that truly is significant, it's both the driver and the constraint, and you should convert to ANSI.
You should be able to wrap up the conversion at the interface between this unit and its clients. Use AnsiString internally and string everywhere else and you should be fine.
In general only use AnsiString if it is important that the Chars are single bytes, Otherwise the use of string ensures future compatibility with Unicode.
You need to check all libraries anyway because all Windows API functions in Delhpi XE replaced by their unicode-analogues, etc. If you will never use UNICODE you need to use Delphi 7.
Use AnsiString explicitly everywhere in this unit and then you'll get compiler warning errors (which you should never ignore) for String to AnsiString conversion errors if you happen to access the routines incorrectly.
Alternately, perhaps preferably depending on your situation, simply convert everything to UTF8.
Stick with Ansi strings ONLY if you do not have the time to convert the code properly. The use of Ansi strings is really only for backward compatibility - to my knowledge C# does not have an equiavalent to Ansi strings. Otherwise use the standard Unicode strings. If you have a look on my web-site I have a whole strings routines unit (about 5,000 LOC) that works with both Delphi 2007 (non-Uniocde) and XE (Unicode) with only "string" interfaces and contains almost all of the conversion issues you might face.
I was getting advice from Rob Kennedy and one of his suggestions that greatly increased the speed of an app I was working on was to use SetString and then load it into the VCL component that displayed it.
I'm using Delphi 2009 so now that PChar is Unicode,
SetString(OutputString, PChar(Output), OutputLength.Value);
edtString.Text := edtString.Text + OutputString;
Works and I changed it to PChar myself but since the data being moved isn't always Unicode in fact its usually ShortString Data.... so onto what he actually gave me to use:
SetString(OutputString, PAnsiChar(Output), OutputLength.Value);
edtString.Text := edtString.Text + OutputString;
Nothing shows up but I check in the debugger and the text that normally appeared the way I did it building 1 char at a time in the past was in the variable.
Oddly enough this is not the first time I ran into this tonight. Because I was trying to come up with another way, I took part of his advice and instead of building into a VCL's TCaption I built it into a string variable and then copied it, but when I send it over nothing's displayed. Once again in the debugger the variable that the data is built in... has the data.
for I := 0 to OutputLength.Value - 1 do
begin
OutputString := OutputString + Char(OutputData^[I]);
end;
edtString.Text := OutputString;
The above does not work but the old slow way of doing it worked just fine....
for I := 0 to OutputLength.Value - 1 do
begin
edtString.Text := edtString.Text + Char(OutputData^[I]);
end;
I tried making the variable a ShortString, String and TCaption and nothing is displayed. What I also find interesting is while I build my hex data from the same array into a richedit it's very fast while doing it inside an edit for the text data is very very slow. Which is why I haven't bothered trying to change the code for the richedit as it works superfast as it is.
Edit to add - I think I sorta found the problem but I have no solution. If I edit the value in the debugger to remove anything that can't be displayed (which by the old method used to just not display... not fail) then what I have left is displayed. So if it's just a matter of getting rid of bytes that were turned to characters that are garbage how can I fix that?
I basically have incoming raw data from a SCSI device that's being displayed hex-editor style. My original slow style of adding one char at a time successfully displayed strings and Unicode strings that did not have Unicode-specific characters in them. The faster methods even if working won't display ShortStrings one way and the other way wont display UnicodeStrings that aren't using non 0-255 characters. I really like and could use the speed boost but if it means sacrificing the ability to read the string... then what's the point in the app?
EDIT3 - Alright now that I've figured out that 0-31 is control char and 32 and up is valid I think I'm gonna make an attempt to filter char and replace those not valid with a . which is something I was planning on doing later to emulate hex editor style.
If there are any other suggestions I'd be glad to hear about them but otherwise I think I can craft a solution that's faster than the original and does what I need it to at the same time.
Some comments:
Your question is very unclear. What exactly do you want to do?
Your question reads terrible, please check your text with a spelling checker.
The question you are referring to is this one: Delphi accessing data from dynamic array that is populated from an untyped pointer
Please give a complete code sample of your function like you did in your previous question, I want to know if you implemented Rob Kennedy's suggestion or the code you gave yourself in a following answer (let's hope not :) )
As far a I understand your question: You're sending a query to your SCSI device and you get an array of bytes back which you store in the variable OutputData. After that you want to show your data to the user. So your real question is: How to show an array of bytes to the user?
Login as the same user and don't create an account for every new question. That way we can track your question history an can find out what you mean by 'getting advice'.
Some assumptions and suggestions if I'm right about the true meaning of your question:
Showing your data as a hex string won't give any problems
Showing your data in a normal Memo field gives you problems, although a Delphi string can contain any character, including 0 bytes, displaying them will give you problems. A TMemo for example will show your data till the first 0 byte. What you have to do (and you gave the answer yourself), is replacing non viewable characters with a dummy. After that you can show your data in a TMemo. Actually all hex viewers do the same, characters that cannot be printed will be shown as a dot.
I used PAnsiChar in my example for a reason. It looked like OutputLength was being measured in bytes, not characters, so I made sure to use a type whose length is always measured in bytes. You'll also notice that I showed the declaration of OutputString as an AnsiString.
Since the edit control stored Unicode, though, there will be a conversion between AnsiString and UnicodeString. That will take the system's current code page into account, but that's probably not what you want. You might want to declare the variable as a RawByteString instead. That won't have any code page associated with it, so there won't be any unexpected conversions.
Don't use strings for storing binary data. If you're building what amounts to a hex editor, then you're working with binary data. It's important to remember that. Even if your binary data happens to consist mostly of bytes that can be interpreted as text, you can't treat the data as text or you'll run into exactly the problems you're seeing — characters that don't appear as expected. If you get a bunch of bytes from your SCSI device, then store them in an array of bytes, not characters.
In hex editors, you'll notice that they always show the hexadecimal values of the bytes. They might show those bytes interpreted as characters, but that's secondary, and they generally only show the bytes that can represent ASCII characters; they don't try to get too fancy with the basic display. The good hex editors will offer to display the data interpreted as wide characters, too. This aids in debugging because the user can look at the same data in multiple ways. But they're just views of the data. They're not actually changing the binary contents of the data.
When you filter out non viewable characters...You'll probably need to decide what to do with a couple of them like #9(Tab), #10(LF), #11(Verticle Tab), #12(FF-or New Page),#13(CR)