When I use UTF8ToAnsi on this string, the result is empty. Any idea why that might be?
msgid "2. Broughton, PMG. ^iJournal of Automatic Chemistry.^n ^lVol 6. No 2. (April – June 1984) pp 94-95."
This demonstrates the problem:
procedure TForm1.FormShow(Sender: TObject);
begin
Memo1.Lines.Text :=
'<<' +
UTF8ToANSI('msgid "2. Broughton, PMG. ^iJournal of Automatic Chemistry.^n^lVol 6. No 2. (April – June 1984) pp 94-95."') +
'>>';
end;
which produces
"<<>>"
Your code fails because what you pass is not UTF-8 encoded. What you pass this function is actually ANSI encoded. When Utf8Decode receives that text, it attempts to decode it and when it encounters the malformed bytes, bytes that are not UTF-8, it bails out and returns the empty string.
The problem character is the dash in April – June 1984 which is an n-dash. In ANSI that is encoded as #150. When you attempt to interpret that as UTF-8, that #150 is not a single byte encoding of a character, and is also invalid as the first byte of a multi-byte sequence. Hence the failure.
To solve your actual problem, you'll need to work out why you have data that is not UTF-8 in a place where you expect UTF-8.
Utf8ToAnsi returns an empty string if the input isn't valid UTF-8 (such as having an incomplete multibyte character or a malformed trailing byte). You can debug your program to discover what your string really contains. You evidently have a problem in the way you obtain your input string. Perhaps you're misinterpreting UTF-8, or perhaps you never really had UTF-8 in the first place.
The dash that you use between April – June is not valid UTF8. So it cannot be decoded correctly. This is not immediately visible, but the symbol that you used here is not a normal minus, but a different character.
Related
I am making a program in Delphi 7, that is supposed to encode a unicode string into html entity string.
For example, "ABCģķī" would result in "ABCģķī"
Now 2 basic things:
Delphi 7 is non-Unicode, so I can't just write unicode chars directly in code to encode them.
Codepages consist of 255 entries, each holding a character, specific to that codepage, except first 127, that are same for all the codepages.
So - How do I get a value of a char, that is in 1-255 range?
I tried Ord(Integer), but it also returns values way past 255. Basically, everything is fine (A returns 65 an so on) until my string reaches non-Latin unicode.
Is there any other method for returning char value? Any help appreciated
I suggest you avoid codepages like the plague.
There are two approaches for Unicode that I'd consider: WideString, and UTF-8.
Widestrings have the advantage that it's 'native' to Windows, which helps if you need to use Windows API calls. Disadvantages are storage space, and that they (like UTF-8) can require multiple WideChars to encode the full Unicode space.
UTF-8 is generally preferable. Like WideStrings, this is a multi-byte encoding, so a particular unicode 'code point' may need several bytes in the string to encode it. This is only an issue if you're doing lots of character-by-character processing on your strings.
#DavidHeffernan comments (correctly) that WideStrings may be more compact in certain cases. However, I'd only recommend UTF-16 only if you are absolutely sure that your encoded text will really be more compact (don't forget markup!), and this compactness is highly important to you.
In HTML 4, numeric character references are relative to the charset used by the HTML. Whether that charset is specified in the HTML itself via a <meta> tag, or out-of-band via an HTTP/MIME Content-Type header or other means, it does not matter. As such, "ABCģķī" would be an accurate representation of "ABCģķī" only if the HTML were using UTF-16. If the HTML were using UTF-8, the correct representation would be either "ABCģķī" or "ABCģķī" instead. Most other charsets do no support those particular Unicode characters.
In HTML 5, numeric character references contain original Unicode codepoint values regardless of the charset used by the HTML. As such, "ABCģķī" would be represented as either "ABC#291;ķī" or "ABCģķī".
So, to answer your question, the first thing you have to do is decide whether you need to use HTML 4 or HTML 5 semantics for numeric character references. Then, you need to assign your Unicode data to a WideString (which is the only Unicode string type that Delphi 7 natively supports), which uses UTF-16, then:
if you need HTML 4:
A. if the HTML charset is not UTF-16, then use WideCharToMultiByte() (or equivalent) to convert the WideString to that charset, then loop through the resulting values outputting unreserved characters as-is and character references for reserved values, using IntToStr() for decimal notation or IntToHex() for hex notation.
B. if the HTML charset is UTF-16, then simply loop through each WideChar in the WideString, outputting unreserved characters as-is and character references for reserved values, using IntToStr() for decimal notation or IntToHex() for hex notation.
If you need HTML 5:
A. if the WideString does not contain any surrogate pairs, then simply loop through each WideChar in the WideString, outputting unreserved characters as-is and character references for reserved values, using IntToStr() for decimal notation or IntToHex() for hex notation.
B. otherwise, convert the WideString to UTF-32 using WideStringToUCS4String(), then loop through the resulting values outputting unreserved codepoints as-is and character references for reserved codepoints, using IntToStr() for decimal notation or IntToHex() for hex notation.
In case I understood the OP correctly, I'll just leave this here.
function Entitties(const S: WideString): string;
var
I: Integer;
begin
Result := '';
for I := 1 to Length(S) do
begin
if Word(S[I]) > Word(High(AnsiChar)) then
Result := Result + '#' + IntToStr(Word(S[I])) + ';'
else
Result := Result + S[I];
end;
end;
I have a text file which can come in different encodings (ASCII, UTF-8, UTF-16,UTF-32). The best part is that it is filled only with numbers, for example:
192848292732
My question is: will a function like the one bellow be able to display all the data correctly? If not why? (I have loaded the file as a string into the container string)
function output(container: AnsiString): AnsiString;
var
i: Integer;
begin
Result := '';
for i := 1 to Length(container) do
if (Ord(container[i]) <> 0) then
Result := Result + container[i];
end;
My logic is that if the encoding is different then ASCII and UTF-8 extra characters are all 0 ?
It passes all the tests just fine.
The ASCII character set uses codes 0-127. In Unicode, these characters map to code points with the same numeric value. So the question comes down to how each of the encodings represent code points 0-127.
UTF-8 encodes code points 0-127 in a single byte containing the code point value. In other words, if the payload is ASCII, then there is no difference between ASCII and UTF-8 encoding.
UTF-16 encodes code points 0-127 in two bytes, one of which is 0, and the other of which is the ASCII code.
UTF-32 encodes code points 0-127 in four bytes, three of which are 0, and the remaining byte is the ASCII code.
Your proposed algorithm will not be able to detect ASCII code 0 (NUL). But you state that character is not present in the file.
The only other problem that I can see with your proposed code is that it will not recognise a byte order mark (BOM). These may be present at the beginning of the file and I guess you should detect them and skip them.
Having said all of this, your implementation seems odd to me. You seem to state that the file only contains numeric characters. In which case your test could equally well be:
if container[i] in ['0'..'9'] then
.........
If you used this code then you would also happen to skip over a BOM, were it present.
I am trying to encode the 'subject' field, written in Hebrew, of an email into Base64 so that the subject can be read correctly in all browsers. At the moment, I am using the encoding Windows-1255 which works on some clients but not all, so I want to use utf-8, base64.
My reading on the subject (no pun intended) shows that the text has to be in the form
=?<charset>?<encoding>?<encoded text>?=
eg
=?windows-1255?Q?=E0=E1?=
I have taken encoded subject lines from letters which were sent to me in Hebrew with UTF-8B encoding and decoded them successfully on this website, www.webatic.com/run/convert/base64.php. I have also used this website to encode simple letters and have noticed that the return encoding is not the same as the result which I get from a Delphi algorithm.
So - I am looking for an algorithm which successfully encodes letters such as aleph (ord=224), bet (ord=225), etc. According to the website, the string composed of the two letters aleph and bet returns the code 15DXkq==, but the basic Delphi algorithm returns Ue4 and the TIdEncoderQuotedPrintable component returns =E0=E1 (which is the ISO-8859 encoding).
Edit (after several comments):
I asked a friend to send me an email from her Mac computer, which unsurprisingly uses UTF-8 encoding (as opposed to Windows-1255). The subject was one letter, aleph, ord 224. The encoded subject appeared in the email's header as follows
=?UTF-8?B?15A=?=
This can be separated into three parts: the 'prefix' (=?UTF-8?B?) which means that UTF-8 with base64 encoding is being used; the 'payload' (15A=), which the web site which I quoted translates this correctly as the letter aleph; and the suffix (?=).
I need an algorithm to translate an arbitrary string of letters, most of which will be in Hebrew (and thus with ord >= 224) into base64/utf-8; a correct solution is one that decodes correctly on the web site quoted.
I'm sorry to have wasted all your time. I spent several hours again on the subject today and discovered that the base64 code which I was using has a huge bug.
The steps necessary to send a base64 encoded UTF-8 subject line are:
Convert 'normal' text (ie local ANSI code page) to UTF-8 via the AnsiToUTF8 function
Encode this into base64
Create a string with the prefix '=?UTF-8?B?', the result from stage 2 and the suffix '=?='
Send!
Here is the complete code for creating and sending the email (obviously simplified)
with IdSMTP1 do
begin
host:= ....;
username:= ....;
password:= ....;
end;
with email do
begin
From.Address:= ....;
Recipients.EMailAddresses:= ....;
cclist.add.address:= ....;
email.subject:= '=?UTF-8?B?' + encode64 (AnsiToUTF8 (edit1.text)) + '=?=';
email.Body.text:= ....;
end;
try
IdSMTP1.Connect (1000);
IdSMTP1.Send (email);
finally
if IdSMTP1.Connected
then IdSMTP1.Disconnect;
end;
Using the code on this page which is the same as this page, the 'codes64' string begins with the digits, then capital letters, then lower case letters and then punctuation. But this page shows that the capital letters should come first, followed by the lower case letters, followed by the digits, followed by the punctuation.
Once I had made this correction, the strings began to be encoded 'correctly' - I could read them properly in my email client, which I am taking to be the definition of 'correct'.
It would be interesting to read whether anybody else has had problems with the base64 encoding code which I found.
You do not need to encode the Subject property manually at all. TIdMessage encodes it automatically for you. Simply assign the
Edit1.Text value as-is to the
Subject and let TIdMessage encode
it as needed.
If you want to customize how
TIdMessage encodes headers, use the TIdMessage.OnInitializeISO
event to provide the desired charset and encoding
values. In Delphi 2009+, it defaults to UTF-8 and Base64. In earlier versions, TIdMessage reads the RTL's current OS language and chooses some default values for known languages. However, Hebrew is not one of them, and so ISO-8859-1 and QuotedPrintable would end up being used. You can override those values, eg:
email.Subject := Edit1.Text;
.
procedure TForm1.emailInitializeISO(var VHeaderEncoding: Char; var VCharSet: string);
begin
VHeaderEncoding := 'B';
VCharSet := 'UTF-8';
end;
I would like to read a UTF-8 text file byte by byte and get the ascii value representation of each byte in the file. Can this be done? If so, what is the best method?
My goal is to then replace 2 byte combinations that i find with one byte (these are set conditions that I have prepared)
for example, If I find a 197 followed by a 158 (decimal representations), i will replace it with a single byte 17
I don't want to use the standard delphi IO operations
AssignFile
ReSet
ReWrite(OutFile);
ReadLn
WriteLn
CloseFile
Is there a better method? Can this be done using TStream (Reader & Writer)?
Here is an example test I am using. I know there is a character (350) (two bytes) starting in column 84. When viewed in a hex editor, the character consists of 197 + 158 - so i am trying to find the 198 using my delphi code and can't seem to find it
FS1:= TFileStream.Create(ParamStr1, fmOpenRead);
try
FS1.Seek(0, soBeginning);
FS1.Position:= FS1.Position + 84;
FS1.Read(B, SizeOf(B));
if ord(B) = 197 then showMessage('True') else ShowMessage('False');
finally
FS1.Free;
end;
You can use TFileStream to read all data from file to, for isntance, array of bytes, and later check for utf8 sequence.
Also please note that utf8 sequence can contain more than 2 bytes.
And, in Delphi there is a function Utf8ToUnicode, which will convert utf8 data to usable unicode string.
My understanding is that you want to convert a text file from UTF-8 to ASCII. That's quite simple:
StringList.LoadFromFile(UTF8FileName, TEncoding.UTF8);
StringList.SaveToFile(ASCIIFileName, TEncoding.ASCII);
The runtime library comes with all sorts of functionality to convert between different text encodings. Surely you don't want to attempt to replicate this functionality yourself?
I trust you realise that this conversion is liable to lose data. Characters with ordinal greater than 127 cannot be represented in ASCII. In fact every code point that requires more than 1 octet in UTF-8 cannot be represented in ASCII.
You asked the same question 5 hours later in another topic, the answer od which better addresses your specific question:
Replacing a unicode character in UTF-8 file using delphi 2010
I have incorrect result when converting file to string in Delphi XE. There are several ' characters that makes the result incorrect. I've used UnicodeFileToWideString and FileToString from http://www.delphidabbler.com/codesnip and my code :
function LoadFile(const FileName: TFileName): ansistring;
begin
with TFileStream.Create(FileName, fmOpenRead or fmShareDenyWrite) do
begin
try
SetLength(Result, Size);
Read(Pointer(Result)^, Size);
// ReadBuffer(Result[1], Size);
except
Result := '';
Free;
end;
Free;
end;
end;
The result between Delphi XE and Delphi 6 is different. The result from D6 is correct. I've compared with result of a hex editor program.
Your output is being produced in the style of the Delphi debugger, which displays string variables using Delphi's own string-literal format. Whatever function you're using to produce that output from your own program has actually been fixed for Delphi XE. It's really your Delphi 6 output that's incorrect.
Delphi string literals consist of a series of printable characters between apostrophes and a series of non-printable characters designated by number signs and the numeric values of each character. To represent an apostrophe, write two of them next to each other. The printable and non-printable series of characters can be written right not to each other; there's no need to concatenate them with the + operator.
Here's an excerpt from the output you say is correct:
#$12'O)=ù'dlû'#6't
There are four lone apostrophes in that string, so each one either opens or closes a series of printable characters. We don't necessarily know which is which when we start reading the string at the left because the #, $, 1, and 2 characters are all printable on their own. But if they represent printable characters, then the 0, ), =, and ù characters are in the non-printable region, and that can't be. Therefore, the first apostrophe above opens a printable series, and the #$12 part represents the character at code 18 (12 in hexadecimal). After the ù is another apostrophe. Since the previous one opened a printable string, this one must close it. But the next character after that is d, which is not #, and therefore cannot be the start of a non-printable character code. Therefore, this string from your Delphi 6 code is mal-formed.
The correct version of that excerpt is this:
#$12'O)=ù''dlû'#6't
Now there are three lone apostrophes and one set of doubled apostrophes. The problematic apostrophe from the previous string has been doubled, indicating that it is a literal apostrophe instead of a printable-string-closing one. The printable series continues with dlû. Then it's closed to insert character No. 6, and then opened again for t. The apostrophe that opens the entire string, at the beginning of the file, is implicit.
You haven't indicated what code you're using to produce the output you've shown, but that's where the problem was. It's not there anymore, and the code that loads the file is correct, so the only place that needs your debugging attention is any code that depended on the old, incorrect format. You'd still do well to replace your code with that of Robmil since it does better at handling (or not handling) exceptions and empty files.
Actually, looking at the real data, your problem is that the file stores binary data, not string data, so interpreting this as a string is not valid at all. The only reason it works at all in Delphi 6 is that non-Unicode Delphi allows you to treat binary data and strings the same way. You cannot do this in Unicode Delphi, nor should you.
The solution to get the actual text from within the file is to read the file as binary data, and then copy any values from this binary data, one byte at a time, to a string if it is a "valid" Ansi character (printable).
I will suggest the code:
function LoadFile(const FileName: TFileName): AnsiString;
begin
with TFileStream.Create(FileName, fmOpenRead or fmShareDenyWrite) do
try
SetLength(Result, Size);
if Size > 0 then
Read(Result[1], Size);
finally
Free;
end;
end;