My program reads from a device via a serial port and returns this string. 'IC'#$0088#$0080'Ô'#$0080#$0080
I need to get the 5 hex values and convert to binary. #$0088 = 10001000, #$0080 = 10000000, Ô = 11010100.
I can convert the 80 & 88, but am having difficulty extracting them from the whole string. The Ô(xD4) I can neither extract or convert. An extended character like the Ô could be at any or all locations.
The read methods in my serial component are:
function Read(var Buffer; Count: Integer): Integer;
function ReadStr(var Str: string; Count: Integer): Integer;
function ReadAsync(var Buffer; Count: Integer; var AsyncPtr: PAsync): Integer;
function ReadStrAsync(var Str: Ansistring; Count: Integer; var AsyncPtr: PAsync): Integer;
Can you give me an example of reading binary?
It looks like the real problem is that you are treating binary data as though it were UTF-16 encoded text.
Whatever is feeding you this data, is not feeding you UTF-16 encoded text. What the device is really feeding you is a byte array. Treat it as such rather than as text. Then you can pick out the five values you are interested in by index.
So, declare an array of bytes:
var
Data: TArray<Byte>; // dynamic array
or
var
Data: TBytes; // shorthand for the same
or
var
Data: array [0..N-1] of Byte; // fixed length array
And then read into those arrays. To pick out values, use Data[i].
Note that I am using a significant amount of guesswork here, based on the question and your comments. Don't take my word for it. My guessing could be wrong. Consult the specification of the communication protocol for the device. And learn carefully the difference between text and binary.
As I wrote earlier in the comments, the problem with the message in your question is that it consists partly of non-ASCII characters. The ASCII range is from $00 to $7F and have the same characters as Unicode U+0000 to U+007F. Therefore no conversion (except for the leading 0). AnsiCharacters ($80 to $FF) on the other hand are subject to conversion according to the code page in use, in order to keep the same glyph for both. F.Ex. AnsiChar $80 (Euro sign in CP1252) is therefore converted to Unicode U+02C6. Bit patten for the lower byte doesn't match anymore.
Ref: https://msdn.microsoft.com/en-us/library/cc195054.aspx
Following code shows the result of two tests, Using Char vs. AnsiChar
procedure TMainForm.Button2Click(Sender: TObject);
const
Buffer: array[0..7] of AnsiChar = ('I','C', #$88, #$80, #$D4, #$80, #$80, ';');
// Buffer: array[0..7] of Char = ('I','C', #$88, #$80, #$D4, #$80, #$80, ';');
BinChars: array[0..1] of Char = ('0','1');
var
i, k: integer;
c: AnsiChar;
// c: Char;
s: string;
begin
for k := 2 to 6 do
begin
c := Buffer[k];
SetLength(s, 8);
for i := 0 to 7 do
s[8-i] := BinChars[(ord(c) shr i) and 1];
Memo1.Lines.Add(format('Character %d in binary format: %s',[k, s]));
end;
end;
Using Char (UTF-16 WideChar)
AnsiChar #$88 is converted to U+02C6
AnsiChar #$80 is converted to U+20AC
AnsiChar #$D4 is converted to U+00D4 !
Lower byte gives
Character 2 in binary format: 11000110
Character 3 in binary format: 10101100
Character 4 in binary format: 11010100
Character 5 in binary format: 10101100
Character 6 in binary format: 10101100
Using AnsiChar
Character 2 in binary format: 10001000
Character 3 in binary format: 10000000
Character 4 in binary format: 11010100
Character 5 in binary format: 10000000
Character 6 in binary format: 10000000
Unfortunately a conversion from Unicode to Ansi (even if originally converted from Ansi to Unicode) is lossy and will fail.
I really don't see any easy solution with the information available.
Related
I am trying to better understand surrogate pairs and Unicode implementation in Delphi.
If I call length() on the Unicode string S := 'Ĥà̲V̂e' in Delphi, I will get back, 8.
This is because the lengths of the individual characters [Ĥ],[à̲],[V̂], and [e] are 2, 3, 2, and 1 respectively. This is because Ĥ has a surrogate, à̲ has two additional surrogates, V̂ has a surrogate and e has no surrogates.
If I wanted to return the second element in the string including all surrogates, [à̲], how would I do that? I know I would need to do some sort of testing of the individual bytes. I ran some tests using the routine
function GetFirstCodepointSize(const S: UTF8String): Integer;
referenced in this SO Question.
but got some unusual results, eg, here are some length and sizes of some different codepoints. Below is a snippet of how I generated these tables.
...
UTFCRUDResultStrings.add('INPUT: '+#9#9+ DATA +#9#9+ 'GetFirstCodePointSize = ' +intToStr(GetFirstCodepointSize(DATA))
+#9#9+ 'Length =' + intToStr(length(DATA)));
...
First Set: This makes sense to me, each code point size is doubled, but these are one character each and Delphi gives me the length as just 1, perfect.
INPUT: ď GetFirstCodePointSize = 2 Length =1
INPUT: ơ GetFirstCodePointSize = 2 Length =1
INPUT: ǥ GetFirstCodePointSize = 2 Length =1
Second set: It initially looks to me like the lengths and code points are reversed? I am guessing the reason for this is that the characters + surrogates are being treated individually, hence the first codepoint size is for the 'H', which is 1, but the length is returning the lengths of 'H' plus '^'.
INPUT: Ĥ GetFirstCodePointSize = 1 Length =2
INPUT: à̲ GetFirstCodePointSize = 1 Length =3
INPUT: V̂ GetFirstCodePointSize = 1 Length =2
INPUT: e GetFirstCodePointSize = 1 Length =1
Some additional tests...
INPUT: ¼ GetFirstCodePointSize = 2 Length =1
INPUT: ₧ GetFirstCodePointSize = 3 Length =1
INPUT: 𤭢 GetFirstCodePointSize = 4 Length =2
INPUT: ß GetFirstCodePointSize = 2 Length =1
INPUT: 𨳒 GetFirstCodePointSize = 4 Length =2
Is there a reliable way in Delphi to determine where an element in a Unicode String starts and ends?
I know my terminology using the word element may be off, but I don't think codepoint and character are right either, particularly given that one element may have a codepoint size of 3, but have a length of only one.
I am trying to better understand surrogate pairs and Unicode implementation in Delphi.
Let's get some terminology out of the way.
Each "character" (known as a grapheme) that is defined by Unicode is assigned a unique codepoint.
In a Unicode Transformation Format (UTF) encoding - UTF-7, UTF-8, UTF-16, and UTF-32 - each codepoint is encoded as a sequence of codeunits. The size of each codeunit is determined by the encoding - 7 bits for UTF-7, 8 bits for UTF-8, 16 bits for UTF-16, and 32 bits for UTF-32 (hence their names).
In Delphi 2009 and later, String is an alias for UnicodeString, and Char is an alias for WideChar. WideChar is 16 bits. A UnicodeString holds a UTF-16 encoded string (in earlier versions of Delphi, the equivalent string type was WideString), and each WideChar is a UTF-16 codeunit.
In UTF-16, a codepoint can be encoded using either 1 or 2 codeunits. 1 codeunit can encode codepoint values in the Basic Multilingual Plane (BMP) range - $0000 to $FFFF, inclusive. Higher codepoints require 2 codeunits, which is also known as a surrogate pair.
If I call length() on the Unicode string S := 'Ĥà̲V̂e' in Delphi, I will get back, 8.
This is because the lengths of the individual characters [Ĥ],[à̲],[V̂], and [e] are 2, 3, 2, and 1 respectively.
This is because Ĥ has a surrogate, à̲ has two additional surrogates, V̂ has a surrogate and e has no surrogates.
Yes, there are 8 WideChar elements (codeunits) in your UTF-16 UnicodeString. What you are calling "surrogates" are actually known as "combining marks". Each combining mark is its own unique codepoint, and thus its own codeunit sequence.
If I wanted to return the second element in the string including all surrogates, [à̲], how would I do that?
You have to start at the beginning of the UnicodeString and analyze each WideChar until you find one that is not a combining mark attached to a previous WideChar. On Windows, the easiest way to do that is to use the CharNextW() function, eg:
var
S: String;
P: PChar;
begin
S := 'Ĥà̲V̂e';
P := CharNext(PChar(S)); // returns a pointer to à̲
end;
The Delphi RTL does not have an equivalent function. You would have write one manually, or use a third-party library. The RTL does have a StrNextChar() function, but it only handles UTF-16 surrogates, not combining marks (CharNext() handles both). So, you could use StrNextChar() to scan through each codepoint in the UnicodeString, but you have to loo at each codepoint to know whether it is a combining mark or not, eg:
uses
Character;
function MyCharNext(P: PChar): PChar;
begin
if (P <> nil) and (P^ <> #0) then
begin
Result := StrNextChar(P);
while GetUnicodeCategory(Result^) = ucCombiningMark do
Result := StrNextChar(Result);
end else begin
Result := nil;
end;
end;
var
S: String;
P: PChar;
begin
S := 'Ĥà̲V̂e';
P := MyCharNext(PChar(S)); // should return a pointer to à̲
end;
I know I would need to do some sort of testing of the individual bytes.
Not the bytes, but the codepoints that they represent when decoded.
I ran some tests using the routine
function GetFirstCodepointSize(const S: UTF8String): Integer
Look closely at that function signature. See the parameter type? It is a UTF-8 string, not a UTF-16 string. This was even stated in the answer you got that function from:
Here is an example how to parse UTF8 string
UTF-8 and UTF-16 are very different encodings, and thus have different semantics. You cannot use UTF-8 semantics to process a UTF-16 string, and vice versa.
Is there a reliable way in Delphi to determine where an element in a Unicode String starts and ends?
Not directly. You have to parse the string from the beginning, skipping elements as needed until you reach the desired element. Remember that each codepoint may be encoded as either 1 or 2 codeunit elements, and each logical glyph may be encoded using multiple codepoints (and thus multiple codeunit sequences).
I know my terminology using the word element may be off, but I don't think codepoint and character are right either, particularly given that one element may have a codepoint size of 3, but have a length of only one.
1 glyph is comprised of 1+ codepoints, and each codepoint is encoded as 1+ codeunits.
Could someone implement the following function?
function GetElementAtIndex(S: String; StrIdx : Integer): String;
Try something like this:
uses
SysUtils, Character;
function MyCharNext(P: PChar): PChar;
begin
Result := P;
if Result <> nil then
begin
Result := StrNextChar(Result);
while GetUnicodeCategory(Result^) = ucCombiningMark do
Result := StrNextChar(Result);
end;
end;
function GetElementAtIndex(S: String; StrIdx : Integer): String;
var
pStart, pEnd: PChar;
begin
Result := '';
if (S = '') or (StrIdx < 0) then Exit;
pStart := PChar(S);
while StrIdx > 1 do
begin
pStart := MyCharNext(pStart);
if pStart^ = #0 then Exit;
Dec(StrIdx);
end;
pEnd := MyCharNext(pStart);
{$POINTERMATH ON}
SetString(Result, pStart, pEnd-pStart);
end;
Looping through the graphemes of a string can be more complicated than you might think. In Unicode 13, some graphemes are up to 14 bytes long. I advise using a third-party library for this. One of the best for this is Skia4Delphi: https://github.com/skia4delphi/skia4delphi
The code is very simple:
var LUnicode: ISkUnicode := TSkUnicode.Create;
for var LGrapheme: string in LUnicode.GetBreaks('Text', TSkBreakType.Graphemes) do
Showmessage(LGrapheme);
In the library demo itself there is an example of graphemes iterator too. Look:
I am using Delphi 6.
I want to decode a Portuguese UTF-8 encoded string to a WideString, but I found that it isn't decoding correctly.
The original text is "ANÁLISE8". After using UTF8Decode(), the result is "ANALISE8". The symbol on top of the "A" disappears.
Here is the code:
var
f : textfile;
s : UTF8String;
w, test : WideString;
begin
while not eof(f) do
begin
readln(f,s);
w := UTF8Decode(s);
How can I decode the Portuguese UTF-8 string to WideString correctly?
Note that the implementation of UTF8Decode() in Delphi 6 is incomplete. Specifically, it does not support encoded 4-byte sequences, which are needed to handle Unicode codepoints above U+FFFF. Which means UTF8Decode() can only decode Unicode codepoints in the UCS-2 range, not the full Unicode repertoire. Thus making UTF8Decode() basically useless in Delphi 6 (and all the way up to Delphi 2007 - it was finally fixed in Delphi 2009).
Try using the Win32 MultiByteToWideChar() function instead, eg:
uses
..., Windows;
function MyUTF8Decode(const s: UTF8String): WideString;
var
Len: Integer;
begin
Len := MultiByteToWideChar(CP_UTF8, 0, PAnsiChar(s), Length(s), nil, 0);
SetLength(Result, Len);
if Len > 0 then
MultiByteToWideChar(CP_UTF8, 0, PAnsiChar(s), Length(s), PWideChar(Result), Len));
end;
var
f : textfile;
s : UTF8String;
w, test : WideString;
begin
while not eof(f) do
begin
readln(f,s);
w := MyUTF8Decode(s);
That being said, your ANÁLISE8 string falls within the UCS-2 range, so I tested UTF8Decode() in Delphi 6 and it decoded the UTF-8 encoded form of ANÁLISE8 just fine. I would conclude that either:
your UTF8String variable DOES NOT contain the UTF-8 encoded form of ANÁLISE8 to begin with (byte sequence 41 4E C3 81 4C 49 53 45 38), but instead contains the ASCII string ANALISE8 instead (byte sequence 41 4E 41 4C 49 53 45 38), which would decode as-is since ASCII is a subset of UTF-8. Double check your file, and the output of Readln().
your WideString contains ANÁLISE8 correctly as expected, but the way you are outputting/debugging it (which you did not show) is converting it to ANSI, losing the Á during the conversion.
I am testing migration from Delphi 5 to XE. Being unfamiliar with UnicodeString, before asking my question I would like to present its background.
Delphi XE string-oriented functions: Copy, Delete and Insert have a parameter Index telling where the operation should start. Index may have any integer value starting from 1 and finishing at the length of the string to which the function is applied.
Since the string can have multi-element characters, function operation can start at an element (surrogate) belonging to a multi-element series encoding a single unicode named code-point.
Then, having a sensible string and using one of the functions, we can obtain non sensible result.
The phenomenon can be illustrated with the below cases using the function Copy with respect to strings representing the same array of named codepoints (i.e. meaningful signs)
($61, $13000, $63)
It's concatenation of 'a', EGYPTIAN_HIEROGLYPH_A001 and 'c'; it looks as
Case 1. Copy of AnsiString (element = byte)
We start with the above mentioned UnicodeString #$61#$13000#$63 and we convert it to UTF-8 encoded AnsiString s0.
Then we test the function
copy (s0, index, 1)
for all possible values of index; there are 6 of them since s0 is 6 bytes long.
procedure Copy_Utf8Test;
type TAnsiStringUtf8 = type AnsiString (CP_UTF8);
var ss : string;
s0,s1 : TAnsiStringUtf8;
ii : integer;
begin
ss := #$61#$13000#$63; //mem dump of ss: $61 $00 $0C $D8 $00 $DC $63 $00
s0 := ss; //mem dump of s0: $61 $F0 $93 $80 $80 $63
ii := length(s0); //sets ii=6 (bytes)
s1 := copy(s0,1,1); //'a'
s1 := copy(s0,2,1); //#$F0 F means "start of 4-byte series"; no corresponding named code-point
s1 := copy(s0,3,1); //#$93 "trailing in multi-byte series"; no corresponding named code-point
s1 := copy(s0,4,1); //#$80 "trailing in multi-byte series"; no corresponding named code-point
s1 := copy(s0,5,1); //#$80 "trailing in multi-byte series"; no corresponding named code-point
s1 := copy(s0,6,1); //'c'
end;
The first and last results are sensible within UTF-8 codepage, while the other 4 are not.
Case 2. Copy of UnicodeString (element = word)
We start with the same UnicodeString s0 := #$61#$13000#$63.
Then we test the function
copy (s0, index, 1)
for all possible values of index; there are 4 of them since s0 is 4 words long.
procedure Copy_Utf16Test;
var s0,s1 : string;
ii : integer;
begin
s0 := #$61#$13000#$63; //mem dump of s0: $61 $00 $0C $D8 $00 $DC $63 $00
ii := length(s0); //sets ii=4 (bytes)
s1 := copy(s0,1,1); //'a'
s1 := copy(s0,2,1); //#$D80C surrogate pair member; no corresponding named code-point
s1 := copy(s0,3,1); //#$DC00 surrogate pair member; no corresponding named code-point
s1 := copy(s0,4,1); //'c'
end;
The first and last results are sensible within codepage CP_UNICODE (1200), while the other 2 are not.
Conclusion.
The string-oriented functions: Copy, Delete and Insert perfectly operate on string considered as a mere array of bytes or words. But they are not helpful if string is seen as that what it essentially is, i.e. representation of array of named code-points.
Both above two cases deal with strings which represent the same array of 3 named code-points. They are considered as representations (encodings) of the same text composed of 3 meaningful signs (to avoid abuse of the term "characters").
One may want to be able to extract (copy) any of those meaningful signs regardless whether a particular text representation (encoding) is mono- or multi-element one.
I've spent quite a time looking around for a satisfactory equivalent of Copy that I used to in Delphi 5.
Question.
Do such equivalents exist or I have to write them myself?
What you have described is how Copy(), Delete(), and Insert() have ALWAYS worked, even for AnsiString. The functions operate on elements (ie codeunits in Unicode terminology), and always have.
AnsiString is a string of 8bit AnsiChar elements, which can be encoded in any 8bit ANSI/MBCS format, including UTF-8.
UnicodeString (and WideString) is a string of 16bit WideChar elements, which are encoded in UTF-16.
The functions HAVE NEVER taken encoding into account. Not for MBCS AnsiString. Not for UTF-16 UnicodeString. Indexes are absolute element indexes from the beginning of the string.
If you need encoding-aware Copy/Delete/Insert functions that operate on logical codepoint boundaries, where each codepoint may be 1+ elements in the string, then you have to write your own functions, or find third-party functions that do what you need. There is no MBCS/UTF-aware mutilator functions in the RTL.
You should parse Unicode string youself. Fortunaly the Unicode encoding is designed to make parsing easy. Here is an example how to parse UTF8 string:
program Project9;
{$APPTYPE CONSOLE}
uses
SysUtils;
function GetFirstCodepointSize(const S: UTF8String): Integer;
var
B: Byte;
begin
B:= Byte(S[1]);
if (B and $80 = 0 ) then
Result:= 1
else if (B and $E0 = $C0) then
Result:= 2
else if (B and $F0 = $E0) then
Result:= 3
else if (B and $F8 = $F0) then
Result:= 4
else
Result:= -1; // invalid code
end;
var
S: string;
begin
S:= #$61#$13000#$63;
Writeln(GetFirstCodepointSize(S));
S:= #$13000#$63;
Writeln(GetFirstCodepointSize(S));
S:= #$63;
Writeln(GetFirstCodepointSize(S));
Readln;
end.
I'm generating texture atlases for rendering Unicode texts in my app. Source texts are stored in ANSI codepages (1250, 1251, 1254, 1257, etc). I want to be able to generate all the symbols from each ANSI codepage.
Here is the outline of the code I would expect to have:
for I := 0 to 255 do
begin
anChar := AnsiChar(I); //obtain AnsiChar
//Apply codepage without converting the chars
//<<--- this part does not work, showing:
//"E2033 Types of actual and formal var parameters must be identical"
SetCodePage(anChar, aCodepages[K], False);
//Assign AnsiChar to UnicodeChar (automatic conversion)
uniChar := anChar;
//Here we get Unicode character index
uniCode := Ord(uniChar);
end;
The code above does not works (E2033) and I'm not sure it is a proper solution at all. Perhaps there's much shorter version.
What is the proper way of converting AnsiChar into Unicode with specific codepage in mind?
I would do it like this:
function AnsiCharToWideChar(ac: AnsiChar; CodePage: UINT): WideChar;
begin
if MultiByteToWideChar(CodePage, 0, #ac, 1, #Result, 1) <> 1 then
RaiseLastOSError;
end;
I think you should avoid using strings for what is in essence a character operation. If you know up front which code pages you need to support then you can hard code the conversions into a lookup table expressed as an array constant.
Note that all the characters that are defined in the ANSI code pages map to Unicode characters from the Basic Multilingual Plane and so are represented by a single UTF-16 character. Hence the size assumptions of the code above.
However, the assumption that you are making, and that this answer persists, is that a single byte represents a character in an ANSI character set. That's a valid assumption for many character sets, for example the single byte western character sets like 1252. But there are character sets like 932 (Japanese), 949 (Koren) etc. that are double byte character sets. Your entire approach breaks down for those code pages. My guess is that only wish to support single byte character sets.
If you are writing cross-platform code then you can replace MultiByteToWideChar with UnicodeFromLocaleChars.
You can also do it in one step for all characters. Here is an example for codepage 1250:
var
encoding: TEncoding;
bytes: TBytes;
unicode: TArray<Word>;
I: Integer;
S: string;
begin
SetLength(bytes, 256);
for I := 0 to 255 do
bytes[I] := I;
SetLength(unicode, 256);
encoding := TEncoding.GetEncoding(1250); // change codepage as needed
try
S := encoding.GetString(bytes);
for I := 0 to 255 do
unicode[I] := Word(S[I+1]); // as long as strings are 1-based
finally
encoding.Free;
end;
end;
Here is the code I have found to be working well:
var
I: Byte;
anChar: AnsiString;
Tmp: RawByteString;
uniChar: Char;
uniCode: Word;
begin
for I := 0 to 255 do
begin
anChar := AnsiChar(I);
Tmp := anChar;
SetCodePage(Tmp, aCodepages[K], False);
uniChar := UnicodeString(Tmp)[1];
uniCode := Word(uniChar);
<...snip...>
end;
This question already has answers here:
Closed 13 years ago.
Possible Duplicate:
Converting float or negative integer to hexadecimal in Borland Delphi
Is there a function i can use to convert a floating point value to a hexadecimal value and back?
procedure ShowBinaryRepresentation;
var
S, S2: Single;
I: Integer;
D, D2: Double;
I64: Int64;
begin
S := -1.841;
I := PInteger(#S)^;
OutputWriteLine(Format('Single in binary represenation: %.8X', [I]));
S2 := PSingle(#I)^;
OutputWriteLine(Format('Converted back to single: %f', [S2]));
D := -1.841E50;
I64 := PInt64(#D)^;
OutputWriteLine(Format('Double in binary represenation: %.16X', [I64]));
D2 := PDouble(#I64)^;
OutputWriteLine(Format('Converted back to double: %f', [D2]));
end;
Single in binary represenation: BFEBA5E3
Converted back to single: -1,84
Double in binary represenation: CA5F7DD860D57D4D
Converted back to double: -1,841E50
Not floating point. There is inttohex to change integers (maybe int64 too) to hex strings.
Probably you need to pry apart the bits in the IEEE format double type and change them to hex which you then concat with a point inbetween.
If this is not the answer you are looking for, please specify what you mean by "hexadecimal value".
If you just want to convert it to an array of e.g. bytes, define a array of byte type with a suitable size (4 for single, 8 for double and 10 extended), and typecast.