Getting char value in Delphi 7 - delphi

I am making a program in Delphi 7, that is supposed to encode a unicode string into html entity string.
For example, "ABCģķī" would result in "ABCģķī"
Now 2 basic things:
Delphi 7 is non-Unicode, so I can't just write unicode chars directly in code to encode them.
Codepages consist of 255 entries, each holding a character, specific to that codepage, except first 127, that are same for all the codepages.
So - How do I get a value of a char, that is in 1-255 range?
I tried Ord(Integer), but it also returns values way past 255. Basically, everything is fine (A returns 65 an so on) until my string reaches non-Latin unicode.
Is there any other method for returning char value? Any help appreciated

I suggest you avoid codepages like the plague.
There are two approaches for Unicode that I'd consider: WideString, and UTF-8.
Widestrings have the advantage that it's 'native' to Windows, which helps if you need to use Windows API calls. Disadvantages are storage space, and that they (like UTF-8) can require multiple WideChars to encode the full Unicode space.
UTF-8 is generally preferable. Like WideStrings, this is a multi-byte encoding, so a particular unicode 'code point' may need several bytes in the string to encode it. This is only an issue if you're doing lots of character-by-character processing on your strings.
#DavidHeffernan comments (correctly) that WideStrings may be more compact in certain cases. However, I'd only recommend UTF-16 only if you are absolutely sure that your encoded text will really be more compact (don't forget markup!), and this compactness is highly important to you.

In HTML 4, numeric character references are relative to the charset used by the HTML. Whether that charset is specified in the HTML itself via a <meta> tag, or out-of-band via an HTTP/MIME Content-Type header or other means, it does not matter. As such, "ABCģķī" would be an accurate representation of "ABCģķī" only if the HTML were using UTF-16. If the HTML were using UTF-8, the correct representation would be either "ABCģķī" or "ABCģķī" instead. Most other charsets do no support those particular Unicode characters.
In HTML 5, numeric character references contain original Unicode codepoint values regardless of the charset used by the HTML. As such, "ABCģķī" would be represented as either "ABC#291;ķī" or "ABCģķī".
So, to answer your question, the first thing you have to do is decide whether you need to use HTML 4 or HTML 5 semantics for numeric character references. Then, you need to assign your Unicode data to a WideString (which is the only Unicode string type that Delphi 7 natively supports), which uses UTF-16, then:
if you need HTML 4:
A. if the HTML charset is not UTF-16, then use WideCharToMultiByte() (or equivalent) to convert the WideString to that charset, then loop through the resulting values outputting unreserved characters as-is and character references for reserved values, using IntToStr() for decimal notation or IntToHex() for hex notation.
B. if the HTML charset is UTF-16, then simply loop through each WideChar in the WideString, outputting unreserved characters as-is and character references for reserved values, using IntToStr() for decimal notation or IntToHex() for hex notation.
If you need HTML 5:
A. if the WideString does not contain any surrogate pairs, then simply loop through each WideChar in the WideString, outputting unreserved characters as-is and character references for reserved values, using IntToStr() for decimal notation or IntToHex() for hex notation.
B. otherwise, convert the WideString to UTF-32 using WideStringToUCS4String(), then loop through the resulting values outputting unreserved codepoints as-is and character references for reserved codepoints, using IntToStr() for decimal notation or IntToHex() for hex notation.

In case I understood the OP correctly, I'll just leave this here.
function Entitties(const S: WideString): string;
var
I: Integer;
begin
Result := '';
for I := 1 to Length(S) do
begin
if Word(S[I]) > Word(High(AnsiChar)) then
Result := Result + '#' + IntToStr(Word(S[I])) + ';'
else
Result := Result + S[I];
end;
end;

Related

Getting a unicode, hidden symbol, as data in Delphi

I'm writing a delimiter for some Excel spreadsheet data and I need to read the rightward arrow symbol and pilcrow symbol in a large string.
The pilcrow symbol, for row ends, was fairly simply, using the Chr function and the AnsiChar code 182.
The rightward arrow has been more tricky to figure out. There isn't an AnsiChar code for it. The Unicode value for it is '2192'. I can't, however, figure out how to make this into a string or char type for me to use in my function.
Any easy ways to do this?
You can't use the 2192 character directly. But since a STRING variable can't contain this value either (as thus your TStringList can't either), that doesn't matter.
What character(s) are the 2192 character represented as in your StringList AFTER you have read it in? Probably by these three separate characters: 0xE2 0x86 0x92 (in UTF-8 format). The simple solution, therefore, is to start by replacing these three characters with a single, unique character that you can then assign to the Delimiter field of the TStringList.
Like this:
.
.
.
<Read file into a STRING variable, say S>
S := ReplaceStr(S,#$E2#$86#$92,'|');
SL := TStringList.Create;
SL.Text := S;
SL.Delimiter := '|';
.
.
.
You'll have to select a single-character representation of your 3-byte UTF-8 Unicode character that doesn't occur in your data elsewhere.
You need to represent that character as a UTF-16 character. In Unicode Delphi you would do it like this:
Chr(2192)
which is of type WideChar.
However, you are using Delphi 7 which is a pre-Unicode Delphi. So you have to do it like this:
var
wc: WideChar;
....
wc := WideChar(2192);
Now, this might all be to no avail for you since it sounds a little like your code is working with 8 bit ANSI text. In which case that character cannot be encoded in any 8 bit ANSI character set. If you really must use that character, you'll need to use Unicode text.

Convert unicode to ascii

I have a text file which can come in different encodings (ASCII, UTF-8, UTF-16,UTF-32). The best part is that it is filled only with numbers, for example:
192848292732
My question is: will a function like the one bellow be able to display all the data correctly? If not why? (I have loaded the file as a string into the container string)
function output(container: AnsiString): AnsiString;
var
i: Integer;
begin
Result := '';
for i := 1 to Length(container) do
if (Ord(container[i]) <> 0) then
Result := Result + container[i];
end;
My logic is that if the encoding is different then ASCII and UTF-8 extra characters are all 0 ?
It passes all the tests just fine.
The ASCII character set uses codes 0-127. In Unicode, these characters map to code points with the same numeric value. So the question comes down to how each of the encodings represent code points 0-127.
UTF-8 encodes code points 0-127 in a single byte containing the code point value. In other words, if the payload is ASCII, then there is no difference between ASCII and UTF-8 encoding.
UTF-16 encodes code points 0-127 in two bytes, one of which is 0, and the other of which is the ASCII code.
UTF-32 encodes code points 0-127 in four bytes, three of which are 0, and the remaining byte is the ASCII code.
Your proposed algorithm will not be able to detect ASCII code 0 (NUL). But you state that character is not present in the file.
The only other problem that I can see with your proposed code is that it will not recognise a byte order mark (BOM). These may be present at the beginning of the file and I guess you should detect them and skip them.
Having said all of this, your implementation seems odd to me. You seem to state that the file only contains numeric characters. In which case your test could equally well be:
if container[i] in ['0'..'9'] then
.........
If you used this code then you would also happen to skip over a BOM, were it present.

Why do string constants use wide characters even when formed entirely from 8 bit characters?

I just posted a question about Unicode character constants, where $HIGHCHARUNICODE appeared to be the reason.
Now with the default $HIGHCHARUNICODE OFF (Delphi XE2), why is this:
const
AllLowByteValues =#$00#$01#$02#$03#$04#$05#$06#$07#$08#$09#$0a#$0b#$0c#$0d#$0e#$0f;
AllHighByteValues=#$D0#$D1#$D2#$D3#$D4#$D5#$D6#$D7#$D8#$D9#$Da#$Db#$Dc#$Dd#$De#$Df;
==> Sizeof(AllLowByteValues[1]) = 2
==> Sizeof(AllHighByteValues[1]) = 2
If "All hexadecimal #$xx 2-digit literals are parsed as AnsiChar" for #$80 ... #$FF, then why is AllHighByteValues a unicode String and not an ANSIString?
That's because string constants are PChar and so made up of UTF-16 elements.
From the documentation:
String constants are assignment-compatible with the PChar and PWideChar types, which represent pointers to null-terminated arrays of Char and WideChar values.
You are not taking that into account that String and Character literals are context-sensitive in D2009+. If a literal is used in an Ansi context, it will be stored as Ansi. If a literal is used in a Unicode context, it will be stored as Unicode. HIGHCHARUNICODE only applies to 3-digit numeric Character literals between #128-#255 and 2-digit hex Character literals between #$80-#$FF. Those particular values are ambiquious between Ansi and Unicode, so HIGHCHARUNICODE is used to address the ambiquity. HIGHCHARUNICODE does not apply to other types of literals, including String literals. If you pass a String or Character literal to SizeOf(), there is no Ansi/Unicode context in the source code for the compiler to use, so it is going to use a Unicode context except in the specific case where HIGHCHARUNICODE applies, in which case an Ansi context is used if HICHCHARUNICODE is OFF. That is what you are seeing happen.

Problem with ord () and string

i having this problem, if i have:
mychr = ' ';
where the 'space' in mychr equival to #255 (typed manually ALT+255), and i write:
myord = ord (mychr)
to myord return value 160 and not 255. Of course, same problem is too with charater ALT+254 etc.
As i can solve this problem? I have tested on delphi xe in console mode.
Note: if i use:
mychar = #255;
then function ord() return value correctly.
I think the problem is that the Windows Alt+Num shortcuts insert characters according to the local codepage, whereas a modern Delphi use Unicode characters, and these differ (unless the value is less than or equal to 127, I think). The solution is to enter the values #255 explicitly in code. In addition, it is a very bad habit to include 'invisible' special characters in code, because you cannot tell what character it is without copying in to an external tool! In addition, you will have to trust the text encoding of the .pas file. It is much better to use constants like #255. Even better, do
const
MY_PRECIOUS_VALUE = #255;
and use this constant every time you need it.
Update
According to the English Wikipedia article on Alt code:
If the number typed has a leading 0
(zero), the character set used is the
Windows code page that matches the
current input locale. For most systems
using the Latin alphabet, this is
Windows-1252. For a complete list, see
code page. If the number does not have
a leading 0 (zero), DOS compatibility
is invoked. The character set used is
the DOS code page for the current
input locale. For systems using
English, this is code page 437. For
most other systems using the Latin
alphabet, this is code page 850. For a
complete list, see code page.
So, if you really, really want to continue entering Alt keycodes, you'd better type Alt and 0255 with the leading zero.
If you type ALT+255, DOS codepage is used; for 437 and 850 DOS codepages (one of which you probably use) #255 is NBSP (non-breaking space). In Unicode, NBSP is $A0 (160). That explains why you obtain Ord 160.
AFAIK console mode use the OEM Ansi char set. And under Delphi XE, you're not in the Ansi world, but in the UCS-2 / Unicode world.
var MyChar: char;
MyWideChar: WideChar;
MyAnsiChar: AnsiChar;
begin
MyChar := #255;
MyWideChar := #255;
MyAnsiChar := #255;
The first two variables are the same, i.e. a character with Unicode code 255 = $00FF, since in Delphi XE, char = WideChar. For the first Unicode Page, see this article.
But MyAnsiChar is what will be displayed on the console, after conversion from the current code page into the OEM console code page.
In the Unicode chart, this $00FF is a minuscule y with trema:
U+00FF ÿ Latin Small Letter Y with diaeresis
Under the console, you'll use the OEM char set, i.e. Code Page 347. So in your case $FF is NOT a character, but a special code
FF NBSP Non Breaking SPace
which is converted into U+00A0 when converted back to Unicode:
U+00A0 NBSP Non Breaking SPace
It is very likely that you are in a Windows-1252 code page, so normally the Delphi XE AnsiString will map #255 into a minuscule y with trema:
FF ÿ Latin Small Letter Y with diaeresis
You can use low-level e.g. CharToOemBuff windows functions to perform the conversion to or from OEM, or use an OEM AnsiString type:
type
TOemString = AnsiString(437);
In all cases, the console is not the best way of entering accentuated text under modern Windows, and Unicode Delphi XE.
Using InputQuery function e.g. should be safer, since it will return an Unicode string variable. ;)

Replace string that contain #0?

I use this function to read file to string
function LoadFile(const FileName: TFileName): string;
begin
with TFileStream.Create(FileName,
fmOpenRead or fmShareDenyWrite) do begin
try
SetLength(Result, Size);
Read(Pointer(Result)^, Size);
except
Result := '';
Free;
raise;
end;
Free;
end;
end;
Here's the text of file :
version
Here's the return value of LoadFile :
'ÿþv'#0'e'#0'r'#0's'#0'i'#0'o'#0'n'#0
I want to make a new file contain "verabc". The problem is I still have a problem to replace "sion" with "abc". I am using D2007. If I remove all #0 then the result become Chinese character.
What you think is the text of the file isn't really the text of the file. What you've read into your string variable is accurate. You have a Unicode text file encoded as little-endian UTF-16. The first two bytes represent the byte-order mark, and each pair of bytes after that are another character of the string.
If you're reading a Unicode file, you should use a Unicode data type, such as WideString. You'll want to divide the file size by two when setting the length of the string, and you'll want to discard the first two bytes.
If you don't know what kind of file you're reading, then you need to read the first two or three bytes first. If the first two bytes are $ff $fe, as above, then you might have a little-endian UTF-16 file; read the rest of the file into a WideString, or UnicodeString if you have that type. If they're $fe $ff, then it might be big-endian; read the remainder of the file into a WideString and then swap the order of each pair of bytes. If the first two bytes are $ef $bb, then check the third byte. If it's $bf, then they are probably the UTF-8 byte-order mark. Discard all three and read the rest of the file into an AnsiString or an array of bytes, and then use a function like UTF8Decode to convert it into a WideString.
Once you have your data in a WideString, the debugger will show that it contains version, and you should have no trouble using a Unicode-enabled version of StringReplace to do your replacement.
It seems that you load a unicode encoded text file. 0 indicates Latin character.
If you don't want to deal with unicode text, choose ANSI encoding in your editor when you save the file.
If you need unicode encoding, use WideCharToString to convert it to an ANSI string, or just remove yourself the 0s, though the latter isn't the best solution. Also remove the 2 leading characters, ÿþ.
The editor put those bytes to mark the file as unicode.

Resources