Problem with ord () and string - delphi

i having this problem, if i have:
mychr = ' ';
where the 'space' in mychr equival to #255 (typed manually ALT+255), and i write:
myord = ord (mychr)
to myord return value 160 and not 255. Of course, same problem is too with charater ALT+254 etc.
As i can solve this problem? I have tested on delphi xe in console mode.
Note: if i use:
mychar = #255;
then function ord() return value correctly.

I think the problem is that the Windows Alt+Num shortcuts insert characters according to the local codepage, whereas a modern Delphi use Unicode characters, and these differ (unless the value is less than or equal to 127, I think). The solution is to enter the values #255 explicitly in code. In addition, it is a very bad habit to include 'invisible' special characters in code, because you cannot tell what character it is without copying in to an external tool! In addition, you will have to trust the text encoding of the .pas file. It is much better to use constants like #255. Even better, do
const
MY_PRECIOUS_VALUE = #255;
and use this constant every time you need it.
Update
According to the English Wikipedia article on Alt code:
If the number typed has a leading 0
(zero), the character set used is the
Windows code page that matches the
current input locale. For most systems
using the Latin alphabet, this is
Windows-1252. For a complete list, see
code page. If the number does not have
a leading 0 (zero), DOS compatibility
is invoked. The character set used is
the DOS code page for the current
input locale. For systems using
English, this is code page 437. For
most other systems using the Latin
alphabet, this is code page 850. For a
complete list, see code page.
So, if you really, really want to continue entering Alt keycodes, you'd better type Alt and 0255 with the leading zero.

If you type ALT+255, DOS codepage is used; for 437 and 850 DOS codepages (one of which you probably use) #255 is NBSP (non-breaking space). In Unicode, NBSP is $A0 (160). That explains why you obtain Ord 160.

AFAIK console mode use the OEM Ansi char set. And under Delphi XE, you're not in the Ansi world, but in the UCS-2 / Unicode world.
var MyChar: char;
MyWideChar: WideChar;
MyAnsiChar: AnsiChar;
begin
MyChar := #255;
MyWideChar := #255;
MyAnsiChar := #255;
The first two variables are the same, i.e. a character with Unicode code 255 = $00FF, since in Delphi XE, char = WideChar. For the first Unicode Page, see this article.
But MyAnsiChar is what will be displayed on the console, after conversion from the current code page into the OEM console code page.
In the Unicode chart, this $00FF is a minuscule y with trema:
U+00FF ÿ Latin Small Letter Y with diaeresis
Under the console, you'll use the OEM char set, i.e. Code Page 347. So in your case $FF is NOT a character, but a special code
FF NBSP Non Breaking SPace
which is converted into U+00A0 when converted back to Unicode:
U+00A0 NBSP Non Breaking SPace
It is very likely that you are in a Windows-1252 code page, so normally the Delphi XE AnsiString will map #255 into a minuscule y with trema:
FF ÿ Latin Small Letter Y with diaeresis
You can use low-level e.g. CharToOemBuff windows functions to perform the conversion to or from OEM, or use an OEM AnsiString type:
type
TOemString = AnsiString(437);
In all cases, the console is not the best way of entering accentuated text under modern Windows, and Unicode Delphi XE.
Using InputQuery function e.g. should be safer, since it will return an Unicode string variable. ;)

Related

Delphi - check if a Unicode character occurs in a set of characters?

This code works good with Delphi-7 (until Delphi had Unicode support):
Value := edit1.Text[1];
if Value in ['м', 'ж'] then ...
'м', 'ж' - cyrillic symbols
But this construction doesn't work with Unicode charachter.
I try a lot of things, but they are doesn't work.
I also tried changing the value types to "Char" and "AnsiChar".
Doesn't work:
const
MySet : set of WideChar = [WideChar('м'), WideChar('ж')];
begin
Value := edit1.Text[1];
if Value in MySet then ...
Doesn't work:
if AnsiChar(Value) in ['м', 'ж'] then ...
Doesn't work:
if CharInSet(Value, ['м', 'ж']) then ...
But this works good:
if (Value = 'м') or (Value = 'ж') then ...
Whether there is an opportunity to check up UNICODE character by use of a SET in the modern versions of Delphi?
Or should we check each character individually?
My Delphi version is 10.4 update 2 Community Edition
A Delphi set type can only handle a maximum of 256 values, so it cannot be used for handling Unicode characters. For handling Unicode, the System.Character unit provides various methods and helpers.
For this particular case, there is an IsInArray() character helper you can use. Instead of declaring a set of characters, you will need to declare an array of characters:
var
ch: Char;
a: array of Char;
s: string;
begin
a := ['м', 'ж'];
s := 'abcж';
for ch in s do
if ch.IsInArray(a) then ...
end;
Note: Delphi XE7 introduced additional language support for initializing and working with dynamic arrays, and square brackets can also be used for simpler array initialization. In the context of above example, ['м', 'ж'] is not a set, but an array of wide characters.
check if a Unicode character occurs in a set of characters?
Do you mean a Delphi set?
In general, it is impossible to have a set of X where the base type X has more than 256 possible distinct values. So set of Byte is fine, but set of Word isn't possible. Since there are 256 * 256 distinct wide character values, it is therefore impossible to have a set of wide characters. (If this were indeed possible, a variable of such a set type would be 8 kB in size. That would be an unusually large variable.)
Since there is no such thing as "Delphi set of Unicode characters", the question "How to see if a character belongs to a Delphi set of Unicode characters" doesn't make sense.
Or do you simply mean a mathematical set?
If so, of course this is possible, but you cannot use a Delphi set to represent the mathematical set of characters. Instead, you need to use some other data type. One possibility is a simple array, if you don't mind its O(n) characteristics.

Using the # symbol in Pascal to type characters past 127 range

I know that in pascal you can use the # character to display certain characters like tab(#9), carriage return line feed(#13#10), ect. When I try to do other characters out of the 127 range like #169 it replaces it with a question mark. How do I change the character encoding in lazarus so I can use this character?
According to Free Pascal:
The # indicates the control string
The control string can be used to specify characters which cannot be typed on a keyboard, such as #27 for the escape character.
See free pascal character strings.
The above page also mentions:
It is possible to use other character sets in strings: in that case the codepage of the source file must be specified with the {$CODEPAGE XXX} directive or with the -Fc command line option for the compiler. In that case the characters in a string will be interpreted as characters from the specified codepage.
For example, If you wanted to encode with UTF-8 you could use (see this page):
{$CODEPAGE UTF8}
{$Mode ObjFPC}{$H+}
(If you are on windows you can choose from these code pages.)
I tested the following example on my mac with the fpc (free pascal compiler):
program Example;
{$CODEPAGE utf8}
{$Mode ObjFPC}{$H+}
Begin
writeln('Hello World'#13#10);
writeln('carriage return line');
writeln('Example: '#$C3#$A4);
End.
This example prints out the ä character and is a modification of the examples on the dealing with UTF8 in free pascal page, which will help you a lot with this issue. You may have to adjust the solutions on that page to fit your problem.
Note, that I was able to access the ä character because I used it's hex value C3A4.

Getting a unicode, hidden symbol, as data in Delphi

I'm writing a delimiter for some Excel spreadsheet data and I need to read the rightward arrow symbol and pilcrow symbol in a large string.
The pilcrow symbol, for row ends, was fairly simply, using the Chr function and the AnsiChar code 182.
The rightward arrow has been more tricky to figure out. There isn't an AnsiChar code for it. The Unicode value for it is '2192'. I can't, however, figure out how to make this into a string or char type for me to use in my function.
Any easy ways to do this?
You can't use the 2192 character directly. But since a STRING variable can't contain this value either (as thus your TStringList can't either), that doesn't matter.
What character(s) are the 2192 character represented as in your StringList AFTER you have read it in? Probably by these three separate characters: 0xE2 0x86 0x92 (in UTF-8 format). The simple solution, therefore, is to start by replacing these three characters with a single, unique character that you can then assign to the Delimiter field of the TStringList.
Like this:
.
.
.
<Read file into a STRING variable, say S>
S := ReplaceStr(S,#$E2#$86#$92,'|');
SL := TStringList.Create;
SL.Text := S;
SL.Delimiter := '|';
.
.
.
You'll have to select a single-character representation of your 3-byte UTF-8 Unicode character that doesn't occur in your data elsewhere.
You need to represent that character as a UTF-16 character. In Unicode Delphi you would do it like this:
Chr(2192)
which is of type WideChar.
However, you are using Delphi 7 which is a pre-Unicode Delphi. So you have to do it like this:
var
wc: WideChar;
....
wc := WideChar(2192);
Now, this might all be to no avail for you since it sounds a little like your code is working with 8 bit ANSI text. In which case that character cannot be encoded in any 8 bit ANSI character set. If you really must use that character, you'll need to use Unicode text.

Getting char value in Delphi 7

I am making a program in Delphi 7, that is supposed to encode a unicode string into html entity string.
For example, "ABCģķī" would result in "ABCģķī"
Now 2 basic things:
Delphi 7 is non-Unicode, so I can't just write unicode chars directly in code to encode them.
Codepages consist of 255 entries, each holding a character, specific to that codepage, except first 127, that are same for all the codepages.
So - How do I get a value of a char, that is in 1-255 range?
I tried Ord(Integer), but it also returns values way past 255. Basically, everything is fine (A returns 65 an so on) until my string reaches non-Latin unicode.
Is there any other method for returning char value? Any help appreciated
I suggest you avoid codepages like the plague.
There are two approaches for Unicode that I'd consider: WideString, and UTF-8.
Widestrings have the advantage that it's 'native' to Windows, which helps if you need to use Windows API calls. Disadvantages are storage space, and that they (like UTF-8) can require multiple WideChars to encode the full Unicode space.
UTF-8 is generally preferable. Like WideStrings, this is a multi-byte encoding, so a particular unicode 'code point' may need several bytes in the string to encode it. This is only an issue if you're doing lots of character-by-character processing on your strings.
#DavidHeffernan comments (correctly) that WideStrings may be more compact in certain cases. However, I'd only recommend UTF-16 only if you are absolutely sure that your encoded text will really be more compact (don't forget markup!), and this compactness is highly important to you.
In HTML 4, numeric character references are relative to the charset used by the HTML. Whether that charset is specified in the HTML itself via a <meta> tag, or out-of-band via an HTTP/MIME Content-Type header or other means, it does not matter. As such, "ABCģķī" would be an accurate representation of "ABCģķī" only if the HTML were using UTF-16. If the HTML were using UTF-8, the correct representation would be either "ABCģķī" or "ABCģķī" instead. Most other charsets do no support those particular Unicode characters.
In HTML 5, numeric character references contain original Unicode codepoint values regardless of the charset used by the HTML. As such, "ABCģķī" would be represented as either "ABC#291;ķī" or "ABCģķī".
So, to answer your question, the first thing you have to do is decide whether you need to use HTML 4 or HTML 5 semantics for numeric character references. Then, you need to assign your Unicode data to a WideString (which is the only Unicode string type that Delphi 7 natively supports), which uses UTF-16, then:
if you need HTML 4:
A. if the HTML charset is not UTF-16, then use WideCharToMultiByte() (or equivalent) to convert the WideString to that charset, then loop through the resulting values outputting unreserved characters as-is and character references for reserved values, using IntToStr() for decimal notation or IntToHex() for hex notation.
B. if the HTML charset is UTF-16, then simply loop through each WideChar in the WideString, outputting unreserved characters as-is and character references for reserved values, using IntToStr() for decimal notation or IntToHex() for hex notation.
If you need HTML 5:
A. if the WideString does not contain any surrogate pairs, then simply loop through each WideChar in the WideString, outputting unreserved characters as-is and character references for reserved values, using IntToStr() for decimal notation or IntToHex() for hex notation.
B. otherwise, convert the WideString to UTF-32 using WideStringToUCS4String(), then loop through the resulting values outputting unreserved codepoints as-is and character references for reserved codepoints, using IntToStr() for decimal notation or IntToHex() for hex notation.
In case I understood the OP correctly, I'll just leave this here.
function Entitties(const S: WideString): string;
var
I: Integer;
begin
Result := '';
for I := 1 to Length(S) do
begin
if Word(S[I]) > Word(High(AnsiChar)) then
Result := Result + '#' + IntToStr(Word(S[I])) + ';'
else
Result := Result + S[I];
end;
end;

Delphi 2010 Blockread seems to get different data than previous version

I had my old MP3 Id3 tag reader recompiled under D2010 and it seems it won't find the tags anymore.
code is farily simple, but it doesn't work.
The debugger shows a lots of zero and then chineese signs in the results!
var dat:file of char;
id3:array [0..TAGLEN] of Char; //is 0..127 for ID3 v1
begin
vValid:=True;
if FileExists(vFilename) then begin
assignfile(dat,vFilename);
If (FileGetAttr(vFilename)>32) or (FileGetAttr(vFilename)=1) then
Filemode:= 0
Else
Filemode:= 2;
reset(dat);
seek(dat,FileSize(dat)-128);
blockread(dat,id3,128);
closefile(dat);
vMP3tag:=copy(id3, 0, 3);
if vMP3Tag='TAG' then begin
vTitle:=strip(copy(id3, 4, 30),' ');
vArtist:=strip(copy(id3, 34, 30), ' ');
I heard something about Unicode, and PansiChar, but I still don't understand much what these do anyway :)
thanks for looking
Try this:
var dat:file of AnsiChar;
id3:array [0..TAGLEN] of AnsiChar; //is 0..127 for ID3 v1
That is of course if your file is ansi-based instead of unicode based. I have no idea what might be in an id3 tag of an mp3 file.
If you want to understand the difference, this white paper explained it all to me. Basically Unicode uses more memory space to store a single character (like 4 times the amount of an ansi character), but they allow characters like ie Chinese and Japanese, which ansi doesn't provide. Just read the white paper, then it'll all be clear.
In short, Ansichar and Ansistring is what used to be a string in Delphi before D2009. In those days your application wouldn't be unicode compatible (you couldn't type chinese characters by default).
As from D2009, the definition of a string changed from an ansistring to a widestring and ansichar to widechar. That means your application will be unicode by default. But old code, expecting strings to be ansicode, need to be adapted to reflect that change.
Your code said char, meaning ansichar to pre-D2009 compilers, but widechar to D2009+ compilers. In other words, the new compilers read your code differently.
I hope that explains it a bit.
Oh!
it seems like AnsiCHar instead of Char is the way to go in D2010.
Ansi-char-them-all!

Resources