What is the CCHAR type equivalent in Delphi? - delphi

The ShortNameLength member of FILE_BOTH_DIR_INFORMATION structure is declared as follows:
typedef struct FILE_BOTH_DIR_INFORMATION {
...
CCHAR ShortNameLength;
...
};
From the explanation of CCHAR type, CCHAR is a 8-bit Windows (ANSI) character. So, it is equivalent to AnsiChar in Delphi, right? However, the description of ShortNameLength member of FILE_BOTH_DIR_INFORMATION structure says,
“ShortNameLength specifies the length, in bytes, of the short file name string.”
The statement makes me think that the CCHAR equivalent is Byte in Delphi. Another example is the NumberOfProcessors member of SYSTEM_BASIC_INFORMATION which is declared in winternl.h as follows:
typedef struct _SYSTEM_BASIC_INFORMATION {
BYTE Reserved1[24];
PVOID Reserved2[4];
CCHAR NumberOfProcessors;
}
Once again, the CCHAR type seems to be used in a Byte context, rather than AnsiChar context.
Now, I confuse, whether to use AnsiChar or Byte as a CCHAR equivalent in Delphi.
Note
JwaWinType.pas of JEDI Windows API declares CCHAR as AnsiChar.

It's a byte, or at least, it is used as a 1 byte integer. In C, chars can be used for this purpose. In Delphi, you couldn't do that without typecasting. So you could use Char, but then you would need to give it the value 'A' or Chr(65) to indicate a string of 65 characters. Now, that would be silly. :-)
To be able to pass it to the API it must have the same size. Apart from that, the callee will not even know how it is declared, so declaring it as a Delphi byte is the most logical solution. A choice backed up by the other declaration you found.

I believe the explanation of CCHAR is wrong. The C prefix indicates that this is a count of characters so this is probably a simple copy-paste error done by Microsoft when writing the explanation.
It is stored in a byte and it is used to count the number of bytes of a string of characters. These characters may be wide characters but the CCHAR value still counts the number of bytes used to store the characters.
The natural translation for this type is Byte. If you marshal it to a character type like AnsiChar you will have to convert the character to an integer value (e.g. a byte) before using it.

Related

Getting a unicode, hidden symbol, as data in Delphi

I'm writing a delimiter for some Excel spreadsheet data and I need to read the rightward arrow symbol and pilcrow symbol in a large string.
The pilcrow symbol, for row ends, was fairly simply, using the Chr function and the AnsiChar code 182.
The rightward arrow has been more tricky to figure out. There isn't an AnsiChar code for it. The Unicode value for it is '2192'. I can't, however, figure out how to make this into a string or char type for me to use in my function.
Any easy ways to do this?
You can't use the 2192 character directly. But since a STRING variable can't contain this value either (as thus your TStringList can't either), that doesn't matter.
What character(s) are the 2192 character represented as in your StringList AFTER you have read it in? Probably by these three separate characters: 0xE2 0x86 0x92 (in UTF-8 format). The simple solution, therefore, is to start by replacing these three characters with a single, unique character that you can then assign to the Delimiter field of the TStringList.
Like this:
.
.
.
<Read file into a STRING variable, say S>
S := ReplaceStr(S,#$E2#$86#$92,'|');
SL := TStringList.Create;
SL.Text := S;
SL.Delimiter := '|';
.
.
.
You'll have to select a single-character representation of your 3-byte UTF-8 Unicode character that doesn't occur in your data elsewhere.
You need to represent that character as a UTF-16 character. In Unicode Delphi you would do it like this:
Chr(2192)
which is of type WideChar.
However, you are using Delphi 7 which is a pre-Unicode Delphi. So you have to do it like this:
var
wc: WideChar;
....
wc := WideChar(2192);
Now, this might all be to no avail for you since it sounds a little like your code is working with 8 bit ANSI text. In which case that character cannot be encoded in any 8 bit ANSI character set. If you really must use that character, you'll need to use Unicode text.

Getting char value in Delphi 7

I am making a program in Delphi 7, that is supposed to encode a unicode string into html entity string.
For example, "ABCģķī" would result in "ABCģķī"
Now 2 basic things:
Delphi 7 is non-Unicode, so I can't just write unicode chars directly in code to encode them.
Codepages consist of 255 entries, each holding a character, specific to that codepage, except first 127, that are same for all the codepages.
So - How do I get a value of a char, that is in 1-255 range?
I tried Ord(Integer), but it also returns values way past 255. Basically, everything is fine (A returns 65 an so on) until my string reaches non-Latin unicode.
Is there any other method for returning char value? Any help appreciated
I suggest you avoid codepages like the plague.
There are two approaches for Unicode that I'd consider: WideString, and UTF-8.
Widestrings have the advantage that it's 'native' to Windows, which helps if you need to use Windows API calls. Disadvantages are storage space, and that they (like UTF-8) can require multiple WideChars to encode the full Unicode space.
UTF-8 is generally preferable. Like WideStrings, this is a multi-byte encoding, so a particular unicode 'code point' may need several bytes in the string to encode it. This is only an issue if you're doing lots of character-by-character processing on your strings.
#DavidHeffernan comments (correctly) that WideStrings may be more compact in certain cases. However, I'd only recommend UTF-16 only if you are absolutely sure that your encoded text will really be more compact (don't forget markup!), and this compactness is highly important to you.
In HTML 4, numeric character references are relative to the charset used by the HTML. Whether that charset is specified in the HTML itself via a <meta> tag, or out-of-band via an HTTP/MIME Content-Type header or other means, it does not matter. As such, "ABCģķī" would be an accurate representation of "ABCģķī" only if the HTML were using UTF-16. If the HTML were using UTF-8, the correct representation would be either "ABCģķī" or "ABCģķī" instead. Most other charsets do no support those particular Unicode characters.
In HTML 5, numeric character references contain original Unicode codepoint values regardless of the charset used by the HTML. As such, "ABCģķī" would be represented as either "ABC#291;ķī" or "ABCģķī".
So, to answer your question, the first thing you have to do is decide whether you need to use HTML 4 or HTML 5 semantics for numeric character references. Then, you need to assign your Unicode data to a WideString (which is the only Unicode string type that Delphi 7 natively supports), which uses UTF-16, then:
if you need HTML 4:
A. if the HTML charset is not UTF-16, then use WideCharToMultiByte() (or equivalent) to convert the WideString to that charset, then loop through the resulting values outputting unreserved characters as-is and character references for reserved values, using IntToStr() for decimal notation or IntToHex() for hex notation.
B. if the HTML charset is UTF-16, then simply loop through each WideChar in the WideString, outputting unreserved characters as-is and character references for reserved values, using IntToStr() for decimal notation or IntToHex() for hex notation.
If you need HTML 5:
A. if the WideString does not contain any surrogate pairs, then simply loop through each WideChar in the WideString, outputting unreserved characters as-is and character references for reserved values, using IntToStr() for decimal notation or IntToHex() for hex notation.
B. otherwise, convert the WideString to UTF-32 using WideStringToUCS4String(), then loop through the resulting values outputting unreserved codepoints as-is and character references for reserved codepoints, using IntToStr() for decimal notation or IntToHex() for hex notation.
In case I understood the OP correctly, I'll just leave this here.
function Entitties(const S: WideString): string;
var
I: Integer;
begin
Result := '';
for I := 1 to Length(S) do
begin
if Word(S[I]) > Word(High(AnsiChar)) then
Result := Result + '#' + IntToStr(Word(S[I])) + ';'
else
Result := Result + S[I];
end;
end;

Why do string constants use wide characters even when formed entirely from 8 bit characters?

I just posted a question about Unicode character constants, where $HIGHCHARUNICODE appeared to be the reason.
Now with the default $HIGHCHARUNICODE OFF (Delphi XE2), why is this:
const
AllLowByteValues =#$00#$01#$02#$03#$04#$05#$06#$07#$08#$09#$0a#$0b#$0c#$0d#$0e#$0f;
AllHighByteValues=#$D0#$D1#$D2#$D3#$D4#$D5#$D6#$D7#$D8#$D9#$Da#$Db#$Dc#$Dd#$De#$Df;
==> Sizeof(AllLowByteValues[1]) = 2
==> Sizeof(AllHighByteValues[1]) = 2
If "All hexadecimal #$xx 2-digit literals are parsed as AnsiChar" for #$80 ... #$FF, then why is AllHighByteValues a unicode String and not an ANSIString?
That's because string constants are PChar and so made up of UTF-16 elements.
From the documentation:
String constants are assignment-compatible with the PChar and PWideChar types, which represent pointers to null-terminated arrays of Char and WideChar values.
You are not taking that into account that String and Character literals are context-sensitive in D2009+. If a literal is used in an Ansi context, it will be stored as Ansi. If a literal is used in a Unicode context, it will be stored as Unicode. HIGHCHARUNICODE only applies to 3-digit numeric Character literals between #128-#255 and 2-digit hex Character literals between #$80-#$FF. Those particular values are ambiquious between Ansi and Unicode, so HIGHCHARUNICODE is used to address the ambiquity. HIGHCHARUNICODE does not apply to other types of literals, including String literals. If you pass a String or Character literal to SizeOf(), there is no Ansi/Unicode context in the source code for the compiler to use, so it is going to use a Unicode context except in the specific case where HIGHCHARUNICODE applies, in which case an Ansi context is used if HICHCHARUNICODE is OFF. That is what you are seeing happen.

How can I make fixed-length Delphi strings use wide characters?

Under Delphi 2010 (and probably under D2009 also) the default string type is UnicodeString.
However if we declare...
const
s :string = 'Test';
ss :string[4] = 'Test';
... then the first string s if declared as UnicodeString, but the second one ss is declared as AnsiString!
We can check this: SizeOf(s[1]); will return size 2 and SizeOf(ss[1]); will return size 1.
If I declare...
var
s :string;
ss :string[4];
... than I want that ss is also UnicodeString type.
How can I tell to Delphi 2010 that both strings should be UnicodeString type?
How else can I declare that ss holds four WideChars? The compiler will not accept the type declarations WideString[4] or UnicodeString[4].
What is the purpose of two different compiler declarations for the same type name: string?
The answer to this lies in the fact that string[n], which is a ShortString, is now considered a legacy type. Embarcadero took the decision not to convert ShortString to have support for Unicode. Since the long string was introduced, if my memory serves correctly, in Delphi 2, that seems a reasonable decision to me.
If you really want fixed length arrays of WideChar then you can simply declare array [1..n] of char.
You can't, using string[4] as the type. Declaring it that way automatically makes it a ShortString.
Declare it as an array of Char instead, which will make it an array of 4 WideChars.
Because a string[4] makes it a string containing 4 characters. However, since WideChars can be more than one byte in size, this would be a) wrong, and b) confusing. ShortStrings are still around for backward compatibility, and are automatically AnsiStrings because they consist of [x] one byte chars.

Casting Delphi 2009/2010 string literals to PAnsiChar

So the question is whether or not string literals (or const strings) in Delphi 2009/2010 can be directly cast as PAnsiChar's or do they need an additional cast to AnsiString first for this to work?
The background is that I am calling functions in a legacy DLL with a C interface that has some functions that require C-style char pointers. In the past (before Delphi 2009) code like the following worked like a charm (where the param to the C DLL function is a LPCSTR):
either:
LegacyFunction(PChar('Fred'));
or
const
FRED = 'Fred';
...
LegacyFunction(PChar(FRED));
So in changing to Delphi 2009 (and now in 2010), I changed the call to this:
LegacyFunction(PAnsiChar('Fred'));
or
const
FRED = 'Fred';
...
LegacyFunction(PAnsiChar(FRED));
This seems to work and I get the correct results from the function call. However there is some definite instability in the app that seems to be occurring mostly the second or third time through the code that calls the legacy functions (that was not present before the move to the 2009 version of the IDE). In investigating this, I realized that the native string literal (and const string) in Delphi 2009/2010 is a Unicode string so my cast was possibly in error. Examples here and elsewhere seem to indicate this call should look more like this:
LegacyFunction(PAnsiChar(AnsiString('Fred')))
What confuses me is that with the code above in the second examples, casting the string literal directly to a PAnsiChar does not generate any compiler warnings. If instead of a string literal, I was casting a string var, I would get a suspicious cast warning (and the string would be mangled). This (and the fact that the string is usable in the DLL) leads me to believe the compiler is doing some magic to correctly interpret the string literal as the intended string type. Is this what is happening or is the double cast (first to AnsiString, then to PAnsiChar) really necessary and the lack of it in my code the reason for the hard to track down instability? And does the same answer hold true for const strings as well?
For type-inferred constants (only initializable from literals) the compiler changes the actual text at compile-time, rather than at runtime. That means it knows whether or not the conversion loses data, so it doesn't need to warn you if it doesn't.
To 'visualize' Barry Kelly and Mason Wheeler words:
const
FRED = 'Fred';
var
p: PAnsiChar;
w: PWideChar;
begin
w := PWideChar(Fred);
p := PAnsiChar(Fred);
In ASM:
Unit7.pas.32: w := PWideChar(Fred);
00462146 BFA4214600 mov edi,$004621a4
// no conversion, just a pointer to constant/"-1 RefCounted" UnicodeString
Unit7.pas.33: p := PAnsiChar(Fred);
0046214B BEB0214600 mov esi,$004621b0
// no conversion, just a pointer to constant/"-1 RefCounted" AnsiString
As you can see in both cases PWideChar/PChar(FRED) and PAnsiChar(FRED), there is no conversion and Delphi compiler make 2 constant strings, one AnsiString and one UnicodeString.
Constants, including string literals, are untyped by default, and the compiler will fit them into whatever format works in the context you're using them in. As long as there are no non-ANSI characters in your string literal, the compiler won't have any trouble generating the string as ANSI instead of Unicode in this situation.
As Mason Wheeler points out all is fine as long as you don't have non-ANSI characters in your string const. If you have things like:
const FRED = 'Frédérick';
I'm pretty sure Delphi 2009/2010 will either issue charset hints (and apply a string conversion automatically - thus the hint) or fail at comparing ('Frédérick' is different in ISO-8859-1 than UTF-16).
If you can have "special" characters in your consts you will need to call string conversion.
Here are some basic examples with TStringList:
TStringList.SaveToFile(DestFilename, TEncoding.GetEncoding(28591)); //ISO-8859-1 (Latin1)
TStringList.SaveToFile(DestFilename, TEncoding.UTF8);

Resources