How to convert unicode codepoint like U+1F4DB to char? - delphi

I have in string a code point like U+1F4DB and I would like to convert it to a unicode char (📛). How to do ?

You can't convert that to a single char. That code point is outside of the range that can be represented in UTF-16 as a single 16 bit element. Instead it is represented by a surrogate pair, two char elements. In Delphi it would be expressed as the string
#$D83D#$DCDB
as can be discerned from this page: http://www.fileformat.info/info/unicode/char/1f4db/index.htm
In practise I think it would be simpler to paste the character into your source code, inside a string literal, and let the source code be stored encoded as UTF-8. The IDE will prompt to do so automatically. That is represent it as
'📛'
In comments you make it clear that you wish to parse arbitrary text of the form U+xxxx. Extract the numeric value and convert it to an integer. Then pass it through TCharHelper.ConvertFromUtf32.

Related

Delphi decoded base64 to something

I am stuck a bit in decoding. I got a base64-encoded .rtf file.
A little part of this looks like this: Bek\u252\''fcld\u337\''3f
Which represents: Beküldő
But my output data after decoding is: Bekuld?
If I manually replace the characters it works.
StringReplace(Result, 'U337\''3F', '''F5', [rfReplaceAll, rfIgnoreCase]);
Does anyone know a general solution for this? Some conversation or something?
For instance, \u242 means Unicode character #242.
So you could search for \u in the RTF content (ignoring any \\ escaped sequence), then retrieve the following number, and use it as a character.
But RTF is a very complex beast.
Check what the RTF 1.5 specifications says about encoding:
\uN This keyword represents a single Unicode character which has no
equivalent ANSI representation based on the current ANSI code page. N
represents the Unicode character value expressed as a decimal number.
This keyword is followed immediately by equivalent character(s) in
ANSI representation. In this way, old readers will ignore the \uN
keyword and pick up the ANSI representation properly. When this
keyword is encountered, the reader should ignore the next N
characters, where N corresponds to the last \ucN value encountered.
Perhaps the easiest is to use a hidden RichEdit for decoding, under Windows/VCL.

Read special character bytes from PDF to unichar or NSString

First off this solution doesn't work for ligatures:
Convert or Print CGPDFStringRef string
I'm reading text from a PDF and trying to convert it to a NSString. I can get a byte array of text using Apple's CGPDFScanner in the form of a CGPDFString. The "fi" ligature character is giving me trouble. When I look at my byte array in the debugger I see a '\f'
So for simplicity sake lets say that I have this char:
unsigned char myLigatureFromPDF = '\f';
Ultimately I'd like to convert it to this (the unicode value for the "fi" ligature):
unichar whatIWant = 0xFB01;
This is my failed attempt (I copied this from PDFKitten btw):
const char str[] = {myLigatureFromPDF, '\0'};
NSString* stringEncodedLigature = [NSString stringWithCString:str encoding:NSUTF8StringEncoding];
unichar encodedLigature = [stringEncodedLigature characterAtIndex:0];
If anyone can tell me how to do this that would be great
Also, as a side note how does the debugger interpret the unencoded byte array, in other words when I hover over the array how does it know to show a '\f'
Thanks!
Every PDF parser is limited in its capabilities by one single important point of the PDF specifications: characters in literal strings are encoded as bytes or words, but the encoding does not need to be included in the file.
For example, if a subset of a font is included where the code "1" corresponds to the image (character glyph) of an "h" and the code "2" maps to a glyph "a", the string (\1\2\1\2) will show "haha", as expected. But if the PDF contains no further information on how the glyphs in that font correspond to Unicode, there is no way for a string decoder to find out the correct character codes for "glyph #1" and "glyph #2".
It seems your test PDF does contain that information -- else, how could it infer the correct characters for "regular" characters? -- but in this case, the "regular" characters were simply not remapped to other binary codes, for convenience. Also, again for convenience, the glyph for the single character "fi" was remapped to "0x0C" in the original font (or in the subset that got included into your file). But, again, if the file does not contain a translation table between character codes and Unicode values, there is no way to retrieve the correct code.
The above is true for all PDFs and strings. If the font definition in the PDF contains an encoding, your string extraction method should use it; if the PDF contains a /ToUnicode table for the font, again, your method should use it. If it contains neither, you get the literal string contents (and, presumably, you are not informed which method was used and how reliable it is).
As a final footnote: in TeX and LaTeX fonts, ligatures are mapped to lower ASCII codes (as well as a smattering of other non-ASCII codes, such as the curly quotes). It seems you are reading a PDF that was created through TeX here -- but that can only be inferred from this particular encoding. Also, even if you know in advance that the PDF was generated through TeX, it's not guaranteed that it does use this particular encoding, as the decision to translate or not translate is at the discretion of the PDF generator, not TeX itself.

Check range of Unicode Value of a character

In Objective-c...
If I have a character like "∆" how can I get the unicode value and then determine if it is in a certain range of values.
For example if I want to know if a certain character is in the unicode range of U+1F300 to U+1F6FF
NSString uses UTF-16 to store codepoints internally, so those in the range you're looking for (U+1F300 to U+1F6FF) will be stored as a surrogate pair (four bytes). Despite its name, characterAtIndex: (and unichar) doesn't know about codepoints and will give you the two bytes that it sees at the index you give it (the 55357 you're seeing is the lead surrogate of the codepoint in UTF-16).
To examine the raw codepoints, you'll want to convert the string/characters into UTF-32 (which encodes them directly). To do this, you have a few options:
Get all UTF-16 bytes that make up the codepoint, and use either this algorithm or CFStringGetLongCharacterForSurrogatePair to convert the surrogate pairs to UTF-32.
Use either dataUsingEncoding: or getBytes:maxLength:usedLength:encoding:options:range:remainingRange: to convert the NSString to UTF-32, and interpret the raw bytes as a uint32_t.
Use a library like ICU.

Translating memory contents into a string via ASCII encoding

I have to translate some memory contents into a string, using ASCII encoding. For example:
0x6A636162
But I am not sure how to break that up, to be translated into ASCII. I think it has something to do with how many bits are in a char/digit, but I am not sure how to go about doing so (and of course, I would like to know more of the reasoning behind it, not just "how to do it").
ASCII uses 7 bits to encode a character (http://en.wikipedia.org/wiki/ASCII). However, it's common to encode characters using 8 bits instead (note that technically this isn't ASCII). Thus, you'd need to split your data into 8-bit chunks and match that to an ASCII table.
If you're using a specific programming language, it may have a way to translate an ASCII code to a character. For instance, Ruby has the .chr method, Python has the chr() built-in function, and in C you can printf("%c", number).
Note that each nibble (4 bits) can be represented as one hexadecimal digit, so for the sample string you show, each 8-bit "chunk" would be:
0x6A
0x63
0x61
0X62
the string reads "jcab" :)

How can I convert unicode characters to ascii codes in delphi 7?

Yes we're talking about ASCII codes. My appologies I'm not the Delphi dev here.
For Delphi 7, I'd get the free Unicode Library by Mike Lischke who is the author of Virtual Treeview.
The libary includes a lot of conversion functions to go to and from Unicode, so you can use the ones that make most sense in your application.
Or you can upgrade to Delphi 2009 which has built-in encoding routines, and its own library of conversion functions.
Let's get a few things straight. Character set (charset) and character encodings are two related but different concepts. A character set is an abstract list of characters with some sort of integer character code associated. Then there are character encodings, which is basically an algorithm that describes how the characters are represented in bytes.
ASCII acts as both the character set and encoding. It uses 7 bits to express 128 characters (94 printable). Unicode on the other hand is a character set, expressing 1,114,112 code points. There are several encodings to represent Unicode strings but most notable ones are UTF-8, UTF-16, UTF-16LE, and UTF-32. In other words, a single Unicode character can be represented in different ways depending on the encodings.
How can I convert unicode characters to ascii codes in delphi 7?
I think the question could be interpreted in two ways.
I have a Unicode string in some encoding that only includes ASCII printable characters. How can I convert the string into a byte array of ASCII encoding?
I have a Unicode string in some encoding that also includes non-ASCII printable characters such as Chinese characters. How can I encode the string into a ASCII encoding without losing information, and later decode it back to the original Unicode string?
If you mean the first, you can load the Unicode string into WideString like Osman is saying and do
var
original: WideString;
s: AnsiString;
begin
s := AnsiString(original);
If you mean the second, you would need a generic encoding algorithm like Base64 encoding. You can use DCPBase64.pas included in David Barton's DCPcrypt v2 Beta 3.
It depends what your definition of conversion is. If you want to map the 127 lowest characters to the Unicode equivalent, you can use an explicit cast. But this creates garbage if the string contains higher characters.
If you want mappings like ë -> e and û -> u, you can write your own code. But be aware that there are always characters that can't be converted.
"ASCII" is the name of a specific mapping of characters to numbers, but some people say "ASCII code" when they don't really mean ASCII at all; they just want the numeric value of a character, whatever mapping is in effect at the time. Does that description apply to you?
If so, then you can use the Ord standard function to get the Unicode code-point value of whatever Unicode character you have.
var
wc: WideChar;
ws: WideString;
x: Word;
x := Ord(wc);
x := Ord(ws[1]);
If you really meant ASCII, though, then you'll have to be more specific about what sort of conversion you have in mind.
As an example, the letter A is represented in unicode as U+0041 and in ansi as just 41. So converting that would be pretty simple, but you must find out how the unicode character is encoded. The most common are UTF-16 and UTF-8. UTF 16, is basically two bytes per character, but even that is an oversimplification, as a character may have more bytes. UTF-8 sounds as if it means 1 byte per character but can be 2 or 3. To further complicate matters, UTF-16 can be little endian or big endian. (U+0041 or U+4100).
Where your question makes no sense is if you wanted to for example convert the arabic letter ain U+0639 to ansi on an English locale. You can't.
See related questions on converting from Unicode to ASCII:
How to convert UTF-8 to US-Ascii in Java
How to convert a Unicode character to its ASCII equivalent
How do I convert a file’s format from Unicode to ASCII using Python?
In general, character set of hundreds thousands entries cannot be converted to character set of 127 entries without some loss of information or encoding scheme.
You can use the function in http://swissdelphicenter.ch/en/showcode.php?id=1692
It converts Unicode string to Ansi string using specified code page. If you want convert using default system codepage (defined in regional options as non-unicode codepage) you can do it simply like following:
var
ws: widestring;
s: string;
begin
s:=string(ws)

Resources