Delphi decoded base64 to something - delphi

I am stuck a bit in decoding. I got a base64-encoded .rtf file.
A little part of this looks like this: Bek\u252\''fcld\u337\''3f
Which represents: Beküldő
But my output data after decoding is: Bekuld?
If I manually replace the characters it works.
StringReplace(Result, 'U337\''3F', '''F5', [rfReplaceAll, rfIgnoreCase]);
Does anyone know a general solution for this? Some conversation or something?

For instance, \u242 means Unicode character #242.
So you could search for \u in the RTF content (ignoring any \\ escaped sequence), then retrieve the following number, and use it as a character.
But RTF is a very complex beast.
Check what the RTF 1.5 specifications says about encoding:
\uN This keyword represents a single Unicode character which has no
equivalent ANSI representation based on the current ANSI code page. N
represents the Unicode character value expressed as a decimal number.
This keyword is followed immediately by equivalent character(s) in
ANSI representation. In this way, old readers will ignore the \uN
keyword and pick up the ANSI representation properly. When this
keyword is encountered, the reader should ignore the next N
characters, where N corresponds to the last \ucN value encountered.
Perhaps the easiest is to use a hidden RichEdit for decoding, under Windows/VCL.

Related

Japanese characters interpreted as control character

I have several files which include various strings in different written languages. The files I am working with are in the .inf format which is somewhat similar to .ini files.
I am inputting the text from these files into a parser which considers the [ symbol as the beginning of a 'category'. Therefore, it is important that this character does not accidentally appear in string sequences or parsing will fail because it interprets these as "control characters".
For example, this string contains some Japanese writings:
iANSProtocol_HELP="�C���e��(R) �A�h�o���X�g�E�l�b�g���[�N�E�T�[�r�X Protocol �̓`�[���������щ��z LAN �Ȃǂ̍��x�#�\�Ɏg�����܂��B"
DISKNAME ="�C���e��(R) �A�h�o���X�g�E�l�b�g���[�N�E�T�[�r�X CD-ROM �܂��̓t���b�s�[�f�B�X�N"
In my text-editors (Atom) default UTF-8 encoding this gives me garbage text which would not be an issue, however the 0x5B character is interpreted as [. Which causes the parser to fail because it assumes that this is signaling the beginning of a new category.
If I change the encoding to Japanese (CP 932), these characters are interpreted correctly as:
iANSProtocol_HELP="インテル(R) アドバンスト・ネットワーク・サービス Protocol はチーム化および仮想 LAN などの高度機能に使われます。"
DISKNAME ="インテル(R) アドバンスト・ネットワーク・サービス CD-ROM またはフロッピーディスク"
Of course I cannot encode every file to be Japanese because they may contain Chinese or other languages which will be written incorrectly.
What is the best course of action for this situation? Should I edit the code of the parser to escape characters inside string literals? Are there any special types of encoding that would allow me to see all special characters and languages?
Thanks
If the source file is in shift-jis, then you should use a parser that can support it, or convert the file to UTF-8 before you parse it.
I believe that this character set also uses ASCII as it's base type but it uses 2 bytes to for certain characters, so if 0x5B it probably doesn't appear as the 'first byte' of a character. (note: this is conjecture based on how I think shift-jis works).
So yea, you need to modify your parser to understand shift-jis, or you need to convert the file to UTF-8 before parsing. I imagine that converting is the easiest.

Encode ASCII Files

this question will have a very simple answers which is yes or no I guess ?
If I encode from x64 bit unicode delphi app my stringlist like this
StringList.SaveToFile(FileName, TEncoding.ASCII);
is there any other limitation , difference in file layout while writing this file with the statement
StringList.SaveToFile(FileName);
or
StringList.SaveToFile(FileName, TEncoding.UTF8);
I'm afraid on line length and control char issues between both versions....Answer NO will make me happy.
UTF-8 and the Windows 'Ansi' codepages are all superset of ASCII. As such, if the string list only contains characters in the ASCII range, the three statements you listed will be equivalent if you prepend the last with this:
StringList.WriteBOM := False;
This is because by default, TStrings will write out a small marker (a BOM) to denote UTF-8 text.
The difference is simply in the encoding used. This in turn, of course, leads to differences in size. So ASCII files will be smaller than UTF-16 (what you get with TEncoding.Unicode. And UTF-8 files could be the same size as ASCII, or larger than UTF-16.
I guess you are asking if using ASCII or UTF-8 in any way damages the text that is written. Well, using ASCII will if the text contains non-ASCII characters. ASCII can only encode 127 characters.
On the other hand, UTF-8 is a full encoding of Unicode. Which means that
StringList.SaveToFile(FileName, TEncoding.UTF8);
StringList.LoadFromFile(FileName, TEncoding.UTF8);
results in the list having exactly the same content as it did before the save.
You ask if lines can be truncated by SaveToFile. They cannot.
Another point to make is that 32/64 bit is not relevant here. The code behaves in exactly the same way under 32 and 64 bit. The issues are always to do with encoding.
I would also note that the title of your question is somewhat mis-leading. When you encode with TEncoding.UTF8 you not do not have an ASCII file.

How to get a single Arabic letter in a string with its Unicode transformation value in DELPHI?

Considering this Arabic word(جبل) made of 3 letters .
-the first letter is جـ,
-name is (ǧīm),
-its Unicode value is FE9F when its in the beginning,
-its basic value is 062C and
-its isolated value is FE9D but the last two values return the same shape drawing ج .
Now, Whenever I try to get it as a single character -trying many different ways-, Delphi returns the basic Unicode value.
well,that makes sense,but what happens to the char with transformation? It is a single char too..Looks like it takes the transformed value only when it is within a string, but where? how to extract it?When and which process decides these values?
Again the MAIN QUESTION:
How can I get the Arabic letter or its Unicode value as it is within a string?
just for information: Unlike English which has tow cases for its letters(Capital and Small), Arabic has four cases(Isolated, Beginning,Middle And End) with different rules as well.
I'm not sure I understand the question. If you want to know how to write U+FE9F in Delphi source code, in a modern Unicode version of Delphi. Do that simply like so:
Char($FE9F)
If you want to read individual characters from جبل then do it like this:
const
MyWord = 'جبل';
var
c: Char;
....
c := MyWord[1];//this is U+062C
Note that the code above is fine for your particular word because each code point can be encoded with a single UTF-16 WideChar character element. If the code point required multiple elements, then it would be best to transform to UTF-32 for code point level processing.
Now, let's look at the string that you included in the question. I downloaded this question using wget and the file that came down the wires was UTF-8 encoded. I used Notepad++ to convert to UTF16-LE and then picked out the three UTF-16 characters of your string. They are:
U+062C
U+0628
U+0644
You stated:
The first letter is جـ, name is (ǧīm), its Unicode value is U+FE9F.
But that is simply incorrect. As can be seen from the above, the actual character you posted was U+062C. So the reason why your attempts to read the first character yield U+062C is that U+062C really is the first character of your string.
The bottom line is that nothing in your Delphi code is transforming your character. When you do:
S[1] := Char($FE9F);
the compiler performs a simple two byte copy. There is no context aware transformation that occurs. And likewise when reading S[1].
Let's look at how these characters are displayed, using this simple code on a VCL forms application that contains a memo control:
Memo1.Clear;
Memo1.Lines.Add(StringOfChar(Char($FE9F), 2));
Memo1.Lines.Add(StringOfChar(Char($062C), 2));
The output looks like this:
As you can see, the rendering layer knows what to do with a U+062C character that appears at the beginning of the string.
Shaping of Arabic characters for presentation in Windows is served by the Uniscribe services (USP10.dll).
UniScribe
You may find the following blog post useful:
Roozbeh's Programming Blog
I don't think you can do it using string/char related methods. But using pchar, maybe can you access the memory and read the Pword values directly
EDIT: After discussing with David, I think that you will always get the basic/isolated value of the letter. The fact that begin or end glyph is used, is probably just handled by the display framework of the OS

How to create UTF-16 animation in Twitter?

I use a UTF-16 character picker to create ASCII art in Texbox in HTML, and UTF-16 characters are supported and visible "as is". Now I need to process such ASCII art and save into an Array as UTF-16 characters, process with Javascript as Strings to build ASCII art animations for Twitter like this:
You don't have to be sorry.
Twitter accepts UTF-16 as ASCIIart
For UTF-16 definition go to Wikipedia
http://en.wikipedia.org/wiki/UTF-16
UTF-16 (16-bit Unicode Transformation Format) is a character encoding for Unicode capable of encoding 1,112,064[1] numbers (called code points) in the Unicode code space from 0 to 0x10FFFF. It produces a variable-length result of either one or two 16-bit code units per code point.
I already did 2-bytes Unicode picker (UTF-16) and can generate UTF-16 input into Twitter.
==
re:
Removed the link as it's pointing to a Twitter account which doesn't show the mentioned content anymore (w/o scrolling). May appear like spam then. – david Nov 20 at 4:09
That way it may take much longet time to get right answer.
UTF-16 is a character encoding. Twitter only accepts UTF-8 as input. You can convert UTF-16 to UTF-8 without any data loss, so just do that and then send it to Twitter.

How can I convert unicode characters to ascii codes in delphi 7?

Yes we're talking about ASCII codes. My appologies I'm not the Delphi dev here.
For Delphi 7, I'd get the free Unicode Library by Mike Lischke who is the author of Virtual Treeview.
The libary includes a lot of conversion functions to go to and from Unicode, so you can use the ones that make most sense in your application.
Or you can upgrade to Delphi 2009 which has built-in encoding routines, and its own library of conversion functions.
Let's get a few things straight. Character set (charset) and character encodings are two related but different concepts. A character set is an abstract list of characters with some sort of integer character code associated. Then there are character encodings, which is basically an algorithm that describes how the characters are represented in bytes.
ASCII acts as both the character set and encoding. It uses 7 bits to express 128 characters (94 printable). Unicode on the other hand is a character set, expressing 1,114,112 code points. There are several encodings to represent Unicode strings but most notable ones are UTF-8, UTF-16, UTF-16LE, and UTF-32. In other words, a single Unicode character can be represented in different ways depending on the encodings.
How can I convert unicode characters to ascii codes in delphi 7?
I think the question could be interpreted in two ways.
I have a Unicode string in some encoding that only includes ASCII printable characters. How can I convert the string into a byte array of ASCII encoding?
I have a Unicode string in some encoding that also includes non-ASCII printable characters such as Chinese characters. How can I encode the string into a ASCII encoding without losing information, and later decode it back to the original Unicode string?
If you mean the first, you can load the Unicode string into WideString like Osman is saying and do
var
original: WideString;
s: AnsiString;
begin
s := AnsiString(original);
If you mean the second, you would need a generic encoding algorithm like Base64 encoding. You can use DCPBase64.pas included in David Barton's DCPcrypt v2 Beta 3.
It depends what your definition of conversion is. If you want to map the 127 lowest characters to the Unicode equivalent, you can use an explicit cast. But this creates garbage if the string contains higher characters.
If you want mappings like ë -> e and û -> u, you can write your own code. But be aware that there are always characters that can't be converted.
"ASCII" is the name of a specific mapping of characters to numbers, but some people say "ASCII code" when they don't really mean ASCII at all; they just want the numeric value of a character, whatever mapping is in effect at the time. Does that description apply to you?
If so, then you can use the Ord standard function to get the Unicode code-point value of whatever Unicode character you have.
var
wc: WideChar;
ws: WideString;
x: Word;
x := Ord(wc);
x := Ord(ws[1]);
If you really meant ASCII, though, then you'll have to be more specific about what sort of conversion you have in mind.
As an example, the letter A is represented in unicode as U+0041 and in ansi as just 41. So converting that would be pretty simple, but you must find out how the unicode character is encoded. The most common are UTF-16 and UTF-8. UTF 16, is basically two bytes per character, but even that is an oversimplification, as a character may have more bytes. UTF-8 sounds as if it means 1 byte per character but can be 2 or 3. To further complicate matters, UTF-16 can be little endian or big endian. (U+0041 or U+4100).
Where your question makes no sense is if you wanted to for example convert the arabic letter ain U+0639 to ansi on an English locale. You can't.
See related questions on converting from Unicode to ASCII:
How to convert UTF-8 to US-Ascii in Java
How to convert a Unicode character to its ASCII equivalent
How do I convert a file’s format from Unicode to ASCII using Python?
In general, character set of hundreds thousands entries cannot be converted to character set of 127 entries without some loss of information or encoding scheme.
You can use the function in http://swissdelphicenter.ch/en/showcode.php?id=1692
It converts Unicode string to Ansi string using specified code page. If you want convert using default system codepage (defined in regional options as non-unicode codepage) you can do it simply like following:
var
ws: widestring;
s: string;
begin
s:=string(ws)

Resources