I have a text file. I should convert it to Utf8. After converting, all the numbers in the file are converted to question marks. For example 1380 is converted to 4 question marks like this: '????'.
I'm using delphi 2009.
This is my code for converting:
RichEdit1.Lines.LoadFromFile(OpenDialog1.FileName,TEncoding.UTF8);
How can i correct this conversion?
You should use TEncoding.Unicode if your file is in UTF-16LE ("Unicode") format.
Or you should convert your file to UTF-8 before loading it into RichEdit.
Related
What scheme is used to encode unicode characters in a windows url shortcut?
For example, a new shortcut for url "http://Ψαℕ℧▶" produces a .url file with the text:
[{000214A0-0000-0000-C000-000000000046}]
Prop3=19,2
[InternetShortcut]
IDList=
URL=http://?aN??/
[InternetShortcut.A]
URL=http://?aN??/
[InternetShortcut.W]
URL=http://+A6gDsSEVIScltg-/
What is the algorithm to decode "+A6gDsSEVIScltg-" to "Ψαℕ℧▶"?
I am not asking for API code, but I would like to know the encoding scheme details.
Note: The encoding scheme is not utf-8 nor utf-16 nor ucs-2 and no %encoding.
+A6gDsSEVIScltg- is the UTF-7 encoded form of Ψαℕ℧▶.
The correct way to process a .url file is to use the IUniformResourceLocator and IPropertyStorage interfaces from the CLSID_InternetShortcut COM object. See Internet Shortcuts on MSDN for details.
The answer (utf-7) allowed me to successfully develop the url conversion routine.
Let me summarize the steps:
To obtain the unicode url from a InternetShortcut.W found in a .url file.
. Pass ascii chars until crlf, after making them internet safe.
. A none escaped + character starts a utf-7 formatted unicode sequence:
. Collect 6-bit nibbles from base64 coded ascii
. Per collected 16 bits, convert the 16 bits to utf-8 (1,2, or 3 chars)
. Pass the utf8 generated characters as %hh
. Continue until the occurrence of a "-" character
. The bit collector should be zero
I'm wanting to convert a hex string to ASCII character, (for the game ROBLOX).
Here's the page for the ASCII icon:
http://www.fileformat.info/info/unicode/char/25ba/index.htm
Although I'm not even sure that Lua supports that icon.
EDIT:
Turns out ROBLOX doesn't support UTF-8 symbols at all due to their 'chat filtering'.
Strings in Lua are encoding-agnostic and you can just use the character in the string:
print"►"
Alternatively:
Output the Unicode code directly with print"\u{25BA}".
Output the UTF-8 encoding directly with print"\xE2\x96\xBA".
Output the UTF-8 encoding directly with print"\226\150\186".
I would like to read a UTF-8 text file byte by byte and get the ascii value representation of each byte in the file. Can this be done? If so, what is the best method?
My goal is to then replace 2 byte combinations that i find with one byte (these are set conditions that I have prepared)
for example, If I find a 197 followed by a 158 (decimal representations), i will replace it with a single byte 17
I don't want to use the standard delphi IO operations
AssignFile
ReSet
ReWrite(OutFile);
ReadLn
WriteLn
CloseFile
Is there a better method? Can this be done using TStream (Reader & Writer)?
Here is an example test I am using. I know there is a character (350) (two bytes) starting in column 84. When viewed in a hex editor, the character consists of 197 + 158 - so i am trying to find the 198 using my delphi code and can't seem to find it
FS1:= TFileStream.Create(ParamStr1, fmOpenRead);
try
FS1.Seek(0, soBeginning);
FS1.Position:= FS1.Position + 84;
FS1.Read(B, SizeOf(B));
if ord(B) = 197 then showMessage('True') else ShowMessage('False');
finally
FS1.Free;
end;
You can use TFileStream to read all data from file to, for isntance, array of bytes, and later check for utf8 sequence.
Also please note that utf8 sequence can contain more than 2 bytes.
And, in Delphi there is a function Utf8ToUnicode, which will convert utf8 data to usable unicode string.
My understanding is that you want to convert a text file from UTF-8 to ASCII. That's quite simple:
StringList.LoadFromFile(UTF8FileName, TEncoding.UTF8);
StringList.SaveToFile(ASCIIFileName, TEncoding.ASCII);
The runtime library comes with all sorts of functionality to convert between different text encodings. Surely you don't want to attempt to replicate this functionality yourself?
I trust you realise that this conversion is liable to lose data. Characters with ordinal greater than 127 cannot be represented in ASCII. In fact every code point that requires more than 1 octet in UTF-8 cannot be represented in ASCII.
You asked the same question 5 hours later in another topic, the answer od which better addresses your specific question:
Replacing a unicode character in UTF-8 file using delphi 2010
I'm trying to process a German word list and can't figure out what encoding the file is in. The 'file' unix command says the file is "Non-ISO extended-ASCII text". Most of the words are in ascii, but here are the exceptions:
ANDR\x82
ATTACH\x82
C\x82ZANNE
CH\x83TEAU
CONF\x82RENCIER
FABERG\x82
L\x82VI-STRAUSS
RH\x93NETAL
P\xF2ANGE
Any hints would be great. Thanks!
EDIT: To be clear, the hex codes above are C hex string literals so replace \xXX with the literal hex value XX.
It looks like CP437 or CP852, assuming the \x82 sequences encode single characters, and are not literally four characters. Well, at least everything else does, but the last line is a bit of a puzzle.
I have an interesting promblem with social network http://www.odnoklassniki.ru/.
When I use advanced searching my cyrillic symbols are encoded in no understantable symbols for me.
For Example:
Иван Иванов Encode %25D0%25B8%25D0%25B2%25D0%25B0%25D0%25BD%25D0%25BE%25D0%25B2+%25D0%25B8%25D0%25B2%25D0%25B0%25D0%25BD%25D0%25BE%25D0%25B2
Any ideas?
It's a double URL-encoded string. The %25 sequences represent the percent sign. Decoding once gives %D0%B8%D0%B2%D0%B0%D0%BD%D0%BE%D0%B2+%D0%B8%D0%B2%D0%B0%D0%BD%D0%BE%D0%B2.
Decoding again gives the UTF-8 string иванов иванов.
That's URL- or percent- encoding. The percent starts it. Then its the 4 hex-digits for the char. The + is the space.
See: http://en.wikipedia.org/wiki/Percent-encoding
Well, it appears to be twice URL encoded. If we unwrap it once, we get
%D0%B8%D0%B2%D0%B0%D0%BD%D0%BE%D0%B2 %D0%B8%D0%B2%D0%B0%D0%BD%D0%BE%D0%B2
and again, we get
иванов иванов
This appears to be UTF-8 with the bytes encoded separately.