I have a Turkish written text but I have some strange characters for example:
ý instead of ı, Ý instead of İ etc... I tried to convert encoding to iso 8859-9 but it didn't help.
If you're running a UNIX/Linux machine, try the following shell command:
you#somewhere:~$ file --mime yourfile.txt
It should output something like the snippet below, where iso-8859-1 is the acutal character set your system assumes:
yourfile.txt: text/plain; charset=iso-8859-1
Now you can convert the file into some more flexible charset, like UTF-8:
you#somewhere:~$ iconv -f iso-8859-1 -t utf-8 yourfile.txt > converted.txt
The above snippet specifies both, the charset to convert -from (which should equal the output of the file command) as well as the charset to convert -to. The result of the conversion of yourfile.txt is then stored in converted.txt, which you should be able to open with gedit.
If that doesn't work, you may paste the output of the file command, as well as some real line of your file, in the comment section...
Related
What scheme is used to encode unicode characters in a windows url shortcut?
For example, a new shortcut for url "http://Ψαℕ℧▶" produces a .url file with the text:
[{000214A0-0000-0000-C000-000000000046}]
Prop3=19,2
[InternetShortcut]
IDList=
URL=http://?aN??/
[InternetShortcut.A]
URL=http://?aN??/
[InternetShortcut.W]
URL=http://+A6gDsSEVIScltg-/
What is the algorithm to decode "+A6gDsSEVIScltg-" to "Ψαℕ℧▶"?
I am not asking for API code, but I would like to know the encoding scheme details.
Note: The encoding scheme is not utf-8 nor utf-16 nor ucs-2 and no %encoding.
+A6gDsSEVIScltg- is the UTF-7 encoded form of Ψαℕ℧▶.
The correct way to process a .url file is to use the IUniformResourceLocator and IPropertyStorage interfaces from the CLSID_InternetShortcut COM object. See Internet Shortcuts on MSDN for details.
The answer (utf-7) allowed me to successfully develop the url conversion routine.
Let me summarize the steps:
To obtain the unicode url from a InternetShortcut.W found in a .url file.
. Pass ascii chars until crlf, after making them internet safe.
. A none escaped + character starts a utf-7 formatted unicode sequence:
. Collect 6-bit nibbles from base64 coded ascii
. Per collected 16 bits, convert the 16 bits to utf-8 (1,2, or 3 chars)
. Pass the utf8 generated characters as %hh
. Continue until the occurrence of a "-" character
. The bit collector should be zero
The file I'm dealing with contains text where some characters aren't showing correctly upon opening.
I've been told the file has UTF-8 encoding but when I open in sublime text 3 (I even used the re-open in UTF-8 option) there are a number of characters that show as ? -
For example Jiří is incorrectly being shown as Ji?í - so the ř isn't shown correctly but the long i í is. There are other characters for example č ň ř that are also not showing correctly.
After some investigation it looks like the file is in ASCII encoding (a subset of UTF-8).
file ~/my_location/my_file.txt
~/my_location/my_file.txt: ASCII text, with very long lines, with CRLF line terminators
I've checked the character set for ASCII and for example ř is present, so I'm wondering if the issue is that these characters are already corrupted or the above file encoding check that I used isn't showing the correct file encoding.
I've tried a few conversions to utf8 but none of them fix the characters.
iconv -f ISO-8859-1 -t UTF-8 ~/my_location/my_file.txt > ~/my_location/my_file_f_ISO-8859-1.txt
iconv -f CP1252 -t UTF-8 ~/my_location/my_file.txt > ~/my_location/my_file_f_CP1252.txt
iconv -f Windows-1252 -t UTF-8 ~/my_location/my_file.txt > ~/my_location/my_file_f_Windows-1252.txt
Would appreciate if anyone has any thoughts on how I can proceed in the investigation...
I'm wanting to convert a hex string to ASCII character, (for the game ROBLOX).
Here's the page for the ASCII icon:
http://www.fileformat.info/info/unicode/char/25ba/index.htm
Although I'm not even sure that Lua supports that icon.
EDIT:
Turns out ROBLOX doesn't support UTF-8 symbols at all due to their 'chat filtering'.
Strings in Lua are encoding-agnostic and you can just use the character in the string:
print"►"
Alternatively:
Output the Unicode code directly with print"\u{25BA}".
Output the UTF-8 encoding directly with print"\xE2\x96\xBA".
Output the UTF-8 encoding directly with print"\226\150\186".
Say you have a file which contains both UTF-8 characters and UTF-8 characters there were once read by a program who thought they were ISO-8859-1. So you have things like "é" instead of "é". How do you fix that ?
I finally came up with a single sed command that did the job for me :
LANG='' sed -re 's/(\xc3)\x83\xc2([\x80-\xbf])/\1\2/g'
It does not handle unicode code point 0xA0 to 0xBF, but it should be pretty easy to adapt for those.
I'm using Sublime Text for Latex, so i need to use a specific encoding. However, in some cases, when I paste text copied from a different program (word/browser in most cases), I'm getting the message:
"Not all characters are representable in XXX encoding, falling back to UTF-8"
My question is: Is there any way to see which parts of the text cannot be encoded, so I can delete them manually?
I had this problem. It is caused by corrupt characters in your document. Here is how i solved it.
1) Make a search in your document for all standard characters. Make sure you enable regular expressions in your search, then paste this :
[^a-zA-Z0-9 -\.;<>/ ={}\[\]\^\?_\\\|:\r\n#]
You can add to that the normal accented characters of your language, here are the characters for French and German. Such as éà and so on :
[^a-zA-Z0-9 -\.;<>/ ='{}\[\]\^\?_\\\|:\r\n~#éàèêîôâûçäöüÄÖÜß]
2) Search for that, and Keep pressing F3 until you see mangled characters. Usually something like "è" which is a corrupt version of "à".
3) Delete those characters or replace them with what they should be.
You will be able to convert the document to another encoding when you have cleared all corrupt characters out.
For Linux users, it's also possible to automatically remove broken characters with command iconv:
iconv -f UTF-8 -t Windows-1251 -c < ~/temp/data.csv > ~/temp/data01.csv
-c Silently discard characters that cannot be converted instead of terminating when encountering such characters.
Just adding to #Draken response: here is the RegEx with spanish characters added.
[^a-zA-Z0-9 -\.;<>/ =“”'{}\[\]\^\?_\\\|:\r\n~#àèêîôâûçäöüÄÖÜßáéíóúñÑ¿€]
In my case I hitted Ctrl+H (for replacement) and as a replacement expression used nothing. So everything got cleared super fast and I was able to save it using ISO-8859-1.