I'm using Sublime Text for Latex, so i need to use a specific encoding. However, in some cases, when I paste text copied from a different program (word/browser in most cases), I'm getting the message:
"Not all characters are representable in XXX encoding, falling back to UTF-8"
My question is: Is there any way to see which parts of the text cannot be encoded, so I can delete them manually?
I had this problem. It is caused by corrupt characters in your document. Here is how i solved it.
1) Make a search in your document for all standard characters. Make sure you enable regular expressions in your search, then paste this :
[^a-zA-Z0-9 -\.;<>/ ={}\[\]\^\?_\\\|:\r\n#]
You can add to that the normal accented characters of your language, here are the characters for French and German. Such as éà and so on :
[^a-zA-Z0-9 -\.;<>/ ='{}\[\]\^\?_\\\|:\r\n~#éàèêîôâûçäöüÄÖÜß]
2) Search for that, and Keep pressing F3 until you see mangled characters. Usually something like "è" which is a corrupt version of "à".
3) Delete those characters or replace them with what they should be.
You will be able to convert the document to another encoding when you have cleared all corrupt characters out.
For Linux users, it's also possible to automatically remove broken characters with command iconv:
iconv -f UTF-8 -t Windows-1251 -c < ~/temp/data.csv > ~/temp/data01.csv
-c Silently discard characters that cannot be converted instead of terminating when encountering such characters.
Just adding to #Draken response: here is the RegEx with spanish characters added.
[^a-zA-Z0-9 -\.;<>/ =“”'{}\[\]\^\?_\\\|:\r\n~#àèêîôâûçäöüÄÖÜßáéíóúñÑ¿€]
In my case I hitted Ctrl+H (for replacement) and as a replacement expression used nothing. So everything got cleared super fast and I was able to save it using ISO-8859-1.
Related
I have a text file with unknown character formatting, below is a snapshot
\216\175\217\133\217\136\216\185 \216\167\217\132\217\133\216\177\216\163\216\169 \216\163\217\130\217\136\217\137 \217\134\217\129\217\136\216\176\216\167\217\139 \217\133\217\134 \216\167\217\132\217\130\217\136\216\167\217\134\217\138\217\134
Anyone has an idea how can I convert it to normal text?
This is apparently how Lua stores strings. Each \nnn represents a single byte where nnn is the byte's value in decimal. (A similar notation is commonly used for octal, which threw me off for longer than I would like to admit. I should have noticed that there were digits 8 and 9 in the data!) This particular string is just plain old UTF-8.
$ perl -ple 's/\\(\d{3})/chr($1)/ge' <<<'\216\175\217\133\217\136\216\185 \216\167\217\132\217\133\216\177\216\163\216\169 \216\163\217\130\217\136\217\137 \217\134\217\129\217\136\216\176\216\167\217\139 \217\133\217\134 \216\167\217\132\217\130\217\136\216\167\217\134\217\138\217\134'
دموع المرأة أقوى نفوذاً من القوانين
You would obviously get a similar result simply by printing the string from Lua, though I'm not familiar enough with the language to tell you how exactly to do that.
Post scriptum: I had to look this up for other reasons, so here's how to execute Lua from the command line.
lua -e 'print("\216\175\217\133\217\136\216\185 \216\167\217\132\217\133\216\177\216\163\216\169 \216\163\217\130\217\136\217\137 \217\134\217\129\217\136\216\176\216\167\217\139 \217\133\217\134 \216\167\217\132\217\130\217\136\216\167\217\134\217\138\217\134")'
The file I'm dealing with contains text where some characters aren't showing correctly upon opening.
I've been told the file has UTF-8 encoding but when I open in sublime text 3 (I even used the re-open in UTF-8 option) there are a number of characters that show as ? -
For example Jiří is incorrectly being shown as Ji?í - so the ř isn't shown correctly but the long i í is. There are other characters for example č ň ř that are also not showing correctly.
After some investigation it looks like the file is in ASCII encoding (a subset of UTF-8).
file ~/my_location/my_file.txt
~/my_location/my_file.txt: ASCII text, with very long lines, with CRLF line terminators
I've checked the character set for ASCII and for example ř is present, so I'm wondering if the issue is that these characters are already corrupted or the above file encoding check that I used isn't showing the correct file encoding.
I've tried a few conversions to utf8 but none of them fix the characters.
iconv -f ISO-8859-1 -t UTF-8 ~/my_location/my_file.txt > ~/my_location/my_file_f_ISO-8859-1.txt
iconv -f CP1252 -t UTF-8 ~/my_location/my_file.txt > ~/my_location/my_file_f_CP1252.txt
iconv -f Windows-1252 -t UTF-8 ~/my_location/my_file.txt > ~/my_location/my_file_f_Windows-1252.txt
Would appreciate if anyone has any thoughts on how I can proceed in the investigation...
I'm trying to process a German word list and can't figure out what encoding the file is in. The 'file' unix command says the file is "Non-ISO extended-ASCII text". Most of the words are in ascii, but here are the exceptions:
ANDR\x82
ATTACH\x82
C\x82ZANNE
CH\x83TEAU
CONF\x82RENCIER
FABERG\x82
L\x82VI-STRAUSS
RH\x93NETAL
P\xF2ANGE
Any hints would be great. Thanks!
EDIT: To be clear, the hex codes above are C hex string literals so replace \xXX with the literal hex value XX.
It looks like CP437 or CP852, assuming the \x82 sequences encode single characters, and are not literally four characters. Well, at least everything else does, but the last line is a bit of a puzzle.
I am having issues with the special CSV interpreter (no idea what its called) on iPad mobile browser.
iPad appears to reserve the character " as reserved or special. When this character appears the string is treated as a literal instead of seperated as a CSV.
INPUT:
1111,64-1111-11,Some Tool 12", 112233
Give the input above, the CSV mobile-safari display shows ([] represents a column)
[1111] [64-1111-11] [Some Tool 12, 112233]
Note that the " is missing. Also note that 112233 is not in its own column like it should be.
Question 2:
How can I get the CSV display tool in safari to not treat a six digit number as a phone number?
1234567
Shows up as a hyperlink and asks to "Add Contact" when I click it. I do not want the hyperlink.
UPDATE
iPad is ignoring the escape character (or backslash is not the escape character) for double quotes in CSV files. I am looking at the hex version of the file and I have
\" or 5C 22 (in hex with UTF-8 encoding).
Unfortuntely, the iPad displays the backslash and still treats " as a special character, thereby corrupting my data formatting. Anybody know how I can possibly use " on iPad CSV?
With regards the quotes, have you tried escaping them in the output?
EDIT: conventional escaping doesn't work for CSV files, my apologies. Most specifications state the following:
Fields that contain a special character (comma, newline, or double quote), must be enclosed in double quotes.
So, testing this on your CSV snippet, a file formatted like this:
1111,64-1111-11,"Some Tool 12""", 112233
or even like this:
1111,64-1111-11,Some Tool 12"""", 112233
… opens in Mobile Safari OK. How good or bad that looks in Excel you'd need to check.
Moving to the second issue, to prevent Mobile Safari from presenting numbers as phone numbers, add this to your page's head element:
<meta name="format-detection" content="telephone=no" />
I've seen this posted a couple of times but none of the solutions seem to work for me so far...
I'm trying to remove a spurious  character from a string...
e.g.
"myÂstring here Â$100"
..but it should be my string here $100
I've tried:
string.gsub(/\194/,'')
string.gsub(194.chr,'')
string.delete 194.chr
All of these still leave the  intact..
Any thoughts?
By default, Rails supports UTF-8.
You can use your favorite editor to write a gsub call using the proper character you want to replace, as in:
"myÂstring here Â$100".gsub(/Â/,"")
If this does not work as well, you might be having an encoding error somewhere on your stack, probably on your HTML document. Try running rails console, extract somehow that string (if it comes from the Model, try to perform a find on the containing class) and run the gsub. It won't solve your problem, but you'll get a clue to where exactly the problem may lie.
Looks like a character encoding problem to me. For every Unicode code point in the range U+0080..U+00BF inclusive, the UTF-8 encoding is a two-byte sequence, 0xC2 (194 decimal) and the numeric value the code point. For example, a non-breaking space--U+00A0--becomes 0xC2 0xA0. Was there another extra character in there, that you already removed?
At any rate, gsub(/\194/,'') is wrong. \nnn is supposed to be an octal escape, but the number is in its decimal form. 194 in octal is \302.
"myÂstring here Â$100".gsub("Â","") # "mystring here $100"
Is that what you meant?