I have a text file with unknown character formatting, below is a snapshot
\216\175\217\133\217\136\216\185 \216\167\217\132\217\133\216\177\216\163\216\169 \216\163\217\130\217\136\217\137 \217\134\217\129\217\136\216\176\216\167\217\139 \217\133\217\134 \216\167\217\132\217\130\217\136\216\167\217\134\217\138\217\134
Anyone has an idea how can I convert it to normal text?
This is apparently how Lua stores strings. Each \nnn represents a single byte where nnn is the byte's value in decimal. (A similar notation is commonly used for octal, which threw me off for longer than I would like to admit. I should have noticed that there were digits 8 and 9 in the data!) This particular string is just plain old UTF-8.
$ perl -ple 's/\\(\d{3})/chr($1)/ge' <<<'\216\175\217\133\217\136\216\185 \216\167\217\132\217\133\216\177\216\163\216\169 \216\163\217\130\217\136\217\137 \217\134\217\129\217\136\216\176\216\167\217\139 \217\133\217\134 \216\167\217\132\217\130\217\136\216\167\217\134\217\138\217\134'
دموع المرأة أقوى نفوذاً من القوانين
You would obviously get a similar result simply by printing the string from Lua, though I'm not familiar enough with the language to tell you how exactly to do that.
Post scriptum: I had to look this up for other reasons, so here's how to execute Lua from the command line.
lua -e 'print("\216\175\217\133\217\136\216\185 \216\167\217\132\217\133\216\177\216\163\216\169 \216\163\217\130\217\136\217\137 \217\134\217\129\217\136\216\176\216\167\217\139 \217\133\217\134 \216\167\217\132\217\130\217\136\216\167\217\134\217\138\217\134")'
Related
Which Ansi escape sequence is the most portable and/or simply best and why?
1. "\u001B[32;1mThis is bright green\u001B[0m"
2. "\x1B[33;1mThis is bright yellow\x1B[0m"
3. "\e[35;4;1mThis is bright purple underlined\e[0m"
I have been using printf "\x1B[32;1mgreen\x1B[0m" (that's an example in unix bash script for example) out of habit, but I was wondering if there were any reasons to use one over the other. Is one more portable than the others? That would be my assumption.
Also, if you know of any other Ansi Escape sequence feel free to share it in the comments or at the end of your answer.
If you don't know what an Ansi Escape sequence is or want to become more familiar with it, then here you go: http://en.wikipedia.org/wiki/ANSI_escape_code
NOTE:
All of the escape sequences above have worked on all of the Unix systems I have been on, however one must still rely on the system itself to interpret the escape codes. Windows, for example, does not permit any sort of escape codes except four (BEL, L-F or linefeed, C-R or carriage return and, of course, BS or backspace), so Ansi escape sequences will not work.
Short answer: It depends on the host string parser.
Long answer:
It depends on the string parser; that is, the piece of code that actually takes in your string ("\x1b[1mSome string\x1b[0m") as a literal and parses the escape characters using the backslash ANSI escape sequence.
For parsers that support hexadecimal escapes (\x), then \x1b (character 0x1B) should work.
For parsers that support octal escapes (\ddd), then \033 (octal 33) should work.
For parsers that support unicode escapes (\u), then \u001B should work.
Quick elaboration: \x and \u are similar; \x usually refers to a single character, 0-255, in hexadecimal radix. \u means the same (as it is represented in hexadecimal), but supports two bytes (in most parsers) and generally refers to 16-bit unicode characters.
A lesser used/supported escape character, as you mentioned, is \e. This escape is most commonly used with parsers/languages that expect a lot of ANSI escaping to happen, such as bash (and most other shells).
For instance, Node.js does not support \e:
> console.log("\x1b[31mhello\x1b[0m")
hello
undefined
> console.log("\e[31mhello\e[0m")
e[31mhelloe[0m
undefined
Neither does Lua:
> print('\x1b[31mhello\x1b[0m')
hello
> print('\e[31mhello\e[0m')
stdin:1: invalid escape sequence near '\e'
Or even Python:
>>> print("\x1b[31mhello\x1b[0m")
hello
>>> print("\e[31mhello\e[0m")
\e[31mhello\e[0m
>>>
Though PHP does:
<?php
echo "\x1b[31mhello\x1b[0m\n"; // hello
echo "\e[31mhello\e[0m\n"; // hello
I'm using Sublime Text for Latex, so i need to use a specific encoding. However, in some cases, when I paste text copied from a different program (word/browser in most cases), I'm getting the message:
"Not all characters are representable in XXX encoding, falling back to UTF-8"
My question is: Is there any way to see which parts of the text cannot be encoded, so I can delete them manually?
I had this problem. It is caused by corrupt characters in your document. Here is how i solved it.
1) Make a search in your document for all standard characters. Make sure you enable regular expressions in your search, then paste this :
[^a-zA-Z0-9 -\.;<>/ ={}\[\]\^\?_\\\|:\r\n#]
You can add to that the normal accented characters of your language, here are the characters for French and German. Such as éà and so on :
[^a-zA-Z0-9 -\.;<>/ ='{}\[\]\^\?_\\\|:\r\n~#éàèêîôâûçäöüÄÖÜß]
2) Search for that, and Keep pressing F3 until you see mangled characters. Usually something like "è" which is a corrupt version of "à".
3) Delete those characters or replace them with what they should be.
You will be able to convert the document to another encoding when you have cleared all corrupt characters out.
For Linux users, it's also possible to automatically remove broken characters with command iconv:
iconv -f UTF-8 -t Windows-1251 -c < ~/temp/data.csv > ~/temp/data01.csv
-c Silently discard characters that cannot be converted instead of terminating when encountering such characters.
Just adding to #Draken response: here is the RegEx with spanish characters added.
[^a-zA-Z0-9 -\.;<>/ =“”'{}\[\]\^\?_\\\|:\r\n~#àèêîôâûçäöüÄÖÜßáéíóúñÑ¿€]
In my case I hitted Ctrl+H (for replacement) and as a replacement expression used nothing. So everything got cleared super fast and I was able to save it using ISO-8859-1.
I'm trying to process a German word list and can't figure out what encoding the file is in. The 'file' unix command says the file is "Non-ISO extended-ASCII text". Most of the words are in ascii, but here are the exceptions:
ANDR\x82
ATTACH\x82
C\x82ZANNE
CH\x83TEAU
CONF\x82RENCIER
FABERG\x82
L\x82VI-STRAUSS
RH\x93NETAL
P\xF2ANGE
Any hints would be great. Thanks!
EDIT: To be clear, the hex codes above are C hex string literals so replace \xXX with the literal hex value XX.
It looks like CP437 or CP852, assuming the \x82 sequences encode single characters, and are not literally four characters. Well, at least everything else does, but the last line is a bit of a puzzle.
I've seen this posted a couple of times but none of the solutions seem to work for me so far...
I'm trying to remove a spurious  character from a string...
e.g.
"myÂstring here Â$100"
..but it should be my string here $100
I've tried:
string.gsub(/\194/,'')
string.gsub(194.chr,'')
string.delete 194.chr
All of these still leave the  intact..
Any thoughts?
By default, Rails supports UTF-8.
You can use your favorite editor to write a gsub call using the proper character you want to replace, as in:
"myÂstring here Â$100".gsub(/Â/,"")
If this does not work as well, you might be having an encoding error somewhere on your stack, probably on your HTML document. Try running rails console, extract somehow that string (if it comes from the Model, try to perform a find on the containing class) and run the gsub. It won't solve your problem, but you'll get a clue to where exactly the problem may lie.
Looks like a character encoding problem to me. For every Unicode code point in the range U+0080..U+00BF inclusive, the UTF-8 encoding is a two-byte sequence, 0xC2 (194 decimal) and the numeric value the code point. For example, a non-breaking space--U+00A0--becomes 0xC2 0xA0. Was there another extra character in there, that you already removed?
At any rate, gsub(/\194/,'') is wrong. \nnn is supposed to be an octal escape, but the number is in its decimal form. 194 in octal is \302.
"myÂstring here Â$100".gsub("Â","") # "mystring here $100"
Is that what you meant?
I am learning ETS. I did:
Sometab = ets:new(sometable, [bag]).
ets:insert(Sometab, {109, ash, 8}).
Then I typed:
ets:match(Sometab, {109, ash, '$1'}).
However instead of getting 8 - I am getting: ["\b"] as output!
You are getting the correct answer. However, the erlang shell prints [8] as "\b" since the ascii code for backspace is 8.
Erlang has no string type. Strings in erlang are represented simply as a list of integers and the Erlang shell prints this list as a string if the list contains integers withing the ascii range only.
This can indeed be confusing at times.