TL;DR:
25> re:run("йцу.asd", xmerl_regexp:sh_to_awk("*.*"), [{capture, none}]).
** exception error: bad argument
in function re:run/3
called as re:run([1081,1094,1091,46,97,115,100],
"^(.*\\..*)$",
[{capture,none}])
How to make this work? 'йцу' are characters that don't belong in a latin charset, obviously; is there a way to tell the re module or entire system to run with a different charset for "strings"?
ORIGINAL QUESTION (for the record):
Another "Programming Erlang" question )
in Chapter 16 there's an example about reading tags from the mp3 files. It works, great. But, there seems to be some bug in a provided module, lib_find, which has a function for searching a path for matching files. This is the call that works:
61> lib_find:files("../..", "*.mp3", true).
["../../early/files/Veronique.mp3"]
and this call fails:
62> lib_find:files("../../..", "*.mp3", true).
** exception error: bad argument
in function re:run/3
called as re:run([46,46,47,46,46,47,46,46,47,46,107,101,114,108,47,98,117,
105,108,100,115,47,50,48,46,49,47,111|...],
"^(.*\\.mp3)$",
[{capture,none}])
in call from lib_find:find_files/6 (lib_find.erl, line 29)
in call from lib_find:find_files/6 (lib_find.erl, line 39)
in call from lib_find:files/3 (lib_find.erl, line 17)
Ironically, the investigation led to finding the culprit in Erlang's own installation:
.kerl/builds/20.1/otp_src_20.1/lib/ssh/test/ssh_sftp_SUITE_data/sftp_tar_test_data_高兴
OK, this seems to mean Erlang is using a more restrictive default charset, which doesn't include hànzì. What are the options? Obviously, I can just ignore this and move on with my study, but I feel I can learn more from this one =) Such as - where/how can I fix the default charset? I'm a little surprised it's something other than UTF8 by default - so maybe I'm on a wrong track?
Thanks!
TL;DR:
UTF-8 regexs are accessible by putting the regex pattern into unicode mode with the option unicode. (Note below that the string "^(.*\\..*)$" is the result of your call to xmerl_regexp:sh_to_awk/1.)
1> re:run("なにこれ.txt", "^(.*\\..*)$").
** exception error: bad argument
in function re:run/2
called as re:run([12394,12395,12371,12428,46,116,120,116],"^(.*\\..*)$")
2> re:run("なにこれ.txt", "^(.*\\..*)$", [unicode]).
{match,[{0,16},{0,16}]}
And from your exact example:
11> re:run("йцу.asd", "^(.*\\..*)$", [unicode, {capture, none}]).
match
Or
12> {ok, Pattern} = re:compile("^(.*\\..*)$", [unicode]).
{ok,{re_pattern,1,1,0,
<<69,82,67,80,87,0,0,0,16,8,0,0,65,0,0,0,255,255,255,
255,255,255,...>>}}
13> re:run("йцу.asd", Pattern, [{capture, none}]).
match
The docs for re are pretty long and extensive, but that's because regexs are an inherently complex subject. You can find options for compiled regexs in the docs for re:compile/2 and the options for run in the docs for re:run/3.
Discussion
Erlang has settled on the idea that strings, though still a list of codepoints, are all UTF-8 everywhere. As I work in Japan and deal with this all the time, this has come as a big relief to me because I can stop using about half of the conversion libraries I had needed in the past (yay!), but has complicated matters a bit for users of the string module because many operations there now perform under slightly different assumptions (a string is still considered "flat" even if it is a deep list of grapheme clusters, so long as those clusters exist on the first level of the list).
Unfortunately, encodings are just not very easy things to deal with and UTF-8 is anything but simple once you step out of the most common representations -- so much of this is a work in progress. I can tell you with confidence, though, that dealing with UTF-8 data in binary, string, deep list, and io_data() forms, whether file names, file data, network data, or user input from WX or web forms works as expected once you read the unicode, regex and string docs.
But that is, of course, a lot of stuff to get familiar with. 99% of the time things will work as expected if you decode everything incoming from outside as UTF-8 using unicode:characters_to_list/1 and unicode:characters_to_binary/1, and specify binary strings as utf8 binary types everywhere:
3> UnicodeBin = <<"この文書はUTF-8です。"/utf8>>.
<<227,129,147,227,129,174,230,150,135,230,155,184,227,129,
175,85,84,70,45,56,227,129,167,227,129,153,227,128,130>>
4> UnicodeString = unicode:characters_to_list(UnicodeBin).
[12371,12398,25991,26360,12399,85,84,70,45,56,12391,12377,
12290]
5> io:format("~ts~n", [UnicodeString]).
この文書はUTF-8です。
ok
6> re:run(UnicodeString, "UTF-8", [unicode]).
{match,[{15,5}]}
7> re:run(UnicodeBin, "UTF-8", [unicode]).
{match,[{15,5}]}
8> unicode:characters_to_binary(UnicodeString).
<<227,129,147,227,129,174,230,150,135,230,155,184,227,129,
175,85,84,70,45,56,227,129,167,227,129,153,227,128,130>>
9> unicode:characters_to_binary(UnicodeBin).
<<227,129,147,227,129,174,230,150,135,230,155,184,227,129,
175,85,84,70,45,56,227,129,167,227,129,153,227,128,130>>
Which Ansi escape sequence is the most portable and/or simply best and why?
1. "\u001B[32;1mThis is bright green\u001B[0m"
2. "\x1B[33;1mThis is bright yellow\x1B[0m"
3. "\e[35;4;1mThis is bright purple underlined\e[0m"
I have been using printf "\x1B[32;1mgreen\x1B[0m" (that's an example in unix bash script for example) out of habit, but I was wondering if there were any reasons to use one over the other. Is one more portable than the others? That would be my assumption.
Also, if you know of any other Ansi Escape sequence feel free to share it in the comments or at the end of your answer.
If you don't know what an Ansi Escape sequence is or want to become more familiar with it, then here you go: http://en.wikipedia.org/wiki/ANSI_escape_code
NOTE:
All of the escape sequences above have worked on all of the Unix systems I have been on, however one must still rely on the system itself to interpret the escape codes. Windows, for example, does not permit any sort of escape codes except four (BEL, L-F or linefeed, C-R or carriage return and, of course, BS or backspace), so Ansi escape sequences will not work.
Short answer: It depends on the host string parser.
Long answer:
It depends on the string parser; that is, the piece of code that actually takes in your string ("\x1b[1mSome string\x1b[0m") as a literal and parses the escape characters using the backslash ANSI escape sequence.
For parsers that support hexadecimal escapes (\x), then \x1b (character 0x1B) should work.
For parsers that support octal escapes (\ddd), then \033 (octal 33) should work.
For parsers that support unicode escapes (\u), then \u001B should work.
Quick elaboration: \x and \u are similar; \x usually refers to a single character, 0-255, in hexadecimal radix. \u means the same (as it is represented in hexadecimal), but supports two bytes (in most parsers) and generally refers to 16-bit unicode characters.
A lesser used/supported escape character, as you mentioned, is \e. This escape is most commonly used with parsers/languages that expect a lot of ANSI escaping to happen, such as bash (and most other shells).
For instance, Node.js does not support \e:
> console.log("\x1b[31mhello\x1b[0m")
hello
undefined
> console.log("\e[31mhello\e[0m")
e[31mhelloe[0m
undefined
Neither does Lua:
> print('\x1b[31mhello\x1b[0m')
hello
> print('\e[31mhello\e[0m')
stdin:1: invalid escape sequence near '\e'
Or even Python:
>>> print("\x1b[31mhello\x1b[0m")
hello
>>> print("\e[31mhello\e[0m")
\e[31mhello\e[0m
>>>
Though PHP does:
<?php
echo "\x1b[31mhello\x1b[0m\n"; // hello
echo "\e[31mhello\e[0m\n"; // hello