Erlang regexp matching on Chinese characters - character-encoding

TL;DR:
25> re:run("йцу.asd", xmerl_regexp:sh_to_awk("*.*"), [{capture, none}]).
** exception error: bad argument
in function re:run/3
called as re:run([1081,1094,1091,46,97,115,100],
"^(.*\\..*)$",
[{capture,none}])
How to make this work? 'йцу' are characters that don't belong in a latin charset, obviously; is there a way to tell the re module or entire system to run with a different charset for "strings"?
ORIGINAL QUESTION (for the record):
Another "Programming Erlang" question )
in Chapter 16 there's an example about reading tags from the mp3 files. It works, great. But, there seems to be some bug in a provided module, lib_find, which has a function for searching a path for matching files. This is the call that works:
61> lib_find:files("../..", "*.mp3", true).
["../../early/files/Veronique.mp3"]
and this call fails:
62> lib_find:files("../../..", "*.mp3", true).
** exception error: bad argument
in function re:run/3
called as re:run([46,46,47,46,46,47,46,46,47,46,107,101,114,108,47,98,117,
105,108,100,115,47,50,48,46,49,47,111|...],
"^(.*\\.mp3)$",
[{capture,none}])
in call from lib_find:find_files/6 (lib_find.erl, line 29)
in call from lib_find:find_files/6 (lib_find.erl, line 39)
in call from lib_find:files/3 (lib_find.erl, line 17)
Ironically, the investigation led to finding the culprit in Erlang's own installation:
.kerl/builds/20.1/otp_src_20.1/lib/ssh/test/ssh_sftp_SUITE_data/sftp_tar_test_data_高兴
OK, this seems to mean Erlang is using a more restrictive default charset, which doesn't include hànzì. What are the options? Obviously, I can just ignore this and move on with my study, but I feel I can learn more from this one =) Such as - where/how can I fix the default charset? I'm a little surprised it's something other than UTF8 by default - so maybe I'm on a wrong track?
Thanks!

TL;DR:
UTF-8 regexs are accessible by putting the regex pattern into unicode mode with the option unicode. (Note below that the string "^(.*\\..*)$" is the result of your call to xmerl_regexp:sh_to_awk/1.)
1> re:run("なにこれ.txt", "^(.*\\..*)$").
** exception error: bad argument
in function re:run/2
called as re:run([12394,12395,12371,12428,46,116,120,116],"^(.*\\..*)$")
2> re:run("なにこれ.txt", "^(.*\\..*)$", [unicode]).
{match,[{0,16},{0,16}]}
And from your exact example:
11> re:run("йцу.asd", "^(.*\\..*)$", [unicode, {capture, none}]).
match
Or
12> {ok, Pattern} = re:compile("^(.*\\..*)$", [unicode]).
{ok,{re_pattern,1,1,0,
<<69,82,67,80,87,0,0,0,16,8,0,0,65,0,0,0,255,255,255,
255,255,255,...>>}}
13> re:run("йцу.asd", Pattern, [{capture, none}]).
match
The docs for re are pretty long and extensive, but that's because regexs are an inherently complex subject. You can find options for compiled regexs in the docs for re:compile/2 and the options for run in the docs for re:run/3.
Discussion
Erlang has settled on the idea that strings, though still a list of codepoints, are all UTF-8 everywhere. As I work in Japan and deal with this all the time, this has come as a big relief to me because I can stop using about half of the conversion libraries I had needed in the past (yay!), but has complicated matters a bit for users of the string module because many operations there now perform under slightly different assumptions (a string is still considered "flat" even if it is a deep list of grapheme clusters, so long as those clusters exist on the first level of the list).
Unfortunately, encodings are just not very easy things to deal with and UTF-8 is anything but simple once you step out of the most common representations -- so much of this is a work in progress. I can tell you with confidence, though, that dealing with UTF-8 data in binary, string, deep list, and io_data() forms, whether file names, file data, network data, or user input from WX or web forms works as expected once you read the unicode, regex and string docs.
But that is, of course, a lot of stuff to get familiar with. 99% of the time things will work as expected if you decode everything incoming from outside as UTF-8 using unicode:characters_to_list/1 and unicode:characters_to_binary/1, and specify binary strings as utf8 binary types everywhere:
3> UnicodeBin = <<"この文書はUTF-8です。"/utf8>>.
<<227,129,147,227,129,174,230,150,135,230,155,184,227,129,
175,85,84,70,45,56,227,129,167,227,129,153,227,128,130>>
4> UnicodeString = unicode:characters_to_list(UnicodeBin).
[12371,12398,25991,26360,12399,85,84,70,45,56,12391,12377,
12290]
5> io:format("~ts~n", [UnicodeString]).
この文書はUTF-8です。
ok
6> re:run(UnicodeString, "UTF-8", [unicode]).
{match,[{15,5}]}
7> re:run(UnicodeBin, "UTF-8", [unicode]).
{match,[{15,5}]}
8> unicode:characters_to_binary(UnicodeString).
<<227,129,147,227,129,174,230,150,135,230,155,184,227,129,
175,85,84,70,45,56,227,129,167,227,129,153,227,128,130>>
9> unicode:characters_to_binary(UnicodeBin).
<<227,129,147,227,129,174,230,150,135,230,155,184,227,129,
175,85,84,70,45,56,227,129,167,227,129,153,227,128,130>>

Related

What is perl experimental feature, postderef?

I see use experimental 'postderef' being used in Moxie here on line 8. I'm just confused at what it does. The man pages for experimental are pretty vague too,
allow the use of postfix dereferencing expressions, including in interpolating strings
Can anyone show what you would have to do without the pragma, and what the pragma makes easier or possible?
What is it
It's simple. It's syntactic sugar with ups and downs. The pragma is no longer needed as the feature is core in 5.24. But in order for the feature to be supported in between 5.20 and 5.24, it had to be enabled with: use experimental 'postderef'. In the provided example, in Moxie, it's used in one line which has $meta->mro->#*; without it you'd have to write #{$meta->mro}.
Synopsis
These are straight from D Foy's blog, along with Idiomatic Perl for comparison that I've written.
D Foy example Idiomatic Perl
$gimme_a_ref->()->#[0]->%* %{ $gimme_a_ref->()[0] }
$array_ref->#* #{ $array_ref }
get_hashref()->#{ qw(cat dog) } #{ get_hashref() }{ qw(cat dog) }
These examples totally provided by D Foy,
D Foy example Idiomatic Perl
$array_ref->[0][0]->#* #{ $array_ref->[0][0] }
$sub->&* &some_sub
Arguments-for
postderef allows chaining.
postderef_qq makes complex interpolation into scalar strings easier.
Arguments-against
not at all provided by D Foy
Loses sigil significance. Whereas before you knew what the "type" was by looking at the sigil on the left-most side. Now, you don't know until you read the whole chain. This seems to undermine any argument for the sigil, by forcing you to read the whole chain before you know what is expected. Perhaps the days of arguing that sigils are a good design decision are over? But, then again, perl6 is still all about them. Lack of consistency here.
Overloads -> to mean, as type. So now you have $type->[0][1]->#* to mean dereference as $type, and also coerce to type.
Slices do not have an similar syntax on primitives.
my #foo = qw/foo bar baz quz quuz quuuz/;
my $bar = \#foo;
# Idiomatic perl array-slices with inclusive-range slicing
say #$bar[2..4]; # From reference; returns bazquzquuz
say #foo[2..4]; # From primitive; returns bazquzquuz
# Whizbang thing which has exclusive-range slicing
say $bar->#[2,4]; # From reference; returns bazquz
# Nothing.
Sources
Brian D Foy in 2014..
Brian D Foy in 2016..

how to convert String to bytes in Erlang

I am trying to implement SMPP protocol using Erlang and I have hit a dead end trying to convert string e.g. username and password to bytes in order to come up with a PDU to send over to the SMSC. All the search and reading various materials has not helped.
Kindly advice on a way I can achieve this.
Probably the first thing to note that there's no special type for strings in Erlang. So strings in Erlang actually represented as lists of integers:
1> [116, 101, 115, 116].
"test"
So keeping that in mind your question is actually transformed to "how to convert list of integers to bytes". And now it's should pretty straightforward with list_to_binary/1 function for strings with 8-bit characters:
1> list_to_binary("test").
<<"test">>
2> list_to_binary([0, 255]).
<<0,255>>
However if you have an Unicode string list_to_binary/1 will raise badarg error (note also how original string represented in the error message):
1> list_to_binary("тест").
** exception error: bad argument
in function list_to_binary/1
called as list_to_binary([1090,1077,1089,1090])
And in this case functions from unicode module can be used. For example to convert Unicode string to UTF-8 binary unicode:characters_to_binary/3 (there are also unicode:characters_to_binary/1 and unicode:characters_to_binary/2) can be used:
1> unicode:characters_to_binary("тест", unicode, utf8).
<<209,130,208,181,209,129,209,130>>

why is this error with bad utf8 character is caused while creating a document in couchdb?

Creating document in couchdb is generating the following error,
12> ADoc.
[{<<"Adress">>,<<"Hjalmar Brantingsgatan 7 C">>},
{<<"District">>,<<"Brämaregården">>},
{<<"Rent">>,3964},
{<<"Rooms">>,2},
{<<"Area">>,0}]
13> IDoc.
[{<<"Adress">>,<<"Segeparksgatan 2A">>},
{<<"District">>,<<"Kirseberg">>},
{<<"Rent">>,9701},
{<<"Rooms">>,3},
{<<"Area">>,83}]
14> erlang_couchdb:create_document({"127.0.0.1", 5984}, "proto_v1", IDoc).
{json,{struct,[{<<"ok">>,true},
{<<"id">>,<<"c6d96b5f923f50bfb9263638d4167b1e">>},
{<<"rev">>,<<"1-0d17a3416d50129328f632fd5cfa1d90">>}]}}
15> erlang_couchdb:create_document({"127.0.0.1", 5984}, "proto_v1", ADoc).
** exception exit: {ucs,{bad_utf8_character_code}}
in function xmerl_ucs:from_utf8/1 (xmerl_ucs.erl, line 185)
in call from mochijson2:json_encode_string/2 (/Users/admin/AlphaGroup/src/mochijson2.erl, line 200)
in call from mochijson2:'-json_encode_proplist/2-fun-0-'/3 (/Users/admin/AlphaGroup/src/mochijson2.erl, line 181)
in call from lists:foldl/3 (lists.erl, line 1197)
in call from mochijson2:json_encode_proplist/2 (/Users/admin/AlphaGroup/src/mochijson2.erl, line 184)
in call from erlang_couchdb:create_document/3 (/Users/admin/AlphaGroup/src/erlang_couchdb.erl, line 256)
Above of two documents one can be created in couchdb with no problem (IDoc).
can any one help me to figure out the reason it is caused?
I think that is problem is in the <<"Brämaregården">>. It is necessary to convert the unicode to binary firstly. Example is in the following links.
unicode discussion. The core function is in unicode
Entering non-ASCII characters in Erlang code is fiddly, not the least because it works differently in the shell than in compiled Erlang code.
Try inputting the binary explicitly as UTF-8:
<<"Br", 16#c3, 16#a4, "mareg", 16#c3, 16#a5, "rden">>
That is, "ä" is represented by the bytes C3 A4 in UTF-8, and "å" by C3 A5. There are many ways to find those codes; a quick search turned up this table.
Normally you'd get the input from somewhere outside your code, e.g. reading from a file, typed into a web form etc, and then you wouldn't have this problem.

What is the difference between \" and "" in Erlang

In Erlang, \" is an escape character that means double quote.
My question is, what is the difference between "\"test\"" and ""test""? The reason I ask is because, I'm trying to handle a list_to_atom error:
> list_to_atom("\"test\"").
'"test"'
> list_to_atom(""test"").
* 1: syntax error before: test
"" is a string/list of length 0
\" is just an escaped double-quote when used in the context of a string. If you wanted to have a string that consists of just a double-quote (ie \"), then you could do: "\"".
""test"" is a syntax error and is no difference than "" test "" which is syntactically <list><atom><list>. What are you trying to accomplish?
It's not advised to dynamically generate atoms as they're never garbage collected.
You'd better use list_to_existing_atom/1 when reading user input. Otherwise, you might end up running out of memory (in a system running long enough; but hey, that's what systems Erlang is for, isn't it?) and crashing the whole virtual machine.
list_to_existing_atom/1 will throw an error in case the atom doesn't exist and return the atom if it exists. A construct like catch list_to_existing_atom(some_atom) might prove useful coupled with a case .. of or a try ... catch block. Try it in the shell and see what you like the best.
If this answer seems irrelevant to the question, then please note I'm not allowed to post comments yet, so this answers the question in the comment to chops' answer, namely:
I have to write a function that reads from keyboard until an atom is typed.
I have to do this with get_line and list_to_atom. – otisonoza

Erlang - Eccentricity with accented characters and string literal

I am trying to implement a function to differentiate between french vowels and consonnants. It should be trivial, let's see what I wrote down :
-define(vowels,"aeiouyàâéèêëôù").
is_vowel(Char) -> C = string:to_lower(Char),
lists:member(C,?vowels).
It's pretty simple, but it behaves incorrectly :
2> char:is_vowel($â).
false
While the interpreted version works well :
3> C = string:to_lower($â), lists:member(C,"aeiouyàâéèêëôù").
true
What's going on ?
The most likely thing here is a conflict of encodings. Your vowels list in the compiled code is using different character values for the accented characters. You should be able to see this by defining acirc() -> $â. in your compiled code and looking at the number output by calling char:acirc(). versus $â. in the interpreter. I think that the compiler assumes that source files are in ISO-Latin-1 encoding, but the interpreter will consult your locale settings and use that encoding, probably UTF-8 if you're on a modern linux distro. See Using Unicode in Erlang for more information.

Resources