Erlang - Eccentricity with accented characters and string literal - erlang

I am trying to implement a function to differentiate between french vowels and consonnants. It should be trivial, let's see what I wrote down :
-define(vowels,"aeiouyàâéèêëôù").
is_vowel(Char) -> C = string:to_lower(Char),
lists:member(C,?vowels).
It's pretty simple, but it behaves incorrectly :
2> char:is_vowel($â).
false
While the interpreted version works well :
3> C = string:to_lower($â), lists:member(C,"aeiouyàâéèêëôù").
true
What's going on ?

The most likely thing here is a conflict of encodings. Your vowels list in the compiled code is using different character values for the accented characters. You should be able to see this by defining acirc() -> $â. in your compiled code and looking at the number output by calling char:acirc(). versus $â. in the interpreter. I think that the compiler assumes that source files are in ISO-Latin-1 encoding, but the interpreter will consult your locale settings and use that encoding, probably UTF-8 if you're on a modern linux distro. See Using Unicode in Erlang for more information.

Related

Character Encoding not resolved

I have a text file with unknown character formatting, below is a snapshot
\216\175\217\133\217\136\216\185 \216\167\217\132\217\133\216\177\216\163\216\169 \216\163\217\130\217\136\217\137 \217\134\217\129\217\136\216\176\216\167\217\139 \217\133\217\134 \216\167\217\132\217\130\217\136\216\167\217\134\217\138\217\134
Anyone has an idea how can I convert it to normal text?
This is apparently how Lua stores strings. Each \nnn represents a single byte where nnn is the byte's value in decimal. (A similar notation is commonly used for octal, which threw me off for longer than I would like to admit. I should have noticed that there were digits 8 and 9 in the data!) This particular string is just plain old UTF-8.
$ perl -ple 's/\\(\d{3})/chr($1)/ge' <<<'\216\175\217\133\217\136\216\185 \216\167\217\132\217\133\216\177\216\163\216\169 \216\163\217\130\217\136\217\137 \217\134\217\129\217\136\216\176\216\167\217\139 \217\133\217\134 \216\167\217\132\217\130\217\136\216\167\217\134\217\138\217\134'
دموع المرأة أقوى نفوذاً من القوانين
You would obviously get a similar result simply by printing the string from Lua, though I'm not familiar enough with the language to tell you how exactly to do that.
Post scriptum: I had to look this up for other reasons, so here's how to execute Lua from the command line.
lua -e 'print("\216\175\217\133\217\136\216\185 \216\167\217\132\217\133\216\177\216\163\216\169 \216\163\217\130\217\136\217\137 \217\134\217\129\217\136\216\176\216\167\217\139 \217\133\217\134 \216\167\217\132\217\130\217\136\216\167\217\134\217\138\217\134")'

Erlang regexp matching on Chinese characters

TL;DR:
25> re:run("йцу.asd", xmerl_regexp:sh_to_awk("*.*"), [{capture, none}]).
** exception error: bad argument
in function re:run/3
called as re:run([1081,1094,1091,46,97,115,100],
"^(.*\\..*)$",
[{capture,none}])
How to make this work? 'йцу' are characters that don't belong in a latin charset, obviously; is there a way to tell the re module or entire system to run with a different charset for "strings"?
ORIGINAL QUESTION (for the record):
Another "Programming Erlang" question )
in Chapter 16 there's an example about reading tags from the mp3 files. It works, great. But, there seems to be some bug in a provided module, lib_find, which has a function for searching a path for matching files. This is the call that works:
61> lib_find:files("../..", "*.mp3", true).
["../../early/files/Veronique.mp3"]
and this call fails:
62> lib_find:files("../../..", "*.mp3", true).
** exception error: bad argument
in function re:run/3
called as re:run([46,46,47,46,46,47,46,46,47,46,107,101,114,108,47,98,117,
105,108,100,115,47,50,48,46,49,47,111|...],
"^(.*\\.mp3)$",
[{capture,none}])
in call from lib_find:find_files/6 (lib_find.erl, line 29)
in call from lib_find:find_files/6 (lib_find.erl, line 39)
in call from lib_find:files/3 (lib_find.erl, line 17)
Ironically, the investigation led to finding the culprit in Erlang's own installation:
.kerl/builds/20.1/otp_src_20.1/lib/ssh/test/ssh_sftp_SUITE_data/sftp_tar_test_data_高兴
OK, this seems to mean Erlang is using a more restrictive default charset, which doesn't include hànzì. What are the options? Obviously, I can just ignore this and move on with my study, but I feel I can learn more from this one =) Such as - where/how can I fix the default charset? I'm a little surprised it's something other than UTF8 by default - so maybe I'm on a wrong track?
Thanks!
TL;DR:
UTF-8 regexs are accessible by putting the regex pattern into unicode mode with the option unicode. (Note below that the string "^(.*\\..*)$" is the result of your call to xmerl_regexp:sh_to_awk/1.)
1> re:run("なにこれ.txt", "^(.*\\..*)$").
** exception error: bad argument
in function re:run/2
called as re:run([12394,12395,12371,12428,46,116,120,116],"^(.*\\..*)$")
2> re:run("なにこれ.txt", "^(.*\\..*)$", [unicode]).
{match,[{0,16},{0,16}]}
And from your exact example:
11> re:run("йцу.asd", "^(.*\\..*)$", [unicode, {capture, none}]).
match
Or
12> {ok, Pattern} = re:compile("^(.*\\..*)$", [unicode]).
{ok,{re_pattern,1,1,0,
<<69,82,67,80,87,0,0,0,16,8,0,0,65,0,0,0,255,255,255,
255,255,255,...>>}}
13> re:run("йцу.asd", Pattern, [{capture, none}]).
match
The docs for re are pretty long and extensive, but that's because regexs are an inherently complex subject. You can find options for compiled regexs in the docs for re:compile/2 and the options for run in the docs for re:run/3.
Discussion
Erlang has settled on the idea that strings, though still a list of codepoints, are all UTF-8 everywhere. As I work in Japan and deal with this all the time, this has come as a big relief to me because I can stop using about half of the conversion libraries I had needed in the past (yay!), but has complicated matters a bit for users of the string module because many operations there now perform under slightly different assumptions (a string is still considered "flat" even if it is a deep list of grapheme clusters, so long as those clusters exist on the first level of the list).
Unfortunately, encodings are just not very easy things to deal with and UTF-8 is anything but simple once you step out of the most common representations -- so much of this is a work in progress. I can tell you with confidence, though, that dealing with UTF-8 data in binary, string, deep list, and io_data() forms, whether file names, file data, network data, or user input from WX or web forms works as expected once you read the unicode, regex and string docs.
But that is, of course, a lot of stuff to get familiar with. 99% of the time things will work as expected if you decode everything incoming from outside as UTF-8 using unicode:characters_to_list/1 and unicode:characters_to_binary/1, and specify binary strings as utf8 binary types everywhere:
3> UnicodeBin = <<"この文書はUTF-8です。"/utf8>>.
<<227,129,147,227,129,174,230,150,135,230,155,184,227,129,
175,85,84,70,45,56,227,129,167,227,129,153,227,128,130>>
4> UnicodeString = unicode:characters_to_list(UnicodeBin).
[12371,12398,25991,26360,12399,85,84,70,45,56,12391,12377,
12290]
5> io:format("~ts~n", [UnicodeString]).
この文書はUTF-8です。
ok
6> re:run(UnicodeString, "UTF-8", [unicode]).
{match,[{15,5}]}
7> re:run(UnicodeBin, "UTF-8", [unicode]).
{match,[{15,5}]}
8> unicode:characters_to_binary(UnicodeString).
<<227,129,147,227,129,174,230,150,135,230,155,184,227,129,
175,85,84,70,45,56,227,129,167,227,129,153,227,128,130>>
9> unicode:characters_to_binary(UnicodeBin).
<<227,129,147,227,129,174,230,150,135,230,155,184,227,129,
175,85,84,70,45,56,227,129,167,227,129,153,227,128,130>>

Compile error in Erlang: "cannot parse file, giving up"

I'm just starting to look at Erlang and trying to compile my first program. Everything looks right compared to the tutorial but I can't seem to get away from this error.
Here is my code, saved under 'useless.erl':
-module(useless).
-export([add/2, hello/0, greet_and_add_two/1]).
add(A,B) ->
A + B.
%% Shows greetings.
%% io:format/1 is the standard function used to output text.
hello() ->
io:format("Hello, world!~n").
greet_and_add_two(X) ->
hello(),
add(X,2).
I changed the directory to that of useless.erl:
cd("C:/Users/CP/Documents/Erlang").
However, when I run
c(useless).
I get
useless.erl:1: cannot parse file, giving up
useless.erl:1: no module definition
error
If your file looks OK, but Erlang compiler complaining that it can't parse the file, your code probably has invalid encoding.
According to the documentation Erlang only accepts source code in UTF-8 or Latin-1 encodings.
The valid encodings are Latin-1 and UTF-8, where the case of the characters can be chosen freely
Note that you still only able to use extended character set inside strings and comments only:
In Erlang/OTP R16B the syntax of Erlang tokens was extended to handle Unicode. The support is limited to string literals and comments.
In case of parse errors or illegal characters errors you should
Make sure your file is saved in UTF-8 or Latin-1 encoding
Make sure you do not use UTF literals outside of comments and strings
With older Erlang versions (before Erlang 17) you can try specify encoding explicitly in the file first two lines
%% coding: utf-8
%%% or
%% For this file we have chosen encoding = Latin-1
%%% or
%% -*- coding: latin-1 -*-

Best Ansi Escape beginning

Which Ansi escape sequence is the most portable and/or simply best and why?
1. "\u001B[32;1mThis is bright green\u001B[0m"
2. "\x1B[33;1mThis is bright yellow\x1B[0m"
3. "\e[35;4;1mThis is bright purple underlined\e[0m"
I have been using printf "\x1B[32;1mgreen\x1B[0m" (that's an example in unix bash script for example) out of habit, but I was wondering if there were any reasons to use one over the other. Is one more portable than the others? That would be my assumption.
Also, if you know of any other Ansi Escape sequence feel free to share it in the comments or at the end of your answer.
If you don't know what an Ansi Escape sequence is or want to become more familiar with it, then here you go: http://en.wikipedia.org/wiki/ANSI_escape_code
NOTE:
All of the escape sequences above have worked on all of the Unix systems I have been on, however one must still rely on the system itself to interpret the escape codes. Windows, for example, does not permit any sort of escape codes except four (BEL, L-F or linefeed, C-R or carriage return and, of course, BS or backspace), so Ansi escape sequences will not work.
Short answer: It depends on the host string parser.
Long answer:
It depends on the string parser; that is, the piece of code that actually takes in your string ("\x1b[1mSome string\x1b[0m") as a literal and parses the escape characters using the backslash ANSI escape sequence.
For parsers that support hexadecimal escapes (\x), then \x1b (character 0x1B) should work.
For parsers that support octal escapes (\ddd), then \033 (octal 33) should work.
For parsers that support unicode escapes (\u), then \u001B should work.
Quick elaboration: \x and \u are similar; \x usually refers to a single character, 0-255, in hexadecimal radix. \u means the same (as it is represented in hexadecimal), but supports two bytes (in most parsers) and generally refers to 16-bit unicode characters.
A lesser used/supported escape character, as you mentioned, is \e. This escape is most commonly used with parsers/languages that expect a lot of ANSI escaping to happen, such as bash (and most other shells).
For instance, Node.js does not support \e:
> console.log("\x1b[31mhello\x1b[0m")
hello
undefined
> console.log("\e[31mhello\e[0m")
e[31mhelloe[0m
undefined
Neither does Lua:
> print('\x1b[31mhello\x1b[0m')
hello
> print('\e[31mhello\e[0m')
stdin:1: invalid escape sequence near '\e'
Or even Python:
>>> print("\x1b[31mhello\x1b[0m")
hello
>>> print("\e[31mhello\e[0m")
\e[31mhello\e[0m
>>>
Though PHP does:
<?php
echo "\x1b[31mhello\x1b[0m\n"; // hello
echo "\e[31mhello\e[0m\n"; // hello

URL Escape in Uppercase

I have a requirement to escape a string with url information but also some special characters such as '<'.
Using cl_http_utility=>escape_url this translates to '%3c'. However due to our backend webserver, it is unable to recognize this as special character and takes the value literally. What it does recognize as special character is '%3C' (C is upper case). Also if one checks http://www.w3schools.com/tags/ref_urlencode.asp it shows the value with all caps as the proper encoding.
I guess my question is is there an alternative to cl_http_utility=>escape_url that does essentially the same thing except outputs the value in upper case?
Thanks.
Use the string function.
l_escaped = escape( val = l_unescaped
format = cl_abap_format=>e_url ).
Other possible formats are e_url_full, e_uri, e_uri_full, and a bunch of xml/json stuff too. The string function escape is documented pretty well, demo programs and all.

Resources