how to convert String to bytes in Erlang - erlang

I am trying to implement SMPP protocol using Erlang and I have hit a dead end trying to convert string e.g. username and password to bytes in order to come up with a PDU to send over to the SMSC. All the search and reading various materials has not helped.
Kindly advice on a way I can achieve this.

Probably the first thing to note that there's no special type for strings in Erlang. So strings in Erlang actually represented as lists of integers:
1> [116, 101, 115, 116].
"test"
So keeping that in mind your question is actually transformed to "how to convert list of integers to bytes". And now it's should pretty straightforward with list_to_binary/1 function for strings with 8-bit characters:
1> list_to_binary("test").
<<"test">>
2> list_to_binary([0, 255]).
<<0,255>>
However if you have an Unicode string list_to_binary/1 will raise badarg error (note also how original string represented in the error message):
1> list_to_binary("тест").
** exception error: bad argument
in function list_to_binary/1
called as list_to_binary([1090,1077,1089,1090])
And in this case functions from unicode module can be used. For example to convert Unicode string to UTF-8 binary unicode:characters_to_binary/3 (there are also unicode:characters_to_binary/1 and unicode:characters_to_binary/2) can be used:
1> unicode:characters_to_binary("тест", unicode, utf8).
<<209,130,208,181,209,129,209,130>>

Related

Erlang equivalent of javascript codePointAt?

Is there an erlang equivalent of codePointAt from js? One that gets the code point starting at a byte offset, without modifying the underlying string/binary?
You can use bit syntax pattern matching to skip the first N bytes and decode the first character from the remaining bytes as UTF-8:
1> CodePointAt = fun(Binary, Offset) ->
<<_:Offset/binary, Char/utf8, _/binary>> = Binary,
Char
end.
Test:
2> CodePointAt(<<"πr²"/utf8>>, 0).
960
3> CodePointAt(<<"πr²"/utf8>>, 1).
** exception error: no match of right hand side value <<207,128,114,194,178>>
4> CodePointAt(<<"πr²"/utf8>>, 2).
114
5> CodePointAt(<<"πr²"/utf8>>, 3).
178
6> CodePointAt(<<"πr²"/utf8>>, 4).
** exception error: no match of right hand side value <<207,128,114,194,178>>
7> CodePointAt(<<"πr²"/utf8>>, 5).
** exception error: no match of right hand side value <<207,128,114,194,178>>
As you can see, if the offset is not in a valid UTF-8 character boundary, the function will throw an error. You can handle that differently using a case expression if needed.
First, remember that only binary strings are using UTF-8 in Erlang. Plain double-quote strings are already just lists of code points (much like UTF-32). The unicode:chardata() type represents both of these kinds of strings, including mixed lists like ["Hello", $\s, [<<"Filip"/utf8>>, $!]]. You can use unicode:characters_to_list(Chardata) or unicode:characters_to_binary(Chardata) to get a flattened version to work with if needed.
Meanwhile, the JS codePointAt function works on UTF-16 encoded strings, which is what JavaScript uses. Note that the index in this case is not a byte position, but the index of the 16-bit units of the encoding. And UTF-16 is also a variable length encoding: code points that need more than 16 bits use a kind of escape sequence called "surrogate pairs" - for example emojis like 👍 - so if such characters can occur, the index is misleading: in "a👍z" (in JavaScript), the a is at 0, but the z is not at 2 but at 3.
What you want is probably what's called the "grapheme clusters" - those that look like a single thing when printed (see the docs for Erlang's string module: https://www.erlang.org/doc/man/string.html). And you can't really use numerical indexes to dig the grapheme clusters out from a string - you need to iterate over the string from the start, getting them out one at a time. This can be done with string:next_grapheme(Chardata) (see https://www.erlang.org/doc/man/string.html#next_grapheme-1) or if you for some reason really need to index them numerically, you could insert the individual cluster substrings in an array (see https://www.erlang.org/doc/man/array.html). For example: array:from_list(string:to_graphemes(Chardata)).

Erlang regexp matching on Chinese characters

TL;DR:
25> re:run("йцу.asd", xmerl_regexp:sh_to_awk("*.*"), [{capture, none}]).
** exception error: bad argument
in function re:run/3
called as re:run([1081,1094,1091,46,97,115,100],
"^(.*\\..*)$",
[{capture,none}])
How to make this work? 'йцу' are characters that don't belong in a latin charset, obviously; is there a way to tell the re module or entire system to run with a different charset for "strings"?
ORIGINAL QUESTION (for the record):
Another "Programming Erlang" question )
in Chapter 16 there's an example about reading tags from the mp3 files. It works, great. But, there seems to be some bug in a provided module, lib_find, which has a function for searching a path for matching files. This is the call that works:
61> lib_find:files("../..", "*.mp3", true).
["../../early/files/Veronique.mp3"]
and this call fails:
62> lib_find:files("../../..", "*.mp3", true).
** exception error: bad argument
in function re:run/3
called as re:run([46,46,47,46,46,47,46,46,47,46,107,101,114,108,47,98,117,
105,108,100,115,47,50,48,46,49,47,111|...],
"^(.*\\.mp3)$",
[{capture,none}])
in call from lib_find:find_files/6 (lib_find.erl, line 29)
in call from lib_find:find_files/6 (lib_find.erl, line 39)
in call from lib_find:files/3 (lib_find.erl, line 17)
Ironically, the investigation led to finding the culprit in Erlang's own installation:
.kerl/builds/20.1/otp_src_20.1/lib/ssh/test/ssh_sftp_SUITE_data/sftp_tar_test_data_高兴
OK, this seems to mean Erlang is using a more restrictive default charset, which doesn't include hànzì. What are the options? Obviously, I can just ignore this and move on with my study, but I feel I can learn more from this one =) Such as - where/how can I fix the default charset? I'm a little surprised it's something other than UTF8 by default - so maybe I'm on a wrong track?
Thanks!
TL;DR:
UTF-8 regexs are accessible by putting the regex pattern into unicode mode with the option unicode. (Note below that the string "^(.*\\..*)$" is the result of your call to xmerl_regexp:sh_to_awk/1.)
1> re:run("なにこれ.txt", "^(.*\\..*)$").
** exception error: bad argument
in function re:run/2
called as re:run([12394,12395,12371,12428,46,116,120,116],"^(.*\\..*)$")
2> re:run("なにこれ.txt", "^(.*\\..*)$", [unicode]).
{match,[{0,16},{0,16}]}
And from your exact example:
11> re:run("йцу.asd", "^(.*\\..*)$", [unicode, {capture, none}]).
match
Or
12> {ok, Pattern} = re:compile("^(.*\\..*)$", [unicode]).
{ok,{re_pattern,1,1,0,
<<69,82,67,80,87,0,0,0,16,8,0,0,65,0,0,0,255,255,255,
255,255,255,...>>}}
13> re:run("йцу.asd", Pattern, [{capture, none}]).
match
The docs for re are pretty long and extensive, but that's because regexs are an inherently complex subject. You can find options for compiled regexs in the docs for re:compile/2 and the options for run in the docs for re:run/3.
Discussion
Erlang has settled on the idea that strings, though still a list of codepoints, are all UTF-8 everywhere. As I work in Japan and deal with this all the time, this has come as a big relief to me because I can stop using about half of the conversion libraries I had needed in the past (yay!), but has complicated matters a bit for users of the string module because many operations there now perform under slightly different assumptions (a string is still considered "flat" even if it is a deep list of grapheme clusters, so long as those clusters exist on the first level of the list).
Unfortunately, encodings are just not very easy things to deal with and UTF-8 is anything but simple once you step out of the most common representations -- so much of this is a work in progress. I can tell you with confidence, though, that dealing with UTF-8 data in binary, string, deep list, and io_data() forms, whether file names, file data, network data, or user input from WX or web forms works as expected once you read the unicode, regex and string docs.
But that is, of course, a lot of stuff to get familiar with. 99% of the time things will work as expected if you decode everything incoming from outside as UTF-8 using unicode:characters_to_list/1 and unicode:characters_to_binary/1, and specify binary strings as utf8 binary types everywhere:
3> UnicodeBin = <<"この文書はUTF-8です。"/utf8>>.
<<227,129,147,227,129,174,230,150,135,230,155,184,227,129,
175,85,84,70,45,56,227,129,167,227,129,153,227,128,130>>
4> UnicodeString = unicode:characters_to_list(UnicodeBin).
[12371,12398,25991,26360,12399,85,84,70,45,56,12391,12377,
12290]
5> io:format("~ts~n", [UnicodeString]).
この文書はUTF-8です。
ok
6> re:run(UnicodeString, "UTF-8", [unicode]).
{match,[{15,5}]}
7> re:run(UnicodeBin, "UTF-8", [unicode]).
{match,[{15,5}]}
8> unicode:characters_to_binary(UnicodeString).
<<227,129,147,227,129,174,230,150,135,230,155,184,227,129,
175,85,84,70,45,56,227,129,167,227,129,153,227,128,130>>
9> unicode:characters_to_binary(UnicodeBin).
<<227,129,147,227,129,174,230,150,135,230,155,184,227,129,
175,85,84,70,45,56,227,129,167,227,129,153,227,128,130>>

How can i parse the standard input with the erlang api?

I'm developing a game in Erlang, and now i need to read the standard input. I tried the following calls:
io:fread()
io:read()
The problem is that i can't read a whole string, when it contains white spaces. So i have the following questions:
How can i read the string typed from the user when he press the enter key? (remember that the string contains white spaces)
How can i convert a string like "56" in the number 56?
Read line
You can use io:get_line/1 to get string terminated by line feed from console.
3> io:get_line("Prompt> ").
Prompt> hello world how are you?
"hello world how are you?\n"
io:read will get you erlang term, so you can't read a string, unless you want to make your users wrap string in quotes.
Patterns in io:fread does not seem to let you read arbitrary length string containing spaces.
Parse integer
You can convert "56" to 56 using erlang:list_to_integer/1.
5> erlang:list_to_integer("56").
56
or using string:to_integer/1 which will also return you the rest of a string
10> string:to_integer("56hello").
{56,"hello"}
11> string:to_integer("56").
{56,[]}
The erlang documentation about io:fread/2 should help you out.
You can use field lengths in order to read an arbitrary length of characters (including whitespace):
io:fread("Prompt> ","~20c").
Prompt> This is a sentence!!
{ok,["This is a sentence!!"]}
As for converting a string (a list of characters) to an integer, erlang:list_to_integer/1 does the job:
7> erlang:list_to_integer("645").
645
Edit: try experimenting with io:fread/2, the format sequence can ease the parsing of data by applying some form of pattern matching:
9> io:fread("Prompt> ","~s ~s").
Prompt> John Doe
{ok,["John","Doe"]}
The console is not really a good place to do your stuff, because you need to know in advance the format of the answer. Considering that you allow spaces, you need to know how many words will be entered before getting the answer. Knowing that, you can use a string as entry, and then parse it later:
1> io:read("Enter a text > ").
Enter a text > "hello guy, this is my answer :o)".
{ok,"hello guy, this is my answer :o)"}
2>
The bad news is that the user must enter the quotes and a final dot, not user friendly...

In Erlang how do I convert a String to a binary value?

In Erlang how do I convert a string to a binary value?
String = "Hello"
%% should be
Binary = <<"Hello">>
In Erlang strings are represented as a list of integers. You can therefore use the list_to_binary (built-in-function, aka BIF). Here is an example I ran in the Erlang console (started with erl):
1> list_to_binary("hello world").
<<"hello world">>
the unicode (utf-8/16/32) character set needs more number of bits to express characters that are greater than 1-byte in length:
this is why the above call failed for any byte value > 255 (the limit of information that a byte can hold, and which is sufficient for IS0-8859/ASCII/Latin1)
to correctly handle unicode characters you'd need to use
unicode:characters_to_binary() R1[(N>3)]
instead, which can handle both Latin1 AND unicode encoding.
HTH ...

ETS matching issue

I am learning ETS. I did:
Sometab = ets:new(sometable, [bag]).
ets:insert(Sometab, {109, ash, 8}).
Then I typed:
ets:match(Sometab, {109, ash, '$1'}).
However instead of getting 8 - I am getting: ["\b"] as output!
You are getting the correct answer. However, the erlang shell prints [8] as "\b" since the ascii code for backspace is 8.
Erlang has no string type. Strings in erlang are represented simply as a list of integers and the Erlang shell prints this list as a string if the list contains integers withing the ascii range only.
This can indeed be confusing at times.

Resources