The following output is as I expected:
125> [97, 98, 99].
"abc"
126> [97, 98, 0].
[97,98,0]
But the output using ~s is not what I expected:
127> io:format("~s~n", [[97, 98, 0]]).
ab^#
ok
How do I interpret that output?
The ~s control sequence expects to get a string, a binary or an atom, and prints it "with string syntax". As Erlang strings are just lists of integers, it tries to print [97, 98, 0] in this example as a string as well. The shell on the other hand tries to guess whether this list of integers is meant to be a string.
^# represents the NUL character. You might be familiar with the caret notation, where ^A represents byte 1, since A is the 1st letter in the alphabet - or in other words, it represents the byte whose value is 64 less than the ASCII value of the character, since A is 65 in ASCII. Extrapolate it to the 0 byte, and you'll find #, which is 64 in ASCII.
Related
I was trying to write a pattern matched function in erlang like:
to_end("A") -> "Z".
The whole idea is to transform a string such as "ABC" into something different such as "ZYX" using pattern matched functions. It looks like a string is represented as a list under the hood...
I was depending on the fact that pattern matching on a "string" in erlang would result in individual string characters. But I find this:
21> F="ABC".
22> F.
"ABC"
23> [H | T]=F.
"ABC"
24> H.
65
25> T.
"BC"
Why does the head of this type of pattern matching on list always result in an ASCII value and the tail result in letters? Is there a better way to pattern match against a "list of string"?
In Erlang, strings are just a list of ascii values. It also displays lists of integers, where every integer is a printable ascii code, as strings. So [48, 49] would print out "01" since 48 corresponds to 0 and 49 to 1. Since you have the string "ABC", this is the same as [65 | [66 | [67]]], and [66, 67] will display as "BC".
If you want to write a function to pattern match on characters, you should use the character literal syntax, which is $ followed by the character. So you would write
to_end($A) -> $Z;
to_end($B) -> $Y;
to_end($C) -> $X;
...
to_end($Z) -> $A.
instead of to_end("A") -> "Z" which is the same as to_end([65]) -> [90].
Why does the head of this type of pattern matching on list always
result in an ASCII value and the tail result in letters?
In erlang, the string "ABC" is a shorthand notation for the list [65,66,67]. The head of that list is 65, and the tail of that list is the list [66,67], which the shell happens to display as "BC". Whaa??!
The shell pretty much sucks when displaying strings/lists: sometimes the shell displays a list and sometimes the shell displays a double quoted string:
2> [0, 65, 66, 67].
[0,65,66,67]
3> [65, 66, 67].
"ABC"
4>
...which is just plain dumb. Every beginning and intermediate erlang programmer gets confused by that at some point.
Just remember: when the shell displays a double quoted string, it should really be displaying a list whose elements are the character codes of each character in the double quoted string. The fact that the shell displays a double quoted string is a TERRIBLE ??feature?? of erlang, and it makes it hard to decipher what is going on in a lot of situations. You have to mentally say to yourself, "That string I'm seeing in the shell is really the list ..."
That fact that the shell displays double quoted strings for some lists really sucks when you want to display, say, a list of a person's test scores: [88, 97, 92, 70] and the shell outputs: "Xa\\F". You can use the io:format() method to get around that:
6> io:format("~w~n", [[88,97,92,70]]).
[88,97,92,70]
ok
But, if you just want to momentarily see the actual list of integers that the shell is displaying as a string, a quick and dirty method is to add the integer 0 to the head of the list:
7> Scores = [88,97,92,70].
"Xa\\F"
Huh?!!
8> [0|Scores].
[0,88,97,92,70]
Oh, okay.
The whole idea is to transform a string such as "ABC" into something
different such as "ZYX" using pattern matched functions.
Because a string is shorthand for a list of integers, you can change those integers by using addition:
-module(my).
-compile(export_all).
cipher([]) -> [];
cipher([H|T]) ->
[H+10|cipher(T)]. %% Add 10 to each character code.
In the shell:
~/erlang_programs$ erl
Erlang/OTP 20 [erts-9.3] [source] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:10] [hipe] [kernel-poll:false]
Eshell V9.3 (abort with ^G)
1> c(my).
my.erl:2: Warning: export_all flag enabled - all functions will be exported
{ok,my}
2> my:cipher("ABC").
"KLM"
3>
By the way, all functions are "pattern matched", so saying "a pattern matched function" is redundant, you can just say, "a function".
I am trying to make a program that will read in a number and then output every digit of that number in a list. However, most of the things look fine until I try with number 8 and 9. The program only output \b \t instead.
if the input number contains 8 or 9, and in the same time there are other numbers, for example 283, it will print normally. Otherwise if there is only 8 or 9, such as8, 99, then it will give me that binary representation of 8 and 9 (if I remember correctly).
My program is as below:
digitize(0)-> 0;
digitize(N) when N < 10 -> [N];
digitize(N) when N >= 10 -> digitize(N div 10)++[N rem 10].
The function returns the expected list, but the shell shows lists of numbers which are ASCII-codes of characters as strings (because that's just what strings are in Erlang; there's no special string type). You can see it by just entering [8, 8] (e.g.) at the prompt and disable this behavior by calling shell:strings(false) (and shell:strings(true) when you need the normal behavior again).
Strings in Erlang are no separate type but a list of numbers. List printing has a heuristic to detect when it might be a string. If it thinks it's a string it will be printed as such. \b is the backspace character and \t is the tab character which are ASCII codes 8 and 9
See also:
Description what a string means
Erlang escape sequences
Explanation of this in LYSE
I am working through some Erlang tutorials and noticed when I enter
[8].
the VM returns "\b"
or if I enter
[9].
the VM returns "\t"
I am confused on why this is happening. Other numbers are returned as a list of that number:
[3].
is returned as [3]
[4].
is returned as [4], etc.
I guess the question is why is the erlang VM return it this way? Perhaps an explanation of a list [65] and a list? "A".
Another related item is confusing as well:
Type conversion, converting a list to an integer is done as:
list_to_integer("3").
Not
list_to_integer([3]).
Which returns an error
In Erlang there are no real strings. String are a list of integers. So if you give a list with integers that represent characters then they will be displayed as a string.
1> [72, 101, 108, 108, 111].
"Hello"
If you specify a list with at least element that does not have a character counterpart, then the list will be displayed as such.
2> [72, 101, 108, 108, 111, 1].
[72,101,108,108,111,1]
In Erlang strings are lists and the notation is exactly the same.
[97,98,99].
returns "abc"
The following excerpt is taken directly from "Learn You Some Erlang for Great Good!", Fred Hébert, (C)2013, No Starch Press. p. 18.
This is one of the most disliked thins in Erlang: strings. Strings are lists, and the notation is exactly the same. Why do people dislike it?
Because of this:
3> [97,98,99,4,5,6].
[97,98,99,4,5,6]
4> [233].
"é"
Erlang will print lists of numbers as numbers only when at least one of them could not also represent a letter. There is no such thing as a real string in Erlang!
"Learn You Some Erlang for Great Good!" is also available online at: http://learnyousomeerlang.com/
kadaj answered your first question. Regarding the second one about list_to_integer, if you look at the documentation, most list_to_XXX functions except binary, bitstring, and tuple consider their argument as a string. Calling them string_to_XXX could be clearer, but changing the name would break a lot of code.
In Erlang how do I convert a string to a binary value?
String = "Hello"
%% should be
Binary = <<"Hello">>
In Erlang strings are represented as a list of integers. You can therefore use the list_to_binary (built-in-function, aka BIF). Here is an example I ran in the Erlang console (started with erl):
1> list_to_binary("hello world").
<<"hello world">>
the unicode (utf-8/16/32) character set needs more number of bits to express characters that are greater than 1-byte in length:
this is why the above call failed for any byte value > 255 (the limit of information that a byte can hold, and which is sufficient for IS0-8859/ASCII/Latin1)
to correctly handle unicode characters you'd need to use
unicode:characters_to_binary() R1[(N>3)]
instead, which can handle both Latin1 AND unicode encoding.
HTH ...
I've parsed an HTML page with mochiweb_html and want to parse the following text fragment
0 – 1
Basically I want to split the string on the spaces and dash character and extract the numbers in the first characters.
Now the string above is represented as the following Erlang list
[48,32,226,128,147,32,49]
I'm trying to split it using the following regex:
{ok, P}=re:compile("\\xD2\\x80\\x93"), %% characters 226, 128, 147
re:split([48,32,226,128,147,32,49], P, [{return, list}])
But this doesn't work; it seems the \xD2 character is the problem [if I remove it from the regex, the split occurs]
Could someone possibly explain
what I'm doing wrong here ?
why the '–' character seemingly requires three integers for representation [226, 128, 147]
Thanks.
226,128,147 is E2,80,93 in hex.
> {ok, P} = re:compile("\xE2\x80\x93").
...
> re:split([48,32,226,128,147,32,49], P, [{return, list}]).
["0 "," 1"]
As to your second question, about why a dash takes 3 bytes to encode, it's because the dash in your input isn't an ASCII hyphen (hex 2D), but is a Unicode en-dash (hex 2013). Your code is recieving this in UTF-8 encoding, rather than the more obvious UCS-2 encoding. Hex 2013 comes out to hex E28093 in UTF-8 encoding.
If your next question is "why UTF-8", it's because it's far easier to retrofit an old system using 8-bit characters and null-terminated C style strings to use Unicode via UTF-8 than to widen everything to UCS-2 or UCS-4. UTF-8 remains compatible with ASCII and C strings, so the conversion can be done piecemeal over the course of years, or decades if need be. Wide characters require a "Big Bang" one-time conversion effort, where everything has to move to the new system at once. UTF-8 is therefore far more popular on systems with legacies dating back to before the early 90s, when Unicode was created.