When converting from infix to postfix, how do you specify between a uniary and a binary +/- - infix-notation

Under this grammar:
^ + - * / < > = <= >= and or not
I'm using a function (shunting-yard algorithm) to convert from infix to postfix and it works! Except it doesn't include the unary - meaning negate and the unary + which doesn't really do much of anything.
Once converted to post fix, a unary + will be a p and a unary - with be a m. For example:
3 + 3 -> 3 3 +
+3 + 3 -> 3 p 3 +
-(3-3) -> 3 3 - m
So if I am reading an infix expression, how do I specify between a unary and binary plus and minus?

It seems to me that the following rule would apply.
The first + or - following a non-operator is a binary operator. Subsequent occurrences (or an occurrence at the start of an expression) are unary.
So, in your (and a couple of extra) examples:
3 + 3 --> 3 binary+ 3
+ 3 + 3 --> unary+ 3 binary+ 3
- ( 3 - 3) --> unary- ( 3 binary- 3)
-9--4 --> unary- 9 binary- unary- 4

Related

Debugging APL code: how to use `#`(index) and `⊢` (right tack) together?

I am attempting to read Aaron Hsu's thesis on A data parallel compiler hosted on the GPU, where I have landed at some APL code I am unable to fix. I've attached both a screenshot of the offending page (page number 74 as per the thesis numbering on the bottom):
The transcribed code is as follows:
d ← 0 1 2 3 1 2 3 3 4 1 2 3 4 5 6 5 5 6 3 4 5 6 5 5 6 3 4
This makes sense: create an array named d.
⍳≢d
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
This too makes sense. Count the number of elements in d and create a sequence of
that length.
⍉↑d,¨⍳≢d
0 1 2 3 1 2 3 3 4 1 2 3 4 5 6 5 5 6 3 4 5 6 5 5 6 3 4
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
This is slightly challenging, but let me break it down:
zip the sequence ⍳≢d = 1..27 with the d array using the ,¨ idiom, which zips the two arrays using a catenation.
Then, split into two rows using ↑ and transpose to get columns using ⍉
Now the biggie:
(⍳≢d)#(d,¨⍳≢d)⊢7 27⍴' '
INDEX ERROR
(⍳≢d)#(d,¨⍳≢d)⊢7 27⍴' '
Attempting to break it down:
⍳≢d counts number of elements in d
(d,¨⍳≢d) creates an array of pairs (d, index of d)
7 27⍴' ' creates a 7 x 27 grid: presumably 7 because that's the max value of d + 1, for indexing reasons.
Now I'm flummoxed about how the use of ⊢ works: as far as I know, it just ignores everything to the left! So I'm missing something about the parsing of this expression.
I presume it is parsed as:
(⍳≢d)#((d,¨⍳≢d)⊢(7 27⍴' '))
which according to me should be evaluated as:
(⍳≢d)#((d,¨⍳≢d)⊢(7 27⍴' '))
= (⍳≢d)#((7 27⍴' ')) [using a⊢b = b]
= not the right thing
As I was writing this down, I managed to fix the bug by sheer luck: if we increment d to be d + 1 so we are 1-indexed, the bug no longer manifests:
d ← d + 1
d
1 2 3 4 2 3 4 4 5 2 3 4 5 6 7 6 6 7 4 5 6 7 6 6 7 4 5
then:
(⍳≢d)#(d,¨⍳≢d)⊢7 27⍴' '
1
2 5 10
3 6 11
4 7 8 12 19 26
9 13 20 27
14 16 17 21 23 24
15 18 22 25
However, I still don't understand how this works! I presume the context will be useful
for others attempting to leave the thesis, so I'm going to leave the rest of it up.
Please explain what (⍳≢d)#(d,¨⍳≢d)⊢7 27⍴' ' does!
I've attached the raw screenshot to make sure I didn't miss something:
I'm happy to see that you found the the off-by-one error. It stems from Aaron Hsu working with index origin 0. If you set ⎕IO←0 then his code will work.
Some dyadic operators can take an array operand, giving the sequence OPERATOR operand argument, e.g. in -#(1 2 3)(4 5 6 7). This poses a problem because both the operand and the argument are arrays, and juxtaposition of arrays forms a new array with those arrays as elements by a process known as stranding. Compare:
(1 2 3)(4 5 6 7)
┌─────┬───┐
│1 2 3│4 5│
└─────┴───┘
However, in the case of the operator with its array operand, we want to "break" this strand so the left part can act as operand while the right part acts as argument. One way to break the stranding up is by applying a function to the argument, giving the sequence OPERATOR operand Function argument. Now, we don't actually need any transformation of the argument, so an identity function will do: -#(1 2 3)⊢(4 5 6 7).
As for what (⍳≢d)#(d,¨⍳≢d)⊢7 27⍴' ' actually does:
7 27⍴' ' creates a blank matrix.
(⍳≢d) are indices to insert into specified slots in the matrix.
#(d,¨⍳≢d) indicates at which locations in the matrix the above should replace the existing values
⊢ serves solely to separate (d,¨⍳≢d) from 7 27⍴' '. The code could also have been written as ((⍳≢d)#(d,¨⍳≢d))7 27⍴' ' with parentheses serving to "bind" the operand to the operator.

Minimum number of states in DFA

Minimum number states in the DFA accepting strings (base 3 i.e,, ternary form) congruent to 5 modulo 6?
I have tried but couldn't do it.
At first sight, It seems to have 6 states but then it can be minimised further.
Let's first see the state transition table:
Here, the states q0, q1, q2,...., q5 corresponds to the states with modulo 0,1,2,..., 5 respectively when divided by 6. q0 is our initial state and since we need modulo 5 therefore our final state will be q5
Few observations drawn from above state transition table:
states q0, q2 and q4 are exactly same
states q1, q3 and q5 are exactly same
The states which make transitions to the same states on the same inputs can be merged into a single state.
Note: Final and Non-final states can never be merged.
Therefore, we can merge q0, q2, q4 together and q1, q3 together leaving the state q5 aloof from collation.
The final Minimal DFA has 3 states as shown below:
Let's look at a few strings in the language:
12 = 1*3 + 2 = 5 ~ 5 (mod 6)
102 = 1*9 + 0*3 + 2 = 11 ~ 5 (mod 6)
122 = 1*9 + 2*3 + 2 = 17 ~ 5 (mod 6)
212 = 2*9 + 1*3 + 2 = 23 ~ 5 (mod 6)
1002 = 1*18 + 0*9 + 0*9 + 2 = 29 ~ 5 (mod 6)
We notice that all the strings end in 2. This makes sense since 6 is a multiple of 3 and the only way to get 5 from a multiple of 3 is to add 2. Based on this, we can try to solve the problem of strings congruent to 3 modulo 6:
10 = 3
100 = 9
120 = 15
210 = 21
1000 = 27
There's not a real pattern emerging, but consider this: every base-3 number ending in 0 is definitely divisible by 3. The ones that are even are also divisible by 6; so the odd numbers whose base-3 representation ends in 0 must be congruent to 3 mod 6. Because all the powers of 3 are odd, we know we have an odd number if the number of 1s in the string is odd.
So, our conditions are:
the string begins with a 1;
the string has an odd number of 1s;
the string ends with 2;
the string can contain any number of 2s and 0s.
To get the minimum number of states in such a DFA, we can use the Myhill-Nerode theorem beginning with the empty string:
the empty string can be followed by any string in the language. Call its equivalence class [e]
the string 0 cannot be followed by anything since valid base-3 representations don't have leading 0s. Call its equivalence class [0].
the string 1 must be followed with stuff that has an even number of 1s in it ending with a 2. Call its equivalence class [1].
the string 2 can be followed by anything in the language. Indeed, you can verify that putting a 2 at the front of any string in the language gives another string in the language. However, it can also be followed by strings beginning with 0. Therefore, its class is new: [2].
the string 00 can't be followed by anything to fix it; its class is the same as its prefix 0, [0]. same for the string 01.
the string 10 can be followed by any string with an even number of 1s that ends in a 2; it is therefore equivalent to the class [1].
the string 11 can be followed by any string in the language whatever; indeed, you can verify prepending 11 in front of any string in the language gives another solution. However, it can also be followed by strings beginning with 0. Therefore, its class is the same as [2].
12 can be followed by a string with an even number of 1s ending in 2, as well as by the empty string (since 12 is in fact in the language). This is a new class, [12].
21 is equivalent to 1; class [1]
22 is equivalent to 2; class [2]
20 is equivalent to 2; class [2]
120 is indistinguishable from 1; its class is [1].
121 is indistinguishable from [2].
122 is indistinguishable from [12].
We have seen no new equivalence classes on new strings of length 3; so, we know we have seen all the equivalence classes. They are the following:
[e]: any string in the language can follow this
[0]: no string can follow this
[1]: a string with an even number of 1s ending in 2 can follow this
[2]: same as [e] but also strings beginning with 0
[12]: same as [1] but also the empty string
This means that a minimal DFA for our language has five states. Here is the DFA:
[0]
^
|
0
|
----->[e]--2-->[2]<-\
| ^ |
| | |
1 __1__/ /
| / /
| | 1
V V |
[1]--2-->[12]
^ |
| |
\___0___/
(transitions not pictured are self-loops on the respective states).
Note: I expected this DFA to have 6 states, as Welbog pointed out in the other answer, so I might have missed an equivalence class. However, the DFA seems right after checking a few examples and thinking about what it's doing: you can only get to accepting state [12] by seeing a 2 as the last symbol (definitely necessary) and you can only get to state [12] from state [1] and you must have seen an odd number of 1s to get to [1]…
The minimum number of states for almost all modulus problems is the base of the modulus. The general strategy is one state for every modulus, as transitions between moduli are independent of what the previous numbers were. For example, if you're in state r4 (representing x = 4 (mod 6)), and you encounter a 1 as your next input, your new modulus is 4x6+1 = 25 = 1 (mod 6), so the transition from r4 on input 1 is to r1. You'll find that the start state and r0 can be merged, for a total of 6 states.

When re-inserting into queue - Huffman Code

Example
3 2 5 5
a b c d
Joining first two
5 | 5 5
3 2 | c d
a b |
I have to put the new tree of five into the queue
Am I obligated to put it in the end like this:
5 5 5
c d / \
3 2
a b
Or can I put it in the beginning:
5 5 5
3 2 c d
a b
Or even in the middle of 'c' and 'd'
Is it my choice or is there a rule?
It's not your choice, the Queue needs to be sorted at all times (by it's number of occurrences and in case of equal number of occurrences by the depth of the tree). So it needs to be inserted where it belongs into the order.
This is needed to pick the sub-trees with the least amount of occurrences and if there is choice the most shallow one of them by simply pop-ing them.
If you simply resort after every insertion (this is inefficient and should not be done) the position obviously doesn't matter.
Yes, it's your choice. Whichever way you will get an optimal Huffman code, even though two resulting codes can be manifestly different.
You can get:
a - 00
b - 01
c - 10
d - 11
or you can get:
a - 111
b - 110
c - 10
d - 0
Now if I multiply the number of bits in each symbol times the number of occurrences, I get for the first code: 2*3 + 2*2 + 2*5 + 2*5 = 30 bits. For the second code: 3*3 + 3*2 + 2*5 + 1*5 = 30 bits. So both codes will code the original message to exactly 30 bits.

Solving shift/reduce conflicts

I'm using PLY to parse this grammar. I implemented a metagrammar for EBNF used in the linked spec, but PLY reports multiple shift/reduce conflicts.
Grammar:
Rule 0 S' -> grammar
Rule 1 grammar -> prod_list
Rule 2 grammar -> empty
Rule 3 prod_list -> prod
Rule 4 prod_list -> prod prod_list
Rule 5 prod -> id : : = rule_list
Rule 6 rule_list -> rule
Rule 7 rule_list -> rule rule_list
Rule 8 rule -> rule_simple
Rule 9 rule -> rule_group
Rule 10 rule -> rule_opt
Rule 11 rule -> rule_rep0
Rule 12 rule -> rule_rep1
Rule 13 rule -> rule_alt
Rule 14 rule -> rule_except
Rule 15 rule_simple -> terminal
Rule 16 rule_simple -> id
Rule 17 rule_simple -> char_range
Rule 18 rule_group -> ( rule_list )
Rule 19 rule_opt -> rule_simple ?
Rule 20 rule_opt -> rule_group ?
Rule 21 rule_rep0 -> rule_simple *
Rule 22 rule_rep0 -> rule_group *
Rule 23 rule_rep1 -> rule_simple +
Rule 24 rule_rep1 -> rule_group +
Rule 25 rule_alt -> rule | rule
Rule 26 rule_except -> rule - rule_simple
Rule 27 rule_except -> rule - rule_group
Rule 28 terminal -> SQ string_no_sq SQ
Rule 29 terminal -> DQ string_no_dq DQ
Rule 30 string_no_sq -> LETTER string_no_sq
Rule 31 string_no_sq -> DIGIT string_no_sq
Rule 32 string_no_sq -> SYMBOL string_no_sq
Rule 33 string_no_sq -> DQ string_no_sq
Rule 34 string_no_sq -> + string_no_sq
Rule 35 string_no_sq -> * string_no_sq
Rule 36 string_no_sq -> ( string_no_sq
Rule 37 string_no_sq -> ) string_no_sq
Rule 38 string_no_sq -> ? string_no_sq
Rule 39 string_no_sq -> | string_no_sq
Rule 40 string_no_sq -> [ string_no_sq
Rule 41 string_no_sq -> ] string_no_sq
Rule 42 string_no_sq -> - string_no_sq
Rule 43 string_no_sq -> : string_no_sq
Rule 44 string_no_sq -> = string_no_sq
Rule 45 string_no_sq -> empty
Rule 46 string_no_dq -> LETTER string_no_dq
Rule 47 string_no_dq -> DIGIT string_no_dq
Rule 48 string_no_dq -> SYMBOL string_no_dq
Rule 49 string_no_dq -> SQ string_no_dq
Rule 50 string_no_dq -> + string_no_dq
Rule 51 string_no_dq -> * string_no_dq
Rule 52 string_no_dq -> ( string_no_dq
Rule 53 string_no_dq -> ) string_no_dq
Rule 54 string_no_dq -> ? string_no_dq
Rule 55 string_no_dq -> | string_no_dq
Rule 56 string_no_dq -> [ string_no_dq
Rule 57 string_no_dq -> ] string_no_dq
Rule 58 string_no_dq -> - string_no_dq
Rule 59 string_no_dq -> : string_no_dq
Rule 60 string_no_dq -> = string_no_dq
Rule 61 string_no_dq -> empty
Rule 62 id -> LETTER LETTER id
Rule 63 id -> LETTER DIGIT id
Rule 64 id -> LETTER
Rule 65 id -> DIGIT
Rule 66 rest_of_id -> LETTER rest_of_id
Rule 67 rest_of_id -> DIGIT rest_of_id
Rule 68 rest_of_id -> empty
Rule 69 char_range -> [ UNI_CH - UNI_CH ]
Rule 70 empty -> <empty>
Conflicts:
id : LETTER LETTER id
| LETTER DIGIT id
| LETTER
| DIGIT
.
state 4
(62) id -> LETTER . LETTER id
(63) id -> LETTER . DIGIT id
(64) id -> LETTER .
! shift/reduce conflict for LETTER resolved as shift
! shift/reduce conflict for DIGIT resolved as shift
LETTER shift and go to state 10
DIGIT shift and go to state 9
| reduce using rule 64 (id -> LETTER .)
- reduce using rule 64 (id -> LETTER .)
( reduce using rule 64 (id -> LETTER .)
SQ reduce using rule 64 (id -> LETTER .)
DQ reduce using rule 64 (id -> LETTER .)
[ reduce using rule 64 (id -> LETTER .)
$end reduce using rule 64 (id -> LETTER .)
) reduce using rule 64 (id -> LETTER .)
: reduce using rule 64 (id -> LETTER .)
? reduce using rule 64 (id -> LETTER .)
* reduce using rule 64 (id -> LETTER .)
+ reduce using rule 64 (id -> LETTER .)
! LETTER [ reduce using rule 64 (id -> LETTER .) ]
! DIGIT [ reduce using rule 64 (id -> LETTER .) ]
The id rule is supposed to guarantee that productions' ids start with a letter.
Next conflict:
rule_alt : rule '|' rule
.
state 113
(25) rule_alt -> rule | rule .
(25) rule_alt -> rule . | rule
(26) rule_except -> rule . - rule_simple
(27) rule_except -> rule . - rule_group
! shift/reduce conflict for | resolved as shift
! shift/reduce conflict for - resolved as shift
( reduce using rule 25 (rule_alt -> rule | rule .)
SQ reduce using rule 25 (rule_alt -> rule | rule .)
DQ reduce using rule 25 (rule_alt -> rule | rule .)
LETTER reduce using rule 25 (rule_alt -> rule | rule .)
DIGIT reduce using rule 25 (rule_alt -> rule | rule .)
[ reduce using rule 25 (rule_alt -> rule | rule .)
) reduce using rule 25 (rule_alt -> rule | rule .)
$end reduce using rule 25 (rule_alt -> rule | rule .)
| shift and go to state 76
- shift and go to state 74
! | [ reduce using rule 25 (rule_alt -> rule | rule .) ]
! - [ reduce using rule 25 (rule_alt -> rule | rule .) ]
Connected to a smiliar one:
rule_except : rule '-' rule_simple
| rule '-' rule_group
How do I fix these?
You really should think seriously about using the usual scanner/parser architecture. Otherwise, you will have to find a way to deal with whitespace.
As it is, you seem to be ignoring whitespace altogether. That means that the parser cannot see the whitespace between three consecutive identifiers. It will see them run together as asoupofundifferentiatedletters, and it has no way to know what the original intent was. This makes your grammar deeply ambiguous, because in the grammar two identifiers can follow each other on the assumption that something will cause them to be differentiated from each other. And ambiguous grammars always result in LR conflicts.
Having the identifiers (and other multi-character tokens) recognized by the lexer is much easier. Otherwise, you will have to rewrite your grammar to identify all the places where whitespace is allowed (such as around the punctuation in (identifer1|identifier2)) or required (such as two identifiers).
Identifying identifiers in the scanner using regular expressions will also remove the other problems with your grammar and identifiers:
id -> LETTER LETTER id
id -> LETTER DIGIT id
id -> LETTER
These rules require id to be an odd number of characters, where the digits only appear in even positions. So a1b would be an id, but not ab1 or ab or a1. I'm sure that's not what you meant.
You seem to be trying to avoid left-recursion. Instead, you should embrace left-recursion. Bottom-up parsers, like PLY, love left-recursion. (They handle right-recursion, but at the cost of excessive parser stack usage.) So what you really want is:
id: LETTER | id LETTER | id DIGIT
There are other places in the grammar where similar changes are necessary.
The other conflict is caused by your unorthodox handling of operator precedence, which might also be a result of your attempt to avoid left-recursion. The EBNF operators can be parsed with a simple precedence scheme, as with algebraic operators. However, the use of precedence declarations (%left and friends) will be complicated because of the "invisible" concatenation operator. Generally, you'll find it easier to use explicit precedence as in the standard expr/factor/term algebraic grammar. In your case, the equivalent would be something like:
item: id
| terminal
| '(' rule ')'
term: item
| item '*'
| item '+'
| item '?'
seq : term
| seq term
alt : seq
| alt '|' seq
except: term '-' term
rule: alt
| except
The handling of except in the above corresponds to the lack of information about the precedence of the - operator. That's expressed by effectively disallowing any mix of - and | operators without explicit parentheses.
You will also find that you have a shift/reduce conflict here:
# The following will create a problem
prod: id "::=" rule
prod_list
: prod
| prod_list prod
(NOTE: the fact that I wrote that with left-recursion does not create the problem.)
That is not ambiguous, but it is not left-to-right parseable with a single lookahead token. It requires two tokens, because you cannot know whether or not the id is part of the currently-being-parsed sequence, or the beginning of a new production until you see the token after the id: if it is ::=, then the id was the start of a new production and should not be shifted into the current rule. The usual solution to that problem is a hack in the lexer: the lexer is wrapped by a function which keeps one extra token of lookahead, so that it can emit id ::= as a single token of type definition. There are a number of examples of this hack for various LR parsers in other SO questions.
Having said all of that, I really don't understand why you want to build a parser for EBNF in order to parse XML. Building a working parser from EBNF is basically what PLY does, except that it doesn't implement the "E" part, so you have to rewrite rules which use the ?, *, + and - operators. This can be handled automatically, although the - operator is non-trivial in general, but it is not going to be simple. It would be easier, IMHO, to rewrite the few EBNF rules into BNF and then just use PLY. But if you are looking for a challenge, go for it.
First of all, you have apparently slavishly translated the grammar. You need to tokenize the input stream.
Normally, something like id would be a terminal to be discerned by the lexical analyzer, rather than parsed as part of the grammar
id : LETTER LETTER id
| LETTER DIGIT id
| LETTER
| DIGIT
It looks like everything you have under terminal should not be part of the grammar.
Second, you use right recursion in your grammar. While LALR works with both left and right recursion, you get smaller tables with left recursion.
Suppose you have the input string AA
If you were to insist on parsing identifiers, you'd want something more like
id : id LETTER
| id DIGIT
| LETTER
Finally, Shift-Reduce conflicts are not necessarily based. They frequently occur in numeric expressions to be resolved by operator precedent.
Reduce-Reduce conflicts are always bad.

Unexpected behavior of io:fread in Erlang

This is an Erlang question.
I have run into some unexpected behavior by io:fread.
I was wondering if someone could check whether there is something wrong with the way I use io:fread or whether there is a bug in io:fread.
I have a text file which contains a "triangle of numbers"as follows:
59
73 41
52 40 09
26 53 06 34
10 51 87 86 81
61 95 66 57 25 68
90 81 80 38 92 67 73
30 28 51 76 81 18 75 44
...
There is a single space between each pair of numbers and each line ends with a carriage-return new-line pair.
I use the following Erlang program to read this file into a list.
-module(euler67).
-author('Cayle Spandon').
-export([solve/0]).
solve() ->
{ok, File} = file:open("triangle.txt", [read]),
Data = read_file(File),
ok = file:close(File),
Data.
read_file(File) ->
read_file(File, []).
read_file(File, Data) ->
case io:fread(File, "", "~d") of
{ok, [N]} ->
read_file(File, [N | Data]);
eof ->
lists:reverse(Data)
end.
The output of this program is:
(erlide#cayle-spandons-computer.local)30> euler67:solve().
[59,73,41,52,40,9,26,53,6,3410,51,87,86,8161,95,66,57,25,
6890,81,80,38,92,67,7330,28,51,76,81|...]
Note how the last number of the fourth line (34) and the first number of the fifth line (10) have been merged into a single number 3410.
When I dump the text file using "od" there is nothing special about those lines; they end with cr-nl just like any other line:
> od -t a triangle.txt
0000000 5 9 cr nl 7 3 sp 4 1 cr nl 5 2 sp 4 0
0000020 sp 0 9 cr nl 2 6 sp 5 3 sp 0 6 sp 3 4
0000040 cr nl 1 0 sp 5 1 sp 8 7 sp 8 6 sp 8 1
0000060 cr nl 6 1 sp 9 5 sp 6 6 sp 5 7 sp 2 5
0000100 sp 6 8 cr nl 9 0 sp 8 1 sp 8 0 sp 3 8
0000120 sp 9 2 sp 6 7 sp 7 3 cr nl 3 0 sp 2 8
0000140 sp 5 1 sp 7 6 sp 8 1 sp 1 8 sp 7 5 sp
0000160 4 4 cr nl 8 4 sp 1 4 sp 9 5 sp 8 7 sp
One interesting observation is that some of the numbers for which the problem occurs happen to be on 16-byte boundary in the text file (but not all, for example 6890).
I'm going to go with it being a bug in Erlang, too, and a weird one. Changing the format string to "~2s" gives equally weird results:
["59","73","4","15","2","40","0","92","6","53","0","6","34",
"10","5","1","87","8","6","81","61","9","5","66","5","7",
"25","6",
[...]|...]
So it appears that it's counting a newline character as a regular character for the purposes of counting, but not when it comes to producing the output. Loopy as all hell.
A week of Erlang programming, and I'm already delving into the source. That might be a new record for me...
EDIT
A bit more investigation has confirmed for me that this is a bug. Calling one of the internal methods that's used in fread:
> io_lib_fread:fread([], "12 13\n14 15 16\n17 18 19 20\n", "~d").
{done,{ok,"\f"}," 1314 15 16\n17 18 19 20\n"}
Basically, if there's multiple values to be read, then a newline, the first newline gets eaten in the "still to be read" part of the string. Other testing suggests that if you prepend a space it's OK, and if you lead the string with a newline it asks for more.
I'm going to get to the bottom of this, gosh-darn-it... (grin) There's not that much code to go through, and not much of it deals specifically with newlines, so it shouldn't take too long to narrow it down and fix it.
EDIT^2
HA HA! Got the little blighter.
Here's the patch to the stdlib that you want (remember to recompile and drop the new beam file over the top of the old one):
--- ../erlang/erlang-12.b.3-dfsg/lib/stdlib/src/io_lib_fread.erl
+++ ./io_lib_fread.erl
## -35,9 +35,9 ##
fread_collect(MoreChars, [], Rest, RestFormat, N, Inputs).
fread_collect([$\r|More], Stack, Rest, RestFormat, N, Inputs) ->
- fread(RestFormat, Rest ++ reverse(Stack), N, Inputs, More);
+ fread(RestFormat, Rest ++ reverse(Stack), N, Inputs, [$\r|More]);
fread_collect([$\n|More], Stack, Rest, RestFormat, N, Inputs) ->
- fread(RestFormat, Rest ++ reverse(Stack), N, Inputs, More);
+ fread(RestFormat, Rest ++ reverse(Stack), N, Inputs, [$\n|More]);
fread_collect([C|More], Stack, Rest, RestFormat, N, Inputs) ->
fread_collect(More, [C|Stack], Rest, RestFormat, N, Inputs);
fread_collect([], Stack, Rest, RestFormat, N, Inputs) ->
## -55,8 +55,8 ##
eof ->
fread(RestFormat,eof,N,Inputs,eof);
_ ->
- %% Don't forget to count the newline.
- {more,{More,RestFormat,N+1,Inputs}}
+ %% Don't forget to strip and count the newline.
+ {more,{tl(More),RestFormat,N+1,Inputs}}
end;
Other -> %An error has occurred
{done,Other,More}
Now to submit my patch to erlang-patches, and reap the resulting fame and glory...
Besides the fact that it seems to be a bug in one of the erlang libs I think you could (very) easily circumvent the problem.
Given the fact your file is line-oriented I think best practice is that you process it line-by-line as well.
Consider the following construction. It works nicely on an unpatched erlang and because it uses lazy evaluation it can handle files of arbitrary length without having to read all of it into memory first. The module contains an example of a function to apply to each line - turning a line of text-representations of integers into a list of integers.
-module(liner).
-author("Harro Verkouter").
-export([liner/2, integerize/0, lazyfile/1]).
% Applies a function to all lines of the file
% before reducing (foldl).
liner(File, Fun) ->
lists:foldl(fun(X, Acc) -> Acc++Fun(X) end, [], lazyfile(File)).
% Reads the lines of a file in a lazy fashion
lazyfile(File) ->
{ok, Fd} = file:open(File, [read]),
lazylines(Fd).
% Actually, this one does the lazy read ;)
lazylines(Fd) ->
case io:get_line(Fd, "") of
eof -> file:close(Fd), [];
{error, Reason} ->
file:close(Fd), exit(Reason);
L ->
[L|lazylines(Fd)]
end.
% Take a line of space separated integers (string) and transform
% them into a list of integers
integerize() ->
fun(X) ->
lists:map(fun(Y) -> list_to_integer(Y) end,
string:tokens(X, " \n")) end.
Example usage:
Eshell V5.6.5 (abort with ^G)
1> c(liner).
{ok,liner}
2> liner:liner("triangle.txt", liner:integerize()).
[59,73,41,52,40,9,26,53,6,34,10,51,87,86,81,61,95,66,57,25,
68,90,81,80,38,92,67,73,30|...]
And as a bonus, you can easily fold over the lines of any (lineoriented) file w/o running out of memory :)
6> lists:foldl( fun(X, Acc) ->
6> io:format("~.2w: ~s", [Acc,X]), Acc+1
6> end,
6> 1,
6> liner:lazyfile("triangle.txt")).
1: 59
2: 73 41
3: 52 40 09
4: 26 53 06 34
5: 10 51 87 86 81
6: 61 95 66 57 25 68
7: 90 81 80 38 92 67 73
8: 30 28 51 76 81 18 75 44
Cheers,
h.
I noticed that there are multiple instances where two numbers are merged, and it appears to be at the line boundaries on every line starting at the fourth line and beyond.
I found that if you add a whitespace character to the beginning of every line starting at the fifth, that is:
59
73 41
52 40 09
26 53 06 34
10 51 87 86 81
61 95 66 57 25 68
90 81 80 38 92 67 73
30 28 51 76 81 18 75 44
...
The numbers get parsed properly:
39> euler67:solve().
[59,73,41,52,40,9,26,53,6,34,10,51,87,86,81,61,95,66,57,25,
68,90,81,80,38,92,67,73,30|...]
It also works if you add the whitespace to the beginning of the first four lines, as well.
It's more of a workaround than an actual solution, but it works. I'd like to figure out how to set up the format string for io:fread such that we wouldn't have to do this.
UPDATE
Here's a workaround that won't force you to change the file. This assumes that all digits are two characters (< 100):
read_file(File, Data) ->
case io:fread(File, "", "~d") of
{ok, [N] } ->
if
N > 100 ->
First = N div 100,
Second = N - (First * 100),
read_file(File, [First , Second | Data]);
true ->
read_file(File, [N | Data])
end;
eof ->
lists:reverse(Data)
end.
Basically, the code catches any of the numbers which are the concatenation of two across a newline and splits them into two.
Again, it's a kludge that implies a possible bug in io:fread, but that should do it.
UPDATE AGAIN The above will only work for two-digit inputs, but since the example packs all digits (even those < 10) into a two-digit format, that will work for this example.

Resources