Is it possible and if so, how could I use binary comprehension on Elixir? I can do it on Erlang like so:
[One || <<One, _rest:3/binary>> <= <<1,2,3,4>>].
What in Erlang is:
1> [Red || <<Red:2/binary, _Blue:2/binary>> <= <<1, 2, 3, 4, 5, 6, 7, 8>> ].
[<<1,2>>,<<5,6>>]
In Elixir is:
iex(1)> for <<red::8, green::8, blue::16 <- <<1, 2, 3, 4, 5, 6, 7, 8>> >>, do: <<red, green>>
[<<1, 2>>, <<5, 6>>]
Note that the Elixir above is explicitly declaring sizes in bits, whereas the Erlang is using a type for make the calculation chop off a size in bytes. There is probably a cleaner way to do that in Elixir (at least I hope there is) and I might even hunt around for it -- but most of the time when I want to do this stuff extensively I stick to Erlang just for readability/universality.
Addendum
#aronisstav asked an interesting question: "Shouldn't there be a part matching the green pattern in the Erlang code?"
The answer is that there would be a Green variable in Erlang were that code to deal with bitstrings instead of binaries. Erlang's bit syntax provides ways to indicate a few arbitrary binary types which correspond to default sizes. Above I matched Red:2/binary which means I want to match a sequence of 2 bytes, and this is how we get the result [<<1,2>><<5,6>>]: two sequences of two bytes.
An Erlang example that is exactly equivalent to the Elixir code above would be:
[<<Red/bitstring, Green/bitstring>>
|| <<Red:8/bitstring, Green:8/bitstring, _Blue:2/binary>>
<= <<1, 2, 3, 4, 5, 6, 7, 8>> ].
But that is just silly to do, as Erlang's syntax for bytes is much more concise.
I found the solution in the documentation:
for << one, _rest :: binary - size(3) <- <<1,2,3,4>> >>, do: one
Related
I am writing a parser for a query engine. My parser DCG query is not deterministic.
I will be using the parser in a relational manner, to both check and synthesize queries.
Is it appropriate for a parser DCG to not be deterministic?
In code:
If I want to be able to use query/2 both ways, does it require that
?- phrase(query, [q,u,e,r,y]).
true;
false.
or should I be able to obtain
?- phrase(query, [q,u,e,r,y]).
true.
nevertheless, given that the first snippet would require me to use it as such
?- bagof(X, phrase(query, [q,u,e,r,y]), [true]).
true.
when using it to check a formula?
The first question to ask yourself, is your grammar deterministic, or in the terminology of grammars, unambiguous. This is not asking if your DCG is deterministic, but if the grammar is unambiguous. That can be answered with basic parsing concepts, no use of DCG is needed to answer that question. In other words, is there only one way to parse a valid input. The standard book for this is "Compilers : principles, techniques, & tools" (WorldCat)
Now you are actually asking about three different uses for parsing.
A recognizer.
A parser.
A generator.
If your grammar is unambiguous then
For a recognizer the answer should only be true for valid input that can be parsed and false for invalid input.
For the parser it should be deterministic as there is only one way to parse the input. The difference between a parser and an recognizer is that a recognizer only returns true or false and a parser will return something more, typically an abstract syntax tree.
For the generator, it should be semi-deterministic so that it can generate multiple results.
Can all of this be done with one, DCG, yes. The three different ways are dependent upon how you use the input and output of the DCG.
Here is an example with a very simple grammar.
The grammar is just an infix binary expression with one operator and two possible operands. The operator is (+) and the operands are either (1) or (2).
expr(expr(Operand_1,Operator,Operand_2)) -->
operand(Operand_1),
operator(Operator),
operand(Operand_2).
operand(operand(1)) --> "1".
operand(operand(2)) --> "2".
operator(operator(+)) --> "+".
recognizer(Input) :-
string_codes(Input,Codes),
DCG = expr(_),
phrase(DCG,Codes,[]).
parser(Input,Ast) :-
string_codes(Input,Codes),
DCG = expr(Ast),
phrase(DCG,Codes,[]).
generator(Generated) :-
DCG = expr(_),
phrase(DCG,Codes,[]),
string_codes(Generated,Codes).
:- begin_tests(expr).
recognizer_test_case_success("1+1").
recognizer_test_case_success("1+2").
recognizer_test_case_success("2+1").
recognizer_test_case_success("2+2").
test(recognizer,[ forall(recognizer_test_case_success(Input)) ] ) :-
recognizer(Input).
recognizer_test_case_fail("2+3").
test(recognizer,[ forall(recognizer_test_case_fail(Input)), fail ] ) :-
recognizer(Input).
parser_test_case_success("1+1",expr(operand(1),operator(+),operand(1))).
parser_test_case_success("1+2",expr(operand(1),operator(+),operand(2))).
parser_test_case_success("2+1",expr(operand(2),operator(+),operand(1))).
parser_test_case_success("2+2",expr(operand(2),operator(+),operand(2))).
test(parser,[ forall(parser_test_case_success(Input,Expected_ast)) ] ) :-
parser(Input,Ast),
assertion( Ast == Expected_ast).
parser_test_case_fail("2+3").
test(parser,[ forall(parser_test_case_fail(Input)), fail ] ) :-
parser(Input,_).
test(generator,all(Generated == ["1+1","1+2","2+1","2+2"]) ) :-
generator(Generated).
:- end_tests(expr).
The grammar is unambiguous and has only 4 valid strings which are all unique.
The recognizer is deterministic and only returns true or false.
The parser is deterministic and returns a unique AST.
The generator is semi-deterministic and returns all 4 valid unique strings.
Example run of the test cases.
?- run_tests.
% PL-Unit: expr ........... done
% All 11 tests passed
true.
To expand a little on the comment by Daniel
As Daniel notes
1 + 2 + 3
can be parsed as
(1 + 2) + 3
or
1 + (2 + 3)
So 1+2+3 is an example as you said is specified by a recursive DCG and as I noted a common way out of the problem is to use parenthesizes to start a new context. What is meant by starting a new context is that it is like getting a new clean slate to start over again. If you are creating an AST, you just put the new context, items in between the parenthesizes, as a new subtree at the current node.
With regards to write_canonical/1, this is also helpful but be aware of left and right associativity of operators. See Associative property
e.g.
+ is left associative
?- write_canonical(1+2+3).
+(+(1,2),3)
true.
^ is right associative
?- write_canonical(2^3^4).
^(2,^(3,4))
true.
i.e.
2^3^4 = 2^(3^4) = 2^81 = 2417851639229258349412352
2^3^4 != (2^3)^4 = 8^4 = 4096
The point of this added info is to warn you that grammar design is full of hidden pitfalls and if you have not had a rigorous class in it and done some of it you could easily create a grammar that looks great and works great and then years latter is found to have a serious problem. While Python was not ambiguous AFAIK, it did have grammar issues, it had enough issues that when Python 3 was created, many of the issues were fixed. So Python 3 is not backward compatible with Python 2 (differences). Yes they have made changes and libraries to make it easier to use Python 2 code with Python 3, but the point is that the grammar could have used a bit more analysis when designed.
The only reason why code should be non-deterministic is that your question has multiple answers. In that case, you'd of course want your query to have multiple solutions. Even then, however, you'd like it to not leave a choice point after the last solution, if at all possible.
Here is what I mean:
"What is the smaller of two numbers?"
min_a(A, B, B) :- B < A.
min_a(A, B, A) :- A =< B.
So now you ask, "what is the smaller of 1 and 2" and the answer you expect is "1":
?- min_a(1, 2, Min).
Min = 1.
?- min_a(2, 1, Min).
Min = 1 ; % crap...
false.
?- min_a(2, 1, 2).
false.
?- min_a(2, 1, 1).
true ; % crap...
false.
So that's not bad code but I think it's still crap. This is why, for the smaller of two numbers, you'd use something like the min() function in SWI-Prolog.
Similarly, say you want to ask, "What are the even numbers between 1 and 10"; you write the query:
?- between(1, 10, X), X rem 2 =:= 0.
X = 2 ;
X = 4 ;
X = 6 ;
X = 8 ;
X = 10.
... and that's fine, but if you then ask for the numbers that are multiple of 3, you get:
?- between(1, 10, X), X rem 3 =:= 0.
X = 3 ;
X = 6 ;
X = 9 ;
false. % crap...
The "low-hanging fruit" are the cases where you as a programmer would see that there cannot be non-determinism, but for some reason your Prolog is not able to deduce that from the code you wrote. In most cases, you can do something about it.
On to your actual question. If you can, write your code so that there is non-determinism only if there are multiple answers to the question you'll be asking. When you use a DCG for both parsing and generating, this sometimes means you end up with two code paths. It feels clumsy but it is easier to write, to read, to understand, and probably to make efficient. As a word of caution, take a look at this question. I can't know that for sure, but the problems that OP is running into are almost certainly caused by unnecessary non-determinism. What probably happens with larger inputs is that a lot of choice points are left behind, there is a lot of memory that cannot be reclaimed, a lot of processing time going into book keeping, huge solution trees being traversed only to get (as expected) no solutions.... you get the point.
For examples of what I mean, you can take a look at the implementation of library(dcg/basics) in SWI-Prolog. Pay attention to several things:
The documentation is very explicit about what is deterministic, what isn't, and how non-determinism is supposed to be useful to the client code;
The use of cuts, where necessary, to get rid of choice points that are useless;
The implementation of number//1 (towards the bottom) that can "generate extract a number".
(Hint: use the primitives in this library when you write your own parser!)
I hope you find this unnecessarily long answer useful.
I just started a tutorial in Rust and I can't get my head around the limitation of tuple printing:
fn main() {
// Tuple definition
let short = (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11);
let long = (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12);
println!("{:?}", short); // Works fine
println!("{:?}", long); // ({integer}...{integer})` cannot be formatted using `:?` because it doesn't implement `std::fmt::Debug`
}
In my ignorant view the printing could be easily achieved by iterating over the entire tuple — this would allow displaying without size constraint. If the solution would be that simple it would be implemented, what am I missing here?
Printing tuples is currently implemented using a macro that only works up to 12 elements.
Functionality to statically iterate/manipulate tuples has been proposed, but has been postponed (see e.g. this RFC). There was some concerns about the implementation of these (e.g. you'd expect to be able to get the head & tail of a tuple, but there is actually no guarantee that a tuple will be stored in the same order as you specified, because the compiler is allowed to optimize for space, which means getting the tail wouldn't be a trivial operation).
As for why you need special support for that, consider the following tuple:
let mixed = (42, true, 3.14, "foo");
How you would iterate this tuple, given that all its elements have a different type? This can't simply be done using regular iterators and a for loop. You would need some new type-level syntax, which Rust is currently lacking.
Debug is only implemented on tuples up to 12 elements. This is why printing short works, but long fails.
I have a consensus sequence with ambiguous bases, in which I want to search for a pattern. Start position should be 3.
Seq="ATGARTTTTTT" -- R is A or G
Pattern="AGT"
I found a tool called nt_search in SeqUtils that can do that but it doesn't give me the coordinates of the match as illustrated in scenario #2. Below are a few test runs to demonstrate the issue.
Scenario #1: No Ambiguous Bases
from Bio import SeqUtils
pattern="TAA"
Seq="ATGTAAAGGAGG"
m=SeqUtils.nt_search(Seq,pattern)
print m
['ACG', 3]
Scenario #2: Ambiguous Bases in sequence
pattern="AGT"
Seq="ATGARTTTTTT"
m=SeqUtils.nt_search(Seq,pattern)
print m
['AGT']
Scenario #3: Ambiguous Bases in pattern
pattern="ART"
Seq="ATGAGTTTTTT"
m=SeqUtils.nt_search(Seq,pattern)
print m
['A[AG]T', 3]
The source code for nt_search is here. I am not sure how to tweak it to get the start position, in this example, that would be 3.
The help text for nt_search could be much clearer, but it returns a list giving the regular expression used and the position of any matches. e.g.
>>> from Bio.SeqUtils import nt_search
>>> print(nt_search("ATGAGTTTTTT", "ART"))
['A[AG]T', 3]
>>> print(nt_search("ATGAGTTTTTTAGT", "ART"))
['A[AG]T', 3, 11]
>>> print(nt_search("ATGAGTTTTTTAGTTTTAAT", "ART"))
['A[AG]T', 3, 11, 17]
So, if you want the first match only as an integer, you need to pull out element one from the list:
>>> from Bio.SeqUtils import nt_search
>>> print(nt_search("ATGAGTTTTTTAGT", "ART")[0])
A[AG]T
>>> print(nt_search("ATGAGTTTTTTAGT", "ART")[1])
3
Element zero would be the regular expression used.
Update: Searching against an ambiguous sequence is not (currently) supported (scenario 2 in the question), in part I would think as it would be difficult to define a sensible implementation. e.g. searching as against "NNN"with any any three letter query like "AAA", "ART", or "NNN" would give you a match.
From the other languages I program in, I'm used to having ranges. In Python, if I want all numbers one up to 100, I write range(1, 101). Similarly, in Haskell I'd write [1..100] and in Scala I'd write 1 to 100.
I can't find something similar in Erlang, either in the syntax or the library. I know that this would be fairly simple to implement myself, but I wanted to make sure it doesn't exist elsewhere first (particularly since a standard library or language implementation would be loads more efficient).
Is there a way to do ranges either in the Erlang language or standard library? Or is there some idiom that I'm missing? I just want to know if I should implement it myself.
I'm also open to the possibility that I shouldn't want to use a range in Erlang (I wouldn't want to be coding Python or Haskell in Erlang). Also, if I do need to implement this myself, if you have any good suggestions for improving performance, I'd love to hear them :)
From http://www.erlang.org/doc/man/lists.html it looks like lists:seq(1, 100) does what you want. You can also do things like lists:seq(1, 100, 2) to get all of the odd numbers in that range instead.
You can use list:seq(From, TO) that's say #bitilly, and also you can use list comprehensions to add more functionality, for example:
1> [X || X <- lists:seq(1,100), X rem 2 == 0].
[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,
44,46,48,50,52,54,56,58|...]
There is a difference between range in Ruby and list:seq in Erlang. Ruby's range doesn't create list and rely on next method, so (1..HugeInteger).each { ... } will not eat up memory. Erlang lists:seq will create list (or I believe it will). So when range is used for side effects, it does make a difference.
P.S. Not just for side effects:
(1..HugeInteger).inject(0) { |s, v| s + v % 1000000 == 0 ? 1 : 0 }
will work the same way as each, not creating a list. Erlang way for this is to create a recursive function. In fact, it is a concealed loop anyway.
Example of lazy stream in Erlang. Although it is not Erlang specific, I guess it can be done in any language with lambdas. New lambda gets created every time stream is advanced so it might put some strain on garbage collector.
range(From, To, _) when From > To ->
done;
range(From, To, Step) ->
{From, fun() -> range(From + Step, To, Step) end}.
list(done) ->
[];
list({Value, Iterator}) ->
[Value | list(Iterator())].
% ----- usage example ------
list_odd_numbers(From, To) ->
list(range(From bor 1, To, 2)).
By definition the integer division returns the quotient.
Why 4613.9145 div 100. gives an error ("bad argument") ?
For div the arguments need to be integers. / accepts arbitrary numbers as arguments, especially floats. So for your example, the following would work:
1> 4613.9145 / 100.
46.139145
To contrast the difference, try:
2> 10 / 10.
1.0
3> 10 div 10.
1
Documentation: http://www.erlang.org/doc/reference_manual/expressions.html
Update: Integer division, sometimes denoted \, can be defined as:
a \ b = floor(a / b)
So you'll need a floor function, which isn't in the standard lib.
% intdiv.erl
-module(intdiv).
-export([floor/1, idiv/2]).
floor(X) when X < 0 ->
T = trunc(X),
case X - T == 0 of
true -> T;
false -> T - 1
end;
floor(X) ->
trunc(X) .
idiv(A, B) ->
floor(A / B) .
Usage:
$ erl
...
Eshell V5.7.5 (abort with ^G)
> c(intdiv).
{ok,intdiv}
> intdiv:idiv(4613.9145, 100).
46
Integer division in Erlang, div, is defined to take two integers as input and return an integer. The link you give in an earlier comment, http://mathworld.wolfram.com/IntegerDivision.html, only uses integers in its examples so is not really useful in this discussion. Using trunc and round will allow you use any arguments you wish.
I don't know quite what you mean by "definition." Language designers are free to define operators however they wish. In Erlang, they have defined div to accept only integer arguments.
If it is the design decisions of Erlang's creators that you are interested in knowing, you could email them. Also, if you are curious enough to sift through the (remarkably short) grammar, you can find it here. Best luck!
Not sure what you're looking for, #Bertaud. Regardless of how it's defined elsewhere, Erlang's div only works on integers. You can convert the arguments to integers before calling div:
trunc(4613.9145) div 100.
or you can use / instead of div and convert the quotient to an integer afterward:
trunc(4613.9145 / 100).
And trunc may or may not be what you want- you may want round, or floor or ceiling (which are not defined in Erlang's standard library, but aren't hard to define yourself, as miku did with floor above). That's part of the reason Erlang doesn't assume something and do the conversion for you. But in any case, if you want an integer quotient from two non-integers in Erlang, you have to have some sort of explicit conversion step somewhere.