Menhir- get values between interval - parsing

I got this rule in parser.mly:
intervalue:
| c = CST(* True False 1 7 89 "sfr" *)
{ Ecst c }
| id = ident (* a-z [a-z]* *)
{ Eident id }
| iv = LSQ l = separated_list(TWOPoints, intervalue) RSQ /* [1..4]*/
{ Elist l }
;
I need to pass to list "l" the values ​​of [start .. end]. Example ([1..4]). I search in manual and separated_list(TWOPoints, intervalue) only get values 1 and 4. But i need all values between 1 and 4 including, something like this [1..2..3..4], but without having to do it exhaustively.

separated_list does not reflect your desired syntax, as far as I can see. But then, neither does using intervalue for the limits of the interval.
separated_list is not correct because it is used for an list of any positive number of elements separated by a delimiter. In particular, separated_list(TWOPoints, intervalue) will not just match 1..4, but also 1, and 1..4..7, among other things. Those other things include nested intervalues, such as 2..[4..7], which seems unlikely to be a desired construct (although since I don't know what your language looks like, perhaps it is).
You seem to be using separated_list in the mistaken belief that it is the only way to turn the reduction into an OCanl list. That's not true, since you have the full power of OCaml available to you; you could write that production as
| LSQ low = CST TWOPoints high = CST RSQ { [ low high] }
Or even
| LSQ low = CST TWOPoints high = CST RSQ { [ low .. high] }
although that won't work for all possible CST tokens (such as [1 .. "a"]). And, furthermore, it doesn't permit the use of non-constant limits, such as [1 .. limit].
But mixing syntax with run-time semantics like that is almost certainly not what you want. How would you deal with program text like the example above ([1 .. limit]), where limit is a variable which will be assigned a value during execution of the program? (Or even many values, as the program executes in a loop.) The parser should limit itself to producing a useful representation of the program to execute, and the most likely production rule will be something like this (where Value needs to be defined according to the actually desired syntax):+
| LSQ low = Value TWOPoints high = Value RSQ { Einterval low high }

Related

Parsing procedure calls for a toy language

I have a certain toy language that defines, amongst others, procedures and procedure calls, using EBNF syntax:
program = procedure, {procedure} ;
procedure = "procedure", NAME, bracedblock ;
bracedBlock = "{" , statementlist , "}" ;
statementlist = statement, { statement } ;
statement = define | if | while | call | // others omitted for brevity ;
define = NAME, "=", expression, ";"
if = "if", conditionalblock, "then", bracedBlock, "else", bracedBlock
call = "call" , NAME, ";" ;
// other definitions omitted for brevity
A tokeniser for a program in this language has been implemented, and returns a vector of tokens.
Now, parsing said program without the procedure calls, is fairly straightforward: one can define a recursive descent parser using the above grammar directly, and simply parse through the tokens. Some further notes:
Each procedure may call any other procedure except itself, directly or indirectly (i.e. no recursion), and these need not necessarily be in the order of appearance in the source code (i.e. B may be defined after A, and A may call B, or vice versa).
Procedure names need to be unique, and 'reserved keywords' may be used as variable/procedure names.
Whitespace does not matter, at least amongst tokens of different type: similar to C/C++.
There is no scoping rule: all variables are global.
The concept of a 'line number' is important: each statement has one or more line numbers associated with it: define statements have only 1 line number each, for instance, whereas an if statement, which is itself a parent of two statement lists, has multiple line numbers. For instance:
LN CODE
procedure A {
1. a = 5;
2. b = 7;
3. c = 3;
4. 5. if (b < c) then { call C; } else {
6. call B;
}
procedure B {
7. d = 5;
8. while (d > 2) {
9. d = d + 1; }
}
procedure C {
10. e = 10;
11. f = 8;
12. call B;
}
Line numbers are continuous throughout the program; only procedure definitions and the else keyword aren't assigned line numbers. The line numbers are defined by grammar, rather than their position in source code: for instance, consider 'lines' 4 and 5.
There are some relationships that need to be set in a database given each statement and its line number, variables used, variables set, and child containers. This is a key consideration.
My question is therefore this: how can I parse these function calls, maintain the integrity of the line numbers, and set the relationships?
I have considered the 'OS' way of doing things: upon encounter of a procedure call, look ahead for a procedure that matches said called procedure, parse the callee, and unroll the call stack back to the caller. However, this ruins the line number ordering: if the above program were to be parsed this way, C would have line numbers 6 to 8 inclusive, rather than 10 to 12 inclusive.
Another solution is to parse the entire program once in order, maintain a toposort of procedure calls, and then parse a second time by following said toposort. This is problematic because of implementation details.
Is there a possibly better way to do this?
It's always tempting to try to completely process a program text in a single on-line pass. Unfortunately, it is practically never the simplest solution. Trying to do everything at once in a linear progression results in a kind of spaghetti of intertwined computations, and making it all work almost always involves unnecessary restrictions on the language which will later prove to be unfortunate.
So I'd encourage you to reconsider some of your design decisions. If you use the parser just to build up some kind of structural representation of the program -- whether it's an abstract syntax tree or a vector of three-address code, or some other alternative -- and then do further processing in a series of single-purpose passes over that structural representations, you'll likely find that the code is:
much simpler, because computations don't have to be intermingled;
more general, because each pass can be done in the most convenient order rather than restricting inputs to fit a linear ordering;
more readable and more maintainable.
Persisting data structures over multiple passes might increase storage requirements slightly. But the structures are unlikely to occupy enough storage that this will be noticeable. And it probably will not increase the computation time; indeed, it might even reduce the time because the individual passes are simpler and easier to optimise.

Is it appropriate for a parser DCG to not be deterministic?

I am writing a parser for a query engine. My parser DCG query is not deterministic.
I will be using the parser in a relational manner, to both check and synthesize queries.
Is it appropriate for a parser DCG to not be deterministic?
In code:
If I want to be able to use query/2 both ways, does it require that
?- phrase(query, [q,u,e,r,y]).
true;
false.
or should I be able to obtain
?- phrase(query, [q,u,e,r,y]).
true.
nevertheless, given that the first snippet would require me to use it as such
?- bagof(X, phrase(query, [q,u,e,r,y]), [true]).
true.
when using it to check a formula?
The first question to ask yourself, is your grammar deterministic, or in the terminology of grammars, unambiguous. This is not asking if your DCG is deterministic, but if the grammar is unambiguous. That can be answered with basic parsing concepts, no use of DCG is needed to answer that question. In other words, is there only one way to parse a valid input. The standard book for this is "Compilers : principles, techniques, & tools" (WorldCat)
Now you are actually asking about three different uses for parsing.
A recognizer.
A parser.
A generator.
If your grammar is unambiguous then
For a recognizer the answer should only be true for valid input that can be parsed and false for invalid input.
For the parser it should be deterministic as there is only one way to parse the input. The difference between a parser and an recognizer is that a recognizer only returns true or false and a parser will return something more, typically an abstract syntax tree.
For the generator, it should be semi-deterministic so that it can generate multiple results.
Can all of this be done with one, DCG, yes. The three different ways are dependent upon how you use the input and output of the DCG.
Here is an example with a very simple grammar.
The grammar is just an infix binary expression with one operator and two possible operands. The operator is (+) and the operands are either (1) or (2).
expr(expr(Operand_1,Operator,Operand_2)) -->
operand(Operand_1),
operator(Operator),
operand(Operand_2).
operand(operand(1)) --> "1".
operand(operand(2)) --> "2".
operator(operator(+)) --> "+".
recognizer(Input) :-
string_codes(Input,Codes),
DCG = expr(_),
phrase(DCG,Codes,[]).
parser(Input,Ast) :-
string_codes(Input,Codes),
DCG = expr(Ast),
phrase(DCG,Codes,[]).
generator(Generated) :-
DCG = expr(_),
phrase(DCG,Codes,[]),
string_codes(Generated,Codes).
:- begin_tests(expr).
recognizer_test_case_success("1+1").
recognizer_test_case_success("1+2").
recognizer_test_case_success("2+1").
recognizer_test_case_success("2+2").
test(recognizer,[ forall(recognizer_test_case_success(Input)) ] ) :-
recognizer(Input).
recognizer_test_case_fail("2+3").
test(recognizer,[ forall(recognizer_test_case_fail(Input)), fail ] ) :-
recognizer(Input).
parser_test_case_success("1+1",expr(operand(1),operator(+),operand(1))).
parser_test_case_success("1+2",expr(operand(1),operator(+),operand(2))).
parser_test_case_success("2+1",expr(operand(2),operator(+),operand(1))).
parser_test_case_success("2+2",expr(operand(2),operator(+),operand(2))).
test(parser,[ forall(parser_test_case_success(Input,Expected_ast)) ] ) :-
parser(Input,Ast),
assertion( Ast == Expected_ast).
parser_test_case_fail("2+3").
test(parser,[ forall(parser_test_case_fail(Input)), fail ] ) :-
parser(Input,_).
test(generator,all(Generated == ["1+1","1+2","2+1","2+2"]) ) :-
generator(Generated).
:- end_tests(expr).
The grammar is unambiguous and has only 4 valid strings which are all unique.
The recognizer is deterministic and only returns true or false.
The parser is deterministic and returns a unique AST.
The generator is semi-deterministic and returns all 4 valid unique strings.
Example run of the test cases.
?- run_tests.
% PL-Unit: expr ........... done
% All 11 tests passed
true.
To expand a little on the comment by Daniel
As Daniel notes
1 + 2 + 3
can be parsed as
(1 + 2) + 3
or
1 + (2 + 3)
So 1+2+3 is an example as you said is specified by a recursive DCG and as I noted a common way out of the problem is to use parenthesizes to start a new context. What is meant by starting a new context is that it is like getting a new clean slate to start over again. If you are creating an AST, you just put the new context, items in between the parenthesizes, as a new subtree at the current node.
With regards to write_canonical/1, this is also helpful but be aware of left and right associativity of operators. See Associative property
e.g.
+ is left associative
?- write_canonical(1+2+3).
+(+(1,2),3)
true.
^ is right associative
?- write_canonical(2^3^4).
^(2,^(3,4))
true.
i.e.
2^3^4 = 2^(3^4) = 2^81 = 2417851639229258349412352
2^3^4 != (2^3)^4 = 8^4 = 4096
The point of this added info is to warn you that grammar design is full of hidden pitfalls and if you have not had a rigorous class in it and done some of it you could easily create a grammar that looks great and works great and then years latter is found to have a serious problem. While Python was not ambiguous AFAIK, it did have grammar issues, it had enough issues that when Python 3 was created, many of the issues were fixed. So Python 3 is not backward compatible with Python 2 (differences). Yes they have made changes and libraries to make it easier to use Python 2 code with Python 3, but the point is that the grammar could have used a bit more analysis when designed.
The only reason why code should be non-deterministic is that your question has multiple answers. In that case, you'd of course want your query to have multiple solutions. Even then, however, you'd like it to not leave a choice point after the last solution, if at all possible.
Here is what I mean:
"What is the smaller of two numbers?"
min_a(A, B, B) :- B < A.
min_a(A, B, A) :- A =< B.
So now you ask, "what is the smaller of 1 and 2" and the answer you expect is "1":
?- min_a(1, 2, Min).
Min = 1.
?- min_a(2, 1, Min).
Min = 1 ; % crap...
false.
?- min_a(2, 1, 2).
false.
?- min_a(2, 1, 1).
true ; % crap...
false.
So that's not bad code but I think it's still crap. This is why, for the smaller of two numbers, you'd use something like the min() function in SWI-Prolog.
Similarly, say you want to ask, "What are the even numbers between 1 and 10"; you write the query:
?- between(1, 10, X), X rem 2 =:= 0.
X = 2 ;
X = 4 ;
X = 6 ;
X = 8 ;
X = 10.
... and that's fine, but if you then ask for the numbers that are multiple of 3, you get:
?- between(1, 10, X), X rem 3 =:= 0.
X = 3 ;
X = 6 ;
X = 9 ;
false. % crap...
The "low-hanging fruit" are the cases where you as a programmer would see that there cannot be non-determinism, but for some reason your Prolog is not able to deduce that from the code you wrote. In most cases, you can do something about it.
On to your actual question. If you can, write your code so that there is non-determinism only if there are multiple answers to the question you'll be asking. When you use a DCG for both parsing and generating, this sometimes means you end up with two code paths. It feels clumsy but it is easier to write, to read, to understand, and probably to make efficient. As a word of caution, take a look at this question. I can't know that for sure, but the problems that OP is running into are almost certainly caused by unnecessary non-determinism. What probably happens with larger inputs is that a lot of choice points are left behind, there is a lot of memory that cannot be reclaimed, a lot of processing time going into book keeping, huge solution trees being traversed only to get (as expected) no solutions.... you get the point.
For examples of what I mean, you can take a look at the implementation of library(dcg/basics) in SWI-Prolog. Pay attention to several things:
The documentation is very explicit about what is deterministic, what isn't, and how non-determinism is supposed to be useful to the client code;
The use of cuts, where necessary, to get rid of choice points that are useless;
The implementation of number//1 (towards the bottom) that can "generate extract a number".
(Hint: use the primitives in this library when you write your own parser!)
I hope you find this unnecessarily long answer useful.

Other ways to call/eval dynamic strings in Lua?

I am working with a third party device which has some implementation of Lua, and communicates in BACnet. The documentation is pretty janky, not providing any sort of help for any more advanced programming ideas. It's simply, "This is how you set variables...". So, I am trying to just figure it out, and hoping you all can help.
I need to set a long list of variables to certain values. I have a userdata 'ME', with a bunch of variables named MVXX (e.g. - MV21, MV98, MV56, etc).
(This is all kind of background for BACnet.) Variables in BACnet all have 17 'priorities', i.e., every BACnet variable is actually a sort of list of 17 values, with priority 16 being the default. So, typically, if I were to say ME.MV12 = 23, that would set MV12's priority-16 to the desired value of 23.
However, I need to set priority 17. I can do this in the provided Lua implementation, by saying ME.MV12_PV[17] = 23. I can set any of the priorities I want by indexing that PV. (Corollaries - what is PV? What is the underscore? How do I get to these objects? Or are they just interpreted from Lua to some function in C on the backend?)
All this being said, I need to make that variable name dynamic, so that i can set whichever value I need to set, based on some other code. I have made several attempts.
This tells me the object(MV12_PV[17]) does not exist:
x = 12
ME["MV" .. x .. "_PV[17]"] = 23
But this works fine, setting priority 16 to 23:
x = 12
ME["MV" .. x] = 23
I was trying to attempt some sort of what I think is called an evaluation, or eval. But, this just prints out function followed by some random 8 digit number:
x = 12
test = assert(loadstring("MV" .. x .. "_PV[17] = 23"))
print(test)
Any help? Apologies if I am unclear - tbh, I am so far behind the 8-ball I am pretty much grabbing at straws.
Underscores can be part of Lua identifiers (variable and function names). They are just part of the variable name (like letters are) and aren't a special Lua operator like [ and ] are.
In the expression ME.MV12_PV[17] we have ME being an object with a bunch of fields, ME.MV12_PV being an array stored in the "MV12_PV" field of that object and ME.MV12_PV[17] is the 17th slot in that array.
If you want to access fields dynamically, the thing to know is that accessing a field with dot notation in Lua is equivalent to using bracket notation and passing in the field name as a string:
-- The following are all equivalent:
x.foo
x["foo"]
local fieldname = "foo"
x[fieldname]
So in your case you might want to try doing something like this:
local n = 12
ME["MV"..n.."_PV"][17] = 23
BACnet "Commmandable" Objects (e.g. Binary Output, Analog Output, and o[tionally Binary Value, Analog Value and a handful of others) actually have 16 priorities (1-16). The "17th" you are referring to may be the "Relinquish Default", a value that is used if all 16 priorities are set to NULL or "Relinquished".
Perhaps your system will allow you to write to a BACnet Property called "Relinquish Default".

Lua base converter

I need a base converter function for Lua. I need to convert from base 10 to base 2,3,4,5,6,7,8,9,10,11...36 how can i to this?
In the string to number direction, the function tonumber() takes an optional second argument that specifies the base to use, which may range from 2 to 36 with the obvious meaning for digits in bases greater than 10.
In the number to string direction, this can be done slightly more efficiently than Nikolaus's answer by something like this:
local floor,insert = math.floor, table.insert
function basen(n,b)
n = floor(n)
if not b or b == 10 then return tostring(n) end
local digits = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ"
local t = {}
local sign = ""
if n < 0 then
sign = "-"
n = -n
end
repeat
local d = (n % b) + 1
n = floor(n / b)
insert(t, 1, digits:sub(d,d))
until n == 0
return sign .. table.concat(t,"")
end
This creates fewer garbage strings to collect by using table.concat() instead of repeated calls to the string concatenation operator ... Although it makes little practical difference for strings this small, this idiom should be learned because otherwise building a buffer in a loop with the concatenation operator will actually tend to O(n2) performance while table.concat() has been designed to do substantially better.
There is an unanswered question as to whether it is more efficient to push the digits on a stack in the table t with calls to table.insert(t,1,digit), or to append them to the end with t[#t+1]=digit, followed by a call to string.reverse() to put the digits in the right order. I'll leave the benchmarking to the student. Note that although the code I pasted here does run and appears to get correct answers, there may other opportunities to tune it further.
For example, the common case of base 10 is culled off and handled with the built in tostring() function. But similar culls can be done for bases 8 and 16 which have conversion specifiers for string.format() ("%o" and "%x", respectively).
Also, neither Nikolaus's solution nor mine handle non-integers particularly well. I emphasize that here by forcing the value n to an integer with math.floor() at the beginning.
Correctly converting a general floating point value to any base (even base 10) is fraught with subtleties, which I leave as an exercise to the reader.
you can use a loop to convert an integer into a string containting the required base. for bases below 10 use the following code, if you need a base larger than that you need to add a line that mapps the result of x % base to a character (usign an array for example)
x = 1234
r = ""
base = 8
while x > 0 do
r = "" .. (x % base ) .. r
x = math.floor(x / base)
end
print( r );

Can a SHA-1 hash be all-zeroes?

Is there any input that SHA-1 will compute to a hex value of fourty-zeros, i.e. "0000000000000000000000000000000000000000"?
Yes, it's just incredibly unlikely. I.e. one in 2^160, or 0.00000000000000000000000000000000000000000000006842277657836021%.
Also, becuase SHA1 is cryptographically strong, it would also be computationally unfeasible (at least with current computer technology -- all bets are off for emergent technologies such as quantum computing) to find out what data would result in an all-zero hash until it occurred in practice. If you really must use the "0" hash as a sentinel be sure to include an appropriate assertion (that you did not just hash input data to your "zero" hash sentinel) that survives into production. It is a failure condition your code will permanently need to check for. WARNING: Your code will permanently be broken if it does.
Depending on your situation (if your logic can cope with handling the empty string as a special case in order to forbid it from input) you could use the SHA1 hash ('da39a3ee5e6b4b0d3255bfef95601890afd80709') of the empty string. Also possible is using the hash for any string not in your input domain such as sha1('a') if your input has numeric-only as an invariant. If the input is preprocessed to add any regular decoration then a hash of something without the decoration would work as well (eg: sha1('abc') if your inputs like 'foo' are decorated with quotes to something like '"foo"').
I don't think so.
There is no easy way to show why it's not possible. If there was, then this would itself be the basis of an algorithm to find collisions.
Longer analysis:
The preprocessing makes sure that there is always at least one 1 bit in the input.
The loop over w[i] will leave the original stream alone, so there is at least one 1 bit in the input (words 0 to 15). Even with clever design of the bit patterns, at least some of the values from 0 to 15 must be non-zero since the loop doesn't affect them.
Note: leftrotate is circular, so no 1 bits will get lost.
In the main loop, it's easy to see that the factor k is never zero, so temp can't be zero for the reason that all operands on the right hand side are zero (k never is).
This leaves us with the question whether you can create a bit pattern for which (a leftrotate 5) + f + e + k + w[i] returns 0 by overflowing the sum. For this, we need to find values for w[i] such that w[i] = 0 - ((a leftrotate 5) + f + e + k)
This is possible for the first 16 values of w[i] since you have full control over them. But the words 16 to 79 are again created by xoring the first 16 values.
So the next step could be to unroll the loops and create a system of linear equations. I'll leave that as an exercise to the reader ;-) The system is interesting since we have a loop that creates additional equations until we end up with a stable result.
Basically, the algorithm was chosen in such a way that you can create individual 0 words by selecting input patterns but these effects are countered by xoring the input patterns to create the 64 other inputs.
Just an example: To make temp 0, we have
a = h0 = 0x67452301
f = (b and c) or ((not b) and d)
= (h1 and h2) or ((not h1) and h3)
= (0xEFCDAB89 & 0x98BADCFE) | (~0x98BADCFE & 0x10325476)
= 0x98badcfe
e = 0xC3D2E1F0
k = 0x5A827999
which gives us w[0] = 0x9fb498b3, etc. This value is then used in the words 16, 19, 22, 24-25, 27-28, 30-79.
Word 1, similarly, is used in words 1, 17, 20, 23, 25-26, 28-29, 31-79.
As you can see, there is a lot of overlap. If you calculate the input value that would give you a 0 result, that value influences at last 32 other input values.
The post by Aaron is incorrect. It is getting hung up on the internals of the SHA1 computation while ignoring what happens at the end of the round function.
Specifically, see the pseudo-code from Wikipedia. At the end of the round, the following computation is done:
h0 = h0 + a
h1 = h1 + b
h2 = h2 + c
h3 = h3 + d
h4 = h4 + e
So an all 0 output can happen if h0 == -a, h1 == -b, h2 == -c, h3 == -d, and h4 == -e going into this last section, where the computations are mod 2^32.
To answer your question: nobody knows whether there exists an input that produces all zero outputs, but cryptographers expect that there are based upon the simple argument provided by daf.
Without any knowledge of SHA-1 internals, I don't see why any particular value should be impossible (unless explicitly stated in the description of the algorithm). An all-zero value is no more or less probable than any other specific value.
Contrary to all of the current answers here, nobody knows that. There's a big difference between a probability estimation and a proof.
But you can safely assume it won't happen. In fact, you can safely assume that just about ANY value won't be the result (assuming it wasn't obtained through some SHA-1-like procedures). You can assume this as long as SHA-1 is secure (it actually isn't anymore, at least theoretically).
People doesn't seem realize just how improbable it is (if all humanity focused all of it's current resources on finding a zero hash by bruteforcing, it would take about xxx... ages of the current universe to crack it).
If you know the function is safe, it's not wrong to assume it won't happen. That may change in the future, so assume some malicious inputs could give that value (e.g. don't erase user's HDD if you find a zero hash).
If anyone still thinks it's not "clean" or something, I can tell you that nothing is guaranteed in the real world, because of quantum mechanics. You assume you can't walk through a solid wall just because of an insanely low probability.
[I'm done with this site... My first answer here, I tried to write a nice answer, but all I see is a bunch of downvoting morons who are wrong and can't even tell the reason why are they doing it. Your community really disappointed me. I'll still use this site, but only passively]
Contrary to all answers here, the answer is simply No.
The hash value always contains bits set to 1.

Resources