I'm making a language and I'm having a little trouble with implementing casting in the grammer. The syntax for casting is on line 61 of the grammer file. Currently it will take something like (Int) 5.4 + 7 and turn it into (Int) (5.4 + 7). I want it to look like ((Int) 5.4) + 7, but I haven't been able to get it to do so. Any ideas on what I need to do to fix this or where I need to go to fix it?
I stripped out the excess rules from the grammar file that weren't referenced by the problematic code.
grammar file
I found a solution. I moved the 3 unary expressions into their own rule and then made casting work with either a primary or a unary expression instead of a normal expression.
Related
I have a fully functional cpp-peglib grammar that works pretty well for parsing expressions like:
SomeObj.value + (Math.random(0, 32)) * 3.14 - ("abcdef".find("c", 0)).toFloat()
I would like to implement an autocomplete system where it parses partial expressions from the current cursor position and backwards until it has a valid expression.
Basically if I attempt to parse something like
something * ( 4 + "test".leng
I would get out the smallest possible rightmost valid expression which is "test".leng in this case.
The naive attempt is to make a root grammar rule that prefixes the Expression with the garbage as to attempt to find the rightmost valid expression.
PartialExpression <- Garbage Expression
Garbage <- < (.*) >
Expression <- .... normal grammar rules following here ....
This doesn't seem to work and I think I know why. It would require the PEG system to try and parse all possible lengths of (.*) starting with the full string and then working backwards until it has a valid expression. I don't think the peg/regex engine is suited for this.
I could of course just try to do all this manually by incrementally throwing longer and longer substrings of the expression ending at the caret position at the parser and seeing when it finally parses. It could be done semi-intelligently to avoid doing it for every character.
A crazy idea came to my head:
Maybe it would be possible to completely "reverse" the grammar and run it on the reversed input string gnel."tset" + 4 ( * gnihtemos
That would allow me to parse by ignoring the garbage at the end instead of at the front.
Getting out the "normal" valid AST would require to post-process it before it could be used. While this could maybe work it seems rather convoluted.
Is there another more clever/performant approach with the cpp-peglib system I could use to do this without changing too many things in the grammar?
Is there some implementation detail with the Packrat parser vs it's normal mode that would affect this?
I would write a function which can parse the multiplication of 2 algebraic expressions in GF(2), i.e any variable in the expression only take on 2 possible values 0 or 1, so a^2 = a,(0^2 = 0, 1^2 = 1)
As an example, if we expand (a+b)*(a+c) in GF(2), we should get
(a + b)*(a + c) = a^2 + a*b + a*c + b*c = a + a*b + a*c + b*c.
However, I am not sure how to start about the parsing of 2 algebraic expressions using strings. Any suggestion/ help is appreciated. Thanks!
I would recommend taking a look at OMeta, by Alex Warth, and/or PetitParser, by Lucas Rengli. Both are excellent frameworks for writing parsers. The first one is for JS, the second for Smalltalk.
Here are some few initial lines of code showing how to write your parser in PetitParser. Every fragment is a method of your own subclass of PPCompositeParser.
constant
ˆ$0 asParser / $1 asParser
variable
^#letter asParser
timesOp
^#blank asParser star , $* asParser, #blank asParser star
sumOp
^#blank asParser star, $* asParser, #blank asParser star
element
^self constant / self variable
term
^self element , (self timesOp , self element) star
etc.
I'm not saying this is trivial. I'm only saying that this is where I would start. Note also that once you have your grammar in place you might want to subclass it so you can generate more appropriate productions, etc.
Writing parsers for big complicated languages can be hard. But writing parsers for algebraic expressions (GF(2) or otherwise) is pretty easy.
See my SO answer on how to write such parsers easily: Is there an alternative for flex/bison that is usable on 8-bit embedded systems?
The GF(2) bit is about semantic interpretation of what such a formula means. It doesn't matter at all for parsing, which is purely about syntax.
Where meaning comes into play is when you want to interpret the formula.
At some point, you may want to evaluate the expression using values for the variables. To do that, you have to capture the formula as a data structure (usually called an (abstract) syntax tree), and then walk that tree to compute the desired result. That link also discusses how to do that.
If you want to manipulate the formula symbolically, you're in an entirely different ball game. Parsing is still easy, but formula manipulation is not, and you'll want to use tools that are designed to do such symbolic manipulation; they generally define thier own parsing machinery (and make it easy to use) to ensure that the captured parse can be manipulated. And of course, you'll have to define what the rules of you symbolic manipulation are.
You can see an example of how to write something pretty close to your needs at Symbolic Algebra with a program transformation system. (This a tool that my company builds).
OK, so here's a question: Given that Haskell allows you to define new operators with arbitrary operator precedence... how is it possible to actually parse Haskell source code?
You cannot know what operator precedences are set until you parse the source. But you cannot parse the source until you know the correct operator precedences. So... um, how?
Consider, for example, the expression
x *** y +++ z
Until we finish parsing the module, we don't know what other modules are imported, and hence what operators (and other identifiers) might be in scope. We certainly don't know their precedences yet. But the parser has to return something... But should it return
(x *** y) +++ z
Or should it return
x *** (y +++ z)
The poor parser has no way to know. This can only be determined once you hunt down the import that brings (+++) and (***) into scope, load that file off disk, and discover what the operator precedences are. Clearly the parser itself isn't going to do all that I/O; a parser just turns a stream of characters into an AST.
Clearly somebody somewhere has figured out how to do this. But I can't work it out... Any hints?
Quoting the page on GHC trac for the parser:
Infix operators are parsed as if they were all left-associative. The
renamer uses the fixity declarations to re-associate the syntax tree.
András Kovács's answer tells what's really done in GHC, but there's some history to this.
There was actually a somewhat hypothetical change from the Haskell 98 to the Haskell 2010 standard. In the former's BNF grammar, operator fixity and parsing were intertwined in such a way that you could in theory have some very strange interactions between the rules for fixity and the rules for when expressions and indentation blocks end. (For the latter two, the rules are essentially, "keep on going until you have to stop".)
In particular you could redefine a local operator and its fixity such that a use of it belonged in the redefining inner where block exactly ... when it didn't. So you got a parser paradox. I cannot find any of the old examples but this may be one:
let (+) = (Prelude.+)
infix 9 + -- make the inner + high precedence and non-associative
in 2 + 3 + 4
-- ^ this + cannot parse here as the inner operator, which means
-- the let ... in ... expression should end automatically first,
-- but then it's the standard +, and its fixity says it should parse
-- as part of the inner expression...
In Haskell 2010 they officially changed that so that operator fixities are determined in a separate stage after the parsing proper.
So why was this a hypothetical change? Because all the compiler writers already did it the Haskell 2010 way, and always had, for their own sanity.
Summarising the comments so far, it seems the possibilities are thus:
Return a parse tree where any infix operators are left as some kind of "list" structure, and then rearrange once precedences become known.
Pretend you know the operator precedences, and then rearrange the parse tree after the fact.
Do a first parse that only reads imports and fixity declarations, load the imports, and then do a full parse with known precedences.
This is a question I've been mildly irritated about for some time and just never got around to search the answer to.
However I thought I might at least ask the question and perhaps someone can explain.
Basically many languages I've worked in utilize syntactic sugar to write (using syntax from C++):
int main() {
int a = 2;
a += 3; // a=a+3
}
while in lua the += is not defined, so I would have to write a=a+3, which again is all about syntactical sugar. when using a more "meaningful" variable name such as: bleed_damage_over_time or something it starts getting tedious to write:
bleed_damage_over_time = bleed_damage_over_time + added_bleed_damage_over_time
instead of:
bleed_damage_over_time += added_bleed_damage_over_time
So I would like to know not how to solve this if you don't have a nice solution, in that case I would of course be interested in hearing it; but rather why lua doesn't implement this syntactical sugar.
This is just guesswork on my part, but:
1. It's hard to implement this in a single-pass compiler
Lua's bytecode compiler is implemented as a single-pass recursive descent parser that immediately generates code. It does not parse to a separate AST structure and then in a second pass convert that to bytecode.
This forces some limitations on the grammar and semantics. In particular, anything that requires arbitrary lookahead or forward references is really hard to support in this model. This means assignments are already hard to parse. Given something like:
foo.bar.baz = "value"
When you're parsing foo.bar.baz, you don't realize you're actually parsing an assignment until you hit the = after you've already parsed and generated code for that. Lua's compiler has a good bit of complexity just for handling assignments because of this.
Supporting self-assignment would make that even harder. Something like:
foo.bar.baz += "value"
Needs to get translated to:
foo.bar.baz = foo.bar.baz + "value"
But at the point that the compiler hits the =, it's already forgotten about foo.bar.baz. It's possible, but not easy.
2. It may not play nice with the grammar
Lua doesn't actually have any statement or line separators in the grammar. Whitespace is ignored and there are no mandatory semicolons. You can do:
io.write("one")
io.write("two")
Or:
io.write("one") io.write("two")
And Lua is equally happy with both. Keeping a grammar like that unambiguous is tricky. I'm not sure, but self-assignment operators may make that harder.
3. It doesn't play nice with multiple assignment
Lua supports multiple assignment, like:
a, b, c = someFnThatReturnsThreeValues()
It's not even clear to me what it would mean if you tried to do:
a, b, c += someFnThatReturnsThreeValues()
You could limit self-assignment operators to single assignment, but then you've just added a weird corner case people have to know about.
With all of this, it's not at all clear that self-assignment operators are useful enough to be worth dealing with the above issues.
I think you could just rewrite this question as
Why doesn't <languageX> have <featureY> from <languageZ>?
Typically it's a trade-off that the language designers make based on their vision of what the language is intended for, and their goals.
In Lua's case, the language is intended to be an embedded scripting language, so any changes that make the language more complex or potentially make the compiler/runtime even slightly larger or slower may go against this objective.
If you implement each and every tiny feature, you can end up with a 'kitchen sink' language: ADA, anyone?
And as you say, it's just syntactic sugar.
Another reason why Lua doesn't have self-assignment operators is that table access can be overloaded with metatables to have arbitrary side effects. For self assignment you would need to choose to desugar
foo.bar.baz += 2
into
foo.bar.baz = foo.bar.baz + 2
or into
local tmp = foo.bar
tmp.baz = tmp.baz + 2
The first version runs the __index metamethod for foo twice, while the second one does so only once. Not including self-assignment in the language and forcing you to be explicit helps avoid this ambiguity.
I'm trying to parse a syntax using the Shunting Yard (SY) algorithm. The syntax includes the following commands (they're are many many others though!)
a + b // a and b are numbers
setxy c d //c,d can be numbers
setxy c+d b+a //all numbers
Essentially, setxy is a function but it doesn't expect any function argument separators. This makes it very difficult (impossible?) to do via SY due to the lack of parens and function argument separators.
Any idea if SY can be used to parse a parentheses-less/function argument separator-less function or should I move on to a different parsing algorithm? If so, which one would you recommend?
Thanks!
djs22
Having defined correct grammar you can make http://www.antlr.org/ generate parser for you. Whether it is appropriate solution depends on your homework "requirements".
At least you can generate it and look inside for some hints.
I don't fully understand what you are trying to do, but perhaps you could use some regex? what are you trying to do write a simple command line program?