I'm reading a language specification, and trying to write a parser for it. In places the specification uses multiple names for the same thing. I tried to copy this to the Happy grammar like so:
foo : ...whatever...
bar : foo { $1 }
baz : foo { $1 }
Unfortunately, this causes Happy to flip out and start whining about "reduce/reduce conflicts". Basically, the problem seems to be that when it sees a foo, it doesn't know whether to reduce it to a bar or a baz. The truth is of course that I don't care, because they're all identical. But it still upsets Happy — which is linguistically ironic if nothing else.
(I'm also slightly scared that if I have a rule that mentions bar and another that mentions baz, the correct rule may fail to fire if the "wrong" reduction is picked. In other words, the parser will have a problem which is 100% impossible to ever debug.)
Is there some way I can tell Happy "wherever I say bar, just pretend I said foo and move on with your life"?
Obviously I could just use some external tool to do a textual find-and-replace, but I'd really prefer to not have to add yet another build step…
Related
I have a project that doesn't have -spec or -type in code, currently dialyzer can find some warnings, most of them are in machine generated codes.
Will adding type specs to code make dialyzer find more errors?
Off the topic, is there any tool to check whether the specs been violated or not?
Adding typespecs will dramatically improve the accuracy of Dialyzer.
Because Erlang is a dynamic language Dialyzer must default to a rather broad interpretation of types unless you give it hints to narrow the "success" typing that it will go by. Think of it like giving Dialyzer a filter by which it can transform a set of possible successes to a subset of explicit types that should ever work.
This is not the same as Haskell, where the default assumption is failure and all code must be written with successful typing to be compiled at all -- Dialyzer must default to assume success unless it knows for sure that a type is going to fail.
Typespecs are the main part of this, but Dialyzer also checks guards, so a function like
increment(A) -> A + 1.
Is not the same as
increment(A) when A > 100 -> A + 1.
Though both may be typed as
-spec increment(integer()) -> integer().
Most of the time you only care about integer values being integer(), pos_integer(), neg_integer(), or non_neg_integer(), but occasionally you need an arbitrary range bounded only on one side -- and the type language has no way to represent this currently (though personally I would like to see a declaration of 100..infinity work as expected).
The unbounded-range of when A > 100 requires a guard, but a bounded range like when A > 100 and A < 201 could be represented in the typespec alone:
-spec increment(101..200) -> pos_integer().
increment(A) -> %stuff.
Guards are fast with the exception of calling length/1 (which you should probably never actually need in a guard), so don't worry with the performance overhead until you actually know and can demonstrate that you have a performance problem that comes from guards. Using guards and typespecs to constrain Dialyzer is extremely useful. It is also very useful as documentation for yourself and especially if you use edoc, as the typespec will be shown there, making APIs less mysterious and easy to toy with at a glance.
There is some interesting literature on the use of Dialyzer in existing codebases. A well-documented experience is here: Gradual Typing of Erlang Programs: A Wrangler Experience. (Unfortunately some of the other links I learned a lot from previously have disappeared or moved. (!.!) A careful read of the Wrangler paper, skimming over the User's Guide and man page, playing with Dialyzer, and some prior experience in a type system like Haskell's will more than prepare you for getting a lot of mileage out of Dialyzer, though.)
[On a side note, I've spoken with a few folks before about specifying "pure" functions that could be guaranteed as strongly typed either with a notation or by using a different definition syntax (maybe Prolog's :- instead of Erlang's ->... or something), but though that would be cool, and it is very possible even now to concentrate side-effects in a tiny part of the program and pass all results back in a tuple of {Results, SideEffectsTODO}, this is simply not a pressing need and Erlang works pretty darn well as-is. But Dialyzer is indeed very helpful for showing you where you've lost track of yourself!]
This is a question I've been mildly irritated about for some time and just never got around to search the answer to.
However I thought I might at least ask the question and perhaps someone can explain.
Basically many languages I've worked in utilize syntactic sugar to write (using syntax from C++):
int main() {
int a = 2;
a += 3; // a=a+3
}
while in lua the += is not defined, so I would have to write a=a+3, which again is all about syntactical sugar. when using a more "meaningful" variable name such as: bleed_damage_over_time or something it starts getting tedious to write:
bleed_damage_over_time = bleed_damage_over_time + added_bleed_damage_over_time
instead of:
bleed_damage_over_time += added_bleed_damage_over_time
So I would like to know not how to solve this if you don't have a nice solution, in that case I would of course be interested in hearing it; but rather why lua doesn't implement this syntactical sugar.
This is just guesswork on my part, but:
1. It's hard to implement this in a single-pass compiler
Lua's bytecode compiler is implemented as a single-pass recursive descent parser that immediately generates code. It does not parse to a separate AST structure and then in a second pass convert that to bytecode.
This forces some limitations on the grammar and semantics. In particular, anything that requires arbitrary lookahead or forward references is really hard to support in this model. This means assignments are already hard to parse. Given something like:
foo.bar.baz = "value"
When you're parsing foo.bar.baz, you don't realize you're actually parsing an assignment until you hit the = after you've already parsed and generated code for that. Lua's compiler has a good bit of complexity just for handling assignments because of this.
Supporting self-assignment would make that even harder. Something like:
foo.bar.baz += "value"
Needs to get translated to:
foo.bar.baz = foo.bar.baz + "value"
But at the point that the compiler hits the =, it's already forgotten about foo.bar.baz. It's possible, but not easy.
2. It may not play nice with the grammar
Lua doesn't actually have any statement or line separators in the grammar. Whitespace is ignored and there are no mandatory semicolons. You can do:
io.write("one")
io.write("two")
Or:
io.write("one") io.write("two")
And Lua is equally happy with both. Keeping a grammar like that unambiguous is tricky. I'm not sure, but self-assignment operators may make that harder.
3. It doesn't play nice with multiple assignment
Lua supports multiple assignment, like:
a, b, c = someFnThatReturnsThreeValues()
It's not even clear to me what it would mean if you tried to do:
a, b, c += someFnThatReturnsThreeValues()
You could limit self-assignment operators to single assignment, but then you've just added a weird corner case people have to know about.
With all of this, it's not at all clear that self-assignment operators are useful enough to be worth dealing with the above issues.
I think you could just rewrite this question as
Why doesn't <languageX> have <featureY> from <languageZ>?
Typically it's a trade-off that the language designers make based on their vision of what the language is intended for, and their goals.
In Lua's case, the language is intended to be an embedded scripting language, so any changes that make the language more complex or potentially make the compiler/runtime even slightly larger or slower may go against this objective.
If you implement each and every tiny feature, you can end up with a 'kitchen sink' language: ADA, anyone?
And as you say, it's just syntactic sugar.
Another reason why Lua doesn't have self-assignment operators is that table access can be overloaded with metatables to have arbitrary side effects. For self assignment you would need to choose to desugar
foo.bar.baz += 2
into
foo.bar.baz = foo.bar.baz + 2
or into
local tmp = foo.bar
tmp.baz = tmp.baz + 2
The first version runs the __index metamethod for foo twice, while the second one does so only once. Not including self-assignment in the language and forcing you to be explicit helps avoid this ambiguity.
This is a follow up to a previous question I asked How to encode FIRST & FOLLOW sets inside a compiler, but this one is more about the design of my program.
I am implementing the Syntax Analysis phase of my compiler by writing a recursive descent parser. I need to be able to take advantage of the FIRST and FOLLOW sets so I can handle errors in the syntax of the source program more efficiently. I have already calculated the FIRST and FOLLOW for all of my non-terminals, but am have trouble deciding where to logically place them in my program and what the best data-structure would be to do so.
Note: all code will be pseudo code
Option 1) Use a map, and map all non-terminals by their name to two Sets that contains their FIRST and FOLLOW sets:
class ParseConstants
Map firstAndFollowMap = #create a map .....
firstAndFollowMap.put("<program>", FIRST_SET, FOLLOW_SET)
end
This seems like a viable option, but inside of my parser I would then need sorta ugly code like this to retrieve the FIRST and FOLLOW and pass to error function:
#processes the <program> non-terminal
def program
List list = firstAndFollowMap.get("<program>")
Set FIRST = list.get(0)
Set FOLLOW = list.get(1)
error(current_symbol, FOLLOW)
end
Option 2) Create a class for each non-terminal and have a FIRST and FOLLOW property:
class Program
FIRST = .....
FOLLOW = ....
end
this leads to code that looks a little nicer:
#processes the <program> non-terminal
def program
error(current_symbol, Program.FOLLOW)
end
These are the two options I thought up, I would love to hear any other suggestions for ways to encode these two sets, and also any critiques and additions to the two ways I posted would be helpful.
Thanks
I have also posted this question here: http://www.coderanch.com/t/570697/java/java/Encode-FIRST-FOLLOW-sets-recursive
You don't really need the FIRST and FOLLOW sets. You need to compute those to get the parse table. That is a table of {<non-terminal, token> -> <action, rule>} if LL(k) (which means seeing a non-terminal in stack and token in input, which action to take and if applies, which rule to apply), or a table of {<state, token> -> <action, state>} if (C|LA|)LR(k) (which means given state in stack and token in input, which action to take and go to which state.
After you get this table, you don't need the FIRST and FOLLOWS any more.
If you are writing a semantic analyzer, you must assume the parser is working correctly. Phrase level error handling (which means handling parse errors), is totally orthogonal to semantic analysis.
This means that in case of parse error, the phrase level error handler (PLEH) would try to fix the error. If it couldn't, parsing stops. If it could, the semantic analyzer shouldn't know if there was an error which was fixed, or there wasn't any error at all!
You can take a look at my compiler library for examples.
About phrase level error handling, you again don't need FIRST and FOLLOW. Let's talk about LL(k) for now (simply because about LR(k) I haven't thought about much yet). After you build the grammar table, you have many entries, like I said like this:
<non-terminal, token> -> <action, rule>
Now, when you parse, you take whatever is on the stack, if it was a terminal, then you must match it with the input. If it didn't match, the phrase level error handler kicks in:
Role one: handle missing terminals - simply generate a fake terminal of the type you need in your lexer and have the parser retry. You can do other stuff as well (for example check ahead in the input, if you have the token you want, drop one token from lexer)
If what you get is a non-terminal (T) from the stack instead, you must look at your lexer, get the lookahead and look at your table. If the entry <T, lookahead> existed, then you're good to go. Follow the action and push to/pop from the stack. If, however, no such entry existed, again, the phrase level error handler kicks in:
Role two: handle unexpected terminals - you can do many things to get passed this. What you do depends on what T and lookahead are and your expert knowledge of your grammar.
Examples of the things you can do are:
Fail! You can do nothing
Ignore this terminal. This means that you push lookahead to the stack (after pushing T back again) and have the parser continue. The parser would first match lookahead, throw it away and continues. Example: if you have an expression like this: *1+2/0.5, you can drop the unexpected * this way.
Change lookahead to something acceptable, push T back and retry. For example, an expression like this: 5id = 10; could be illegal because you don't accept ids that start with numbers. You can replace it with _5id for example to continue
I have what I imagine will be a fairly involved technical challenge: I want to be able to reliably alpha-rename identifiers in multiple languages (as many as possible). This will require special consideration for each language, and I'm asking for advice for how to minimize the amount of work I need to do by sharing code. Something like a unified parsing or abstract syntax framework that already has support for many languages would be great.
For example, here is some python code:
def foo(x):
def bar(y):
return x+y
return bar
An alpha renaming of x to y changes the x to a y and preserves semantics. So it would become:
def foo(y):
def bar(y1):
return y+y1
return bar
See how we needed to rename y to y1 in order to keep from breaking the code? That is why this is a hard problem. It seems like the program would have to have a pretty good knowledge of what constitutes a scope, rather than just doing, say, a string search and replace.
I would also like to preserve as much of the formatting as possible: comments, spacing, indentation. But that is not 100% necessary, it would just be nice.
Any tips?
To do this safely, you need to be able to to determine
all the identifiers (and those things that are not, e.g., the middle of a comment) in your code
the scopes of validity for each identifer
the ability to substitute a new identifier for an old one in the text
the ability to determine if renaming an identifier causes another name to be shadowed
To determine identifiers accurately, you need a least a langauge-accurate lexer. Identifiers in PHP look different than the do in COBOL.
To determine scopes of validity, you have to be determine program structure in practice, since most "scopes" are defined by such structure. This means you need a langauge-accurate parser; scopes in PHP are different than scopes in COBOL.
To determine which names are valid in which scopes, you need to know the language scoping rules. Your language may insist that the identifier X will refer to different Xes depending on the context in which X is found (consider object constructors named X with different arguments). Now you need to be able to traverse the scope structures according to the naming rules. Single inheritance, multiple inheritance, overloading, default types all will pretty much require you to build a model of the scopes for the programs, insert the identifiers and corresponding types into each scope, and then climb from the point of encounter of an identifier in the program text through the various scopes according to the language semantics. You will need symbol tables, inheritance linkages, ASTs, and the ability to navigage all of these. These structures are different from PHP and COBOL, but they share lots of common ideas so you likely need a library with the common concept support.
To rename an identifier, you have to modify the text. In a million lines of code, you need to point carefully. Modifying an AST node is one way to point carefully. Actually, you need to modify all the identifiers that correspond to the one being renamed; you have to climb over the tree to find them all, or record in the AST where all the references exist so they can be found easily. After modifyingy the tree you have to regenerate the source text after modifying the AST. That's a lot of machinery; see my SO answer on how to prettyprint ASTs preseriving all of the stuff you reasonably suggest should be preserved.
(Your other choice is to keep track in the AST of where the text for the string is,
and the read/patch/write the file.)
Before you update the file, you need to check that you haven't shadowed something. Consider this code:
{ local x;
x=1;
{local y;
y=2;
{local z;
z=y
print(x);
}
}
}
We agree this code prints "1". Now we decide to rename y to x.
We've broken the scoping, and now the print statement which referred
conceptually to the outer x refers to an x captured by the renamed y. The code now prints "2", so our rename broke it. This means that one must check all the other identifiers in scopes in which the renamed variable might be found, to see if the new name "captures" some name we weren't expecting. (This would be legal if the print statement printed z).
This is a lot of machinery.
Yes, there is a framework that has almost all of this as well as a number of robust language front ends. See our DMS Software Reengineering Toolkit. It has parsers producing ASTs, prettyprinters to produce text back from ASTs, generic symbol table management machinery (including support for multiple inheritance), AST visiting/modification machinery. Ithas prettyprinting machinery to turn ASTs back into text. It has front ends for C, C++, COBOL and Java that implement name and type resolution (e.g. instanting symbol table scopes and identifier to symbol table entry mappings); it has front ends for many other langauges that don't have scoping implemented yet.
We've just finished an exercise in implementing "rename" for Java. (All the above issues of course appeared). We about about to start one for C++.
You could try to create Xtext based implementations for the involved languages. The Xtext framework provides reliable infrastructure for cross language rename refactoring. However, you'll have to provide a grammar a at least a "good enough" scope resolution for each language.
Languages mostly guarantee tokens will be unique, whatever the context. A naive first approach (and this will break many, many pieces of code) would be:
cp file file.orig
sed -i 's/\b(newTokenName)\b/TEMPTOKEN/g' file
sed -i 's/\b(oldTokenName)\b/newTokenName/g' file
With GNU sed, this will break on PHP. Rewriting \b to a general token match, like ([^a-zA-Z~$-_][^a-zA-Z0-9~$-_]) would work on most C, Java, PHP, and Python, but not Perl (need to add # and % to the token characters. Beyond that, it would require a plugin architecture that works for any language you wanted to add. At some point, there will be two languages whose variable and function naming rules will be incompatible, and at that point, you'll need to do more and more in the plugin.
Seems like it's inconsistent in the lists module. For example, split has the number as the first argument and the list as the second, but sublists has the list as the first argument and the len as the second argument.
OK, a little history as I remember it and some principles behind my style.
As Christian has said the libraries evolved and tended to get the argument order and feel from the impulses we were getting just then. So for example the reason why element/setelement have the argument order they do is because it matches the arg/3 predicate in Prolog; logical then but not now. Often we would have the thing being worked on first, but unfortunately not always. This is often a good choice as it allows "optional" arguments to be conveniently added to the end; for example string:substr/2/3. Functions with the thing as the last argument were often influenced by functional languages with currying, for example Haskell, where it is very easy to use currying and partial evaluation to build specific functions which can then be applied to the thing. This is very noticeable in the higher order functions in lists.
The only influence we didn't have was from the OO world. :-)
Usually we at least managed to be consistent within a module, but not always. See lists again. We did try to have some consistency, so the argument order in the higher order functions in dict/sets match those of the corresponding functions in lists.
The problem was also aggravated by the fact that we, especially me, had a rather cavalier attitude to libraries. I just did not see them as a selling point for the language, so I wasn't that worried about it. "If you want a library which does something then you just write it" was my motto. This meant that my libraries were structured, just not always with the same structure. :-) That was how many of the initial libraries came about.
This, of course, creates unnecessary confusion and breaks the law of least astonishment, but we have not been able to do anything about it. Any suggestions of revising the modules have always been met with a resounding "no".
My own personal style is a usually structured, though I don't know if it conforms to any written guidelines or standards.
I generally have the thing or things I am working on as the first arguments, or at least very close to the beginning; the order depends on what feels best. If there is a global state which is chained through the whole module, which there usually is, it is placed as the last argument and given a very descriptive name like St0, St1, ... (I belong to the church of short variable names). Arguments which are chained through functions (both input and output) I try to keep the same argument order as return order. This makes it much easier to see the structure of the code. Apart from that I try to group together arguments which belong together. Also, where possible, I try to preserve the same argument order throughout a whole module.
None of this is very revolutionary, but I find if you keep a consistent style then it is one less thing to worry about and it makes your code feel better and generally be more readable. Also I will actually rewrite code if the argument order feels wrong.
A small example which may help:
fubar({f,A0,B0}, Arg2, Ch0, Arg4, St0) ->
{A1,Ch1,St1} = foo(A0, Arg2, Ch0, St0),
{B1,Ch2,St2} = bar(B0, Arg4, Ch1, St1),
Res = baz(A1, B1),
{Res,Ch2,St2}.
Here Ch is a local chained through variable while St is a more global state. Check out the code on github for LFE, especially the compiler, if you want a longer example.
This became much longer than it should have been, sorry.
P.S. I used the word thing instead of object to avoid confusion about what I was talking.
No, there is no consistently-used idiom in the sense that you mean.
However, there are some useful relevant hints that apply especially when you're going to be making deeply recursive calls. For instance, keeping whichever arguments will remain unchanged during tail calls in the same order/position in the argument list allows the virtual machine to make some very nice optimizations.