I'm writing an LALR parser generator as a pet project.
I'm using the purple dragon book to help me with the design, and what I gather from it is that there are four methods of error recovery in a parser:
Panic mode: Start dumping input symbols until a symbol pre-selected by the compiler designer is found
Phrase-level recovery: Modify the input string into something that allows the current production to reduce
Error productions: Anticipate errors by incorporating them into the grammar
Global correction: Way more complicated version of phrase-level recovery (as I understand it)
Two of these require modifying the input string (which I'd like to avoid), and the other two require the compiler designer to anticipate errors and design the error recovery based on their knowledge of the language. But the parser generator also has knowledge about the language, so I'm curious if there's a better way to recover from parsing errors without pre-selecting synchronizing tokens or filling up the grammar with error productions.
Instead of picking synchronizing tokens, can't the parser just treat symbols in the follow of all the nonterminals the current production can reduce to as synchronizing tokens? I haven't really worked out how well that would work - I visualize the parser being down a chain of in-progress productions but of course that's not how bottom-up parsers work. Would it produce too many irrelevant errors trying to find a workable state? Would it attempt to resume the parser in an invalid state? Is there a good way to pre-fill the parser table with valid error actions so the actual parsing program doesn't have to reason about where to go next when an error is encountered?
It's way too easy to get lost in a dead-end when you try to blindly follow all available productions. There are things that you know about your language which it would be very difficult for the parser generator to figure out. (Like, for example, that skipping to the next statement delimiter is very likely to allow the parse to recover.)
That's not to say that automated procedures haven't been tried. There is a long section about it in Parsing Theory (Sippu & Soisalon-Soininen). (Unfortunately, this article is paywalled, but if you have an ACM membership or access to a good library, you can probably find it.)
On the whole, the yacc strategy has proven to be "not awful", and even "good enough". There is one well-known way of making it better, which is to collect really bad syntax error messages (or failed error recovery), trace them to the state which is active when they occur (which is easy to do), and attach an error recovery procedure to that precise state and lookahead token. See, for example, Russ Cox's approach.
Related
I know that syntax analysis is needed to determine if the series of tokens given are appropriate in a language (by parsing these tokens to produce a syntax tree), and detect errors occurred during the parsing of input code, which caused by grammatically incorrect statements.
I also know that semantic analysis is then performed on the syntax tree to produce an annotated tree ,checking aspects that are not related to the syntactic form( like type correctness of expressions and declaration prior to use) , detecting errors occurred during the execution of the code, after it has been parsed as grammatically correct.
However , the following issue is not clear to me :
In case of syntax error detected by the syntax analyzer - does it mean that should be no semantic analysis ? Or perhaps the recovery from errors (in syntax analysis) should make the semantic analysis possible to be carried out?
When you compile an incorrect program, you generally want the compiler to inform you about as many problems as possible, so that you can fix them all before attempting to compile the program again. However, you don't want the compiler to report the same error many times, or to report things which are not really errors but rather the result of the compiler getting confused by previous errors.
Or am I projecting my expectations on you? Perhaps I should have written that whole paragraph in first person, since it is really about what I expect from a compiler. Perhaps you have different expectations. Or perhaps your expectations are similar to mine. Whatever they are, you should probably write your compiler to satisfy them. That's basically the point of writing your own compiler.
So, if you share my expectations, you probably want to do as much semantic analysis as you can feel reasonably confident about. You might, for example, be able to do type checking inside some functions, because there are no syntax errors within those functions. On the other hand, that's a lot of work and there's always the chance that the resulting error messages will not be helpful.
That's not very precise, because there really is no definitive answer. But you should at least be able to answer your own question on your own behalf. If your compiler does a lousy job of error reporting and you find that frustrating when you try to use your compiler, then you should work on making the reports better. (But, of course, time is limited and you might well feel that your compiler will be better if it optimises better, even though the error reports are not so great.)
I am actually trying to make a simple chatbot for information retrieval purposes in slack using python and I have come up with a Context Free Grammar (CFG) for synatx check . Now that I have a grammar, I want to create a parsing table/ parse tree for this grammar to validate my input string. It would be really helpful if you could let me know some libraries/ links/ mateirals that can help me implement a parser to perform syntax check for my chatbot.
Any help is appreciated. Thanks.
If you already wrote a context-free grammar, you can use the ChartParser of NLTK to parse any input, as described here: http://www.nltk.org/book/ch08.html
I think, however, that a hand written grammar will not be robust enough to deal with the huge amount of variations your users could write. These have gone out of fashion decades ago due to their poor performance and in constituency parsing one rather uses treebanks to generate grammars.
Depending on what exactly you want to archive, I suggest you also take a look at a dependency parser e.g. from spaCy. They are faster and allow you to easily navigate from the predicate of the sentence to its subject and objects.
I am trying to research on the possible parsers, as part of developing a PC application that can be used for parsing a Lin Descriptor File. The current parser application is based on flex-bison parsing approach. Now I need to redesign the parser, since the current one is incapable of detecting specific errors.
I have previously used Ragel parser(https://en.wikipedia.org/wiki/Ragel) for parsing a regular expression (Regex : https://en.wikipedia.org/wiki/Regular_expression) commands and it proved very handy.
However, with the current complexity of a LDF-file, I am unsure if Ragel(with C++ as host language) is the best possible approach to parse the LDF-file. The reason for this is that the LDF-file has a lot of data that is not fixed or constant, but varies as per the vendors. Also the LDF fields must have retain references to other fields to detect errors in the file.
Ragel is more suited when the structure for parsing is fixed(thats what I found while developing a Regex parser)
Could anyone who has already worked on such a project, provide some tips to select a suitable parser for the Lin Descriptor File.
Example for Lin Descriptor File : http://microchipdeveloper.com/lin:protocol-app-ldf
If you feel that an LALR(1) parser is not adequate to your parsing problem, then it is not possible that a finite automaton would be better. The FA is strictly less powerful.
But without knowing much about the nature of the checks you want to implement, I'm pretty sure that the appropriate strategy is to parse the file into some simple hierarchical data structure (i.e. a tree of some form, usually called an AST in parsing literature) using a flex/bison grammar, and then walk the resulting data structure to perform whatever semantic checks seem necessary.
Attempting to do semantic checks while parsing usually leads to over-complicated, badly-factored and unscalable solutions. That is not a problem with the bison tool, but rather with a particular style of using it which does not take into account what we have learned about the importance of separation of concerns.
Refactoring your existing grammar so that it uses "just a grammar" -- that is, so that it just generates the required semantic representation -- is probably a much simpler task than reimplementing with some other parser generator (which is unlikely to provide any real advantage, in any case).
And you should definitely resist the temptation to abandon parser generators in favour of an even less modular solution: you might succeed in building something, but the probability is that the result will be even less maintainable and extensible than what you currently have.
I just have to build myself some parsers for different computer languages. I thought about using ANTLR but what I really want is to explore this myself because I dislike the idea of generated code (yeah silly I know).
The question is how are compile errors (missing identifiers, wrong token for a certain rule etc.) are handled and represented within ASTs etc.
What I know from compiler lectures is that a parser usually tries to throw away token or try to find the next matching code element (like missing a ';' and so taking the ';' of the next expression).
But then how is this expressed within the AST. Is there some malformed expression object/type? I am a bit puzzled.
I do not want to just reject a certain input but handle it.
I'm creating a compiler with Lex and YACC (actually Flex and Bison). The language allows unlimited forward references to any symbol (like C#). The problem is that it's impossible to parse the language without knowing what an identifier is.
The only solution I know of is to lex the entire source, and then do a "breadth-first" parse, so higher level things like class declarations and function declarations get parsed before the functions that use them. However, this would take a large amount of memory for big files, and it would be hard to handle with YACC (I would have to create separate grammars for each type of declaration/body). I would also have to hand-write the lexer (which is not that much of a problem).
I don't care a whole lot about efficiency (although it still is important), because I'm going to rewrite the compiler in itself once I finish it, but I want that version to be fast (so if there are any fast general techniques that can't be done in Lex/YACC but can be done by hand, please suggest them also). So right now, ease of development is the most important factor.
Are there any good solutions to this problem? How is this usually done in compilers for languages like C# or Java?
It's entirely possible to parse it. Although there is an ambiguity between identifiers and keywords, lex will happily cope with that by giving the keywords priority.
I don't see what other problems there are. You don't need to determine if identifiers are valid during the parsing stage. You are constructing either a parse tree or an abstract syntax tree (the difference is subtle, but irrelevant for the purposes of this discussion) as you parse. After that you build your nested symbol table structures by performing a pass over the AST you generated during the parse. Then you do another pass over the AST to check that identifiers used are valid. Follow this with one or more additional parses over the AST to generate the output code, or some other intermediate datastructure and you're done!
EDIT: If you want to see how it's done, check the source code for the Mono C# compiler. This is actually written in C# rather than C or C++, but it does use .NET port of Jay which is very similar to yacc.
One option is to deal with forward references by just scanning and caching tokens till you hit something you know how to real with (sort of like "panic-mode" error recovery). Once you have run thought the full file, go back and try to re parse the bits that didn't parse before.
As to having to hand write the lexer; don't, use lex to generate a normal parser and just read from it via a hand written shim that lets you go back and feed the parser from a cache as well as what lex makes.
As to making several grammars, a little fun with a preprocessor on the yacc file and you should be able to make them all out of the same original source