and also I want to compare the result of concurrent parsing with serial parsing .
How to do this? Which one is the simplest approach?
As others have said, YACC (based on LR parsing) by itself isn't "concurrent".
(It is my understanding that YACC isn't even re-entrant, which will make it pretty hard to use it in multithreaded context no matter what you do. The non-reentrancy can presumably be overcome with mere sweat, so this is an annoyance, not a show stopper.)
One idea is to construct a pipeline, allowing a lexer to generate lexemes into a steam as fast as it can, and letting the actual parser read from the stream. This can get you at best a factor of 2. You might be able to do this with YACC relatively easily, modulo setting up communicating threads.
McKeeman et al have implemented parallel LR parsing by dividing the file into N chunks for N threads and claim to have gotten good results. The approach isn't simple, because dividing parsing of a single file into parallel chunks of about the same size, and stitching those chunks together, isn't easy. I doubt that one could easily hack up YACC to do this.
A screwy idea is parse a file from both ends toward the middle.
Its easy enough to define the backward parser grammar from the "natural" forward one: just reverse all the grammar rule content. Nothing is easy; this might introduce ambiguities not present in the forward parser. This paper combines McKeeman's idea of breaking the file into chunks with bidirectional parsing of each chunk, enabling one to find a lot of parallelism on a big file.
Easier to do is to parallelize the parsing of individual files, using whatever parsing technology you have. This does parallelize relatively well, although the parsing time for individual files may not be even, so this is best done with some kind of worklist and a team parser threads taking work from this worklist. You could probably organize to do this with YACC.
Our DMS Software Reengineering Toolkit (DMS) uses a generalization of this when reading large systems of Java sources and classes. Parsing by itself isnt very useful; you almost always need to build symbol tables, too. DMS reading Java thus parallelizes both parsing, AST building and symbol table construction. Starting from a base set of filenames, it launches parses in parallel grains (multiplexed by DMS on top of OS threads); when a parser completes, the parsed tree is handed to name resolver, which splits into a parallel grain per parallel scope encountered. Nested scopes cause a tree of grains to be generated. Further work is gated by treating resolution of a scope as a future (event); while the scope is being resolved, more Java files parse/name resolution activities may be launched; when a scope is resolved, an event is signalled and grains waiting for scope completion can then inspect the scope content to resolve their own names. The tangle of (potential) parallelism in the middle of this is almost frightening :-} but is managed by DMS's underlying parallel programming language, PARLANSE, which uses work-stealing to balance the load across the threads.
Experience shows that doing this in production with 8 cores leads to a 5x speedup over sequential for a few thousand (typical/Java implies small) source files. We think we know where the bottleneck is; there are parts of the name resolver that are more expensive than it should be in terms of excess synchronization in an attribute grammar. I suspect we can get this closer to 8. So, parallel parsing and name resolution can be pretty effective.
We don't do quite as well with C++14, because of all the dependencies of individual files on #includes that it reads, often in different preprocessor configurations.
If the input allows splitting into chunks, like, e.g. log file lines, which can be parsed in parallel, then you can parse a certain number such chunks in parallel using a producer/consumer queue and then join the parse results.
Related
I am trying to research on the possible parsers, as part of developing a PC application that can be used for parsing a Lin Descriptor File. The current parser application is based on flex-bison parsing approach. Now I need to redesign the parser, since the current one is incapable of detecting specific errors.
I have previously used Ragel parser(https://en.wikipedia.org/wiki/Ragel) for parsing a regular expression (Regex : https://en.wikipedia.org/wiki/Regular_expression) commands and it proved very handy.
However, with the current complexity of a LDF-file, I am unsure if Ragel(with C++ as host language) is the best possible approach to parse the LDF-file. The reason for this is that the LDF-file has a lot of data that is not fixed or constant, but varies as per the vendors. Also the LDF fields must have retain references to other fields to detect errors in the file.
Ragel is more suited when the structure for parsing is fixed(thats what I found while developing a Regex parser)
Could anyone who has already worked on such a project, provide some tips to select a suitable parser for the Lin Descriptor File.
Example for Lin Descriptor File : http://microchipdeveloper.com/lin:protocol-app-ldf
If you feel that an LALR(1) parser is not adequate to your parsing problem, then it is not possible that a finite automaton would be better. The FA is strictly less powerful.
But without knowing much about the nature of the checks you want to implement, I'm pretty sure that the appropriate strategy is to parse the file into some simple hierarchical data structure (i.e. a tree of some form, usually called an AST in parsing literature) using a flex/bison grammar, and then walk the resulting data structure to perform whatever semantic checks seem necessary.
Attempting to do semantic checks while parsing usually leads to over-complicated, badly-factored and unscalable solutions. That is not a problem with the bison tool, but rather with a particular style of using it which does not take into account what we have learned about the importance of separation of concerns.
Refactoring your existing grammar so that it uses "just a grammar" -- that is, so that it just generates the required semantic representation -- is probably a much simpler task than reimplementing with some other parser generator (which is unlikely to provide any real advantage, in any case).
And you should definitely resist the temptation to abandon parser generators in favour of an even less modular solution: you might succeed in building something, but the probability is that the result will be even less maintainable and extensible than what you currently have.
I'm attempting to build my first C-like programming language, likely an interpreter and I've just made the first step aka the lexer.
I've thought about taking the lazy route by simply lexing the entire source code stream all at one then then have the parser process the data.
I've noticed that many other compilers and interpreters only lex during parsing when the parser module asks for another token.
Is it quicker in terms of code performance for a program to lex source code all at once then parse the resulting tokens or lex and parse tokens individually?
"faster" is a bit of a fuzzy word. There are different kinds of speed (latency, absolute start-to-finish duration, compile speed, execution speed), and depending on how you implement your language's front-end and backend, either approach can be faster.
Also, faster is not always better. If your parser is technically faster, but uses too much memory, it could crash or at the least end up swapping, which would slow it down again. If your parser is lightning-fast but produces inefficient code, your users will pay for your faster development speed. You'll have to write actual code and run it in a profiler to be able to tell what is really better, and come up with which criteria are important to you.
Tokenizing/Lexing everything at once at the start means you might be able to optimize memory allocation and thus take less time resizing your token list etc., but it also means the entire file has to be lexed before it can even be partially parsed.
OTOH if you parse as-needed, you may have to append to your arrays in small steps more often, so you'll pay a memory penalty, but in case of e.g. an interpreted language like JavaScript, you may only have to parse the parts that are actually used for this run-through.
So a lot of it depends on the details of your language, and the hardware you expect to be running on. In embedded systems with little memory and no swap, you may have no choice but to progressively lex, as the whole program source code might not fit in memory. If your language's syntax needs a lot of lookahead, you may not see any benefit of progressively lexing because you're reading it all anyway...
I'm making an interpreter for my own language as a hobby project. Currently my interpreter just executes the code as it sees it. I've heard you should make the parser generate an AST from the source code. So I was wondering, how does an AST actually make things faster than just executing the code linearly, as the parser sees it?
Because then you would have to do the parsing all the time. If you have a loop for instance, you'd have to parse the commands in the loop body over and over again.
Also, I would argue that it's cleaner since you break down the problem in two distinct tasks: Deal with syntax, then deal with semantics.
It isn't specifically the "AST" that makes it faster.
It is using any data structure (AST, symbol tables, control flow graph, triples, p-codes, machine code) that caches the analysis of source code to extract its intended meaning, and as much of precomputation of the answer ("optimization"), as possible. In effect, anything that partially compiles the code, should produce programs that run faster than an interpreter of the pure text.
In interesting tradeoff: if the amount of program being executed before the execution stops isn't very big, it may actually be cheaper to execute the text, than to do any compiler-style analysis.
Given the speed of machines these days, one can sloppily compile a pretty big program in 100 milliseconds, which is about as fast as a human can react. Various versions of TurboPascal back in the 80s and 90s were pretty famous for this.
I'm building my own AST data structure by adding actions to a grammar (just like I did with yacc, long ago---but this time, using Antlr4). While the generated parser builds correct ASTs, it's a terrible memory hog. A little instrumentation shows that up to 95% of the objects constructed by my actions and returned by grammar rules are discarded and don't end up in the final tree. I suspect this is due to the generated parser using a backtracking strategy instead of relying solely on lookahead. Is there a way to disable backtracking so I can test this hypothesis? I'm using the antlr-csharp-4.0.1 system under Visual Studio.
ANTLR 4 does not use backtracking or speculative execution of actions in any scenario. Actions are only ever executed on paths that are actually parsed. Note that if you have semantic predicates, those may be executed during prediction, and may be executed multiple times.
Your version of ANTLR 4 is significantly out of date. The latest version of the C# target can be found on this page:
https://github.com/tunnelvisionlabs/antlr4cs/releases/latest
I'm creating a compiler with Lex and YACC (actually Flex and Bison). The language allows unlimited forward references to any symbol (like C#). The problem is that it's impossible to parse the language without knowing what an identifier is.
The only solution I know of is to lex the entire source, and then do a "breadth-first" parse, so higher level things like class declarations and function declarations get parsed before the functions that use them. However, this would take a large amount of memory for big files, and it would be hard to handle with YACC (I would have to create separate grammars for each type of declaration/body). I would also have to hand-write the lexer (which is not that much of a problem).
I don't care a whole lot about efficiency (although it still is important), because I'm going to rewrite the compiler in itself once I finish it, but I want that version to be fast (so if there are any fast general techniques that can't be done in Lex/YACC but can be done by hand, please suggest them also). So right now, ease of development is the most important factor.
Are there any good solutions to this problem? How is this usually done in compilers for languages like C# or Java?
It's entirely possible to parse it. Although there is an ambiguity between identifiers and keywords, lex will happily cope with that by giving the keywords priority.
I don't see what other problems there are. You don't need to determine if identifiers are valid during the parsing stage. You are constructing either a parse tree or an abstract syntax tree (the difference is subtle, but irrelevant for the purposes of this discussion) as you parse. After that you build your nested symbol table structures by performing a pass over the AST you generated during the parse. Then you do another pass over the AST to check that identifiers used are valid. Follow this with one or more additional parses over the AST to generate the output code, or some other intermediate datastructure and you're done!
EDIT: If you want to see how it's done, check the source code for the Mono C# compiler. This is actually written in C# rather than C or C++, but it does use .NET port of Jay which is very similar to yacc.
One option is to deal with forward references by just scanning and caching tokens till you hit something you know how to real with (sort of like "panic-mode" error recovery). Once you have run thought the full file, go back and try to re parse the bits that didn't parse before.
As to having to hand write the lexer; don't, use lex to generate a normal parser and just read from it via a hand written shim that lets you go back and feed the parser from a cache as well as what lex makes.
As to making several grammars, a little fun with a preprocessor on the yacc file and you should be able to make them all out of the same original source