I found this question here in which OP asks for a way to profile an ANTLTR grammar.
However the answer is somewhat unsatisfying as it is limited to grammars without actions and - even more important - it is an automated profiling that will (as I see it) use the defaul constructor of the generated lexer/parser to construct it.
I need to profile a grammar, that does contain actions and that has to be constructed using a custom constructor. Therefore I'd need to be able to instantiate the lexer + parser myself and then profile it.
I was unable to find any information on this topic. I know there is a profiler for IntelliJ but it works quite similar to the one described in the linked question's answer (maybe it's even the same).
Does anyone know how I can profile my grammar with this special needs? I don't need any fancy GUI. I'd be satisified if I get the result printed to the console or something like that.
To wrap it up: I'm searching for either a tool or a hint on how to write some code that lets me profile my ANTLR grammar (with self-instantiated lexer/parser).
Btw my target language is Java so I guess the profiler has to be in Java as well.
A good start is setting Parser.setProfile() to true and examine what you get from Parser.getParseInfo() after a parse run. I haven't yet looked closer what's provided in detail by the profiling result, but it's on the todo list for my vscode extension for ANTLR4 to provide profiling info for grammars to help improving them.
A hint for getting from the decision info to a specific rule: there's a decision number, which is an index into ATN.decisionToState. The DecisionState instance you can get by this is an ATNState descendant, which allows to get a ATNState.ruleIndex from it. The rule index then can be used with your parser's ruleNames property to find the name of that rule. The value is also what is used for the rule's enum entry.
Related
I'm looking for steps/libraries/approaches to solve this Problem statement.
Given a source file of a Programming language, I need to parse it and Subdivide it into components.
Example:
Given a Java File, I need to find the following in it.
list of Imports
Classes present in it
Attributes in the Class
Methods in it - along the Parameters if any.
etc.
I need to extract these and store it separately.
Reason Why I want to do it?
I want to build an Inverted Index on the top of these Components.
Example queries to Inverted index
1. Find the list of files with Class name: Sample
2. Find the positions where variable XXX is used within the class AAA.
I need to support queries likes the above
So, my plan is given a file, if I build these components from it, It would be easy to build an Inverted index on the top of it.
Example: Sample -- Class - Sample.java(Keyword - Component - FileName )
I want to build an Inverted index like above.
I see it is being implemented in many IDEs like IntelliJ.What I'm interested it how much effort it would take to build something like this. And I want to try implementing the same for at least one language.
Thanks in advance.
You can try to do this "just" a parser; for your specific example, that might be enough.
But you'll need a parser for each language. If you stick to just Java, you can find Java parsers pretty easily; just reuse one, there is little point in you reinventing one more set of grammar rules to describe Java.
For more than one language, this starts to get tricky. You can:
try to find a separate parser for each language. This may be sort of successful for mainstream languages. As you get to less well known languages, these get a lot harder to find. If you succeed, you'll have the problem that the parsers are likely incompatible technology; now gluing them together to collectively collect your index information is going to be a mess.
pick one parsing technology and get grammars for all the languages you care about. You have only two realistic choices: YACC/Bison, and ANTLR.
As a practical matter the YACC and Bison have been used to implement LOTS of languages... but the grammar files are not collected in one place, so they are hard to find. ANTLR at least has a single repository you can find at their web site. So that might kind of work.
Its going to be quite the effort to assemble all these into an integrated whole.
A complication is that you may want more than just raw syntax; you might want to know the meaning of the symbols, and for each symbol, precisely where it is defined in which file. After all, you want your index to be accurate at scale, and this will require differentiating foo the variable name from foo the function name. Arguably you need symbol tables.
As a general rule, this is where pure-parsing of languages breaks down;
there is serious Life After Parsing.
In that case, you want an integrated set of tools for extracting information from the different languages.
Our DMS Software Reengineering Toolkit is such a framework, and has some 40 languages predefined for it. We use something like OP's suggested process to build indexes of a code base for search tools based on DMS. Building something like DMS is an enormous effort.
I'm taking a compiler-design class where we have to implement our own compiler (using flex and bison). I have had experience in parsing (writing EBNF's and recursive-descent parsers), but this is my first time writing a compiler.
The language design is pretty open-ended (the professor has left it up to us). In class, the professor went over generating intermediate code. He said that it is not necessary for us to construct an Abstract Syntax Tree or a parse tree while parsing, and that we can generate the intermediate code as we go.
I found this confusing for two reasons:
What if you are calling a function before it is defined? How can you resolve the branch target? I guess you would have to make it a rule that you have to define functions before you use them, or maybe pre-define them (like C does?)
How would you deal with conditionals? If you have an if-else or even just an if, how can you resolve the branch target for the if when the condition is false (if you're generating code as you go)?
I planned on generating an AST and then walking the tree after I create it, to resolve the addresses of functions and branch targets. Is this correct or am I missing something?
The general solution to both of your issues is to keep a list of addresses that need to be "patched." You generate the code and leave holes for the missing addresses or offsets. At the end of the compilation unit, you go through the list of holes and fill them in.
In FORTH the "list" of patches is kept on the control stack and is unwound as each control structure terminates. See FORTH Dimensions
Anecdote: an early Lisp compiler (I believe it was Lisp) generated a list of machine code instructions in symbolic format with forward references to the list of machine code for each branch of a conditional. Then it generated the binary code walking the list backwards. This way the code location for all forward branches was known when the branch instruction needed to be emitted.
The Crenshaw tutorial is a concrete example of not using an AST of any kind. It builds a working compiler (including conditionals, obviously) with immediate code generation targeting m68k assembly.
You can read through the document in an afternoon, and it is worth it.
Out of curiosity, I wonder what can people do with parsers, how they are applied, and what do people usually create with it?
I know it's widely used in programming language industry, however I think this is just a tiny portion of it, right?
Besides special-purpose languages, my most ambitious use of a parser generator yet (with good old yacc back in C, and again later with pyparsing in Python) was to extract, validate and possibly alter certain meta-info from SQL queries -- parsing SQL properly is a real challenge (especially if you hope to support more than one dialect!-), a parser generator (and a lexer it sits on top of) at least remove THAT part of the job!-)
They are used to parse text....
To give a more concrete example, where I work we use lexx/yacc to parse strings coming over sockets.
Also from the name it should give you an idea what javacc is used for (java compiler compiler!)
Generally to parse Domain Specific Languages or scripting languages, or similar support for code snipits.
Previously I have seen it used to parse the command line based output of another software tool. This way the outer tool (VPN software) could re-use the base router IPSec code without modification. As lots of what was being parsed was IP Route tables and other structured repeated text.
Using a parser allowed simple changes when the formatting changed, instead of trying to find and tweak the a hand written parser. And the output did change a few times of the life of the product.
I used parsers to help process +/- 800 Clipper source files into similar PRGs that could be compiled with Alaksa Xbase 32.
You can use it to extend your favorite language by getting its language definition from their repository and then adding what you've always wanted to have. You can pass the regular syntax to your application and handle the extension in your own program.
I'm creating a compiler with Lex and YACC (actually Flex and Bison). The language allows unlimited forward references to any symbol (like C#). The problem is that it's impossible to parse the language without knowing what an identifier is.
The only solution I know of is to lex the entire source, and then do a "breadth-first" parse, so higher level things like class declarations and function declarations get parsed before the functions that use them. However, this would take a large amount of memory for big files, and it would be hard to handle with YACC (I would have to create separate grammars for each type of declaration/body). I would also have to hand-write the lexer (which is not that much of a problem).
I don't care a whole lot about efficiency (although it still is important), because I'm going to rewrite the compiler in itself once I finish it, but I want that version to be fast (so if there are any fast general techniques that can't be done in Lex/YACC but can be done by hand, please suggest them also). So right now, ease of development is the most important factor.
Are there any good solutions to this problem? How is this usually done in compilers for languages like C# or Java?
It's entirely possible to parse it. Although there is an ambiguity between identifiers and keywords, lex will happily cope with that by giving the keywords priority.
I don't see what other problems there are. You don't need to determine if identifiers are valid during the parsing stage. You are constructing either a parse tree or an abstract syntax tree (the difference is subtle, but irrelevant for the purposes of this discussion) as you parse. After that you build your nested symbol table structures by performing a pass over the AST you generated during the parse. Then you do another pass over the AST to check that identifiers used are valid. Follow this with one or more additional parses over the AST to generate the output code, or some other intermediate datastructure and you're done!
EDIT: If you want to see how it's done, check the source code for the Mono C# compiler. This is actually written in C# rather than C or C++, but it does use .NET port of Jay which is very similar to yacc.
One option is to deal with forward references by just scanning and caching tokens till you hit something you know how to real with (sort of like "panic-mode" error recovery). Once you have run thought the full file, go back and try to re parse the bits that didn't parse before.
As to having to hand write the lexer; don't, use lex to generate a normal parser and just read from it via a hand written shim that lets you go back and feed the parser from a cache as well as what lex makes.
As to making several grammars, a little fun with a preprocessor on the yacc file and you should be able to make them all out of the same original source
This question is intended to be a discussion of people's personal opinions in handling user input.
This portion of the project that I am working on handles user input in a manner similar to an IRC chat. For instance, there are set commands and whatnot, for chatting, executing actions, etc.
Now, I have several options to choose from for parsing this input. I could go with regular expressions, I could parse it directly (ie a large switch statement with all supported commands, simply checking the first x number of characters in the user input), or could even go crazy and add in a parser similar to Flex/Bison implementations. One other option I was considering was defining all commands in an XML file to separate them from the code implementation.
So, what are the thoughts of the community?
I'd go with a nice mixed bag of all.
Obviously you'll have to sanitize the input. Make sure there's no nasty stuff there, depending on where the input is going to prevent SQL injection, XSS, CSRF etc...
But when your input is clean, you could go with a regexp that catches the ones intended as command and gets all necessary submatches (command parameters etc.) and then have some sort of dispatcher-switch statement or similar.
There really is no cover-all best practice here, apart from always always and quadruple-always making sure user input is sanitized. Apart from that, go with what seems to fit best for your case.
Obviously there are those that say if you've got a problem and you're thinking of using reg exps to solve said problem, you've got two problems, but used cautiously, they're the best thing ever. Just remember that regexp-monsters can read to really poor readability really quick.