I want to build a parser for a C like language. The interesting aspect about it is that I want to build it in such a way that someone who has access to the source can easily modified it to extend the language (a new expression type of instance) with the extensions being runtime configurable (they can be turned on and off).
My current intent is to build a recursive decent parser as an object. Each production will be a method of an object. The method of extension will be to derive classes from this base replacing methods (and production definitions) as needed. I'm still trying to figure out how to mix and match extensions. One idea is to play games with the v-tbl. Objects would be constructed with a v-tbl that is a copy of the base but with methods replaced from derived classes.
Aside from the bit-twiddling nature of the solution the only issues I have with it is
a reasonable way to do the v-tbl mixup
what to do when 2 extensions alter the same productions (as most replacements will end up calling the original having one replacement call the other would work but the mechanics of setting this up are the issue)
how to allow the extension of extensions (this might end up looking like a standard MI system, but I've never got how they work)
Another solution (a slightly more mundane version of the same same approach) would be to use static member variables to store function-pointers and call them for the same effect.
Edit: I have already built a system that lets me build productions from BNF definitions. I can alter it to support whatever I decide on.
These are some of the challenges the Perl 6 design effort has faced. You may find it worthwhile looking into some of the solutions they came up with. Or you may find that to be gross overkill.
I made a configurable parser I uploadei it some time ago at
http://code.google.com/p/compparser/
The project there is not up-to-date but is working fine.
If I recall my university courses correctly, recursive descent parsers have some limitations that might bite you, especially since you're allowing extensions - somebody elses language extension could cause issues.
A proper compiler toolkit - such as the open source ANTLR - might make things easier, and might also provide some different approaches for you.
another option is to express the parsing rules in XML or something, instead of in code; less efficient, but far more dynamically configurable; each language or variant can just use its own (XML) file, and even include/reference other files as 'base' files...
Frankly, I am not even sure I understood everything you wrote... :-)
But when I see parser and flexibility, I think about LPeg - Parsing Expression Grammars For Lua. It might not fit your needs but it is well worth a look... ;-)
Related
I'm doing a project which involves parsing the histories of common lisp repos. I need to parse them into list-of-lists or something like that. Ideally, I'd like to preserve as much of the original source file syntax as possible, in some way. For example, in the case of the text #+sbcl <something>, which I think means "If our current lisp is sbcl, read <something>, otherwise skip it", I'd like to get something like (#+ 'sbcl <something>).
I originally wrote a LALR parser in Python, which sort of worked, but it's not ideal for many reasons. I'm having a lot of difficulty getting correct output, and I have tons of special cases to add.
I figured that what I should really do is is use lisp itself, since it already has a lisp parser built in. If I could just read a file into sexps, I could dump it into something (cl-json would do) for further processing down the line.
Unfortunately, when I attempt to read https://github.com/fukamachi/woo/blob/master/src/woo.lisp, I get the error
There is no package with the name WOO.EV.TCP
which is of course coming from line 80 of that file, since that package is defined in src/ev/tcp.lisp, and we haven't read it.
Basically, is it possible to just read the file into sexps without caring whether the packages are defined or if they contain the relevant symbols? If so, how? I've tried looking at the hyperspec reader documentation, but I don't see anything that sounds relevant.
I'm out of practice with actually writing common lisp, but it seems potentially possible to hack around this by handling the undefined package condition by creating a blank package with that name, and handling the no-symbol-of-that-name-in-package condition by just interning a given symbol. I think. I don't know how to actually do this, I don't know if it would work, I don't know how many special cases would be involved. Offhand, the first condition is called no-such-package, but the second one (at least in sbcl) is called simple-error, so I don't even know how to determine whether this particular simple-error is the no-such-symbol-in-that-package error, let alone how to extract the relevant names from the condition, fix it, and restart. I'd really like to hear from a common lisp expert that this is the right thing to do here before I go down the road of trying to do it this way, because it will involve a lot of learning.
It also occurs to me that I could fix this by just sed-ing the file before reading it. E.g. turning woo.ev.tcp:start-listening-socket into, say, woo.ev.tcp===start-listening-socket. I don't particularly like this solution, and it's not clear that I wouldn't run into tons more ugly special cases, but it might work if there's no better answer.
I am almost sure there is no easy portable way to do this for a number of reasons.
(Just limiting things to the non-existent-package problem for now.)
First of all there is no portable access into the bit of the reader which decides that tokens are going to be symbols and then looks for package markers &c: that just happens according to the rules in 2.3. So you can't easily intervene in this.
Secondly there's not portably enough information in any kind of condition the reader might signal to be able to handle them.
There are several possible ways out of this bit of the problem.
If you felt sufficiently heroic you might be able to teach the reader that all of the token-starting characters are in fact things you control and then write a token-reader that somehow deals with the whole package thing by returning some object which isn't a symbol. But to do that you need to deal with numbers, and if you think that's simple, well, it's not.
If you felt less heroic you could write a more primitive token-reader which just doesn't even try to deal with anything except grabbing all the characters needed and returns some kind of object which wraps a string. This would avoid the whole number problem at the cost of losing a lot of intofmration.
If you don't care about portability, find an implementation, understand how its reader does it, and muck around with it. There are more open source or source-available implementations than I can easily count (perhaps I am not very good at counting) so this is a pretty good approach. It's certainly what I'd do.
But this is only the start of the problems. The CL reader is hairy and, in its standard configuration (the configuration which is used for things like compile-file unless people have arranged otherwise) can run completely arbitrary code at read time, including code which modifies the reader itself, some of which may do so in an implementation-dependent way. And people use this: there's a reason Lisp is called the 'programmable programming language' and it's that people program it.
I've decided to solve this using sed (actually Python's re.sub, but who's counting?) because it'll work for my actual use case, and was easy.
For future readers: The various people saying this is impossible in general are probably right. The other questions posted by #Svante look like good easy ways to solve part of the problem. Other parts of the problem might be solved more elegantly by replacing the reader macros for #., #+, #-, etc with ones which just make a list, which sounds less heroic than the suggestions from #tfb, but I don't have time for that shit.
I'm looking for steps/libraries/approaches to solve this Problem statement.
Given a source file of a Programming language, I need to parse it and Subdivide it into components.
Example:
Given a Java File, I need to find the following in it.
list of Imports
Classes present in it
Attributes in the Class
Methods in it - along the Parameters if any.
etc.
I need to extract these and store it separately.
Reason Why I want to do it?
I want to build an Inverted Index on the top of these Components.
Example queries to Inverted index
1. Find the list of files with Class name: Sample
2. Find the positions where variable XXX is used within the class AAA.
I need to support queries likes the above
So, my plan is given a file, if I build these components from it, It would be easy to build an Inverted index on the top of it.
Example: Sample -- Class - Sample.java(Keyword - Component - FileName )
I want to build an Inverted index like above.
I see it is being implemented in many IDEs like IntelliJ.What I'm interested it how much effort it would take to build something like this. And I want to try implementing the same for at least one language.
Thanks in advance.
You can try to do this "just" a parser; for your specific example, that might be enough.
But you'll need a parser for each language. If you stick to just Java, you can find Java parsers pretty easily; just reuse one, there is little point in you reinventing one more set of grammar rules to describe Java.
For more than one language, this starts to get tricky. You can:
try to find a separate parser for each language. This may be sort of successful for mainstream languages. As you get to less well known languages, these get a lot harder to find. If you succeed, you'll have the problem that the parsers are likely incompatible technology; now gluing them together to collectively collect your index information is going to be a mess.
pick one parsing technology and get grammars for all the languages you care about. You have only two realistic choices: YACC/Bison, and ANTLR.
As a practical matter the YACC and Bison have been used to implement LOTS of languages... but the grammar files are not collected in one place, so they are hard to find. ANTLR at least has a single repository you can find at their web site. So that might kind of work.
Its going to be quite the effort to assemble all these into an integrated whole.
A complication is that you may want more than just raw syntax; you might want to know the meaning of the symbols, and for each symbol, precisely where it is defined in which file. After all, you want your index to be accurate at scale, and this will require differentiating foo the variable name from foo the function name. Arguably you need symbol tables.
As a general rule, this is where pure-parsing of languages breaks down;
there is serious Life After Parsing.
In that case, you want an integrated set of tools for extracting information from the different languages.
Our DMS Software Reengineering Toolkit is such a framework, and has some 40 languages predefined for it. We use something like OP's suggested process to build indexes of a code base for search tools based on DMS. Building something like DMS is an enormous effort.
When writing proofs I noticed that Agda's auto proof search frequently wouldn't find solutions that seem obvious to me. Unfortunately coming up with a small example, that illustrates the problem seems to be hard, so I try to describe the most common patterns instead.
I forgot to add -m to the hole to make Agda look at the module scope. Can I make that flag the default? What downsides would that have?
Often the current hole can be filled by a parameter of the function I am about to implement. Even when adding -m, Agda will not consider function parameters or symbols introduced in let or where clauses though. Is there something wrong with simply trying all of them?
When viewing a goal, symbols introduced in let or where clauses are not even displayed. Why?
What other habits can make using auto more effective?
Agda's auto proof search is hardwired into the compiler. That makes it fast,
but limits the amount of customization you can do. One alternative approach
would be to implement a similar proof search procedure using Agda's
reflection mechanism. With the recent beefed up version of reflection using
the TC monad,
you no longer need to implement your own unification procedure.
Carlos
Tome's been working on reimplementing these ideas (check out his code
https://github.com/carlostome/AutoInAgda ). He's been working on several
versions that try to use information from the context, print debugging info,
etc. Hope this helps!
Out of curiosity, I wonder what can people do with parsers, how they are applied, and what do people usually create with it?
I know it's widely used in programming language industry, however I think this is just a tiny portion of it, right?
Besides special-purpose languages, my most ambitious use of a parser generator yet (with good old yacc back in C, and again later with pyparsing in Python) was to extract, validate and possibly alter certain meta-info from SQL queries -- parsing SQL properly is a real challenge (especially if you hope to support more than one dialect!-), a parser generator (and a lexer it sits on top of) at least remove THAT part of the job!-)
They are used to parse text....
To give a more concrete example, where I work we use lexx/yacc to parse strings coming over sockets.
Also from the name it should give you an idea what javacc is used for (java compiler compiler!)
Generally to parse Domain Specific Languages or scripting languages, or similar support for code snipits.
Previously I have seen it used to parse the command line based output of another software tool. This way the outer tool (VPN software) could re-use the base router IPSec code without modification. As lots of what was being parsed was IP Route tables and other structured repeated text.
Using a parser allowed simple changes when the formatting changed, instead of trying to find and tweak the a hand written parser. And the output did change a few times of the life of the product.
I used parsers to help process +/- 800 Clipper source files into similar PRGs that could be compiled with Alaksa Xbase 32.
You can use it to extend your favorite language by getting its language definition from their repository and then adding what you've always wanted to have. You can pass the regular syntax to your application and handle the extension in your own program.
I'm creating a compiler with Lex and YACC (actually Flex and Bison). The language allows unlimited forward references to any symbol (like C#). The problem is that it's impossible to parse the language without knowing what an identifier is.
The only solution I know of is to lex the entire source, and then do a "breadth-first" parse, so higher level things like class declarations and function declarations get parsed before the functions that use them. However, this would take a large amount of memory for big files, and it would be hard to handle with YACC (I would have to create separate grammars for each type of declaration/body). I would also have to hand-write the lexer (which is not that much of a problem).
I don't care a whole lot about efficiency (although it still is important), because I'm going to rewrite the compiler in itself once I finish it, but I want that version to be fast (so if there are any fast general techniques that can't be done in Lex/YACC but can be done by hand, please suggest them also). So right now, ease of development is the most important factor.
Are there any good solutions to this problem? How is this usually done in compilers for languages like C# or Java?
It's entirely possible to parse it. Although there is an ambiguity between identifiers and keywords, lex will happily cope with that by giving the keywords priority.
I don't see what other problems there are. You don't need to determine if identifiers are valid during the parsing stage. You are constructing either a parse tree or an abstract syntax tree (the difference is subtle, but irrelevant for the purposes of this discussion) as you parse. After that you build your nested symbol table structures by performing a pass over the AST you generated during the parse. Then you do another pass over the AST to check that identifiers used are valid. Follow this with one or more additional parses over the AST to generate the output code, or some other intermediate datastructure and you're done!
EDIT: If you want to see how it's done, check the source code for the Mono C# compiler. This is actually written in C# rather than C or C++, but it does use .NET port of Jay which is very similar to yacc.
One option is to deal with forward references by just scanning and caching tokens till you hit something you know how to real with (sort of like "panic-mode" error recovery). Once you have run thought the full file, go back and try to re parse the bits that didn't parse before.
As to having to hand write the lexer; don't, use lex to generate a normal parser and just read from it via a hand written shim that lets you go back and feed the parser from a cache as well as what lex makes.
As to making several grammars, a little fun with a preprocessor on the yacc file and you should be able to make them all out of the same original source