Is there such thing as a generic code parser? - parsing

There are tons of different programming languages, but lots of them adhere to some vaguely unifying rules about things like indentation and/or punctuation. Is there any library out there that serves as a "generic code parser"? For instance, as an input I would like to dump arbitrary code or even code fragments. The output wouldn't have to be a true abstract syntax tree, but I would like to have a good idea about of the hierarchy of the code, that is, what blocks exist at the top level, what blocks exist within those blocks.

Related

What exactly does the Lua Programming WikiBooks mean by "Instructions"

https://en.wikibooks.org/wiki/Lua_Programming/Statements says:
Statements are pieces of code that can be executed and that contain an
instruction and expressions to use with it. Some statements will also
contain code inside of themselves that may, for example, be run under
certain conditions. Dissimilarly to expressions, they can be put
directly in code and will execute.
What does it mean by instructions?
Am I looking too deep in to it
That article seems very poorly written in general. Containing gems like "Assignment is [...] used to assign". Your confusion is probably also just a result of this awkward style. The way I read it, the book separates between:
Statements: Do something specific, like adding two values into a variable.
Instructions: Things you can do in general, like adding any two values.
It seems to suggest a sort of abstraction-application relationship between the two.
That's a very specific way of dividing between the two and it's ultimately very inconsequential, so you can probably treat them as interchangeable while reading that book.

Tatsu: Rule Ordering

I am playing around with Tatsu to implement a parser for a language used in the semiconductor industry. This language requires that variables be defined before usage. So for example:
SignalGroup { A: In; B: Out};
Pattern {
V {A=1, B=1 }
V {A=1, B=0 }
};
In this case, the SignalGroup block must come before the Pattern block. How do I enforce/implement this "ordering" when writing the grammer in TatSu?
Although for some languages it is possible to write grammars that verify if the same symbol appears on different places, the grammars usually end up being too complicated to be useful.
Compilers (translators) are usually implemented with separate lexical, syntactical, and semantic analyzer components. There are several reasons for that:
Each component is so well focused that it is clearer and easier to write.
Each component is very efficient
The most common errors (which are exactly lexical, syntactical, and semantic) can be reported earlier
With those components in mind, checking if a symbol has ben previously defined belongs to the semantic (meaning) aspect of the program, and the way to check is to keep a symbol table that is filled when the definition parts of the input are being parsed, and queried on the use parts of the input are being parsed.
In TatSu in particular the different components are well separated, yet run in parallel. For your requirement you just need to use the simplest grammar that allows for the semantic actions that store and query the symbols. By raising FailedSemantics from within semantic actions, any semantic errors will be reported exactly as the lexical and syntactical ones so the user doesn't have to think about which component flagged each error.
If you use the Python parser generation in TatSu, the translator will generate the skeleton of a semantic actions class as part of the output.

Partial parsing with flex/antlr

I encountered a problem while doing my student research project. I'm an electrical engineering student, but my project has somewhat to do with theoretical computer science: I need to parse a lot of pascal sourcecode-files for typedefinitions and constants and visualize all occurrences. The typedefinitions are spread recursively over various files, i.e. there is type a = byte in file x, in file y, there is a record (struct) b, that contains type a and then there is even a type c in file z that is an array of type b.
My idea so far was to learn about compiler construction, since the compiler has to resolve all typedefinitions and break them down to the elemental types.
So, I've read about compiler construction in two books (one of which is even written by the pascal inventor), but I'm lacking so many basics of theoretical computer science that it took me one week alone to work my way halfway through. What I've learned so far is that for achieving my goal, lexer and parser should be sufficient. Since this software is only a really smart part of the whole project, I can't spend so much time with it, so I started experimenting with flex and later with antlr.
My hope was, that parsing for typedefinitions only was such an easy task, that I could manage to do it with only using a scanner and let it do some parser's work: The pascal-files consist of 5 main-parts, each one being optional: A header with comments, a const-section, a type-section, a var-section and (in least cases) a code-section. Each section has a start-identifier but no clear end-identifier. So I started searching for the start of the type- and const-section (TYPE, CONST), discarding everything else. In flex, this is fairly easy, because it allows "start conditions". They can be used as various states like "INITIAL", "TYPE-SECTION", "CONST-SECTION" and "COMMENT" with different rules for each state. I wanted to get back a string from the scanner with following syntax " = ". There was one thing that made this task difficult: Some type contain comments like in this example: AuEingangsBool_t {PCMON} = MAX_AuEingangsFeld;. The scanner can not extract such type-definition with a regular expression.
My next step was to do it properly with scanner AND parser, so I searched for a parsergenerator and found antlr. Since I write the tool in C# anyway, I decided to use its scannergenerator, too, so that I do not have to communicate between different programs. Now I encountered following Problem: AFAIK, antlr does not support "start conditions" as flex do. That means, I have to scan the whole file (okay, comments still get discarded) and get a lot of unneccessary (and wrong) tokens. Because I don't use rules for the whole pascal grammar, the scanner would identify most keywords of the pascal syntax as user-identifiers and the parser would nag about all those series of tokens, that do not fit to type- and constant-defintions
Now, finally my question(s): Can anyone of you tell me, which approach leads anywhere for my project? Is there a possibility to scan only parts of the source-files with antlr? Or do I have to connect flex with antlr for that purpose? Can I tell antlr's parser to ignore every token that is not in the const- or type-section? Are those tools too powerful for my task and should I write own routines instead?
You'd be better off to find a compiler for Pascal, and simply modify to report the information you want. Presumably there is such a compiler for your Pascal, and often the source code for such compilers is available.
Otherwise you essentially need to build a parser. Building lexer, and then hacking around with the resulting lexemes, is essentially building a bad parser by ad hoc methods. ANTLR is a good way to go; you can define the lexemes (including means to pick up and ignore comments) pretty easily, especially for older dialects of Pascal. You'll need good BNF rules for the type information that you want, and translate those rules to the parser generator. What you can do to minimize work, is to cheat on rules for the parts of the language you don't care about. For instance, you could write an accurate subgrammar for assignment statements. Since you don't care about them, you can write a sloppy subgrammar that treats assignment statements as anything that begins with an identifier, is followed by arbitrary other tokens, and ends with semicolon. This kind of a grammar is called an "island grammar"; it is only accurate where it needs to be accurate.
I don't know about the recursive bit. Is there a reason you can't just process each file separately? The answer may depend on what information you want to know about each type declaration, and if you go deep enough, you may need a symbol table as well as an island parser. Parser generators offer you no help for this.
First, there can be type and const blocks within other blocks (procedures, in later Delphi versions also classes).
Moreover, I'm not entirely sure that you can actually simply scan for a const token, and then start parsing. Const is also used for other purposes in most common (Borland) Pascal dialects. Some keywords can be reused in a different context, and if you don't parse the global blockstructure, and only look for const and type in specific places you will erroneously start parsing there.
A base problem of course is the comments. Scanners cut out comments as early as possible, and don't regard them further. You probably have to setup the scanner so that comments are attached to the adjacent tokens as field (associate with token before or save them up till a certain token follows).
As far antlr vs flex, no clue. The only parsergenerator I have some minor experience in parsing Pascal with is Coco/R (a parsergenerator popular by Wirthians), but in general I (and many pascalians) prefer handcoded.

Is there a relationship between untyped/typed code quotations in F# and macro hygiene?

I wonder if there is a relationship between untyped/typed code quotations in F# and the hygiene of macro systems. Do they solve the same issues in their respective languages or are they separate concerns?
The meta-programming aspect is the only similarity, and even in that regard, there is a big difference. You can think of the macro's transformer as a function from syntax to syntax like you can manipulate quotations, but the transformers are globally coordinated so that names used as binders follow a specific protocol:
1) Binders may not be the same as any free name in input to the macro (unless you use an unhygienic escape hatch)
2) Names bound in a macro definition's context that are free in the macro's expansion must point to the same thing at macro use time. (this needs global coordination)
Choices for names are made so that expansion does not fail if you used the wrong name (unless it turns out that name is unbound).
Transformers of typed quotations do not have this definition time context idea. You manipulate quotations to form a program that does not refer to any names in your program. They are not meant to provide a syntactic abstraction mechanism. Arbitrary shapes of syntax? Nope. It all has to be core AST shapes.
Open code in typed quotation systems can be closed with anything that fits the type structure of the expected context - there is no coordinated composition of several open components into a coherent structure.
Quotations are a form of meta-programming. They allow you to manipulate abstract syntax trees programmatically, which can be in turned spliced into code, and evaluated.
Typed quotations embed the reified type of the AST in the host language's type system, so they ensure you cannot generate ill-typed fragments of code. Untyped quotations do not offer that guarantee (it may fail with a runtime error).
As an aside, typed quotations are strongly similar to Template Haskell quasiquotations.
Hygenic macros in Lisp-like languages are related, in that they exist to support meta-programming. The hygiene however is for simple name capture confusion, something that typed quasi quotations already avoid (and more).
So yes, they are similar, in that they are mechanisms for meta-programming in typed and untyped languages, respectively. Both typed quasi quotes and hygenic macros add additional safety to fully untyped, unsound meta programming. The level of guarantee they offer the programmer though is different. The typed quotes are strictly stronger.

Parsing Source Code - Unique Identifiers for Different Languages? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I'm building an application that receives source code as input and analyzes several aspects of the code. It can accept code from many common languages, e.g. C/C++, C#, Java, Python, PHP, Pascal, SQL, and more (however many languages are unsupported, e.g. Ada, Cobol, Fortran). Once the language is known, my application knows what to do (I have different handlers for different languages).
Currently I'm asking the user to input the programming language the code is written in, and this is error-prone: although users know the programming languages, a small percentage of them (on rare occasions) click the wrong option just due to recklessness, and that breaks the system (i.e. my analysis fails).
It seems to me like there should be a way to figure out (in most cases) what the language is, from the input text itself. Several notes:
I'm receiving pure text and not file names, so I can't use the extension as a hint.
The user is not required to input complete source codes, and can also input code snippets (i.e. the include/import part may not be included).
it's clear to me that any algorithm I choose will not be 100% proof, certainly for very short input codes (e.g. that could be accepted by both Python and Ruby), in which cases I will still need the user's assistance, however I would like to minimize user involvement in the process to minimize mistakes.
Examples:
If the text contains "x->y()", I may know for sure it's C++ (?)
If the text contains "public static void main", I may know for sure it's Java (?)
If the text contains "for x := y to z do begin", I may know for sure it's Pascal (?)
My question:
Are you familiar with any standard library/method for figuring out automatically what the language of an input source code is?
What are the unique code "tokens" with which I could certainly differentiate one language from another?
I'm writing my code in Python but I believe the question to be language agnostic.
Thanks
Vim has a autodetect filetype feature. If you download vim sourcecode you will find a /vim/runtime/filetype.vim file.
For each language it checks the extension of the file and also, for some of them (most common), it has a function that can get the filetype from the source code. You can check that out. The code is pretty easy to understand and there are some very useful comments there.
build a generic tokenizer and then use a Bayesian filter on them. Use the existing "user checks a box" system to train it.
Here is a simple way to do it. Just run the parser on every language. Whatever language gets the farthest without encountering any errors (or has the fewest errors) wins.
This technique has the following advantages:
You already have most of the code necessary to do this.
The analysis can be done in parallel on multi-core machines.
Most languages can be eliminated very quickly.
This technique is very robust. Languages that might appear very similar when using a fuzzy analysis (baysian for example), would likely have many errors when the actual parser is run.
If a program is parsed correctly in two different languages, then there was never any hope of distinguishing them in the first place.
I think the problem is impossible. The best you can do is to come up with some probability that a program is in a particular language, and even then I would guess producing a solid probability is very hard. Problems that come to mind at once:
use of features like the C pre-processor can effectively mask the underlyuing language altogether
looking for keywords is not sufficient as the keywords can be used in other languages as identifiers
looking for actual language constructs requires you to parse the code, but to do that you need to know the language
what do you do about malformed code?
Those seem enough problems to solve to be going on with.
One program I know which even can distinguish several different languages within the same file is ohcount. You might get some ideas there, although I don't really know how they do it.
In general you can look for distinctive patterns:
Operators might be an indicator, such as := for Pascal/Modula/Oberon, => or the whole of LINQ in C#
Keywords would be another one as probably no two languages have the same set of keywords
Casing rules for identifiers, assuming the piece of code was writting conforming to best practices. Probably a very weak rule
Standard library functions or types. Especially for languages that usually rely heavily on them, such as PHP you might just use a long list of standard library functions.
You may create a set of rules, each of which indicates a possible set of languages if it matches. Intersecting the resulting lists will hopefully get you only one language.
The problem with this approach however, is that you need to do tokenizing and compare tokens (otherwise you can't really know what operators are or whether something you found was inside a comment or string). Tokenizing rules are different for each language as well, though; just splitting everything at whitespace and punctuation will probably not yield a very useful sequence of tokens. You can try several different tokenizing rules (each of which would indicate a certain set of languages as well) and have your rules match to a specified tokenization. For example, trying to find a single-quoted string (for trying out Pascal) in a VB snippet with one comment will probably fail, but another tokenizer might have more luck.
But since you want to perform analysis anyway you probably have parsers for the languages you support, so you can just try running the snippet through each parser and take that as indicator which language it would be (as suggested by OregonGhost as well).
Some thoughts:
$x->y() would be valid in PHP, so ensure that there's no $ symbol if you think C++ (though I think you can store function pointers in a C struct, so this could also be C).
public static void main is Java if it is cased properly - write Main and it's C#. This gets complicated if you take case-insensitive languages like many scripting languages or Pascal into account. The [] attribute syntax in C# on the other hand seems to be rather unique.
You can also try to use the keywords of a language - for example, Option Strict or End Sub are typical for VB and the like, while yield is likely C# and initialization/implementation are Object Pascal / Delphi.
If your application is analyzing the source code anyway, you code try to throw your analysis code at it for every language and if it fails really bad, it was the wrong language :)
My approach would be:
Create a list of strings or regexes (with and without case sensitivity), where each element has assigned a list of languages that the element is an indicator for:
class => C++, C#, Java
interface => C#, Java
implements => Java
[attribute] => C#
procedure => Pascal, Modula
create table / insert / ... => SQL
etc. Then parse the file line-by-line, match each element of the list, and count the hits.
The language with the most hits wins ;)
How about word frequency analysis (with a twist)? Parse the source code and categorise it much like a spam filter does. This way when a code snippet is entered into your app which cannot be 100% identified you can have it show the closest matches which the user can pick from - this can then be fed into your database.
Here's an idea for you. For each of your N languages, find some files in the language, something like 10-20 per language would be enough, each one not too short. Concatenate all files in one language together. Call this lang1.txt. GZip it to lang1.txt.gz. You will have a set of N langX.txt and langX.txt.gz files.
Now, take the file in question and append to each of he langX.txt files, producing langXapp.txt, and corresponding gzipped langXapp.txt.gz. For each X, find the difference between the size of langXapp.gz and langX.gz. The smallest difference will correspond to the language of your file.
Disclaimer: this will work reasonably well only for longer files. Also, it's not very efficient. But on the plus side you don't need to know anything about the language, it's completely automatic. And it can detect natural languages and tell between French or Chinese as well. Just in case you need it :) But the main reason, I just think it's interesting thing to try :)
The most bulletproof but also most work intensive way is to write a parser for each language and just run them in sequence to see which one would accept the code. This won't work well if code has syntax errors though and you most probably would have to deal with code like that, people do make mistakes. One of the fast ways to implement this is to get common compilers for every language you support and just run them and check how many errors they produce.
Heuristics works up to a certain point and the more languages you will support the less help you would get from them. But for first few versions it's a good start, mostly because it's fast to implement and works good enough in most cases. You could check for specific keywords, function/class names in API that is used often, some language constructions etc. Best way is to check how many of these specific stuff a file have for each possible language, this will help with some syntax errors, user defined functions with names like this() in languages that doesn't have such keywords, stuff written in comments and string literals.
Anyhow you most likely would fail sometimes so some mechanism for user to override language choice is still necessary.
I think you never should rely on one single feature, since the absence in a fragment (e.g. somebody systematically using WHILE instead of for) might confuse you.
Also try to stay away from global identifiers like "IMPORT" or "MODULE" or "UNIT" or INITIALIZATION/FINALIZATION, since they might not always exist, be optional in complete sources, and totally absent in fragments.
Dialects and similar languages (e.g. Modula2 and Pascal) are dangerous too.
I would create simple lexers for a bunch of languages that keep track of key tokens, and then simply calculate a key tokens to "other" identifiers ratio. Give each token a weight, since some might be a key indicator to disambiguate between dialects or versions.
Note that this is also a convenient way to allow users to plugin "known" keywords to increase the detection ratio, by e.g. providing identifiers of runtime library routines or types.
Very interesting question, I don't know if it is possible to be able to distinguish languages by code snippets, but here are some ideas:
One simple way is to watch out for single-quotes: In some languages, it is used as character wrapper, whereas in the others it can contain a whole string
A unary asterisk or a unary ampersand operator is a certain indication that it's either of C/C++/C#.
Pascal is the only language (of the ones given) to use two characters for assignments :=. Pascal has many unique keywords, too (begin, sub, end, ...)
The class initialization with a function could be a nice hint for Java.
Functions that do not belong to a class eliminates java (there is no max(), for example)
Naming of basic types (bool vs boolean)
Which reminds me: C++ can look very differently across projects (#define boolean int) So you can never guarantee, that you found the correct language.
If you run the source code through a hashing algorithm and it looks the same, you're most likely analyzing Perl
Indentation is a good hint for Python
You could use functions provided by the languages themselves - like token_get_all() for PHP - or third-party tools - like pychecker for python - to check the syntax
Summing it up: This project would make an interesting research paper (IMHO) and if you want it to work well, be prepared to put a lot of effort into it.
There is no way of making this foolproof, but I would personally start with operators, since they are in most cases "set in stone" (I can't say this holds true to every language since I know only a limited set). This would narrow it down quite considerably, but not nearly enough. For instance "->" is used in many languages (at least C, C++ and Perl).
I would go for something like this:
Create a list of features for each language, these could be operators, commenting style (since most use some sort of easily detectable character or character combination).
For instance:
Some languages have lines that start with the character "#", these include C, C++ and Perl. Do others than the first two use #include and #define in their vocabulary? If you detect this character at the beginning of line, the language is probably one of those. If the character is in the middle of the line, the language is most likely Perl.
Also, if you find the pattern := this would narrow it down to some likely languages.
Etc.
I would have a two-dimensional table with languages and patterns found and after analysis I would simply count which language had most "hits". If I wanted it to be really clever I would give each feature a weight which would signify how likely or unlikely it is that this feature is included in a snippet of this language. For instance if you can find a snippet that starts with /* and ends with */ it is more than likely that this is either C or C++.
The problem with keywords is someone might use it as a normal variable or even inside comments. They can be used as a decider (e.g. the word "class" is much more likely in C++ than C if everything else is equal), but you can't rely on them.
After the analysis I would offer the most likely language as the choice for the user with the rest ordered which would also be selectable. So the user would accept your guess by simply clicking a button, or he can switch it easily.
In answer to 2: if there's a "#!" and the name of an interpreter at the very beginning, then you definitely know which language it is. (Can't believe this wasn't mentioned by anyone else.)

Resources