Save Mathematica code in `FullForm` syntax - parsing

I need to do some metaprogramming on a large Mathematica code base (hundreds of thousands of lines of code) and don't want to have to write a full-blown parser so I was wondering how best to get the code from a Mathematica notebook out in an easily-parsed syntax.
Is it possible to export a Mathematica notebook in FullForm syntax, or to save all definitions in FullForm syntax?
The documentation for Save says that it can only export in the InputForm syntax, which is non-trivial to parse.
The best solution I have so far is to evaluate the notebook and then use DownValues to extract the rewrite rules with arguments (but this misses symbol definitions) as follows:
DVs[_] := {}
DVs[s_Symbol] := DownValues[s]
stream = OpenWrite["FullForm.m"];
WriteString[stream,
DVs[Symbol[#]] & /# Names["Global`*"] // Flatten // FullForm];
Close[stream];
I've tried a variety of approaches so far but none are working well. Metaprogramming in Mathematica seems to be extremely difficult because it keeps evaluating things that I want to keep unevaluated. For example, I wanted to get the string name of the infinity symbol using SymbolName[Infinity] but the Infinity gets evaluated into a non-symbol and the call to SymbolName dies with an error. Hence my desire to do the metaprogramming in a more suitable language.
EDIT
The best solution seems to be to save the notebooks as package (.m) files by hand and then translate them using the following code:
stream = OpenWrite["EverythingFullForm.m"];
WriteString[stream, Import["Everything.m", "HeldExpressions"] // FullForm];
Close[stream];

You can certainly do this. Here is one way:
exportCode[fname_String] :=
Function[code,
Export[fname, ToString#HoldForm#FullForm#code, "String"],
HoldAllComplete]
For example:
fn = exportCode["C:\\Temp\\mmacode.m"];
fn[
Clear[getWordsIndices];
getWordsIndices[sym_, words : {__String}] :=
Developer`ToPackedArray[words /. sym["Direct"]];
];
And importing this as a string:
In[623]:= Import["C:\\Temp\\mmacode.m","String"]//InputForm
Out[623]//InputForm=
"CompoundExpression[Clear[getWordsIndices], SetDelayed[getWordsIndices[Pattern[sym, Blank[]], \
Pattern[words, List[BlankSequence[String]]]], Developer`ToPackedArray[ReplaceAll[words, \
sym[\"Direct\"]]]], Null]"
However, going to other language to do metaprogramming for Mathematica sounds ridiculous to me, given that Mathematica is very well suited for that. There are many techniques available in Mathematica to do meta-programming and avoid premature evaluation. One that comes to my mind I described in this answer, but there are many others. Since you can operate on parsed code and use the pattern-matching in Mathematica, you save a lot. You can browse the SO Mathematica tags (past questions) and find lots of examples of meta-programming and evaluation control.
EDIT
To ease your pain with auto-evaluating symbols (there are only a few actually, Infinity being one of them).If you just need to get a symbol name for a given symbol, then this function will help:
unevaluatedSymbolName = Function[sym, SymbolName#Unevaluated#sym, HoldAllComplete]
You use it as
In[638]:= unevaluatedSymbolName[Infinity]//InputForm
Out[638]//InputForm="Infinity"
Alternatively, you can simply add HoldFirst attribute to SymbolName function via SetAttributes. One way is to do that globally:
SetAttributes[SymbolName,HoldFirst];
SymbolName[Infinity]//InputForm
Modifying built-in functions globally is however dangerous since it may have unpredictable effects for such a large system as Mathematica:
ClearAttributes[SymbolName, HoldFirst];
Here is a macro to use that locally:
ClearAll[withUnevaluatedSymbolName];
SetAttributes[withUnevaluatedSymbolName, HoldFirst];
withUnevaluatedSymbolName[code_] :=
Internal`InheritedBlock[{SymbolName},
SetAttributes[SymbolName, HoldFirst];
code]
Now,
In[649]:=
withUnevaluatedSymbolName[
{#,StringLength[#]}&[SymbolName[Infinity]]]//InputForm
Out[649]//InputForm= {"Infinity", 8}
You may also wish to do some replacements in a piece of code, say, replace a given symbol by its name. Here is an example code (which I wrap in Hold to prevent it from evaluation):
c = Hold[Integrate[Exp[-x^2], {x, -Infinity, Infinity}]]
The general way to do replacements in such cases is using Hold-attributes (see this answer) and replacements inside held expressions (see this question). For the case at hand:
In[652]:=
withUnevaluatedSymbolName[
c/.HoldPattern[Infinity]:>RuleCondition[SymbolName[Infinity],True]
]//InputForm
Out[652]//InputForm=
Hold[Integrate[Exp[-x^2], {x, -"Infinity", "Infinity"}]]
, although this is not the only way to do this. Instead of using the above macro, we can also encode the modification to SymbolName into the rule itself (here I am using a more wordy form ( Trott - Strzebonski trick) of in-place evaluation, but you can use RuleCondition as well:
ClearAll[replaceSymbolUnevaluatedRule];
SetAttributes[replaceSymbolUnevaluatedRule, HoldFirst];
replaceSymbolUnevaluatedRule[sym_Symbol] :=
HoldPattern[sym] :> With[{eval = SymbolName#Unevaluated#sym}, eval /; True];
Now, for example:
In[629]:=
Hold[Integrate[Exp[-x^2],{x,-Infinity,Infinity}]]/.
replaceSymbolUnevaluatedRule[Infinity]//InputForm
Out[629]//InputForm=
Hold[Integrate[Exp[-x^2], {x, -"Infinity", "Infinity"}]]
Actually, this entire answer is a good demonstration of various meta-programming techniques. From my own experiences, I can direct you to this, this, this, this and this answers of mine, where meta-programming was essential to solve problem I was addressing. You can also judge by the fraction of functions in Mathematica carrying Hold-attributes to all functions - it is about 10-15 percents if memory serves me well. All those functions are effectively macros, operating on code. To me, this is a very indicative fact, telling me that Mathematica jeavily builds on its meta-programming facilities.

The full forms of expressions can be extracted from the Code and Input cells of a notebook as follows:
$exprs =
Cases[
Import["mynotebook.nb", "Notebook"]
, Cell[content_, "Code"|"Input", ___] :>
ToExpression[content, StandardForm, HoldComplete]
, Infinity
] //
Flatten[HoldComplete ## #, 1, HoldComplete] & //
FullForm
$exprs is assigned the expressions read, wrapped in Hold to prevent evaluation. $exprs could then be saved into a text file:
Export["myfile.txt", ToString[$exprs]]
Package files (.m) are slightly easier to read in this way:
Import["mypackage.m", "HeldExpressions"] //
Flatten[HoldComplete ## #, 1, HoldComplete] &

Related

Elixir: rationale behind allowing rebinding of variables

What is the rationale behind allowing rebinding of variables in Elixir, when Erlang doesn't allow that?
Most functional languages don't allow rebinding of variables in the same scope. So Elixir allowing this does definitely give it an non-functional, imperative feel. Erlang's problem is rather lack of scope, or to be more precise that there is only one scope in a whole function clause. We did have serious discussions whether to introduce scope but in the end we decided against it as it was backwards incompatible with the existing system. And developers hate backwards inconsistent changes.
The Erlang way has one serious benefit: when you get it wrong you generally get an error so you can SEE the mistake. This compared just getting strange behaviour when a variable doesn't have the value you expect it to have which is MUCH harder to detect and correct.
Personally I think that the problem of new variable names, for example using the number scheme, is greatly overblown. Compared to the time it takes me to work out WHAT I am going to do changing variable names is trivial. And after a while you just see it without reflecting about it. Honestly.
EDIT:
Also when chaining data through a sequence of functions the actual meaning of the data changes so reusing the same variable name can be very misleading. It can end up just meaning a generic "data I am passing from one function to another".
Here's the rationale straight from the horse's mouth:
http://blog.plataformatec.com.br/2016/01/comparing-elixir-and-erlang-variables/
Because it's simpler.
Take a look at this question posted to the Erlang mailing list in 2009. Specifically this part:
I like pattern matching in the majority of cases, but I find I write
enough code where I need to incrementally update a data structure, and
maintaining that code is a pain when I have code like:
X = foo(),
X1 = bar(X),
X2 = xyzzy(X1),
blah(X2).
and later want to change it to:
X = foo(),
X1 = whee(X),
X2 = bar(X1),
X3 = xyzzy(X2),
blah(X3).
Editor's note--this is the reply to that question.
This goes through IRC a lot. This is a result of poor variable naming
practice, and there is no need to introduce rebinding to "fix" it;
just stop using single letters and counters as variable names.
If for example that was written
FooStateX = foo(),
PostBarX = bar(FooStateX),
OnceXyzziedX = xyzzy(PostBarX),
blah(OnceXyzziedX).
The code demonstrated there isn't all that uncommon in Erlang (note the remark "this goes through IRC a lot"). Elixir's ability to simply rebind names saves us from having to generate new dummy names for things all the time. That's all. It's wise to bear in mind that the original creators of Erlang weren't trying to build a functional language. They were simply pragmatic developers who had a problem to solve. Elixir's approach is simple pragmatism.

Can you "teach" computers to do algebra using variable expressions (eg aX+bX=(a+b)X)

Let's say in the example lower case is constant and upper case is variable.
I'd like to have programs that can "intelligently" do specified tasks like algebra, but teaching the program new methods should be easy using symbols understood by humans. For example if the program told these facts:
aX+bX=(a+b)X
if a=bX then X=a/b
Then it should be able to perform these operations:
2a+3a=5a
3x+3x=6x
3x=1 therefore x=1/3
4x+2x=1 -> 6x=1 therefore x= 1/6
I was trying to do similar things with Prolog as it can easily "understand" variables, but then I had too many complications, mainly because two describing a relationship both ways results in a crash. (not easy to sort out)
To summarise: I want to know if a program which can be taught algebra by using mathematic symbols only. I'd like to know if other people have tried this and how complicated it is expected to be. The purpose of this is to make programming easier (runtime is not so important)
It depends on what do you want machine to do and how intelligent it should be.
Your question is mostly about AI but not ML. AI deals with formalization of "human" tasks while ML (though being a subset of AI) is about building models from data.
Described program may be implemented like this:
Each fact form a pattern. Program given with an expression and some patterns can try to apply some of them to expression and see what happens. If you want your program to be able to, for example, solve quadratic equations given rule like ax² + bx + c = 0 → x = (-b ± sqrt(b²-4ac))/(2a) then it'd be designed as follows:
Somebody gives a set of rules. Rule consists of a pattern and an outcome (solution or equivalent form). Think about the pattern as kind of a regular expression.
Then the program is asked to show some intelligence and prove its knowledge via doing something with a given expression. Here comes the major part:
you build a graph of expressions by applying possible rules (if a pattern is applicable to an expression you add new vertex with the corresponding outcome).
Then you run some path-search algorithm (A*, for example) to find sequence of transformations leading to the form like x = ...
I think this is an interesting question, although it off topic in SO (tool recommendation)
But nevertheless, because it captured my imagination, I wrote couple of function using R that can solve stuff like that quite easily
First, you'll have to install R, after words you'll need to download package called stringr
So in R console run
install.packages("stringr")
library(stringr)
And then you can define the following functions that I wrote
FirstFunc <- function(temp){
paste0(eval(parse(text = gsub("[A-Z]", "", temp))), unique(str_extract_all(temp, "[A-Z]")[[1]]))
}
SecondFunc <- function(temp){
eval(parse(text = strsplit(temp, "=")[[1]][2])) / eval(parse(text = gsub("[[:alpha:]]", "", strsplit(temp, "=")[[1]][1])))
}
Now, the first function will solve equations like
aX+bX=(a+b)X
While the second will solve equations like
4x+2x=1
For example
FirstFunc("3X+6X-2X-3X")
will return
"4X"
Now this functions is pretty primitive (mostly for the propose of illustration) and will solve equation that contain only one variable type, something like FirstFunc("3X-2X-2Y") won't give the correct result (but the function could be easily modified)
The second function will solve stuff like
SecondFunc("4x-2x=1")
will return
0.5
or
SecondFunc("4x+2x*3x=1")
will return
0.1
Note that this function also works only for one unknown variable (x) but could be easily modified too

Using ANTLR to analyze and modify source code; am I doing it wrong?

I'm writing a program where I need to parse a JavaScript source file, extract some facts, and insert/replace portions of the code. A simplified description of the sorts of things I'd need to do is, given this code:
foo(['a', 'b', 'c']);
Extract 'a', 'b', and 'c' and rewrite the code as:
foo('bar', [0, 1, 2]);
I am using ANTLR for my parsing needs, producing C# 3 code. Somebody else had already contributed a JavaScript grammar. The parsing of the source code is working.
The problem I'm encountering is figuring out how to actually properly analyze and modify the source file. Each approach that I try to take in actually solving the problem leads me to a dead end. I can't help but think that I'm not using the tool as it's intended or am just too much of a novice when it comes to dealing with ASTs.
My first approach was to parse using a TokenRewriteStream and implement the EnterRule_* partial methods for the rules I'm interested in. While this seems to make modifying the token stream pretty easy, there is not enough contextual information for my analysis. It seems that all I have access to is a flat stream of tokens, which doesn't tell me enough about the entire structure of code. For example, to detect whether the foo function is being called, simply looking at the first token wouldn't work because that would also falsely match:
a.b.foo();
To allow me to do more sophisticated code analysis, my second approach was to modify the grammar with rewrite rules to produce more of a tree. Now, the first sample code block produces this:
Program
CallExpression
Identifier('foo')
ArgumentList
ArrayLiteral
StringLiteral('a')
StringLiteral('b')
StringLiteral('c')
This is working great for analyzing the code. However, now I am unable to easily rewrite the code. Sure, I could modify the tree structure to represent the code I want, but I can't use this to output source code. I had hoped that the token associated with each node would at least give me enough information to know where in the original text I would need to make the modifications, but all I get are token indexes or line/column numbers. To use the line and column numbers, I would have to make an awkward second pass through the source code.
I suspect I'm missing something in understanding how to properly use ANTLR to do what I need. Is there a more proper way for me to solve this problem?
What you are trying to do is called program transformation, that is, the automated generation of one program from another. What you are doing "wrong" is assuming is parser is all you need, and discovering that it isn't and that you have to fill in the gap.
Tools that do that this well have parsers (to build ASTs), means to modify the ASTs (both procedural and pattern directed), and prettyprinters which convert the (modified) AST back into legal source code. You seem to be struggling with the the fact that ANTLR doesn't come with prettyprinters; that's not part of its philosophy; ANTLR is a (fine) parser-generator. Other answers have suggested using ANTLR's "string templates", which are not by themselves prettyprinters, but can be used to implement one, at the price of implementing one. This harder to do than it looks; see my SO answer on compiling an AST back to source code.
The real issue here is the widely made but false assumption that "if I have a parser, I'm well on my way to building complex program analysis and transformation tools." See my essay on Life After Parsing for a long discussion of this; basically, you need a lot more tooling that "just" a parser to do this, unless you want to rebuild a significant fraction of the infrastructure by yourself instead of getting on with your task. Other useful features of practical program transformation systems include typically source-to-source transformations, which considerably simplify the problem of finding and replacing complex patterns in trees.
For instance, if you had source-to-source transformation capabilities (of our tool, the DMS Software Reengineering Toolkit, you'd be able to write parts of your example code changes using these DMS transforms:
domain ECMAScript.
tag replace; -- says this is a special kind of temporary tree
rule barize(function_name:IDENTIFIER,list:expression_list,b:body):
expression->expression
= " \function_name ( '[' \list ']' ) "
-> "\function_name( \firstarg\(\function_name\), \replace\(\list\))";
rule replace_unit_list(s:character_literal):
expression_list -> expression_list
replace(s) -> compute_index_for(s);
rule replace_long_list(s:character_list, list:expression_list):
expression_list -> expression_list
"\replace\(\s\,\list)-> "compute_index_for\(\s\),\list";
with rule-external "meta" procedures "first_arg" (which knows how to compute "bar" given the identifier "foo" [I'm guessing you want to do this), and "compute_index_for" which given a string literals, knows what integer to replace it with.
Individual rewrite rules have parameter lists "(....)" in which slots representing subtrees are named, a left-hand side acting as a pattern to match, and an right hand side acting as replacement, both usually quoted in metaquotes " which seperates rewrite-rule language text from target-language (e.g. JavaScript) text. There's lots of meta-escapes ** found inside the metaquotes which indicate a special rewrite-rule-language item. Typically these are parameter names, and represent whatever type of name tree the parameter represents, or represent an external meta procedure call (such as first_arg; you'll note the its argument list ( , ) is metaquoted!), or finally, a "tag" such as "replace", which is a peculiar kind of tree that represent future intent to do more transformations.
This particular set of rules works by replacing a candidate function call by the barized version, with the additional intent "replace" to transform the list. The other two transformations realize the intent by transforming "replace" away by processing elements of the list one at a time, and pushing the replace further down the list until it finally falls off the end and the replacement is done. (This is the transformational equivalent of a loop).
Your specific example may vary somewhat since you really weren't precise about the details.
Having applied these rules to modify the parsed tree, DMS can then trivially prettyprint the result (the default behavior in some configurations is "parse to AST, apply rules until exhaustion, prettyprint AST" because this is handy).
You can see a complete process of "define language", "define rewrite rules", "apply rules and prettyprint" at (High School) Algebra as a DMS domain.
Other program transformation systems include TXL and Stratego. We imagine DMS as the industrial strength version of these, in which we have built all that infrastructure including many standard language parsers and prettyprinters.
So it's turning out that I can actually use a rewriting tree grammar and insert/replace tokens using a TokenRewriteStream. Plus, it's actually really easy to do. My code resembles the following:
var charStream = new ANTLRInputStream(stream);
var lexer = new JavaScriptLexer(charStream);
var tokenStream = new TokenRewriteStream(lexer);
var parser = new JavaScriptParser(tokenStream);
var program = parser.program().Tree as Program;
var dependencies = new List<IModule>();
var functionCall = (
from callExpression in program.Children.OfType<CallExpression>()
where callExpression.Children[0].Text == "foo"
select callExpression
).Single();
var argList = functionCall.Children[1] as ArgumentList;
var array = argList.Children[0] as ArrayLiteral;
tokenStream.InsertAfter(argList.Token.TokenIndex, "'bar', ");
for (var i = 0; i < array.Children.Count(); i++)
{
tokenStream.Replace(
(array.Children[i] as StringLiteral).Token.TokenIndex,
i.ToString());
}
var rewrittenCode = tokenStream.ToString();
Have you looked at the string template library. It is by the same person who wrote ANTLR and they are intended to work together. It sounds like it would suit do what your looking for ie. output matched grammar rules as formatted text.
Here is an article on translation via ANTLR

Is there an idiomatic way to order function arguments in Erlang?

Seems like it's inconsistent in the lists module. For example, split has the number as the first argument and the list as the second, but sublists has the list as the first argument and the len as the second argument.
OK, a little history as I remember it and some principles behind my style.
As Christian has said the libraries evolved and tended to get the argument order and feel from the impulses we were getting just then. So for example the reason why element/setelement have the argument order they do is because it matches the arg/3 predicate in Prolog; logical then but not now. Often we would have the thing being worked on first, but unfortunately not always. This is often a good choice as it allows "optional" arguments to be conveniently added to the end; for example string:substr/2/3. Functions with the thing as the last argument were often influenced by functional languages with currying, for example Haskell, where it is very easy to use currying and partial evaluation to build specific functions which can then be applied to the thing. This is very noticeable in the higher order functions in lists.
The only influence we didn't have was from the OO world. :-)
Usually we at least managed to be consistent within a module, but not always. See lists again. We did try to have some consistency, so the argument order in the higher order functions in dict/sets match those of the corresponding functions in lists.
The problem was also aggravated by the fact that we, especially me, had a rather cavalier attitude to libraries. I just did not see them as a selling point for the language, so I wasn't that worried about it. "If you want a library which does something then you just write it" was my motto. This meant that my libraries were structured, just not always with the same structure. :-) That was how many of the initial libraries came about.
This, of course, creates unnecessary confusion and breaks the law of least astonishment, but we have not been able to do anything about it. Any suggestions of revising the modules have always been met with a resounding "no".
My own personal style is a usually structured, though I don't know if it conforms to any written guidelines or standards.
I generally have the thing or things I am working on as the first arguments, or at least very close to the beginning; the order depends on what feels best. If there is a global state which is chained through the whole module, which there usually is, it is placed as the last argument and given a very descriptive name like St0, St1, ... (I belong to the church of short variable names). Arguments which are chained through functions (both input and output) I try to keep the same argument order as return order. This makes it much easier to see the structure of the code. Apart from that I try to group together arguments which belong together. Also, where possible, I try to preserve the same argument order throughout a whole module.
None of this is very revolutionary, but I find if you keep a consistent style then it is one less thing to worry about and it makes your code feel better and generally be more readable. Also I will actually rewrite code if the argument order feels wrong.
A small example which may help:
fubar({f,A0,B0}, Arg2, Ch0, Arg4, St0) ->
{A1,Ch1,St1} = foo(A0, Arg2, Ch0, St0),
{B1,Ch2,St2} = bar(B0, Arg4, Ch1, St1),
Res = baz(A1, B1),
{Res,Ch2,St2}.
Here Ch is a local chained through variable while St is a more global state. Check out the code on github for LFE, especially the compiler, if you want a longer example.
This became much longer than it should have been, sorry.
P.S. I used the word thing instead of object to avoid confusion about what I was talking.
No, there is no consistently-used idiom in the sense that you mean.
However, there are some useful relevant hints that apply especially when you're going to be making deeply recursive calls. For instance, keeping whichever arguments will remain unchanged during tail calls in the same order/position in the argument list allows the virtual machine to make some very nice optimizations.

Parsing Source Code - Unique Identifiers for Different Languages? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I'm building an application that receives source code as input and analyzes several aspects of the code. It can accept code from many common languages, e.g. C/C++, C#, Java, Python, PHP, Pascal, SQL, and more (however many languages are unsupported, e.g. Ada, Cobol, Fortran). Once the language is known, my application knows what to do (I have different handlers for different languages).
Currently I'm asking the user to input the programming language the code is written in, and this is error-prone: although users know the programming languages, a small percentage of them (on rare occasions) click the wrong option just due to recklessness, and that breaks the system (i.e. my analysis fails).
It seems to me like there should be a way to figure out (in most cases) what the language is, from the input text itself. Several notes:
I'm receiving pure text and not file names, so I can't use the extension as a hint.
The user is not required to input complete source codes, and can also input code snippets (i.e. the include/import part may not be included).
it's clear to me that any algorithm I choose will not be 100% proof, certainly for very short input codes (e.g. that could be accepted by both Python and Ruby), in which cases I will still need the user's assistance, however I would like to minimize user involvement in the process to minimize mistakes.
Examples:
If the text contains "x->y()", I may know for sure it's C++ (?)
If the text contains "public static void main", I may know for sure it's Java (?)
If the text contains "for x := y to z do begin", I may know for sure it's Pascal (?)
My question:
Are you familiar with any standard library/method for figuring out automatically what the language of an input source code is?
What are the unique code "tokens" with which I could certainly differentiate one language from another?
I'm writing my code in Python but I believe the question to be language agnostic.
Thanks
Vim has a autodetect filetype feature. If you download vim sourcecode you will find a /vim/runtime/filetype.vim file.
For each language it checks the extension of the file and also, for some of them (most common), it has a function that can get the filetype from the source code. You can check that out. The code is pretty easy to understand and there are some very useful comments there.
build a generic tokenizer and then use a Bayesian filter on them. Use the existing "user checks a box" system to train it.
Here is a simple way to do it. Just run the parser on every language. Whatever language gets the farthest without encountering any errors (or has the fewest errors) wins.
This technique has the following advantages:
You already have most of the code necessary to do this.
The analysis can be done in parallel on multi-core machines.
Most languages can be eliminated very quickly.
This technique is very robust. Languages that might appear very similar when using a fuzzy analysis (baysian for example), would likely have many errors when the actual parser is run.
If a program is parsed correctly in two different languages, then there was never any hope of distinguishing them in the first place.
I think the problem is impossible. The best you can do is to come up with some probability that a program is in a particular language, and even then I would guess producing a solid probability is very hard. Problems that come to mind at once:
use of features like the C pre-processor can effectively mask the underlyuing language altogether
looking for keywords is not sufficient as the keywords can be used in other languages as identifiers
looking for actual language constructs requires you to parse the code, but to do that you need to know the language
what do you do about malformed code?
Those seem enough problems to solve to be going on with.
One program I know which even can distinguish several different languages within the same file is ohcount. You might get some ideas there, although I don't really know how they do it.
In general you can look for distinctive patterns:
Operators might be an indicator, such as := for Pascal/Modula/Oberon, => or the whole of LINQ in C#
Keywords would be another one as probably no two languages have the same set of keywords
Casing rules for identifiers, assuming the piece of code was writting conforming to best practices. Probably a very weak rule
Standard library functions or types. Especially for languages that usually rely heavily on them, such as PHP you might just use a long list of standard library functions.
You may create a set of rules, each of which indicates a possible set of languages if it matches. Intersecting the resulting lists will hopefully get you only one language.
The problem with this approach however, is that you need to do tokenizing and compare tokens (otherwise you can't really know what operators are or whether something you found was inside a comment or string). Tokenizing rules are different for each language as well, though; just splitting everything at whitespace and punctuation will probably not yield a very useful sequence of tokens. You can try several different tokenizing rules (each of which would indicate a certain set of languages as well) and have your rules match to a specified tokenization. For example, trying to find a single-quoted string (for trying out Pascal) in a VB snippet with one comment will probably fail, but another tokenizer might have more luck.
But since you want to perform analysis anyway you probably have parsers for the languages you support, so you can just try running the snippet through each parser and take that as indicator which language it would be (as suggested by OregonGhost as well).
Some thoughts:
$x->y() would be valid in PHP, so ensure that there's no $ symbol if you think C++ (though I think you can store function pointers in a C struct, so this could also be C).
public static void main is Java if it is cased properly - write Main and it's C#. This gets complicated if you take case-insensitive languages like many scripting languages or Pascal into account. The [] attribute syntax in C# on the other hand seems to be rather unique.
You can also try to use the keywords of a language - for example, Option Strict or End Sub are typical for VB and the like, while yield is likely C# and initialization/implementation are Object Pascal / Delphi.
If your application is analyzing the source code anyway, you code try to throw your analysis code at it for every language and if it fails really bad, it was the wrong language :)
My approach would be:
Create a list of strings or regexes (with and without case sensitivity), where each element has assigned a list of languages that the element is an indicator for:
class => C++, C#, Java
interface => C#, Java
implements => Java
[attribute] => C#
procedure => Pascal, Modula
create table / insert / ... => SQL
etc. Then parse the file line-by-line, match each element of the list, and count the hits.
The language with the most hits wins ;)
How about word frequency analysis (with a twist)? Parse the source code and categorise it much like a spam filter does. This way when a code snippet is entered into your app which cannot be 100% identified you can have it show the closest matches which the user can pick from - this can then be fed into your database.
Here's an idea for you. For each of your N languages, find some files in the language, something like 10-20 per language would be enough, each one not too short. Concatenate all files in one language together. Call this lang1.txt. GZip it to lang1.txt.gz. You will have a set of N langX.txt and langX.txt.gz files.
Now, take the file in question and append to each of he langX.txt files, producing langXapp.txt, and corresponding gzipped langXapp.txt.gz. For each X, find the difference between the size of langXapp.gz and langX.gz. The smallest difference will correspond to the language of your file.
Disclaimer: this will work reasonably well only for longer files. Also, it's not very efficient. But on the plus side you don't need to know anything about the language, it's completely automatic. And it can detect natural languages and tell between French or Chinese as well. Just in case you need it :) But the main reason, I just think it's interesting thing to try :)
The most bulletproof but also most work intensive way is to write a parser for each language and just run them in sequence to see which one would accept the code. This won't work well if code has syntax errors though and you most probably would have to deal with code like that, people do make mistakes. One of the fast ways to implement this is to get common compilers for every language you support and just run them and check how many errors they produce.
Heuristics works up to a certain point and the more languages you will support the less help you would get from them. But for first few versions it's a good start, mostly because it's fast to implement and works good enough in most cases. You could check for specific keywords, function/class names in API that is used often, some language constructions etc. Best way is to check how many of these specific stuff a file have for each possible language, this will help with some syntax errors, user defined functions with names like this() in languages that doesn't have such keywords, stuff written in comments and string literals.
Anyhow you most likely would fail sometimes so some mechanism for user to override language choice is still necessary.
I think you never should rely on one single feature, since the absence in a fragment (e.g. somebody systematically using WHILE instead of for) might confuse you.
Also try to stay away from global identifiers like "IMPORT" or "MODULE" or "UNIT" or INITIALIZATION/FINALIZATION, since they might not always exist, be optional in complete sources, and totally absent in fragments.
Dialects and similar languages (e.g. Modula2 and Pascal) are dangerous too.
I would create simple lexers for a bunch of languages that keep track of key tokens, and then simply calculate a key tokens to "other" identifiers ratio. Give each token a weight, since some might be a key indicator to disambiguate between dialects or versions.
Note that this is also a convenient way to allow users to plugin "known" keywords to increase the detection ratio, by e.g. providing identifiers of runtime library routines or types.
Very interesting question, I don't know if it is possible to be able to distinguish languages by code snippets, but here are some ideas:
One simple way is to watch out for single-quotes: In some languages, it is used as character wrapper, whereas in the others it can contain a whole string
A unary asterisk or a unary ampersand operator is a certain indication that it's either of C/C++/C#.
Pascal is the only language (of the ones given) to use two characters for assignments :=. Pascal has many unique keywords, too (begin, sub, end, ...)
The class initialization with a function could be a nice hint for Java.
Functions that do not belong to a class eliminates java (there is no max(), for example)
Naming of basic types (bool vs boolean)
Which reminds me: C++ can look very differently across projects (#define boolean int) So you can never guarantee, that you found the correct language.
If you run the source code through a hashing algorithm and it looks the same, you're most likely analyzing Perl
Indentation is a good hint for Python
You could use functions provided by the languages themselves - like token_get_all() for PHP - or third-party tools - like pychecker for python - to check the syntax
Summing it up: This project would make an interesting research paper (IMHO) and if you want it to work well, be prepared to put a lot of effort into it.
There is no way of making this foolproof, but I would personally start with operators, since they are in most cases "set in stone" (I can't say this holds true to every language since I know only a limited set). This would narrow it down quite considerably, but not nearly enough. For instance "->" is used in many languages (at least C, C++ and Perl).
I would go for something like this:
Create a list of features for each language, these could be operators, commenting style (since most use some sort of easily detectable character or character combination).
For instance:
Some languages have lines that start with the character "#", these include C, C++ and Perl. Do others than the first two use #include and #define in their vocabulary? If you detect this character at the beginning of line, the language is probably one of those. If the character is in the middle of the line, the language is most likely Perl.
Also, if you find the pattern := this would narrow it down to some likely languages.
Etc.
I would have a two-dimensional table with languages and patterns found and after analysis I would simply count which language had most "hits". If I wanted it to be really clever I would give each feature a weight which would signify how likely or unlikely it is that this feature is included in a snippet of this language. For instance if you can find a snippet that starts with /* and ends with */ it is more than likely that this is either C or C++.
The problem with keywords is someone might use it as a normal variable or even inside comments. They can be used as a decider (e.g. the word "class" is much more likely in C++ than C if everything else is equal), but you can't rely on them.
After the analysis I would offer the most likely language as the choice for the user with the rest ordered which would also be selectable. So the user would accept your guess by simply clicking a button, or he can switch it easily.
In answer to 2: if there's a "#!" and the name of an interpreter at the very beginning, then you definitely know which language it is. (Can't believe this wasn't mentioned by anyone else.)

Resources