Parsing Source Code - Unique Identifiers for Different Languages? [closed] - parsing

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I'm building an application that receives source code as input and analyzes several aspects of the code. It can accept code from many common languages, e.g. C/C++, C#, Java, Python, PHP, Pascal, SQL, and more (however many languages are unsupported, e.g. Ada, Cobol, Fortran). Once the language is known, my application knows what to do (I have different handlers for different languages).
Currently I'm asking the user to input the programming language the code is written in, and this is error-prone: although users know the programming languages, a small percentage of them (on rare occasions) click the wrong option just due to recklessness, and that breaks the system (i.e. my analysis fails).
It seems to me like there should be a way to figure out (in most cases) what the language is, from the input text itself. Several notes:
I'm receiving pure text and not file names, so I can't use the extension as a hint.
The user is not required to input complete source codes, and can also input code snippets (i.e. the include/import part may not be included).
it's clear to me that any algorithm I choose will not be 100% proof, certainly for very short input codes (e.g. that could be accepted by both Python and Ruby), in which cases I will still need the user's assistance, however I would like to minimize user involvement in the process to minimize mistakes.
Examples:
If the text contains "x->y()", I may know for sure it's C++ (?)
If the text contains "public static void main", I may know for sure it's Java (?)
If the text contains "for x := y to z do begin", I may know for sure it's Pascal (?)
My question:
Are you familiar with any standard library/method for figuring out automatically what the language of an input source code is?
What are the unique code "tokens" with which I could certainly differentiate one language from another?
I'm writing my code in Python but I believe the question to be language agnostic.
Thanks

Vim has a autodetect filetype feature. If you download vim sourcecode you will find a /vim/runtime/filetype.vim file.
For each language it checks the extension of the file and also, for some of them (most common), it has a function that can get the filetype from the source code. You can check that out. The code is pretty easy to understand and there are some very useful comments there.

build a generic tokenizer and then use a Bayesian filter on them. Use the existing "user checks a box" system to train it.

Here is a simple way to do it. Just run the parser on every language. Whatever language gets the farthest without encountering any errors (or has the fewest errors) wins.
This technique has the following advantages:
You already have most of the code necessary to do this.
The analysis can be done in parallel on multi-core machines.
Most languages can be eliminated very quickly.
This technique is very robust. Languages that might appear very similar when using a fuzzy analysis (baysian for example), would likely have many errors when the actual parser is run.
If a program is parsed correctly in two different languages, then there was never any hope of distinguishing them in the first place.

I think the problem is impossible. The best you can do is to come up with some probability that a program is in a particular language, and even then I would guess producing a solid probability is very hard. Problems that come to mind at once:
use of features like the C pre-processor can effectively mask the underlyuing language altogether
looking for keywords is not sufficient as the keywords can be used in other languages as identifiers
looking for actual language constructs requires you to parse the code, but to do that you need to know the language
what do you do about malformed code?
Those seem enough problems to solve to be going on with.

One program I know which even can distinguish several different languages within the same file is ohcount. You might get some ideas there, although I don't really know how they do it.
In general you can look for distinctive patterns:
Operators might be an indicator, such as := for Pascal/Modula/Oberon, => or the whole of LINQ in C#
Keywords would be another one as probably no two languages have the same set of keywords
Casing rules for identifiers, assuming the piece of code was writting conforming to best practices. Probably a very weak rule
Standard library functions or types. Especially for languages that usually rely heavily on them, such as PHP you might just use a long list of standard library functions.
You may create a set of rules, each of which indicates a possible set of languages if it matches. Intersecting the resulting lists will hopefully get you only one language.
The problem with this approach however, is that you need to do tokenizing and compare tokens (otherwise you can't really know what operators are or whether something you found was inside a comment or string). Tokenizing rules are different for each language as well, though; just splitting everything at whitespace and punctuation will probably not yield a very useful sequence of tokens. You can try several different tokenizing rules (each of which would indicate a certain set of languages as well) and have your rules match to a specified tokenization. For example, trying to find a single-quoted string (for trying out Pascal) in a VB snippet with one comment will probably fail, but another tokenizer might have more luck.
But since you want to perform analysis anyway you probably have parsers for the languages you support, so you can just try running the snippet through each parser and take that as indicator which language it would be (as suggested by OregonGhost as well).

Some thoughts:
$x->y() would be valid in PHP, so ensure that there's no $ symbol if you think C++ (though I think you can store function pointers in a C struct, so this could also be C).
public static void main is Java if it is cased properly - write Main and it's C#. This gets complicated if you take case-insensitive languages like many scripting languages or Pascal into account. The [] attribute syntax in C# on the other hand seems to be rather unique.
You can also try to use the keywords of a language - for example, Option Strict or End Sub are typical for VB and the like, while yield is likely C# and initialization/implementation are Object Pascal / Delphi.
If your application is analyzing the source code anyway, you code try to throw your analysis code at it for every language and if it fails really bad, it was the wrong language :)

My approach would be:
Create a list of strings or regexes (with and without case sensitivity), where each element has assigned a list of languages that the element is an indicator for:
class => C++, C#, Java
interface => C#, Java
implements => Java
[attribute] => C#
procedure => Pascal, Modula
create table / insert / ... => SQL
etc. Then parse the file line-by-line, match each element of the list, and count the hits.
The language with the most hits wins ;)

How about word frequency analysis (with a twist)? Parse the source code and categorise it much like a spam filter does. This way when a code snippet is entered into your app which cannot be 100% identified you can have it show the closest matches which the user can pick from - this can then be fed into your database.

Here's an idea for you. For each of your N languages, find some files in the language, something like 10-20 per language would be enough, each one not too short. Concatenate all files in one language together. Call this lang1.txt. GZip it to lang1.txt.gz. You will have a set of N langX.txt and langX.txt.gz files.
Now, take the file in question and append to each of he langX.txt files, producing langXapp.txt, and corresponding gzipped langXapp.txt.gz. For each X, find the difference between the size of langXapp.gz and langX.gz. The smallest difference will correspond to the language of your file.
Disclaimer: this will work reasonably well only for longer files. Also, it's not very efficient. But on the plus side you don't need to know anything about the language, it's completely automatic. And it can detect natural languages and tell between French or Chinese as well. Just in case you need it :) But the main reason, I just think it's interesting thing to try :)

The most bulletproof but also most work intensive way is to write a parser for each language and just run them in sequence to see which one would accept the code. This won't work well if code has syntax errors though and you most probably would have to deal with code like that, people do make mistakes. One of the fast ways to implement this is to get common compilers for every language you support and just run them and check how many errors they produce.
Heuristics works up to a certain point and the more languages you will support the less help you would get from them. But for first few versions it's a good start, mostly because it's fast to implement and works good enough in most cases. You could check for specific keywords, function/class names in API that is used often, some language constructions etc. Best way is to check how many of these specific stuff a file have for each possible language, this will help with some syntax errors, user defined functions with names like this() in languages that doesn't have such keywords, stuff written in comments and string literals.
Anyhow you most likely would fail sometimes so some mechanism for user to override language choice is still necessary.

I think you never should rely on one single feature, since the absence in a fragment (e.g. somebody systematically using WHILE instead of for) might confuse you.
Also try to stay away from global identifiers like "IMPORT" or "MODULE" or "UNIT" or INITIALIZATION/FINALIZATION, since they might not always exist, be optional in complete sources, and totally absent in fragments.
Dialects and similar languages (e.g. Modula2 and Pascal) are dangerous too.
I would create simple lexers for a bunch of languages that keep track of key tokens, and then simply calculate a key tokens to "other" identifiers ratio. Give each token a weight, since some might be a key indicator to disambiguate between dialects or versions.
Note that this is also a convenient way to allow users to plugin "known" keywords to increase the detection ratio, by e.g. providing identifiers of runtime library routines or types.

Very interesting question, I don't know if it is possible to be able to distinguish languages by code snippets, but here are some ideas:
One simple way is to watch out for single-quotes: In some languages, it is used as character wrapper, whereas in the others it can contain a whole string
A unary asterisk or a unary ampersand operator is a certain indication that it's either of C/C++/C#.
Pascal is the only language (of the ones given) to use two characters for assignments :=. Pascal has many unique keywords, too (begin, sub, end, ...)
The class initialization with a function could be a nice hint for Java.
Functions that do not belong to a class eliminates java (there is no max(), for example)
Naming of basic types (bool vs boolean)
Which reminds me: C++ can look very differently across projects (#define boolean int) So you can never guarantee, that you found the correct language.
If you run the source code through a hashing algorithm and it looks the same, you're most likely analyzing Perl
Indentation is a good hint for Python
You could use functions provided by the languages themselves - like token_get_all() for PHP - or third-party tools - like pychecker for python - to check the syntax
Summing it up: This project would make an interesting research paper (IMHO) and if you want it to work well, be prepared to put a lot of effort into it.

There is no way of making this foolproof, but I would personally start with operators, since they are in most cases "set in stone" (I can't say this holds true to every language since I know only a limited set). This would narrow it down quite considerably, but not nearly enough. For instance "->" is used in many languages (at least C, C++ and Perl).
I would go for something like this:
Create a list of features for each language, these could be operators, commenting style (since most use some sort of easily detectable character or character combination).
For instance:
Some languages have lines that start with the character "#", these include C, C++ and Perl. Do others than the first two use #include and #define in their vocabulary? If you detect this character at the beginning of line, the language is probably one of those. If the character is in the middle of the line, the language is most likely Perl.
Also, if you find the pattern := this would narrow it down to some likely languages.
Etc.
I would have a two-dimensional table with languages and patterns found and after analysis I would simply count which language had most "hits". If I wanted it to be really clever I would give each feature a weight which would signify how likely or unlikely it is that this feature is included in a snippet of this language. For instance if you can find a snippet that starts with /* and ends with */ it is more than likely that this is either C or C++.
The problem with keywords is someone might use it as a normal variable or even inside comments. They can be used as a decider (e.g. the word "class" is much more likely in C++ than C if everything else is equal), but you can't rely on them.
After the analysis I would offer the most likely language as the choice for the user with the rest ordered which would also be selectable. So the user would accept your guess by simply clicking a button, or he can switch it easily.

In answer to 2: if there's a "#!" and the name of an interpreter at the very beginning, then you definitely know which language it is. (Can't believe this wasn't mentioned by anyone else.)

Related

Good practice to parse data in a custom format

I'm writing a program that takes in input a straight play in a custom format and then performs some analysis on it (like number of lines and words for each character). It's just for fun, and a pretext for learning cool stuff.
The first step in that process is writing a parser for that format. It goes :
####Play
###Act I
##Scene 1
CHARACTER 1. Line 1, he's saying some stuff.
#Comment, stage direction
CHARACTER 2, doing some stuff. Line 2, she's saying some stuff too.
It's quite a simple format. I read extensively about basic parser stuff like CFG, so I am now ready to get some work done.
I have written my grammar in EBNF and started playing with flex/bison but it raises some questions :
Is flex/bison too much for such a simple parser ? Should I just write it myself as described here : Is there an alternative for flex/bison that is usable on 8-bit embedded systems? ?
What is good practice regarding the respective tasks of the tokenizer and the parser itself ? There is never a single solution, and for such a simple language they often overlap. This is especially true for flex/bison, where flex can perform some intense stuff with regex matching. For example, should "#" be a token ? Should "####" be a token too ? Should I create types that carry semantic information so I can directly identify for example a character ? Or should I just process it with flex the simplest way then let the grammar defined in bison decide what is what ?
With flex/bison, does it makes sense to perform the analysis while parsing or is it more elegant to parse first, then operate on the file again with some other tool ?
This got me really confused. I am looking for an elegant, perhaps simple solution. Any guideline ?
By the way, about the programing language, I don't care much. For now I am using C because of flex/bison but feel free to advise me on anything more practical as long as it is a widely used language.
It's very difficult to answer those questions without knowing what your parsing expectations are. That is, an example of a few lines of text does not provide a clear understanding of what the intended parse is; what the lexical and syntactic units are; what relationships you would like to extract; and so on.
However, a rough guess might be that you intend to produce a nested parse, where ##{i} indicates the nesting level (inversely), with i≥1, since a single # is not structural. That violates one principle of language design ("don't make the user count things which the computer could count more accurately"), which might suggest a structure more like:
#play {
#act {
#scene {
#location: Elsinore. A platform before the castle.
#direction: FRANCISCO at his post. Enter to him BERNARDO
BERNARDO: Who's there?
FRANCISCO: Nay, answer me: stand, and unfold yourself.
BERNARDO: Long live the king!
FRANCISCO: Bernardo?
or even something XML-like. But that would be a different language :)
The problem with parsing either of these with a classic scanner/parser combination is that the lexical structure is inconsistent; the first token on a line is special, but most of the file consists of unparsed text. That will almost inevitably lead to spreading syntactic information between the scanner and the parser, because the scanner needs to know the syntactic context in order to decide whether or not it is scanning raw text.
You might be able to avoid that issue. For example, you might require that a continuation line start with whitespace, so that every line not otherwise marked with #'s starts with the name of a character. That would be more reliable than recognizing a dialogue line just because it starts with the name of a character and a period, since it is quite possible for a character's name to be used in dialogue, even at the end of a sentence (which consequently might be the first word in a continuation line.)
If you do intend for dialogue lines to be distinguished by the fact that they start with a character name and some punctuation then you will definitely have to give the scanner access to the character list (as a sort of symbol table), which is a well-known but not particularly respected hack.
Consider the above a reflection about your second question ("What are the roles of the scanner and the parser?"), which does not qualify as an answer but hopefully is at least food for thought. As to your other questions, and recognizing that all of this is opinionated:
Is flex/bison too much for such a simple parser ? Should I just write it myself...
The fact that flex and bison are (potentially) more powerful than necessary to parse a particular language is a red herring. C is more powerful than necessary to write a factorial function -- you could easily do it in assembler -- but writing a factorial function is a good exercise in learning C. Similarly, if you want to learn how to write parsers, it's a good idea to start with a simple language; obviously, that's not going to exercise every option in the parser/scanner generators, but it will get you started. The question really is whether the language you're designing is appropriate for this style of parsing, not whether it is too simple.
With flex/bison, does it makes sense to perform the analysis while parsing or is it more elegant to parse first, then operate on the file again with some other tool?
Either can be elegant, or disastrous; elegance has more to do with how you structure your thinking about the problem at hand. Having said that, it is often better to build a semantic structure (commonly referred to as an AST -- abstract syntax tree) during the parse phase and then analyse that structure using other functions.
Rescanning the input file is very unlikely to be either elegant or effective.

Partial parsing with flex/antlr

I encountered a problem while doing my student research project. I'm an electrical engineering student, but my project has somewhat to do with theoretical computer science: I need to parse a lot of pascal sourcecode-files for typedefinitions and constants and visualize all occurrences. The typedefinitions are spread recursively over various files, i.e. there is type a = byte in file x, in file y, there is a record (struct) b, that contains type a and then there is even a type c in file z that is an array of type b.
My idea so far was to learn about compiler construction, since the compiler has to resolve all typedefinitions and break them down to the elemental types.
So, I've read about compiler construction in two books (one of which is even written by the pascal inventor), but I'm lacking so many basics of theoretical computer science that it took me one week alone to work my way halfway through. What I've learned so far is that for achieving my goal, lexer and parser should be sufficient. Since this software is only a really smart part of the whole project, I can't spend so much time with it, so I started experimenting with flex and later with antlr.
My hope was, that parsing for typedefinitions only was such an easy task, that I could manage to do it with only using a scanner and let it do some parser's work: The pascal-files consist of 5 main-parts, each one being optional: A header with comments, a const-section, a type-section, a var-section and (in least cases) a code-section. Each section has a start-identifier but no clear end-identifier. So I started searching for the start of the type- and const-section (TYPE, CONST), discarding everything else. In flex, this is fairly easy, because it allows "start conditions". They can be used as various states like "INITIAL", "TYPE-SECTION", "CONST-SECTION" and "COMMENT" with different rules for each state. I wanted to get back a string from the scanner with following syntax " = ". There was one thing that made this task difficult: Some type contain comments like in this example: AuEingangsBool_t {PCMON} = MAX_AuEingangsFeld;. The scanner can not extract such type-definition with a regular expression.
My next step was to do it properly with scanner AND parser, so I searched for a parsergenerator and found antlr. Since I write the tool in C# anyway, I decided to use its scannergenerator, too, so that I do not have to communicate between different programs. Now I encountered following Problem: AFAIK, antlr does not support "start conditions" as flex do. That means, I have to scan the whole file (okay, comments still get discarded) and get a lot of unneccessary (and wrong) tokens. Because I don't use rules for the whole pascal grammar, the scanner would identify most keywords of the pascal syntax as user-identifiers and the parser would nag about all those series of tokens, that do not fit to type- and constant-defintions
Now, finally my question(s): Can anyone of you tell me, which approach leads anywhere for my project? Is there a possibility to scan only parts of the source-files with antlr? Or do I have to connect flex with antlr for that purpose? Can I tell antlr's parser to ignore every token that is not in the const- or type-section? Are those tools too powerful for my task and should I write own routines instead?
You'd be better off to find a compiler for Pascal, and simply modify to report the information you want. Presumably there is such a compiler for your Pascal, and often the source code for such compilers is available.
Otherwise you essentially need to build a parser. Building lexer, and then hacking around with the resulting lexemes, is essentially building a bad parser by ad hoc methods. ANTLR is a good way to go; you can define the lexemes (including means to pick up and ignore comments) pretty easily, especially for older dialects of Pascal. You'll need good BNF rules for the type information that you want, and translate those rules to the parser generator. What you can do to minimize work, is to cheat on rules for the parts of the language you don't care about. For instance, you could write an accurate subgrammar for assignment statements. Since you don't care about them, you can write a sloppy subgrammar that treats assignment statements as anything that begins with an identifier, is followed by arbitrary other tokens, and ends with semicolon. This kind of a grammar is called an "island grammar"; it is only accurate where it needs to be accurate.
I don't know about the recursive bit. Is there a reason you can't just process each file separately? The answer may depend on what information you want to know about each type declaration, and if you go deep enough, you may need a symbol table as well as an island parser. Parser generators offer you no help for this.
First, there can be type and const blocks within other blocks (procedures, in later Delphi versions also classes).
Moreover, I'm not entirely sure that you can actually simply scan for a const token, and then start parsing. Const is also used for other purposes in most common (Borland) Pascal dialects. Some keywords can be reused in a different context, and if you don't parse the global blockstructure, and only look for const and type in specific places you will erroneously start parsing there.
A base problem of course is the comments. Scanners cut out comments as early as possible, and don't regard them further. You probably have to setup the scanner so that comments are attached to the adjacent tokens as field (associate with token before or save them up till a certain token follows).
As far antlr vs flex, no clue. The only parsergenerator I have some minor experience in parsing Pascal with is Coco/R (a parsergenerator popular by Wirthians), but in general I (and many pascalians) prefer handcoded.

Profanity filter import

I am looking to write a basic profanity filter in a Rails based application. This will use a simply search and replace mechanism whenever the appropriate attribute gets submitted by a user. My question is, for those who have written these before, is there a CSV file or some database out there where a list of profanity words can be imported into my database? We are submitting the words that we will replace the profanities with on our own. We more or less need a database of profanities, racial slurs and anything that's not exactly rated PG-13 to get triggered.
As the Tin Man suggested, this problem is difficult, but it isn't impossible. I've built a commercial profanity filter named CleanSpeak that handles everything mentioned above (leet speak, phonetics, language rules, whitelisting, etc). CleanSpeak is capable of filtering 20,000 messages per second on a low end server, so it is possible to build something that works well and performs well. I will mention that CleanSpeak is the result of about 3 years of on-going development though.
There are a few things I tell everyone that is looking to try and tackle a language filter.
Don't use regular expressions unless you have a small list and don't mind a lot of things getting through. Regular expressions are relatively slow overall and hard to manage.
Determine if you want to handle conjugations, inflections and other language rules. These often add a considerable amount of time to the project.
Decide what type of performance you need and whether or not you can make multiple passes on the String. The more passes you make the slow your filter will be.
Understand the scunthrope and clbuttic problems and determine how you will handle these. This usually requires some form of language intelligence and whitelisting.
Realize that whitespace has a different meaning now. You can't use it as a word delimiter any more (b e c a u s e of this)
Be careful with your handling of punctuation because it can be used to get around the filter (l.i.k.e th---is)
Understand how people use ascii art and unicode to replace characters (/ = v - those are slashes). There are a lot of unicode characters that look like English characters and you will want to handle those appropriately.
Understand that people make up new profanity all the time by smashing words together (likethis) and figure out if you want to handle that.
You can search around StackOverflow for my comments on other threads as I might have more information on those threads that I've forgotten here.
Here's one you could use: Offensive/Profane Word List from CMU site
Based on personal experience, you do understand that it's an exercise in futility?
If someone wants to inject profanity, there's a slew of words that are innocent in one context, and profane in another so you'll have to write a context parser to avoid black-listing clean words. A quick glance at CMU's list shows words I'd never consider rude/crude/socially unacceptable. You'll see there are many words that could be proper names or nouns, countries, terms of endearment, etc. And, there are myriads of ways to throw your algorithm off using L33T speak and such. Search Wikipedia and the internets and you can build tables of variations of letters.
Look at CMU's list and imagine how long the list would be if, in addition to the correct letter, every a could also be 4, o could be 0 or p, e could be 3, s could be 5. And, that's a very, very, short example.
I was asked to do a similar task and wrote code to generate L33T variations of the words, and generated a hit-list of words based on several profanity/offensive lists available on the internet. After running the generator, and being a little over 1/4 of the way through the file, I had over one million entries in my DB. I pulled the plug on the project at that point, because the time spent searching, even using Perl's Regex::Assemble, was going to be ridiculous, especially since it'd still be so easy to fool.
I recommend you have a long talk with whoever requested that, and ask if they understand the programming issues involved, and low-likelihood of accuracy and success, especially over the long-term, or the possible customer backlash when they realize you're censoring them.
I have one that I've added to (obfuscated a bit) but here it is: https://github.com/rdp/sensible-cinema/blob/master/lib/subtitle_profanity_finder.rb

Determine Cobol coding style

I'm developing an application that parses Cobol programs. In these programs some respect the traditional coding style (programm text from column 8 to 72), and some are newer and don't follow this style.
In my application I need to determine the coding style in order to know if I should parse content after column 72.
I've been able to determine if the program start at column 1 or 8, but prog that start at column 1 can also follow the rule of comments after column 72.
So I'm trying to find rules that will allow me to determine if texts after column 72 are comments or valid code.
I've find some but it's hard to tell if it will work everytime :
dot after column 72, determine the end of sentence but I fear that dot can be in comments too
find the close character of a statement after column 72 : " ' ) }
look for char at columns 71 - 72 - 73, if there is not space then find the whole word, and check if it's a key word or a var. Problem, it can be a var from a COPY or a replacement etc...
I'd like to know what do you think of these rules and if you have any ideas to help me determine the coding style of a Cobol program.
I don't need an API or something just solid rules that I will be able to rely on.
I think you need to know the COBOL compiler for each program. Its documentation should tell you what conventions/configurations/switches it uses to decide if the source code ends at column 72 or not.
So.... which compiler(s)?
And if you think the column 72 issue is a pain, wait till you get around to actually parsing the COBOL itself. If you are not well prepared to handle the lexical issues of the language, you are probably very badly prepared to handle the syntactic ones.
There is no absolutely reliable way to determine if a COBOL program
is in fixed or free format based only on the source code. Heck it is sometimes difficult to identify
the programming language based only on source code. Check out
this classic polyglot - it is valid under 8 different language compilers. That
said, you could try a few heuristics that might yield
the correct answer more often than not.
Compiler directives imbedded in source code
Watch for certain compiler directives that determine code format.
Unfortunately, every compiler vendor uses their own flavour of directive.
For example, Microfocus COBOL uses the
SOURCEFORMAT directive. This directive will appear near the top of the program so a short pre-scan
could be used to find it. On the other hand, OpenCobol uses >>SOURCE FORMAT IS FREE and
>>SOURCE FORMAT IS FIXED to toggle between free and fixed format, different parts of the same program
could be formatted differently!
The bottom line here is that you will have to support the conventions of multiple COBOL compilers.
Compiler switches
Source code format can be also be specified using a compiler switch. In this case, there are no concrete
clues to go on. However, you can be reasonably sure that the entire source program will be either
fixed or free. All you can do here is guess. Unless the programmer is out to "mess with
your head" (and some will), a program in free format will have the keywords IDENTIFICATION DIVISION or ID DIVISION, starting before column 8.
Every COBOL program will begin with these keywords so you can use them as the anchor point for determining code format in the
absence of imbedded compiler directives.
Warning - this is far from fool proof, but might be a good start.
There won't be an algorithm to do this with 100% certainty, because if comments can be anything, they can also be compilable COBOL code. So you could theoretically write a program that means one thing if the comments are ignored, and something else entirely if the comments are treated as part of the COBOL.
But that's extremely unlikely. What's most likely to happen is that if you try to compile the code under the wrong convention, it will simply fail. So the only accurate way to do this is to try compiling/parsing the program one way, and if you come to a line that can't make sense, switch to the other style. You could also support passing an argument to the compiler when the style is already known.
You can try using heuristics like what you've described, but that will never be totally accurate. The most they can give you is a probability that the code is one or the other style, which will increase as they examine more and more lines of code. They could be useful for helping you guess the style before you start compiling, or for figuring out when the problem is really just a typo in the code.
EDIT:
Regarding ideas for heuristics, it's hard to say. If there were a standard comment sigil like // or # in other languages, this would be a lot easier (actually, there is, but it sounds like your code doesn't follow this convention). The only thing I can think of would be to check whether every line (or maybe 99% of lines, and not counting empty lines or lines commented with *) has a period somewhere before position 72.
One thing you DON'T want to do is apply any heuristics to the part after position 72. That is, you don't want to be checking the comments to see if they're valid COBOL. You want to check what you know is COBOL first, and see if that works by itself. There are several reasons for this:
Comments written in English are likely to have periods and quotes in them, so your first and second bullet points are out.
Natural languages are WAY harder to parse than something like COBOL.
The comments could easily have COBOL in them (maybe someone commented out the previous version of the line).
An important rule for comments is that they should never affect what the program does. If changing the comments can change how the program is compiled, you violate that.
All that in mind, my opinion is that you shouldn't use heuristics at all. You should always try to compile the program under both conventions unless one is explicitly specified. There's a chance that code will compile successfully under both conventions, and then you'll have two different programs and no way to tell which one is correct.
If that happens, you need to compare the two results (perhaps with a hash or something) to see if they're the same program. If they're the same, great, but if not, you'll need to force the user to explicitly choose a convention.
Most COBOL compilers will allow you to generate and analyze the post text manipulation phase.
The text preprocessor output can be seen (using OpenCOBOL for the example)
cobc -E program.cob
The text manipulation processor deals with any COPY ... REPLACING compiler directives, as well as converting SOURCE FORMAT IS FIXED (with line continuations, string literal concatenations, comment line removal, among other things) to the actual free format that the compiler lexical analyzer needs. A lot of the OpenCOBOL toolkits (Cross referencer and Animator, to name two) use source code AFTER the preprocessor pass. I don't think you'll lose any street cred if your parser program relies on post processed source code files.

How to obfuscate C++ variables and functions

I'm trying to do some algorithm comparison for plagiarism. I've found many TEXT comparison for plagiarism.
But in an algorithm it's very different. Let's say that some algorithm uses an huge number of variables, functions and user defined structures. If some guy copy the source code from someone, he'll at least, change the variables and functions names. With an simple text comparison algorithm this difference in functions and variables letters will count as an "difference" making the algorithm gives an "false" for plagiarism.
What I want to do is "generalize" (I don't know if that's the right word) all the variables, functions and user-defined structures names in an C++ source code. So the varibles will be named like "a", "b", the same for functions "... fa(...)", "... fb(...)".
I have the c++ source algorithms in strings variables in PHP to be compared.
I know that many other things should be analysed for an accurate source code comparison, but that will be enough to me.
It's an interesting question. Depending on how complex the algorithm, however, it might be that variable names are what gives the plagiarism away. How many ways can you really code up a tree traversal for example?
I think there was a paper a few years ago on identifying coders through their style - looking at all the little things like whitespace, where {}s are placed, etc. Who knows but maybe that is the way to go, look for a negative match to the student's previous style rather than positive match to the known sources. Saying that, students aren't likely to have developed a very personal coding style at an early stage of learning.
One thought - what language are the examples written in? Can it be compiled? If you compile C and then do a binary comparison on the executables, then will identical programs with different local variable names have the exact same binary? (Global vars and functions wouldn't, though).
I've used MOSS in the past: http://theory.stanford.edu/~aiken/moss/ to detect plagiarized code. Since it works on a semantic level, it will detect the situations you presented above. The tool is language-aware, so comments are not considered in the analysis, and it goes a long way in detecting code that has been modified through simple search-and-replace of variable and/or function names.
Note: I used the tool a few years ago when I taught computer science in grad school, and it worked wonderfully in detecting code that had been yanked from the internet. Here is a well-documented account of similar application: http://fie2012.org/sites/fie2012.org/history/fie99/papers/1110.pdf
If you google "measure software similarity", you should find a few more useful hits: http://www.ics.heacademy.ac.uk/resources/assessment/plagiarism/detectiontools_sourcecode.html

Resources