COBOL clone detection with ConQAT? - cobol

ConQAT's doc claims it can do clone detection on COBOL code, but I can't find any appropriate block in the list of Included blocks.
The only one that could be considered is StatementCloneAnalysis but it would get confused by the line numbers that precede each line:
016300******************************************************************0058

Interesting tool. I took a quick look and it seems to me that a simple fix might be to pre-process COBOL source to overwrite columns 1 through 6 with spaces and trim everything after column 72.
After poking around for a while I came across the NextToken scanner definition file for COBOL. It looks like it will "happily" pick up tokens from the sequence number area as well as after column 72. The tokenizer looks like it only deals with COBOL source code after it has gone through the library processing phase of a compile (i.e. after compiler directives such as COPY/REPLACE have been processed). COPY/REPLACE were specified as keywords but I really don't see how this tokenizer would deal with them properly - particularly where pseudo text is involved.
If working with an IBM COBOL compiler, you can specifying the MDECK option on a compile to generate a suitable source file for analysis. I am not familiar with other vendors so cannot comment further on how to generate a post text-manipulation source deck.
The level of clone detection conquat provides for COBOL appears to be very limited relative to other languages (e.g. java). I suspect you will have to put in a lot of hours to get anything more than trivial clone detection out of it for COBOL programs. However this could be a very useful project given the heavy use of cut/paste coding in typical COBOL programs (COBOL programmers often make a joke out of it: Only one COBOL program has ever been written, the rest are just modified copies of it). I wish you well.

Given that ConQat deals with COBOL badly, you might look at our CloneDR tool.
It has a version that works explicitly with IBM Enterprise COBOL, using a precise parser, and it handles all that sequence number nonsense correctly. (It will even read the COBOL code in its native ECBDIC, meaning a literal string containing an ASCII newline character doesn't break the parser).
[If your COBOL isn't IBM COBOL, this won't help you, but otherwise you won't "have to put a lot of hours to get anything"].
We think the AST-based detection technique detects better clones more accurately than ConQat's token-based detection. The site explains why in detail, and shows sample COBOL clones detected by CloneDR.
Specific to the OP who appears to be working in Japan: as a bonus, CloneDR handles Japanese character sets because it is implemented on top of an underlying tool infrastructure that is Unicode and Shift-JIS enabled. We haven't had a lot of experience with Japanese COBOL so there might be a remaining glitch; see G literals with Japanese characters.

Related

View code generated by IBM's Enterprise COBOL compiler

I have recently started doing some work with COBOL, where I have only ever done work in z/OS Assembler on a Mainframe before.
I know that COBOL will be translated into Mainframe machine-code, but I am wondering if it is possible to see the generated code?
I want to use this to better understand the under workings of COBOL.
For example, if I was to compile a COBOL program, I would like to see the assembly that results from the compile. Is something like this possible?
Relenting, only because of this: "I want to use this to better understand the under workings of Cobol".
The simple answer is that there is, for Enterprise COBOL on z/OS, a compiler option, LIST. LIST will provide what is known as the "pseudo assembler" output in your compile listing (and some other useful stuff for understanding the executable program). Another compiler option, OFFSET, shows the displacement from the start of the program of the code generated for each COBOL verb. LIST (which inherently has the offset already) and OFFSET are mutually exclusive. So you need to specify LIST and NOOFFSET.
Compiler options can be specified on the PARM of the EXEC PGM= for the compiler. Since the PARM is limited to 100 characters, compiler options can also be specified in a data set, with a DDName of SYSOPTF (which, in turn, you use a compiler option to specify its use).
A third way to specify compiler options is to include them in the program source, using the PROCESS or (more common, since it is shorter) CBL statement.
It is likely that you have a "panel" to compile your programs. This may have a field allowing options to be specified.
However, be aware of a couple of things: it is possible, when installing the compiler, to "nail in" compiler options (which means they can't be changed by the application programmer); it is possible, when installing the compiler, to prevent the use of PROCESS/CBL statements.
The reason for the above is standardisation. There are compiler options which affect code generation, and using different code generation options within the same system can cause unwanted affects. Even across systems, different code generation options may not be desirable if programmers are prone to expect the "normal" options.
It is unlikely that listing-only options will be "nailed", but if you are prevented from specifying options, then you may need to make a special request. This is not common, but you may be unlucky. Not my fault if it doesn't work for you.
This compiler options, and how you can specify them, are documented in the Enterprise COBOL Programming Guide for your specific release. There you will also find the documentation of the pseudo-assembler (be aware that it appears in the document as "pseudo-assembler", "pseudoassembler" and "pseudo assembler", for no good reason).
When you see the pseudo-assembler, you will see that it is not in the same format as an Assembler statement (I've never discovered why, but as far as I know it has been that way for more than 40 years). The line with the pseudo-assembler will also contain the machine-code in the format you are already familiar with from the output of the Assembler.
Don't expect to see a compiled COBOL program looking like an Assembler program that you would write. Enterprise COBOL adheres to a language Standard (1985) with IBM Extensions. The answer to "why does it do it likely that" will be "because", except for optimisations (see later).
What you see will depend heavily on the version of your compiler, because in the summer of 2013, IBM introduced V5, with entirely new code-generation and optimisation. Up to V4.2, the code generator dated back to "ESA", which meant that over 600 machine instructions introduced since ESA were not available to Enterprise COBOL programs, and extended registers. The same COBOL program compiled with V4.2 and with V6.1 (latest version at time of writing) will be markedly different, and not only because of the different instructions, but also because the structure of an executable COBOL program was also redesigned.
Then there's opimisation. With V4.2, there was one level of possible optimisation, and the optimised code was generally "recognisable". With V5+, there are three levels of optimisation (you get level zero without asking for it) and the optimisations are much more extreme, including, well, extreme stuff. If you have V5+, and want to know a bit more about what is going on, use OPT(0) to get a grip on what is happening, and then note the effects of OPT(1) and OPT(2) (and realise, with the increased compile times, how much work is put into the optimisation).
There's not really a substantial amount of official documentation of the internals. Search-engineing will reveal some stuff. IBM's Compiler Cafe:COBOL Cafe Forum - IBM is a good place if you want more knowledge of V5+ internals, as a couple of the developers attend there. For up to V4.2, here may be as good a place as any to ask further specific questions.

Learning incremental compilation design

There are a lot of books and articles about creating compilers which do all the compilation job at a time. And what about design of incremental compilers/parsers, which are used by IDEs? I'm familiar with first class of compilers, but I have never work with the second one.
I tried to read some articles about Eclipse Java Development Tools, but they describe how to use complete infrastructure(i.e. APIs) instead of describing internal design(i.e. how it works internally).
My goal is to implement incremental compiler for my own programming language. Which books or articles would you recommend me?
This book is worth a look: Builing a Flexible Incremental Compiler Back-End.
Quote from Ch. 10 "Conclusions":
This paper has explored the design of
the back-end of an incremental
compilation system. Rather than
building a single fixed incremental
compiler, this paper has presented a
flexible framework for constructing such
systems in accordance with user needs.
I think this is what you are looking for...
Edit:
So you plan to create something that is known as a "cross compiler"?!
I started a new attempt. Until now, I can't provide the ultimate reference. If you plan such a big project, I'm sure you are an experienced programmer. Therefore it is possible, that you already know these link(s).
Compilers.net
List of certain compilers, even cross compilers (Translators). Unfortunately with some broken links, but 'Toba' is still working and has a link to its source code. May be that this can inspire you.
clang: a C language family frontend for LLVM
Ok, it's for LVVM but source is available in a SVN repository and it seems to be a front end for a compiler (translator). May be that this can inspire you as well.
I'm going to disagree with conventional wisdom on this one because most conventional wisdom makes unwritten assumptions about your goals, such as complete language designs and the need for extreme efficiency. From your question, I am assuming these goals:
learn about writing your own language
play around with your language until it looks elegant
try to emit code into another language or byte code for actual execution.
You want to build a hacking harness and a recursive descent parser.
Here is what you might want to build for a harness, using just a text based processor.
Change the code fragment (now "AT 0700 SET HALLWAY LIGHTS ON FULL")
Compile the fragment
Change the code file (now "tests.l")
Compile from file
Toggle Lexer output (now ON)
Toggle Emitter output (now ON)
Toggle Run on home hardware (now OFF)
Your command, sire?
You will probably want to write your code in Python or some other scripting language. You are optimizing your speed of play, not execution. A recursive descent parser might look like:
def cmd_at():
if next_token.type == cTIME:
num = next_num()
emit("events.setAlarm(events.DAILY, converttime(" + time[0:1] + ", "
+ time[2:] + ", func_" + num + ");")
match_token(cTIME)
match_token(LOCATION)
...
So you need to write:
A little menu for hacking.
Some lexing routines, to return different tokens for numbers, reserved words, and the like.
A bunch of logic for what your language
This approach is aimed at speeding up the cycle for hacking together the language. When you have finished this approach, then you reach for BISON, test harnesses, etc.
Making your own language can be a wonderful journey! Expect to learn. Do not expect to get rich.
I see that there is an accepted answer, but I think that some additional material could be usefully included on this page.
I read the Wikipedia article on this topic and it linked to a DDJ article from 1997:
http://www.drdobbs.com/cpp/codestore-and-incremental-c/184410345?pgno=1
The meat of the article is the first page. It explains that the code in the editor is divided into pieces that are "incorporated" into a "CodeStore" (database). The pieces are incorporated via a work queue which contains unincorporated pieces. A piece of code may be parsed and returned to the work queue multiple times, with some failure on each attempt, until it goes through successfully. The database includes dependencies between the pieces so that when the source code is edited the effects on the edited piece and other pieces can be seen and these pieces can be reprocessed.
I believe other systems approach the problem differently. Java presents different problems than C/C++ but has advantages as well, so Eclipse perhaps has a different design.

What is COBOL used for? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
What is COBOL used for?
COmmon Business Oriented Language 'invented' by Grace Murray Hopper ( read about her she is one one of the pivotal people in the development of computing as we know it today). The general idea was to produce a language that was English based as opposed to mathematically based or expressed as such in the code.
Very simply put you would use a construct like
ADD YEARS TO AGE
as opposed to
age = age + years
or
age += years
Appearing in the early 1960's it was massively adopted for processing in the area of business. There are still a large volume of applications built in COBOL still running and maintained and it is still very much alive and kicking. Around 1997 Gartner reported that 80% of the world's business ran on COBOL with over 200 billion lines of code in existence and with an estimated 5 billion lines of new code annually. So you could do a lot worse than learn COBOL to ensure you have a job for life.
The structure of a cobol program is summarised in the Mnemonic In Every Damn Program. Meaning that there is an:-
Identification Division giving information about the program
Environment Division describing the hardware
Data Division (In my day we used CODASYL now better known and newly re-invented as no-sql
Procedure Division 'Here be code'
Because of the legacy from punch cards (yes i used them as well) you always started the code by indenting 8 spaces in else some compilers would not recognise it (shades of Python where whitespace is significant).
It is of course a compiled language.
Where is it used. Governments, the Military Businesses of all sizes but usually the larger corporates so i suppose you could say everywhere and it is used to run governments, and the Military and business's. I believe the US's social welfare system runs on several million lines of Cobol written in the mid 60's. Experian a large UK based credit rating company uses it throughout there operation with interfaces to the web. Again in the UK most of the Building Societies and Banks run their core systems on it.
I could go on but i won't go and read about it. And by the way you can even get Object Oriented Cobol if you want
Ever use a credit card? Your transaction is probably touching COBOL code on the backend.
The Right Tool for the Right Job
Batch
The most salient point about COBOL is not its verboseness. It is that it was primarily designed, as a language, to do batch processing. Its I/O functionality in that regard is exceptionally efficient.
Even though it predates OOLs by a geological epoch, it is useful when speaking to a modern-day OO programmer to describe batch programming and COBOL from an OO point of view. Describing it like this, though historically incorrect, helps OO programmers conceptually.
To wit, the utterly fallacious, and yet very true:
COBOL has been "optimized" to iterate over large, nay, vast sequential "collections"
(i.e. batches, also known as files). In fact, it is so optimized, that all the OO
functionality has been stripped out, leaving a basic API that opens files, processes
records, and closes files. In more complex version of the basic algorithm, multiple
files are opened, their records matched to each other and manipulated to produce one
or more output files (batches).
Where COBOL was co-opted for non-batch processes, for instance pseudo-conversational programming (backing up CICS "green" screens - aka BMS), it was least suitable. Not surprisingly, it is this functionality that has been most quickly replaced by GUI apps written in OOLs.
The Editor
The ISPF Editor on IBM mainframes has been optimized to handle the kind of coding COBOL requires. The basic unit of manipulation in the editor is the line. By default, vertical alignment is static and not flowed or shifted based on context; typing to the end of a line results in a keyboard lock. Because of this "conservation of vertical alignment," it is relatively easy to duplicate lines or blocks of lines, and align commands. With COBOL vertical alignment, as a legibility issue, is of greater importance than OO languages.
It is difficult to describe in a post, but having facility in both programming worlds and with both types of editors, I have to say that I would not want to edit COBOL in an IDE style editor, and I would not want to edit Java and C-family languages in an ISPF Editor. (I imagine you can plug-in an ISPF style editor into the various IDEs, but I haven't had the need to go there.)
N.B. OO COBOL has its uses, but not as a new way to re-engineer code that handles batch processing.
From my, although limited experience, COBOL is used a lot with IBM mainframe systems. So I believe in any situation where I/O is the emphasis (as mentioned above financial systems, insurance companies, government, etc) to the extent that a mainframe is needed or preferred and has been around for a while COBOL is probably used. I say been around for a while since in modern day I do not hear much of COBOL being a go to language.
Cobol is used primarily for financial processing. Any time banks, brokerage houses, credit card vendors, et al are doign business, there will be Cobol in the mix.
The ANSI standard for COBOL and some compilers have evolved considerably in the last 15 years and include libraries for creating and operating web frame contents and interactive sites, for data communications, for running on small processors and devices used in the hand. Well known versions are prefixed with characters like MF, CIS, RF, RM or the names of computer mainframe manufacturers old and new for versions in use primarily in Data Processing computer installations.
Today COBOL is used only because it used to be popular back in the day, and many old large businesses don't want to re-write their code into a modern language. (mainly cost + time)
The maximum length of a line of COBOL code is 72 characters long, why you ask? Because that's how many holes there were in punch cards. Even still the language hasn't been updated to allow for longer lines...
COBOL is an evil, ancient language that has little use any more, unless you are extending OLD programs...
COBOL is used for Business applications. Fortran is for scientific apps. C and C++ for hardware and firmware. Java for the web.
Then you may ask, why COBOL? Well, COBOL is about ten times easier to program for business than any of the other languages.
For example, to move a numeric to a report field and format it as a currency:
MOVE VAL-A TO REPORT-FIELD-A.
There are no getter or setter methods needed. No need to program two methods for each MOVE statement.
And all the changing to string characters, and formating to $99,999.99 IS automatic. Try that in any of the other languages.
The dirty little secret is that COBOL is really a glorified ASSEMBLER MACRO language. There is even a compiler option to print the assembler code. That makes it easy to understand and powerful.
COBOL: Easy, quick, accurate, readable and maintainable. Everything a boss could ask for.

Source of parsers for programming languages?

I'm dusting off an old project of mine which calculates a number of simple metrics about large software projects. One of the metrics is the length of files/classes/methods. Currently my code "guesses" where class/method boundaries are based on a very crude algorithm (traverse the file, maintaining a "current depth" and adjusting it whenever you encounter unquoted brackets; when you return to the level a class or method began on, consider it exited). However, there are many problems with this procedure, and a "simple" way of detecting when your depth has changed is not always effective.
To make this give accurate results, I need to use the canonical way (in each language) of detecting function definitions, class definitions and depth changes. This amounts to writing a simple parser to generate parse trees containing at least these elements for every language I want my project to be applicable to.
Obviously parsers have been written for all these languages before, so it seems like I shouldn't have to duplicate that effort (even though writing parsers is fun). Is there some open-source project which collects ready-to-use parser libraries for a bunch of source languages? Or should I just be using ANTLR to make my own from scratch? (Note: I'd be delighted to port the project to another language to make use of a great existing resource, so if you know of one, it doesn't matter what language it's written in.)
If you want language-accurate parsing, especially in the face of language complications such as macros and preprocessor conditionals, you need full language parsers. These are actually quite a lot of work to construct, and most languages don't lend themselves nicely to the various kinds of parser generators around. Nor are most authors of a language parser interested in other langauges; they tend to choose some parser generator that isn't obviously a huge roadblock when they start, implement their parser for the specific purpose they intend, and move on.
Consequence: there are very few libraries of language definitions around that are defined using a single formalism or a shared foundation. The ANTLR crowd maintains one of the larger sets IMHO, although as far as I can tell most of those parsers are not-quite-production capable. There's always Bison, which has been around long enough so you'd expect a library of langauge definitions to be collected somewhere, but I've never seen one.
I've spent the last 15 years defining foundation machinery for program analysis and transformation, and building another such library, called the DMS Software Reengineering Toolkit. It has production quality parsers for C, C++, C#, Java, COBOL (IBM Enterprise version), JCL, PHP, Python, etc. Your opinion may of course vary from mine but these are used daily with DMS to carry out mass change tasks on large bodies of code.
I don't know of any others where the set of langauge definitions are mature and built on a single foundation... it may be that IBM's compilers are such a set, but IBM doesn't offer out the machinery or the language definitions.
If all you want to do is compute simple metrics, you might be able to live with just lexers and ad hoc nest-counting (as you've described). Even that's harder than it looks to make it work right in most cases (check out Python's, Perl's and PHP crazy string syntaxes). When all is said and done, even C is a surprising amount of work just to define an accurate lexer: we have several thousand lines of sophisticated regular expressions to cover all the strange lexemes you find in Microsoft and/or GNU C.
Because DMS has consistently-defined, mature parsers for many languages, it follows that DMS has consistently defined, mature lexers for the same langauges. We actually build a Source Code Search Engine (SCSE) that provides fast search across large bodies of codes in multiple languages that works by lexing the languages it encounters and indexing those lexemes for fast lookup. The SCSE just so happens to compute the kind of metrics you are discussing, too, as it indexes the code base, pretty much the way you describe, except that it has these langauage accurate lexers to use.
You might be interested in gcc-xml if you are parsing C++. Java CUP has grammars for the Java language.

Grammar/own-written parser?

I'm doing some small projects which involve having different syntaxes for something, however sometimes these syntaxes are so easy that using a parser generator might be overkill.
Now, when should I use a hand-made parser, and when should I use a parser generator?
Thanks,
William van Doorn
There is no hard-and-fast answer, other than "use whatever is easiest for the particular situation".
My experience is that parsers tend to get more complicated over their lifetimes, so using a parser generator up front usually pays off. Even if the language doesn't get more complicated, using a generator forces you to create a formal specification of the syntax, which is itself valuable.
The downsides are that other programmers may not know how to use the generator, so it makes it difficult for others to help out, and it makes your project dependent on that generator.
It's worth coding the parser by hand if, and only if, you're super-keen to have it be extremely fast even on a machine of very modest speed. For example, in this article on the history of Turbo Pascal from before it got its name, you can see how and why the prototype impressed the small (then Danish) firm "Borland" to hire the prototype's author (Anders Hejlsberg), fully develop the compiler, and launch it as its main product, and I quote...:
with no great expectations I hit the
compile key - AND THEN I WAS
COMPLETELY FLOORED! My test program,
that took minutes to compile and link
using Digital Research’s Pascal MT+,
was compiled and running before I
could blink an eye! That was a great
WOW moment!
Turbo Pascal's amazing compile speed -- coming first and foremost from a carefully hand-coded and highly tuned recursive descent parser coded in assembly language -- allowed it to use a very different strategy from most compilers: no separate compilation pass generating object files and libraries, and then a linker to put them together, rather, Turbo Pascal 1.0 was a single-pass compiler that directly turned source code into a single executable binary.
I remember just the same amazing experience on the tiny personal computers of that era (when a Z80, 64K or RAM, and two floppies was a lot;-) -- Turbo Pascal, with its amazing parser and the IDE and everything else, fit comfortably in memory together with a substantial program in both source and compiled form -- no floppies were needed, which meant many orders of magnitude of difference in program turnaround time.
If Hejlsberg had stuck to what was already the traditional wisdom at the time -- always use parser generators -- Turbo Pascal would probably never have emerged as a commercial product, and definitely not achieved the dominance in the Pascal world it enjoyed for years.
Of course, on a typical PC of today, such extreme parsing speed would not be needed for most compilers. Possible exceptions include compilers that must run seamlessly as part of an "interpreter-like" environment (the simple compilers for languages such as Perl and Python are typically hand-coded, to substantial extents, for that reason -- that was an implementation choice that made them viable in the '90s, although today it's not clear it's still needed), or compilers that run on very limited hardware resources, such as smartphones or low-end netbooks.
In the vast majority of cases in which you'll be writing a compiler, none of these performance considerations probably apply, and you'll be happier with a parser generator.
Your question title suggests that using a grammar is optional. It really isn't - even if I was going to implement a tiny language, I'd sketch out a grammar on a single sheet of paper.
As for when to use parser generators, this is really personal preference. Many people believe in hand-writing recursive descent parsers, rather than using the table-driven approach, for example. The important thing is to be comfortable in understanding the capabilities of the generator.
And don't be thinking that using parser generators is somehow the more professional, or even the easier approach. Bjarne Stroustrup when writing the first C++ compiler intended to use recursive descent, but got talked out of it by some keen colleagues at Bell Labs, much to his eventual chagrin. See section 3.3.2 of The Design and Evolution of C++ for more details.

Resources