Command line argument / program option parsing Styles and Specification? - parsing

I am curious if there are any extensive overview, preferrably specifications / technical reports about the GNU style and other commonly used styles for parsing Command Line Arguments.
As far as I know, there are many catches and it's not completely trivial to write a parsing library that would be as compliant as, for example, C++ boost::program_options, Python's argparse, GNU getopt and more.
On the other hand, there might be libraries that are too liberal in accepting certain options or too restrictive. So, if one wants to aim for a good compatibility / conformance with a de-facto standard (if such exists), is there a better way than simply reading a number of mature libraries' source code and/or test cases?

Posix provides guidelines for the syntax of utilities, as Chapter 12 of XBD (the Base Definitions). It's certainly worth a read. As is noted, backwards-compatibility has meant that many standardized utilities do not conform to these guidelines, but nonetheless the standard recommends
... that all future utilities and applications use these guidelines to enhance user portability. The fact that some historical utilities could not be changed (to avoid breaking existing applications) should not deter this future goal.
You can also read the rationale for the syntax guidelines.
Posix provides a basic syntax but it's insufficient for utilities with a large number of arguments, and single-letter options are somewhat lacking in self-documentation. Some utilities -- test, find and tcpdump spring to mind -- essentially implement domain specific languages. Others -- ls and ps, for example -- have a bewildering pantheon of invocation options. To say nothing of compilers...
Over the years, a number of possible extension methods have been considered, and probably all of the are still in use in at least one common (possibly even standard) utility. Posix recommends the use of -W as an extension mechanism, but there are few uses of that. X Windows and TCL/Tk popularized the use of spelled-out multicharacter options, but those utilities expect long option names to still start with a single dash, which renders it impossible to condense non-argument options [Note 1]. Other utilities -- dd, make and awk, to name a few -- special-case arguments which have the form {íd}={val} with no hyphens at all. The GNU approach of using a double-hyphen seems to have largely won, partly for this reason, but GNU-style option reordering is not universally appreciated.
A brief discussion of GNU style is found in the GNU style guide (see also the list of long options), and a slightly less brief discussion is in Eric Raymond's The Art of Unix Programming [Note 2].
Google code takes command-line options to a new level; the internal library has now been open-sourced as gflags so I suppose it is now not breaking confidentiality to observe how much of Google's server management tooling is done through command-line options. Google flags are scattered indiscriminately throughout the code, so that library functions can define their own options without the calling program ever being aware of them, making it possible to tailor the behaviour of key libraries independently of the application. (It's also possible to modify the value of a gflag on the fly at runtime, another interesting tool for service management.) From a syntactic viewpoint, gflags allows both single- and double-hyphen long option presentation, indiscriminately, and it doesn't allow coalesced single-character-option calls. [Note 3]
It's worth highlighting the observation in The Unix Programming Environment (Kernighan & Pike) that because the shell "must satisfy both the interactive and programming aspects of command execution, it is a strange language, shaped as much by history as by design." The requirements of these two aspects -- the desire of a concise interactive language and a precise programming language -- are not always compatible.
Syntax flexibility, while handy for the interactive user, can be disastrous for the script author. As an example, last night I typed -env=... instead of --env=... which resulted in my passing nv=... to the -e option rather than passing ... to the --env option, which I didn't notice until someone asked me why I was passing that odd string as an EOF indicator. On the other hand, my pet bugbear -- the fact that some prefer --long-option and others prefer --long_option and sometimes you find both styles in the same program (I'm looking at you, gcc) -- is equally annoying as an interactive user and as a scripter.
Sadly, I don't know of any resource which would serve as an answer to this question, and I'm not sure that the above serves the need either. But perhaps we can improve it over time.
Notes:
Obviously a bad idea, since it would make impossible the pastime of constructing useful netstat invocations whose argument is a readable word.
The book and its author are commonly known as TAOUP and ESR, respectively.
It took me a while to get used to this, and very little time to revert to my old habits. So you can see where my biases lie.

Related

What makes libadalang special?

I have been reading about libadalang 1 2 and I am very impressed by it. However, I was wondering if this technique has already been used and another language supports a library for syntactically and semantically analyzing its code. Is this a unique approach?
C and C++: libclang "The C Interface to Clang provides a relatively small API that exposes facilities for parsing source code into an abstract syntax tree (AST), loading already-parsed ASTs, traversing the AST, associating physical source locations with elements within the AST, and other facilities that support Clang-based development tools." (See libtooling for a C++ API)
Python: See the ast module in the Python Language Services section of the Python Library manual. (The other modules can be useful, as well.)
Javascript: The ongoing ESTree effort is attempting to standardize parsing services over different Javascript engines.
C# and Visual Basic: See the .NET Compiler Platform ("Roslyn").
I'm sure there are lots more; those ones just came off the top of my head.
For a practical and theoretical grounding, you should definitely (re)visit the classical textbook Structure and Interpretation of Computer Programs by Abelson & Sussman (1st edition 1985, 2nd edition 1996), which helped popularise the idea of Metacircular Interpretation -- that is, interpreting a computer program as a formal datastructure which can be interpreted (or otherwise analysed) programmatically.
You can see "libadalang" as ASIS Mark II. AdaCore seems to be attempting to rethink ASIS in a way that will support both what ASIS already can do, and more lightweight operations, where you don't require the source to compile, to provide an analysis of it.
Hopefully the final API will be nicer than that of ASIS.
So no, it is not a unique approach. It has already been done for Ada. (But I'm not aware of similar libraries for other languages.)

View code generated by IBM's Enterprise COBOL compiler

I have recently started doing some work with COBOL, where I have only ever done work in z/OS Assembler on a Mainframe before.
I know that COBOL will be translated into Mainframe machine-code, but I am wondering if it is possible to see the generated code?
I want to use this to better understand the under workings of COBOL.
For example, if I was to compile a COBOL program, I would like to see the assembly that results from the compile. Is something like this possible?
Relenting, only because of this: "I want to use this to better understand the under workings of Cobol".
The simple answer is that there is, for Enterprise COBOL on z/OS, a compiler option, LIST. LIST will provide what is known as the "pseudo assembler" output in your compile listing (and some other useful stuff for understanding the executable program). Another compiler option, OFFSET, shows the displacement from the start of the program of the code generated for each COBOL verb. LIST (which inherently has the offset already) and OFFSET are mutually exclusive. So you need to specify LIST and NOOFFSET.
Compiler options can be specified on the PARM of the EXEC PGM= for the compiler. Since the PARM is limited to 100 characters, compiler options can also be specified in a data set, with a DDName of SYSOPTF (which, in turn, you use a compiler option to specify its use).
A third way to specify compiler options is to include them in the program source, using the PROCESS or (more common, since it is shorter) CBL statement.
It is likely that you have a "panel" to compile your programs. This may have a field allowing options to be specified.
However, be aware of a couple of things: it is possible, when installing the compiler, to "nail in" compiler options (which means they can't be changed by the application programmer); it is possible, when installing the compiler, to prevent the use of PROCESS/CBL statements.
The reason for the above is standardisation. There are compiler options which affect code generation, and using different code generation options within the same system can cause unwanted affects. Even across systems, different code generation options may not be desirable if programmers are prone to expect the "normal" options.
It is unlikely that listing-only options will be "nailed", but if you are prevented from specifying options, then you may need to make a special request. This is not common, but you may be unlucky. Not my fault if it doesn't work for you.
This compiler options, and how you can specify them, are documented in the Enterprise COBOL Programming Guide for your specific release. There you will also find the documentation of the pseudo-assembler (be aware that it appears in the document as "pseudo-assembler", "pseudoassembler" and "pseudo assembler", for no good reason).
When you see the pseudo-assembler, you will see that it is not in the same format as an Assembler statement (I've never discovered why, but as far as I know it has been that way for more than 40 years). The line with the pseudo-assembler will also contain the machine-code in the format you are already familiar with from the output of the Assembler.
Don't expect to see a compiled COBOL program looking like an Assembler program that you would write. Enterprise COBOL adheres to a language Standard (1985) with IBM Extensions. The answer to "why does it do it likely that" will be "because", except for optimisations (see later).
What you see will depend heavily on the version of your compiler, because in the summer of 2013, IBM introduced V5, with entirely new code-generation and optimisation. Up to V4.2, the code generator dated back to "ESA", which meant that over 600 machine instructions introduced since ESA were not available to Enterprise COBOL programs, and extended registers. The same COBOL program compiled with V4.2 and with V6.1 (latest version at time of writing) will be markedly different, and not only because of the different instructions, but also because the structure of an executable COBOL program was also redesigned.
Then there's opimisation. With V4.2, there was one level of possible optimisation, and the optimised code was generally "recognisable". With V5+, there are three levels of optimisation (you get level zero without asking for it) and the optimisations are much more extreme, including, well, extreme stuff. If you have V5+, and want to know a bit more about what is going on, use OPT(0) to get a grip on what is happening, and then note the effects of OPT(1) and OPT(2) (and realise, with the increased compile times, how much work is put into the optimisation).
There's not really a substantial amount of official documentation of the internals. Search-engineing will reveal some stuff. IBM's Compiler Cafe:COBOL Cafe Forum - IBM is a good place if you want more knowledge of V5+ internals, as a couple of the developers attend there. For up to V4.2, here may be as good a place as any to ask further specific questions.

Programmatic access to fslex and fsyacc

The fslex and fsyacc tools currently require 2-stage compilation, generating files that are then compiled by fsc. It seems to me that these tools would be much easier to use if the source files were embedded resources, fed to fslex and fsyacc programmatically and the generated code compiled on-the-fly using the CodeDom.
Is this feasible and, if so, what would be required to implement this?
Jon, this is a great question; in fact, one of the design goals I have for fsharp-tools (new lexer- and parser-generator implementations for F#) is for them to be embeddable, specifically to enable scenarios like this.
As of now, I haven't implemented (yet) the functionality which would let you do this easily in fsharplex, but don't let that deter you; I've written fsharplex (and the other tools in fsharp-tools) in a more-or-less purely-functional style, so there shouldn't be any issues with global state or anything like that. It should be relatively straightforward to hack up the compiler code so you can build a regex AST using some combinators, run the compiler to get a compiled DFA, then emit IL for your state machine into a dynamic assembly (which you could then "bake" and execute).
fsharpyacc currently uses an approach where I've put the bulk of the compilation logic into a purely-functional library, Graham; the idea there is that the grammar analysis/manipulation and parser DFA compilation algorithms should be generic, reusable, and easy to test, so anyone else wanting to build language tools with F# will have a common framework on which to build them. Likewise, contributions/improvements to Graham can easily flow back to fsharpyacc. Eventually, I will modify fsharplex to use this same approach, which will allow you to embed the regex compiler in your own code simply by referencing the NuGet package (you'd just need to write the code to generate IL from the DFA).
fsharplex and fsharpyacc use MEF to allow various backends to be plugged in; for now, they're only targetting fslex and fsyacc for compatibility reasons, but I'd like to implement code-based backends (as opposed to the current table-based backends) to get better performance in the future.
Update -- I just re-read your question and noticed you want to embed the *.fsl and *.fsy files themselves and invoke the respective compilers at run-time. You could accomplish this by compiling the tools and referencing the assemblies from your own projects. IIRC, I exposed an entry point in both compilers so they could be called from outside code; the main entry points (e.g., what gets executed when you invoke the tools from a console) simply parse the command-line arguments then pass them into this "external" entry point.
There is one problem with directly embedding the *.fsl and *.fsy files though; if you embed them, then run them through fsharplex and fsharpyacc at run-time, your user-defined actions (e.g., the code executed when a lexer or parser rule is matched) will still be specified as F# source code -- you'd need to decide how you want to compile them into executable code.
It should be feasible to provide a parser combinator-like interface with a backend that uses expression trees (the LISP "eval" of F#) or something similar, for full integration with the language. Or else a TypeProvider. There are many options. If table generation is an expensive computation, it could be cached by providing a Cache, for example a disk cache.
I think nothing except lack of time, dedication and expertise, prevents us from having tools with (non-monadic) parser combinator-like interface, yet efficient compiled implementation.
Sometimes I get back to this pet project of mine, playing with an algebraic approach to optimizing regular expressions (and lexers) specified in source using combinators and then compiled to a state machine. It still lacks a few key pieces for efficiency, but there it is:
https://github.com/toyvo/ocaml-regex-algebraic

Learning incremental compilation design

There are a lot of books and articles about creating compilers which do all the compilation job at a time. And what about design of incremental compilers/parsers, which are used by IDEs? I'm familiar with first class of compilers, but I have never work with the second one.
I tried to read some articles about Eclipse Java Development Tools, but they describe how to use complete infrastructure(i.e. APIs) instead of describing internal design(i.e. how it works internally).
My goal is to implement incremental compiler for my own programming language. Which books or articles would you recommend me?
This book is worth a look: Builing a Flexible Incremental Compiler Back-End.
Quote from Ch. 10 "Conclusions":
This paper has explored the design of
the back-end of an incremental
compilation system. Rather than
building a single fixed incremental
compiler, this paper has presented a
flexible framework for constructing such
systems in accordance with user needs.
I think this is what you are looking for...
Edit:
So you plan to create something that is known as a "cross compiler"?!
I started a new attempt. Until now, I can't provide the ultimate reference. If you plan such a big project, I'm sure you are an experienced programmer. Therefore it is possible, that you already know these link(s).
Compilers.net
List of certain compilers, even cross compilers (Translators). Unfortunately with some broken links, but 'Toba' is still working and has a link to its source code. May be that this can inspire you.
clang: a C language family frontend for LLVM
Ok, it's for LVVM but source is available in a SVN repository and it seems to be a front end for a compiler (translator). May be that this can inspire you as well.
I'm going to disagree with conventional wisdom on this one because most conventional wisdom makes unwritten assumptions about your goals, such as complete language designs and the need for extreme efficiency. From your question, I am assuming these goals:
learn about writing your own language
play around with your language until it looks elegant
try to emit code into another language or byte code for actual execution.
You want to build a hacking harness and a recursive descent parser.
Here is what you might want to build for a harness, using just a text based processor.
Change the code fragment (now "AT 0700 SET HALLWAY LIGHTS ON FULL")
Compile the fragment
Change the code file (now "tests.l")
Compile from file
Toggle Lexer output (now ON)
Toggle Emitter output (now ON)
Toggle Run on home hardware (now OFF)
Your command, sire?
You will probably want to write your code in Python or some other scripting language. You are optimizing your speed of play, not execution. A recursive descent parser might look like:
def cmd_at():
if next_token.type == cTIME:
num = next_num()
emit("events.setAlarm(events.DAILY, converttime(" + time[0:1] + ", "
+ time[2:] + ", func_" + num + ");")
match_token(cTIME)
match_token(LOCATION)
...
So you need to write:
A little menu for hacking.
Some lexing routines, to return different tokens for numbers, reserved words, and the like.
A bunch of logic for what your language
This approach is aimed at speeding up the cycle for hacking together the language. When you have finished this approach, then you reach for BISON, test harnesses, etc.
Making your own language can be a wonderful journey! Expect to learn. Do not expect to get rich.
I see that there is an accepted answer, but I think that some additional material could be usefully included on this page.
I read the Wikipedia article on this topic and it linked to a DDJ article from 1997:
http://www.drdobbs.com/cpp/codestore-and-incremental-c/184410345?pgno=1
The meat of the article is the first page. It explains that the code in the editor is divided into pieces that are "incorporated" into a "CodeStore" (database). The pieces are incorporated via a work queue which contains unincorporated pieces. A piece of code may be parsed and returned to the work queue multiple times, with some failure on each attempt, until it goes through successfully. The database includes dependencies between the pieces so that when the source code is edited the effects on the edited piece and other pieces can be seen and these pieces can be reprocessed.
I believe other systems approach the problem differently. Java presents different problems than C/C++ but has advantages as well, so Eclipse perhaps has a different design.

Appropriate uses for yacc/byacc/bison and lex/flex

Most of the posts that I read pertaining to these utilities usually suggest using some other method to obtain the same effect. For example, questions mentioning these tools usual have at least one answer containing some of the following:
Use the boost library (insert appropriate boost library here)
Don't create a DSL use (insert favorite scripting language here)
Antlr is better
Assuming the developer ...
... is comfortable with the C language
... does know at least one scripting
language (e.g., Python, Perl, etc.)
... must write some parsing code in almost
every project worked on
So my questions are:
What are appropriate situations which
are well suited for these utilities?
Are there any (reasonable) situations
where there is not a better
alternative to a problem than yacc
and lex (or derivatives)?
How often in actual parsing problems
can one expect to run into any short
comings in yacc and lex which are
better addressed by more recent
solutions?
For a developer which is not already
familiar with these tools is it worth
it for them to invest time in
learning their syntax/idioms? How do
these compare with other solutions?
The reasons why lex/yacc and derivatives seem so ubiquitous today are that they have been around for much longer than other tools, that they have far more coverage in the literature and that they traditionally came with Unix operating systems. It has very little to do with how they compare to other lexer and parser generator tools.
No matter which tool you pick, there is always going to be a significant learning curve. So once you have used a given tool a few times and become relatively comfortable in its use, you are unlikely to want to incur the extra effort of learning another tool. That's only natural.
Also, in the late 1960s and early 1970s when lex/yacc were created, hardware limitations posed a serious challenge to parsing. The table driven LR parsing method used by Yacc was the most suitable at the time because it could be implemented with a small memory footprint by using a relatively small general program logic and by keeping state in files on tape or disk. Code driven parsing methods such as LL had a larger minimum memory footprint because the parser program's code itself represents the grammar and therefore it needs to fit entirely into RAM to execute and it keeps state on the stack in RAM.
When memory became more plentiful a lot more research went into different parsing methods such as LL and PEG and how to build tools using those methods. This means that many of the alternative tools that have been created after the lex/yacc family use different types of grammars. However, switching grammar types also incurs a significant learning curve. Once you are familiar with one type of grammar, for example LR or LALR grammars, you are less likely to want to switch to a tool that uses a different type of grammar, for example LL grammars.
Overall, the lex/yacc family of tools is generally more rudimentary than more recent arrivals which often have sophisticated user interfaces to graphically visualise grammars and grammar conflicts or even resolve conflicts through automatic refactoring.
So, if you have no prior experience with any parser tools, if you have to learn a new tool anyway, then you should probably look at other factors such as graphical visualisation of grammars and conflicts, auto-refactoring, availability of good documentation, languages in which the generated lexers/parsers can be output etc etc. Don't pick any tool simply because "this is what everybody else seems to be using".
Here are some reasons I could think of for using lex/yacc or flex/bison :
the developer is already familiar with lex/yacc or flex/bison
the developer is most familiar and comfortable with LR/LALR grammars
the developer has plenty of books covering lex/yacc but no books covering others
the developer has a prospective job offer coming up and has been told that lex/yacc skills would increase his chances to get hired
the developer could not get buy-in from project members/stake holders for the use of other tools
the environment has lex/yacc installed and for some reason it is not feasible to install other tools
Whether it's worth learning these tools or not will depend heavily (almost entirely on how much parsing code you write, or how interested you are in writing more code on that general order. I've used them quite a bit, and find them extremely useful.
The tool you use doesn't really make as much difference as many would have you believe. For about 95% of the inputs I've had to deal with, there's little enough difference between one and another that the best choice is simply the one with which I'm most familiar and comfortable.
Of course, lex and yacc produce (and demand that you write your actions in) C (or C++). If you're not comfortable with them, a tool that uses and produces a language you prefer (e.g. Python or Java) will undoubtedly be a much better choice. I, for one, would not advise trying to use a tool like this with a language with which you're unfamiliar or uncomfortable. In particular, if you write code in an action that produces a compiler error, you'll probably get considerably less help from the compiler than usual in tracking down the problem, so you really need to be familiar enough with the language to recognize the problem with only a minimal hint about where compiler noticed something being wrong.
In a previous project, I needed a way to be able to generate queries on arbitrary data in a way that was easy for a relatively non-technical person to be able to use. The data was CRM-type stuff (so First Name, Last Name, Email Address, etc) but it was meant to work against a number of different databases, all with different schemas.
So I developed a little DSL for specifying the queries (e.g. [FirstName]='Joe' AND [LastName]='Bloggs' would select everybody called "Joe Bloggs"). It had some more complicated options, for example there was the "optedout(medium)" syntax which would select all people who had opted-out of receiving messages on a particular medium (email, sms, etc). There was "ingroup(xyz)" which would select everybody in a particular group, etc.
Basically, it allowed us to specify queries like "ingroup('GroupA') and not ingroup('GroupB')" which would be translated to an SQL query like this:
SELECT
*
FROM
Users
WHERE
Users.UserID IN (SELECT UserID FROM GroupMemberships WHERE GroupID=2) AND
Users.UserID NOT IN (SELECT UserID GroupMemberships WHERE GroupID=3)
(As you can see, the queries aren't as effecient as possible, but that's what you get with machine generation, I guess).
I didn't use flex/bison for it, but I did use a parser generator (the name of which has escaped me at the moment...)
I think it's pretty good advice to eschew the creation of new languages just to support a Domain specific language. It's going to be a better use of your time to take an existing language and extend it with domain functionality.
If you are trying to create a new language for some other reason, perhaps for research into language design, then these tools are a bit outdated. Newer generators such as antlr, or even newer implementation languages like ML, make language design a much easier affair.
If there's a good reason to use these tools, it's probably because of their legacy. You might already have a skeleton of a language you need to enhance, which is already implemented in one of these tools. You might also benefit from the huge volumes of tutorial information written about these old tools, for which there is not so great a corpus written for newer and slicker ways of implementing languages.
We have a whole programming language implemented in my office. We use it for that. I think it's meant to be a quick and easy way to write interpreters for things. You could conceivably write almost any sort of text parser using them, but a lot of times it's either A) easier to write it yourself quick or B) you need more flexibility than they provide.

Resources