How should I parse bundled command line option with ambiguities? - parsing

I'm creating a command line parser and want to support option bundling. However, I'm not sure how to handle ambiguities and conflicts that can arise. Consider the three following cases:
1.
-I accepts a string
"-Iinclude" -> Would be parsed as "-I include"
2.
-I accepts a string
-n accepts an integer
"-Iincluden10" -> Would be parsed as "-I include -n 10" because the 'cluden10' after the first occurrence of 'n' cannot be parsed as an integer.
3.
-I accepts a string
-n accepts an integer
-c accepts a string
"-Iin10clude" -> ??? What now ???
How do I handle the last string? There are multiple ways of parsing it, so do I just throw an error informing the user about the ambiguity or do I choose to parse the string that yields the most, i.e. as "-I i -n 10 -c lude"?
I could not find any detailed conventions online, but personally, I'd flag this as an ambiguity error.

As far as I know, there is no standard on command-line parameter parsing, nor even a cross-platform consensus. So the best we can do is appeal to common-sense and the principle of least astonishment.
The Posix standard suggests some guidelines for parsing command-line parameters. They are just guidelines; as the linked section indicates, some standard shell utilities don't conform. And all while Gnu utilities are expected to conform to the Posix guidelines, they also typically deviate in some respects, including the use of "long" parameters.
In any event, what Posix says about grouping is:
One or more options without option-arguments, followed by at most one option that takes an option-argument, should be accepted when grouped behind one '-' delimiter.
Note that Posix options are all single character options. Note also that the guideline is clear that only the last option in an option group is permitted to be an option which might accept an argument.
With respect to Gnu-style long options, I don't know of a standard other than the behaviour of the getopt_long utility. This utility implements Posix style for single character options, including the above-mentioned grouped option syntax; it allows single character options which take arguments to either be immediately followed by the argument, or to be at the end of an (possibly singular) options group with the argument as the following word.
For long options, grouping is not allowed, regardless of whether the option accepts arguments. If the option does accept arguments, two styles are allowed: either the option is immediately followed by an = and then the argument, or the argument is the following word.
In Gnu style, long options cannot be confused with single-character options, because the long options must be specified with two dashes (--).
By contrast, many TCL/Tk-based utilities (and some other command-line parsers) allow long options with a single -, but do not allow option grouping.
In all of these styles, options are divided into two disjoint sets: those that take arguments, and those that do not.
None of these systems are ambiguous, although a random mix of styles, as you seem to be proposing, would be. Even with formal disambiguation rules, ambiguity is dangerous, particularly in console applications where a command line can be irreversible. Furthermore, contextual disambiguation can (even silently) change meaning if the set of available options is extended in the future, which would be a source of hard-to-predict errors in scripts.
Consequently, I'd recommend sticking to a simple existing practice such as Gnu, and to not try too hard to interpret incorrect command lines which do not conform.

Related

How to declare and reuse a character class in flex lexer?

Normally, when you want to reuse a regular expression, you can declare it in flex in declaration section. They will get enclosed by parenthesis by default. Eg:
num_seq [0-9]+
%%
{num_seq} return INT; // will become ([0-9]+)
{num_seq}\.{num_seq} return FLOAT; // will become ([0-9]+)\.([0-9]+)
But, I wanted to reuse some character classes. Can I define custom classes like [:alpha:], [:alnum:] etc. A toy Eg:
chars [a-zA-Z]
%%
// will become (([a-zA-Z]){-}[aeiouAEIOU])+ // ill-formed
// desired ([a-zA-Z]{-}[aeiouAEIOU])+ // correct
({chars}{-}[aeiouAEIOU])+ return ONLY_CONS;
({chars}{-}[a-z])+ return ONLY_UPPER;
({chars}{-}[A-Z])+ return ONLY_LOWER;
But currently, this will fail to compile because of the parenthesis added around them. Is there a proper way or at-least a workaround to achieve this?
This might be useful from time to time, but unfortunately it has never been implemented in flex. You could suppress the automatic parentheses around macro substitution by running flex in lex compatibility mode, but that has other probably undesirable effects.
Posix requires that regular expression bracket syntax includes, in addition to the predefined character classes,
…character class expressions of the form: [:name:] … in those locales where the name keyword has been given a charclass definition in the LC_CTYPE category.
Unfortunately, flex does not implement this requirement. It is not too difficult to patch flex to do this, but since there is no portable mechanism to allow the user to add charclasses to their locale --and, indeed, many standard C library implementations lack proper locale support-- there is little incentive to make this change.
Having looked at all these options, I eventually convinced myself that the simplest portable solution is to preprocess the flex input file to replace [:name:] with a set of characters based on name. Since that sequence of characters is unlikely to be present in a flex input file, a simple-minded search and replace using sed or python is adequate; correctly parsing the flex input file seems to me to be more trouble than it was worth.

How should an application parse a list of filenames from cmdline args?

I'd like to write an app, my_app, that takes a list of named options and a list of filenames from the cmdline, e.g.,
% my_app --arg_1 arg_1_value filename_1 filename_2
The filenames are the last args and are not associated with any named options.
From the cmdline parsers, e.g., flag in Golang, that I've worked, it seems that the parsers will only extract the args that are configured, and that I'd need to identify the list of filenames manually by walking thru the original argv[] list.
I'd like to ask if there are parsers (or their options that I may have overlooked) that can also extract those filenames, or they only return the unprocessed args, and therefore, I could assume that these are the filenames.
The Golang flag module makes the trailing arguments available as the slice flag.Args, which is the trailing part of os.Args.
That's a pretty typical way for command-line argument parsers to work, although the details will vary according to language. The standard C library argument parser, fir example, provides the global optind, which is the index in argv of the first non-flag argument.

Bash - grep command inconsistent with man page

I am trying to understand and read the man page. Yet everyday I find more inconsistent syntax and I would like some clarification to whether I am misunderstanding something.
Within the man page, it specifies the syntax for grep is grep [OPTIONS] [-e PATTERN]... [-f FILE]... [FILE...]
I got a working example that recursively searches all files within a directory for a keyword.
grep -rnw . -e 'memes
Now this example works, but I find it very inconsistent with the man page. The directory (Which the man page has written as [FILE...] but specifies the use case for if file == directory in the man page) is located last. Yet in this example it is located after [OPTIONS] and before [-e PATTERN].... Why is this allowed, it does not follow the specified regex fule of using this command?
Why is this allowed, it does not follow the specified regex fule of using this command?
The lines in the SYNOPSIS section of a manpage are not to be understood as strict regular expressions, but as a brief description of the syntax of a utility's arguments.
Depending on the particular application, the parser might be more or less flexible on how it accepts its options. After all, each program can implement whatever grammar they like for their arguments. Therefore, some might allow options at the beginning, at the end, or even in-between files (typically with ways to handle ambiguity that may arisa, e.g. reading from the standard input with -, filenames starting with -...).
Now, of course, there are some ways to do it that are common. For instance, POSIX.1-2017 12.1 Utility Argument Syntax says:
This section describes the argument syntax of the standard utilities and introduces terminology used throughout POSIX.1-2017 for describing the arguments processed by the utilities.
In your particular case, your implementation of grep (probably GNU's grep) allows to pass options in-between the file list, as you have discovered.
For more information, see:
https://unix.stackexchange.com/questions/17833/understand-synopsis-in-manpage
Are there standards for Linux command line switches and arguments?
https://www.gnu.org/software/libc/manual/html_node/Getopt-Long-Options.html
You can also leverage .
grep ‘string’ * -lR

Parsing text that requires lookahead using nom

tl;dr: I'm struggling to find documentation or examples of text parsers that require lookahead using nom.
Long version
I'm using nom to parse 6502 assembly. I'm struggling with creating a parser that can parse the various addressing modes. Any given opcode will have the following format:
XXX AM
Where XXX is a three-character mnemonic and AM is the operand. The operand can take many forms and is referred to as the "addressing mode." I've defined an enum for the operands, an enum for the addressing modes, and an OpCode tuple struct containing these values, which is ultimately the result returned when parsing.
The addressing mode can be omitted completely, in which case the addressing mode is Implied, it can have a literal value of A, which is the Accumulator addressing mode.
Many of the addressing modes refer to memory locations, and it's these addressing modes I'm struggling to parse. In particular, if an addressing mode specifies a single byte in the form of $00, it is a ZeroPage addressing mode, whereas an operand specifying two bytes in the form of $0000 is an Absolute addressing mode. To complicate the matter, there are indexed variants of these addressing modes in the form of $00,X, $00,Y, $0000,X, etc.
Are there any good examples of existing text parsers that would illustrate the correct way to parse values that all start similarly ($00...) but are differentiated by how they end? The nom documentation is not very comprehensive, and the best example I've found is the INI parser, which isn't doing anything as complex as I'm trying to accomplish. I've also look at the syn source code, but it's using a lot of custom macros and is a pretty complex beast, making it hard to learn from.
One way of doing this is with the alt!() macro.
The idea is have a parser which tries each alternative in sequence. So if you already have parsers for each of the addressing modes separately, you can combine them into a parser for any of them:
// The sub-parsers all return Operand too.
named!(parse_operand<&str, Operand>,
alt!(parse_absolute_indexed |
parse_absolute |
parse_zeropage_indexed |
parse_zeropage |
parse_implied));
Some notes:
The order may be important; I've put parse_absolute after parse_absolute_indexed since the former would match the initial part of the operand and return too early.
A variant would be to include the end of line (including comments if applicable) matching into each sub parser. Then it couldn't match early.
If you're parsing to the end of the input without a byte/character which terminates the pattern (such as a newline) then you may need to use alt_complete!() instead of alt!(). The reason for this is that if you try matching ADD $00, the parser which might match ADD $0000 has to assume that it might still match if more input arrives, and alt!() won't then skip to the next case. Using alt_complete!(), or alternatively wrapping the inner matchers in complete!(), is saying that an incomplete match is a non-match.
If the parsers were very complicated it might mean doing extra work (trying each parse in sequence) compared to a parser generated by eg the venerable yacc, but I don't think it's an issue in this case.

Enable/disable grammar rules in Yacc/Bison

Like the title says, I would like to enable/disable certain grammar rules in a yacc or bison grammar file.
Is there a way to do so?
If you mean, at compile time, yacc uses standard C /* */ comment syntax.
If you mean, at run time, you still have to work with the tables you have, so they need to include the entire grammar with the optional phrases.
So I would suggest making a fake terminal symbol. Rules that are optional would be preceded by the fake terminal. You would only return this terminal if you were including the optional productions.
A variation on this approach would involve defining two versions of a real terminal that actually exists. This only works for grammars that lead strings with terminals but if you have such an input then one terminal can mean one set of rules and another terminal might appear in two sets of rules, that is:
T_A dynamic_phrase_in_grammar;
always_on static_phrase_in_grammar;
always_on: T_A | T_B;
So, to enable the dynamic phrase, the real terminal is returned as T_A, to disable it, return as T_B.

Resources