Parsing a date token - parsing

I would like to parse various SQL literals in ANTLR. Examples of the literal would be:
DATE '2020-01-01'
DATE '1992-11-23'
DATE '2014-01-01'
Would it be better to do the 'bare minimum' at the parsing stage and just put in something like:
Or, should I be doing any validation within the parser as well, for example something like:
If I do the latter I'll still need to validate...even if I do a longer regex, I'll need to check things like the number of days in the month, leap years, a valid date range, etc.
What is usually the preferable method to do this? As in, how much 'validation' do you want to do in your grammar and how much in the listeners where the actual programming would be done? Additionally, are there any performance differences between doing (small) validations 'within the grammar' vs doing it in listeners/followers?

These are actually two slightly different syntaxes (the second does not specify that the date should be surrounded by 's)
Based on your example, that may be an oversight, so I'll assume you mean both to require the 's, and that your STRINGs are ' delimited.
It's a design choice, but a couple of factors to consider.
If you use the more specific grammar, then, if the user input doesn't match, you'll get the default ANTLR error message (which will be "pretty good for a generated tool", but probably a bit obtuse to your user).
As you say, you'll still have to perform further edits.
I lean toward keeping the grammar as simple as possible and doing more validation in a listener (maybe a visitor). This allows you to be as clear with your error messages as possible.
The only reason I see to not use the 'DATE' STRING rule would be if there is some other string content that would NOT be a date_literal, but would be some other, valid syntax in your language. It might be an invalid date literal, in which case, I'd use your simple rules and do the edit.


Understanding 'strictness' in regex grammar

In writing a grammar/parser for a regex, I'm wondering why the following constructions are both syntactically and semantically valid in the regex syntax (at least as far as I can understand it)?
Repetitions of a character class, such as:
Repetitions of a zero-width assertion, such as:
Assertions at a bogus position, such as:
To me, this sort of seems like having a syntax that would allow having a construction like SELECT SELECT SELECT SELECT SELECT col FROM tbl. In other words, why isn't the regex syntax defined as more strict than it is in practice?
To start with, that's not a very good analogy. Your statement with multiple SELECT keywords is not, as far as I know, part of the SQL grammar, so it's simply ungrammatical. Repeated elements in a character class are more like the SQL construct:
SELECT * FROM table WHERE Value IN (1, 2, 3, 2, 3, 2, 3)
I think most (if not all) SQL processors would allow that. You could argue that it would be nice if a warning message were issued, but the usual SQL interface (where a query is sent from client to server and a result returned) does not leave space for a warning.
It's certainly the case that repeated characters in a character class are often an indication that the regular expression was written by a novice with only a fuzzy idea of what a character class is. If you hang out in SO's flex-lexer long enough, you'll see how often students write regular expressions like [a-z|A-Z|0-9], or even [begin|end]. If flex detected duplicate characters in character classes, those mistakes would receive warnings, which might or might not be useful to the student coder. (Reading and understanding warning messages is not, apparently, an innate skill.) But it needs to be asked, Who is the target audience for a tool like Flex, and I think the answer is not "impatient beginners who won't read documentation". Certainly, a non-novice programmer might also make that kind of mistake, usually as a result of a typo, but it's not common and it will probably easily be detected during debugging.
If you've already started to work on a parser for regular expressions, you should already know why these rules are not usually a feature of regular expression libraries. Every rule you place on a syntax must be:
precisely defined,
acted upon appropriately (which may mean complicating the interface in order to allow for warnings).
and all of that needs to be thoroughly tested.
That's a lot of work to prevent something which is not even technically an error. And catching those errors (if they are errors) will probably make the "grammar" for the regular expressions context-sensitive. (In other words, you can't write a context-free grammar which forbids duplication inside a character class.)
Moreover, practically all of those expressions might show up in the wild, typically in the case that the regular expression was programmatically generated. For example, suppose you have a set of words, and you want to write a regular expression which will match any sequence of characters not in any of the words. The simple solution is [^firstsecondthirdwordandtherest]+. Of course, you could have gone to the trouble of deduping the individual letters in the words, -- a simple task in some programming languages, but more complicated in others -- but should it be necessary?
With respect to your other two examples, with repeated and non-terminal $, there are actual regex libraries in which those interpreted in some different way. Many older regex libraries (including (f)lex) only treat $ as a zero-length end-of-string assertion if it appears at the very end of the regex; such libraries would treat all of those $s other than the last one in a$$$$$$$$ as matching a literal $. Others treat $ as an end-of-line assertion, rather than an end-of-string assertion. But I don't know of any which treat those as errors.
There's a pretty good argument that a didactic tool, such as regex101, might usefully issue warnings for doubtful regular expressions, a task usually called "linting". But for a production regex library, there is little to be gained, and a lot of pain.

ANTLR4 - Parse subset of a language (e.g. just query statements)

I'm trying to figure out how I can best parse just a subset of a given language with ANTLR. For example, say I'm looking to parse U-SQL. Really, I'm only interested in parsing certain parts of the language, such as query statements. I couldn't be bothered with parsing the many other features of the language. My current approach has been to design my lexer / parser grammar as follows:
// ...
: queryStatement
| undefinedStatement
// ...
: (.)+?
// ...
: (.)+?
The gist is, I add a fall-back parser rule and lexer rule for undefined structures and tokens. I imagine later, when I go to walk the parse tree, I can simply ignore the undefined statements in the tree, and focus on the statements I'm interested in.
This seems like it would work, but is this an optimal strategy? Are there more elegant options available? Thanks in advance!
Parsing a subpart of a grammar is super easy. Usually you have a top level rule which you call to parse the full input with the entire grammar.
For the subpart use the function that parses only a subrule like:
const expression = parser.statement();
I use this approach frequently when I want to parse stored procedures or data types only.
Keep in mind however, that subrules usually are not termined with the EOF token (as the top level rule should be). This will cause no syntax error if more than the subelement is in the token stream (the parser just stops when the subrule has matched completely). If that's a problem for you then add a copy of the subrule you wanna parse, give it a dedicated name and end it with EOF, like this:
dataTypeDefinition: // For external use only. Don't reference this in the normal grammar.
dataType EOF
dataType: // type in sql_yacc.yy
type = (
Check the MySQL grammar for more details.
This general idea -- to parse the interesting bits of an input and ignore the sea of surrounding tokens -- is usually called "island parsing". There's an example of an island parser in the ANTLR reference book, although I don't know if it is directly applicable.
The tricky part of island parsing is getting the island boundaries right. If you miss a boundary, or recognise as a boundary something which isn't, then your parse will fail disastrously. So you need to understand the input at least well enough to be able to detect where the islands are. In your example, that might mean recognising a SELECT statement, for example. However, you cannot blindly recognise the string of letters SELECT because that string might appear inside a string constant or a comment or some other context in which it was never intended to be recognised as a token at all.
I suspect that if you are going to parse queries, you'll basically need to be able to recognise any token. So it's not going to be sea of uninspected input characters. You can view it as a sea of recognised but unparsed tokens. In that case, it should be reasonably safe to parse a non-query statement as a keyword followed by arbitrary tokens other than ; and ending with a ;. (But you might need to recognise nested blocks; I don't really know what the possibilities are.)

Should a Lexer be able to distinguish between Syntax Tokens contained in a "variable" and actual Syntax Tokens

I am writing a lexer for a simple language(Gherkin).
While some of the lexer is done, I am struggling with a design decision.
Currently, the lexer has an examples and a step mode.
That means it has to track context, which I would rather not do.
I want to make the lexer as dumb as possible, so that most of the work is done by the parser.
My problem with the current approach is that I don't know if the lexer should distinguish Syntax and Literals in certain cases.
For a better understanding, here is a brief overview of the language.
The language has syntax tokens like: : < > | #.
The language can have variables, written as <Name>.
The language has an examples section, where syntax tokens differ from the rest of the test case
An example table looks like this:
| Name | Last Name |
| John | Doe |
A full(stripped out unneeded information) test written in Gherkin looks like this:
Scenario Outline: User logs in
Given user is on login_view
And user enters <Username> in username_field
And user enters <Password> in password_field
And user answers <Qu|estion>
When user clicks on login_button
Then user is logged in
|JohnDoe11|Test<Pass>##Word|Who am I|
Note how I escaped | in the first Examples column.
Also take note of all the syntax characters in the password example.
By escaping the | character, I can use it in the examples part of the test without it getting detected as a Syntax Token.
But for the variable in line And user answers <Qu|estion> I don't need or want to escape it.
By language specification, the example entries can contain any character, except |, unless escaped, as it marks the end of a column.
That means no other syntax character should be detected as a Syntax Token.
Without two modes, all the syntax characters in the password example would be detected as such tokens.
The opposite is the case for the other part of the tests.
Unless at the start of a new line(where # and : are Syntax Tokens),
only <> should be considered part of the syntax
The current implementation prevents this by having the two modes mentioned, which is not the best solution.
My question therefore is:
Should the lexer just detect it as Syntax Tokens, which then get picked up by the Parser which figures out that those are actualyl part of the literal ?
Or is having context the preferable way.
Thank you for answering.
If you have two different lexical environments, then you have two difference lexical environments. They need to be handled differently. Almost all real-world programming languages feature this kind of complication, and most lexical generators have mechanisms designed to help maintain a moderate amount of lexical state.
The problem is figuring out how to do the transitions between the different lexical contexts. As you note, that can be a lot of work, which is ugly. If it's really ugly, you might want to revisit your language design, because it is not just your parser which has to be able to predict which lexical context applies where: any human being reading the code also needs to understand that, and all of the subtleties built in to the algorithm. If you can't describe the algorithm in a couple of clear sentences, you'll be putting quite a burden on code readers.
In the case of Gherkin, it looks to me like the tables are fairly easy to recognise: they start with a line whose first token is | and presumably continue until you reach a line whose first token is not a |. So it should be pretty straight-forward to switch lexical contexts, particularly as your lexer probably already needs to be aware of line-endings.

Is Pug context free?

I was thinking to make a Pug parser but besides the indents are well-known to be context-sensitive (that can be trivially hacked with a lexer feedback loop to make it almost context-free which is adopted by Python), what otherwise makes it not context-free?
XML tags are definitely not context-free, that each starting tag needs to match an end tag, but Pug does not have such restriction, that makes me wonder if we could just parse each starting identifier as a production for a tag root.
The main thing that Pug seems to be missing, at least from a casual scan of its website, is a formal description of its syntax. Or even an informal description. Perhaps I wasn't looking in right places.
Still, based on the examples, it doesn't look awful. There will be some challenges; in particular, it does not have a uniform tokenisation context, so the scanner is going to be complicated, not just because of the indentation issue. (I got the impression from the section on whitespace that the indentation rule is much stricter than Python's, but I didn't find a specification of what it is exactly. It appeared to me that leading whitespace after the two-character indent is significant whitespace. But that doesn't complicate things much; it might even simplify the task.)
What will prove interesting is handling embedded JavaScript. You will at least need to tokenise the embedded JS, and the corner cases in the JS spec make it non-trivial to tokenise without parsing. Anyway, just tokenising isn't sufficient to know where the embedded code terminates. (For the lexical challenge, consider the correct identification of regular expression literals. /= might be the start of a regex or it might be a divide-and-assign operator; how a subsequent { is tokenised will depend on that decision.) Template strings present another challenge (recursive embedding). However, JavaScript parsers do exist, so you might be able to leverage one.
In other words, recognising tag nesting is not going to be the most challenging part of your project. Once you've identified that a given token is a tag, the nesting part is trivial (and context-free) because it is precisely defined by the indentation, so a DEDENT token will terminate the tag.
However, it is worth noting that tag parsing is not particularly challenging for XML (or XML-like HTML variants). If you adopt the XML rule that close tags cannot be omitted (except for self-closing tags), then the tagname in a close tag does not influence the parse of a correct input. (If the tagname in the close tag does not match the close tag in the corresponding open tag, then the input is invalid. But the correspondence between open and close tags doesn't change.) Even if you adopt the HTML-5 rule that close tags cannot be omitted except in the case of a finite list of special-case tagnames, then you could theoretically do the parse with a CFG. (However, the various error recovery rules in HTML-5 are far from context free, so that would only work for input which did not require rematching of close tags.)
Ira Baxter makes precisely this point in the cross-linked post he references in a comment: you can often implement context-sensitive aspects of a language by ignoring them during the parse and detecting them in a subsequent analysis, or even in a semantic predicate during the parse. Correct matching of open- and close tagnames would fall into this category, as would the "declare-before-use" rule in languages where the declaration of an identifier does not influence the parse. (Not true of C or C++, but true in many other languages.)
Even if these aspects cannot be ignored -- as with C typedefs, for example -- the simplest solution might be to use an ambiguous CFG and a parsing technology which produces all possible parses. After the parse forest is generated, you could walk the alternatives and reject the ones which are inconsistent. (In the case of C, that would include an alternative parse in which a name was typedef'd and then used in a context where a typename is not valid.)

Good grammar for date data type for recursive descent parser LL(1)

I'm building a custom expression parser and evaluator for production environment to provide a limited DSL to the users. The parser itself as the DSL, need to be simple. The parser is going to be built in an exotic language that doesn't support dynamic expression parsing nor has any parser generator tools available.
My current decision is to go for recursive descent approach with LL(1) grammar, so that even programmers with no previous experience in evaluating expression could quickly learn how the code works.
It has to handle mixed expressions made up of several data types: decimals, percentages, strings and dates. And dates in the format of dd/mm/yyyy are easy to confuse with a string of division ops.
Is where a good solution to this problem?
My own solution that is aimed at keeping the parser simple and involves prefixing dates with a special symbol, let's say apostrophe:
<date> ::= <apostr><digit><digit>/<digit><digit>/<digit><digit><digit><digit>
<apostr> ::= '
<digit> ::= '0'..'9'
First off, I'm a fan of LL parsers, so I approve of your approach heartily. Note that one of the newer popular parser generators (ANTLR) is LL. If you allow more look-ahead, rather that restricting yourself to LL(1), you can do pretty much anything you'd ever want to do with an LR(1) parser, but the code will be far clearer, more reliable, and easier to debug.
I don't know enough about your overall grammar to be able to tell. It is possible you might be able to design things so that the LL parser can always tell from context if it is an integer expression or a date constant. However, assuming you can't, yeah you'd need some kind of way to tell the difference. The only other thing I can think of would be to use backslash as a separator instead of slash, but that's kinda ugly.
An LL-like lexerless parser with an infinite lookahead is what you need. And, namely, it is PEG.
With an ordered choice it is quite easy to avoid this date vs. constant literals division confusion.
When a language is intended for human input, defining it is as much a matter of
adding syntax constraints to ensure unambiguous and easy parsing
removing/bending syntax to ensure that the language feels intuitive, "natural", to the intended human audience.
Satisfying the second requirement is much harder than the first one, and requires insight into the
intended use cases of the language
Which type of keyboard/input device is available? Are there some characters among the allowed characters which are hard to produce or to see on the display?
Which tokens / expression will be frequently used, which will be required only occasionally?
Are users frequently inputting short, ad-hoc code snippets, or are the programs meant to be reused and modified over long periods
... etc.
background/culture of the intended audience
Which common practices/idioms from other regular (and plain natural) languages can or should be reused if possible?
Should one favor a terse but cryptic style, or a more explicit, but more verbose style?
... etc.
Basically, it is hard to make suggestions about a language syntax, without a good grasp of the intended usage and users.
Nevertheless, I'd like to suggest the following for the date format question:
Use an alternative format for date values altogether; one that would be "natural" enough to users but distinctive enough that a regular grammar can describe.
For example, one that uses a 3 letters abbreviation for the month (downside DSL becomes tied to English or other tongue, but also advantage, the ambiguity for humans about which is day and which is month is removed). Tentatively:
dd-mmm-yyyy (may seem unnatural in cultures where the prevailing date order
starts with the month maybe yyyy-mmm-dd then ?)
mmm-dd-yyyy (better for the above mentioned cultures)
ddmmmyyyy (avoid the dashes, but impose leading zeros)
MnnDnnYyyyy (using "M", "D" and "Y" (or others) as explicit prefixes; now,
this is completely culture neutral, but maybe a bit awkward...)
Anyway, just ideas... Applicability will vary with the human/cultural factors mentioned, and with the rest of the syntax. For example the above may imply that variables be explicitly marked (that's one of the reasons many languages use the $ prefix for example), to avoid possible conflict with [odd, but possible] variable identifiers.
In a nutshell the idea is to substitute the need for a special character prefix (which may then collide with the use these characters for mathematical and other expressions), by making the 12 months tag an good enough discriminator for the parser.
