Simplifying an HTTP headers parsing - parsing

RFC-2616 says that method names are case sensitive.
Trying to simplify the parser routing I'm writing, I've got a question. What could happen if I'll treat these names case insensitive?
There are some statements in the standard, that say that programs SHOULD be tolerant. As far as I can see, this is the case for tolerance.
One more question I have, is about leading and trailing spaces and tabs where the standard forbids it. For example, inside the Request-Line only spaces allowed.
What if my parser will allow tabs as separators? What about leading spaces before the Request-Line?

One rule of thumb says: Be conservative in what you do, be liberal in what you accept from others.
So go for it, be as tolerant as you can as long as the input intent is clear, and if it simplifies your parser it's even better.

1) RFC 2616 is obsolete. You should be looking at RFC 7230.
2) If you treat method names case-insensitively, you'll fail once there are two different names that are the same when compared case-insensitively. Unlikely? Yes. Impossible? No.
3) WRT request line parsing: there's absolutely no point in being "liberal" here. In the best case, you'll accept requests that never are made. In the worst case, you'll introduce security holes because you don't know what you're doing.

Related

Understanding 'strictness' in regex grammar

In writing a grammar/parser for a regex, I'm wondering why the following constructions are both syntactically and semantically valid in the regex syntax (at least as far as I can understand it)?
Repetitions of a character class, such as:
Repetitions of a zero-width assertion, such as:
Assertions at a bogus position, such as:
To me, this sort of seems like having a syntax that would allow having a construction like SELECT SELECT SELECT SELECT SELECT col FROM tbl. In other words, why isn't the regex syntax defined as more strict than it is in practice?
To start with, that's not a very good analogy. Your statement with multiple SELECT keywords is not, as far as I know, part of the SQL grammar, so it's simply ungrammatical. Repeated elements in a character class are more like the SQL construct:
SELECT * FROM table WHERE Value IN (1, 2, 3, 2, 3, 2, 3)
I think most (if not all) SQL processors would allow that. You could argue that it would be nice if a warning message were issued, but the usual SQL interface (where a query is sent from client to server and a result returned) does not leave space for a warning.
It's certainly the case that repeated characters in a character class are often an indication that the regular expression was written by a novice with only a fuzzy idea of what a character class is. If you hang out in SO's flex-lexer long enough, you'll see how often students write regular expressions like [a-z|A-Z|0-9], or even [begin|end]. If flex detected duplicate characters in character classes, those mistakes would receive warnings, which might or might not be useful to the student coder. (Reading and understanding warning messages is not, apparently, an innate skill.) But it needs to be asked, Who is the target audience for a tool like Flex, and I think the answer is not "impatient beginners who won't read documentation". Certainly, a non-novice programmer might also make that kind of mistake, usually as a result of a typo, but it's not common and it will probably easily be detected during debugging.
If you've already started to work on a parser for regular expressions, you should already know why these rules are not usually a feature of regular expression libraries. Every rule you place on a syntax must be:
precisely defined,
documented,
implemented,
acted upon appropriately (which may mean complicating the interface in order to allow for warnings).
and all of that needs to be thoroughly tested.
That's a lot of work to prevent something which is not even technically an error. And catching those errors (if they are errors) will probably make the "grammar" for the regular expressions context-sensitive. (In other words, you can't write a context-free grammar which forbids duplication inside a character class.)
Moreover, practically all of those expressions might show up in the wild, typically in the case that the regular expression was programmatically generated. For example, suppose you have a set of words, and you want to write a regular expression which will match any sequence of characters not in any of the words. The simple solution is [^firstsecondthirdwordandtherest]+. Of course, you could have gone to the trouble of deduping the individual letters in the words, -- a simple task in some programming languages, but more complicated in others -- but should it be necessary?
With respect to your other two examples, with repeated and non-terminal $, there are actual regex libraries in which those interpreted in some different way. Many older regex libraries (including (f)lex) only treat $ as a zero-length end-of-string assertion if it appears at the very end of the regex; such libraries would treat all of those $s other than the last one in a$$$$$$$$ as matching a literal $. Others treat $ as an end-of-line assertion, rather than an end-of-string assertion. But I don't know of any which treat those as errors.
There's a pretty good argument that a didactic tool, such as regex101, might usefully issue warnings for doubtful regular expressions, a task usually called "linting". But for a production regex library, there is little to be gained, and a lot of pain.

Practical uses of inclusive start conditions in scanner generator

What are some real-world (not-contrived) lexical-scanning problems where "inclusive scan conditions" (as opposed to "exclusive" ones) are a better solution?
That is, when is %s FOO any better than %x FOO in a (f)lex definition?
I understand the difference in function as well as how to implement the difference in a scanner generator. I'm just trying to get a sense of the kinds of situations where you would legitimately want to mash together different groups of scan rules into a single scan condition.
Full disclosure: Answers will inspire example code for this project.
"Mashing together" lexical rules is pretty common. Consider backslash-escape handling, which you might want to do more or less the same way in different quoting syntaxes and even regex literals. But those are likely to be combined explicitly.
For an only slightly contrived example of where implicit combination with the INITIAL state might be used, consider lexical analysis of a Python-like language with contextually meaningful leading whitespace. Here, there are two almost-identical lexical contexts: the normal context, in which a newline character is a syntactic marker and leading whitespace needs to be turned into INDENT/DEDENT sequences, and the parenthesised context in which newlines and leading whitespace are both just whitespace, which is not forwarded to the parser. These contexts will only differ in a couple of patterns; the vast majority of rules will be shared. So having an inclusive state which contains only something like:
<IN_PAREN>[[:space:]]+ /* Ignore all whitespace */
might be a simple solution. Of course, that rule would have to be placed before normal whitespace handling so that it overrides while IN_PAREN is active.

Is Pug context free?

I was thinking to make a Pug parser but besides the indents are well-known to be context-sensitive (that can be trivially hacked with a lexer feedback loop to make it almost context-free which is adopted by Python), what otherwise makes it not context-free?
XML tags are definitely not context-free, that each starting tag needs to match an end tag, but Pug does not have such restriction, that makes me wonder if we could just parse each starting identifier as a production for a tag root.
The main thing that Pug seems to be missing, at least from a casual scan of its website, is a formal description of its syntax. Or even an informal description. Perhaps I wasn't looking in right places.
Still, based on the examples, it doesn't look awful. There will be some challenges; in particular, it does not have a uniform tokenisation context, so the scanner is going to be complicated, not just because of the indentation issue. (I got the impression from the section on whitespace that the indentation rule is much stricter than Python's, but I didn't find a specification of what it is exactly. It appeared to me that leading whitespace after the two-character indent is significant whitespace. But that doesn't complicate things much; it might even simplify the task.)
What will prove interesting is handling embedded JavaScript. You will at least need to tokenise the embedded JS, and the corner cases in the JS spec make it non-trivial to tokenise without parsing. Anyway, just tokenising isn't sufficient to know where the embedded code terminates. (For the lexical challenge, consider the correct identification of regular expression literals. /= might be the start of a regex or it might be a divide-and-assign operator; how a subsequent { is tokenised will depend on that decision.) Template strings present another challenge (recursive embedding). However, JavaScript parsers do exist, so you might be able to leverage one.
In other words, recognising tag nesting is not going to be the most challenging part of your project. Once you've identified that a given token is a tag, the nesting part is trivial (and context-free) because it is precisely defined by the indentation, so a DEDENT token will terminate the tag.
However, it is worth noting that tag parsing is not particularly challenging for XML (or XML-like HTML variants). If you adopt the XML rule that close tags cannot be omitted (except for self-closing tags), then the tagname in a close tag does not influence the parse of a correct input. (If the tagname in the close tag does not match the close tag in the corresponding open tag, then the input is invalid. But the correspondence between open and close tags doesn't change.) Even if you adopt the HTML-5 rule that close tags cannot be omitted except in the case of a finite list of special-case tagnames, then you could theoretically do the parse with a CFG. (However, the various error recovery rules in HTML-5 are far from context free, so that would only work for input which did not require rematching of close tags.)
Ira Baxter makes precisely this point in the cross-linked post he references in a comment: you can often implement context-sensitive aspects of a language by ignoring them during the parse and detecting them in a subsequent analysis, or even in a semantic predicate during the parse. Correct matching of open- and close tagnames would fall into this category, as would the "declare-before-use" rule in languages where the declaration of an identifier does not influence the parse. (Not true of C or C++, but true in many other languages.)
Even if these aspects cannot be ignored -- as with C typedefs, for example -- the simplest solution might be to use an ambiguous CFG and a parsing technology which produces all possible parses. After the parse forest is generated, you could walk the alternatives and reject the ones which are inconsistent. (In the case of C, that would include an alternative parse in which a name was typedef'd and then used in a context where a typename is not valid.)

Antlr: lookahead and lookbehind examples

I'm having a hard time figuring out how to recognize some text only if it is preceded and followed by certain things. The task is to recognize AND, OR, and NOT, but not if they're part of a word:
They should be recognized here:
x AND y
(x)AND(y)
NOT x
NOT(x)
but not here:
xANDy
abcNOTdef
AND gets recognized if it is surrounded by spaces or parentheses. NOT gets recognized if it is at the beginning of the input, preceded by a space, and followed by a space or parenthesis.
The trouble is that if I include parentheses as part of the definition of AND or NOT, they get consumed, and I need them to be separate tokens.
Is there some kind of lookahead/lookbehind syntax I can use?
EDIT:
Per the comments, here's some context. The problem is related to this problem: Antlr: how to match everything between the other recognized tokens? My working solution there is just to recognize AND, OR, etc. and skip everything else. Then, in a second pass over the text, I manually grab the characters not otherwise covered, and run a totally different tokenizer on it. The reason is that I need a custom, human-language-specific tokenizer for this content, which means that I can't, in advance, describe what is an ID. Each human language is different. I want to combine, in stages, a single query-language tokenizer, and then apply a human-language tokenizer to what's left.
ANTLR is not the right tool for this task. A normal parser is designed for a specific language, that is, a set of sentences consisting of elements that are known at parser creation time. There are ways to make this more flexible, e.g. by using a runtime function in a predicate to recognize words not defined in the grammar, but this has other (negative) implications.
What you should consider is NLP for a different approach to process natural language. It's more than just skipping things between two known tokens.

Standardizing "character set ranges" as internationally defined values

Lets say I have a field which accepts A-Z,a-z,0-9 . If I'm trying to communicate to someone, via documenation or api creation "what" my code can accept, i HAVE to say:
A-Z,a-z,0-9
Now that in my mind this is restrictive and error prone.
Compare that to what i'm proposing.
Suppose A-Z,a-z,0-9 was allocated the "code" ANSI456
When I'm communicating that to someone, I can say that my code accepts ANSI456. If someone else was developing a check, there is no confusion on what my code can or cannot accept.
To those who will suggest just specifying character ranges, please note that what i'm envisioning will handle scenarios where even this is defined as a valid "code"
0-9, +, -, *, /
In fact, if its done properly, we can have a site generate automatic code in various languages to accomodate the different "codes".
Okay - i KNOW there are ~ infinite values, eg:
a-z
is different from
a-l,n-z
And these would have two different codes in this "system".
I'm not proposing a HUMAN moderated system - it can be completely automatic BUT systematic way of generating these "codes"
There already is such a standard, although it doesn't have the word "standard" in its name. It is called Perl 5 compatible regular expressions, and it is used in Perl 5, Java, JavaScript, libpcre and many other contexts.

Resources