I'm creating a Lexer in the vein of StdLexical with a few changes in terms of behavior (but for the purposes of my question, a demonstration of how to add it to StdLexical would be fine). I'm trying to add support for recording the positions of tokens, but running into issues. If I simply try to add positioned I get the not entirely unexpected error which is basically indicating that I can't run positioned with Parsers that don't output positional.
So: How do I constrain the input to my lexer so that it can have positional parsers, OR (If this is the wrong question to be asking): What is the best way to add positional information to StdLexical?
Related
I was thinking to make a Pug parser but besides the indents are well-known to be context-sensitive (that can be trivially hacked with a lexer feedback loop to make it almost context-free which is adopted by Python), what otherwise makes it not context-free?
XML tags are definitely not context-free, that each starting tag needs to match an end tag, but Pug does not have such restriction, that makes me wonder if we could just parse each starting identifier as a production for a tag root.
The main thing that Pug seems to be missing, at least from a casual scan of its website, is a formal description of its syntax. Or even an informal description. Perhaps I wasn't looking in right places.
Still, based on the examples, it doesn't look awful. There will be some challenges; in particular, it does not have a uniform tokenisation context, so the scanner is going to be complicated, not just because of the indentation issue. (I got the impression from the section on whitespace that the indentation rule is much stricter than Python's, but I didn't find a specification of what it is exactly. It appeared to me that leading whitespace after the two-character indent is significant whitespace. But that doesn't complicate things much; it might even simplify the task.)
What will prove interesting is handling embedded JavaScript. You will at least need to tokenise the embedded JS, and the corner cases in the JS spec make it non-trivial to tokenise without parsing. Anyway, just tokenising isn't sufficient to know where the embedded code terminates. (For the lexical challenge, consider the correct identification of regular expression literals. /= might be the start of a regex or it might be a divide-and-assign operator; how a subsequent { is tokenised will depend on that decision.) Template strings present another challenge (recursive embedding). However, JavaScript parsers do exist, so you might be able to leverage one.
In other words, recognising tag nesting is not going to be the most challenging part of your project. Once you've identified that a given token is a tag, the nesting part is trivial (and context-free) because it is precisely defined by the indentation, so a DEDENT token will terminate the tag.
However, it is worth noting that tag parsing is not particularly challenging for XML (or XML-like HTML variants). If you adopt the XML rule that close tags cannot be omitted (except for self-closing tags), then the tagname in a close tag does not influence the parse of a correct input. (If the tagname in the close tag does not match the close tag in the corresponding open tag, then the input is invalid. But the correspondence between open and close tags doesn't change.) Even if you adopt the HTML-5 rule that close tags cannot be omitted except in the case of a finite list of special-case tagnames, then you could theoretically do the parse with a CFG. (However, the various error recovery rules in HTML-5 are far from context free, so that would only work for input which did not require rematching of close tags.)
Ira Baxter makes precisely this point in the cross-linked post he references in a comment: you can often implement context-sensitive aspects of a language by ignoring them during the parse and detecting them in a subsequent analysis, or even in a semantic predicate during the parse. Correct matching of open- and close tagnames would fall into this category, as would the "declare-before-use" rule in languages where the declaration of an identifier does not influence the parse. (Not true of C or C++, but true in many other languages.)
Even if these aspects cannot be ignored -- as with C typedefs, for example -- the simplest solution might be to use an ambiguous CFG and a parsing technology which produces all possible parses. After the parse forest is generated, you could walk the alternatives and reject the ones which are inconsistent. (In the case of C, that would include an alternative parse in which a name was typedef'd and then used in a context where a typename is not valid.)
I'm having a hard time figuring out how to recognize some text only if it is preceded and followed by certain things. The task is to recognize AND, OR, and NOT, but not if they're part of a word:
They should be recognized here:
x AND y
(x)AND(y)
NOT x
NOT(x)
but not here:
xANDy
abcNOTdef
AND gets recognized if it is surrounded by spaces or parentheses. NOT gets recognized if it is at the beginning of the input, preceded by a space, and followed by a space or parenthesis.
The trouble is that if I include parentheses as part of the definition of AND or NOT, they get consumed, and I need them to be separate tokens.
Is there some kind of lookahead/lookbehind syntax I can use?
EDIT:
Per the comments, here's some context. The problem is related to this problem: Antlr: how to match everything between the other recognized tokens? My working solution there is just to recognize AND, OR, etc. and skip everything else. Then, in a second pass over the text, I manually grab the characters not otherwise covered, and run a totally different tokenizer on it. The reason is that I need a custom, human-language-specific tokenizer for this content, which means that I can't, in advance, describe what is an ID. Each human language is different. I want to combine, in stages, a single query-language tokenizer, and then apply a human-language tokenizer to what's left.
ANTLR is not the right tool for this task. A normal parser is designed for a specific language, that is, a set of sentences consisting of elements that are known at parser creation time. There are ways to make this more flexible, e.g. by using a runtime function in a predicate to recognize words not defined in the grammar, but this has other (negative) implications.
What you should consider is NLP for a different approach to process natural language. It's more than just skipping things between two known tokens.
I'm trying to figure out the best approach to improving the errors displayed to a user of a Grako-generated parser. It seems like the default parse errors displayed by the Grako-generated parser when it hits some parsing issue in the input file are not helpful. The errors often seem to imply the issue is in one part of the input file when the true error is somewhere different.
I've been looking into the Grako Semantics class to put in some checks which would display better error messages if the checks fail, but it also seems like there could be tons of edge cases that must be specified to be able to catch all of the possible ways the parsing of a rule can fail.
Does anyone have any recommendations or examples I can view?
A PEG parser will exhaust all options, sometimes leaving you at a failure corresponding to the last, and least likely option.
With Grako, you can add cut elements (~) to the grammar to have the parser commit to certain options when it can be sure they are the ones to match.
term = '(' ~ expression ')' | int ;
Cut elements also prune the memoization cache, which improves parser performance.
Suppose I already have a complete YACC grammar. Let that be C grammar for example. Now I want to create a separate parser for domain-specific language, with simple grammar, except that it still needs to parse complete C type declarations. I wouldn't like to duplicate long rules from the original grammar with associated handling code, but instead would like to call out to the original parser to handle exactly one rule (let's call it "declarator").
If it was a recursive descent parser, there would be a function for each rule, easy to call in. But what about YACC with its implicit stack automaton?
Basically, no. Composing LR grammars is not easy, and bison doesn't offer much help.
But all is not lost. Nothing stops you from including the entire grammar (except the %start declaration), and just using part of it, except for one little detail: bison will complain about useless productions.
If that's a show-stopper for you, then you can use a trick to make it possible to create a grammar with multiple start rules. In fact, you can create a grammar which lets you specify the start symbol every time you call the parser; it doesn't even have to be baked in. Then you can tuck that into a library and use whichever parser you want.
Of course, this also comes at a cost: the cost is that the parser is bigger than it would otherwise need to be. However, it shouldn't be any slower, or at least not much -- there might be some cache effects -- and the extra size is probably insignificant compared to the rest of your compiler.
The hack is described in the bison FAQ in quite a lot of detail, so I'll just do an outline here: for each start production you want to support, you create one extra production which starts with a pseudo-token (that is, a lexical code which will never be generated by the lexer). For example, you might do the following:
%start meta_start
%token START_C START_DSL
meta_start: START_C c_start | START_DSL dsl_start;
Now you just have to arrange for the lexer to produce the appropriate START token when it first starts up. There are various ways to do that; the FAQ suggests using a global variable, but if you use a re-entrant flex scanner, you can just put the desired start token in the scanner state (along with a flag which is set when the start token has been sent).
there can be two productions from which we can do the reduction. After giving precedence and associations as required there will be one handle only.so is this statement true??
This is partially true, a reduce/reduce conflict is usually resolved by specifying precedence or by letting the parser builder choose which rule to apply before the other.
This means that the conflict is solved but not that the parser is going to behave exactly as intended. It is convenient to study what is causing the conflict and think if a refactoring of the grammar is needed to express what you are trying to parse or if the automatic choice/precedence is enough.
If you have a grammar which has ambiguous rules, you get multiple interpretations. You don't have to insist that the grammar removes ambiguity; you can simply agree that something is ambiguous and parse it multiple ways:
fruit flies like an arrow.
The result of the parse is multiple interpretations.
Now, for such a language to be useful to a reader, either he has to be happy with the ambiguity, or you need to give him a way to resolve it. (In the example, I've decided for you that you are happy the ambiguity, because I haven't given you a way to resolve it!). Or, one can provide the reader of something with ambiguous parsess, a way to choose which parse make sense, and he rejects the inappropriate parses.
I can do that for the above case by telling you that I mean "fruit => watermelon".
Computer grammars are not different, but most programmers don't want ambiguous code. So in general, langauge designers like to define unambiguous grammars. In practice, they don't succeed and you get funny language rules like, "If this could be interpreted ambiguously, then interpret it this way.".