Writing a lexer for a context sensitive markup language, that has recursive structures such as nested lists - parsing

I'm working on a reStructuredText transpiler in Rust, and am in need of some advice concerning how lexing should be structured in languages that have recursive structures. For example lists within lists are possible in rST:
* This is a list item
* This is a sub list item
* And here we are at the preceding indentation level again.
The default docutils.parsers.rst took the approach of scanning the input one line at a time:
The reStructuredText parser is implemented as a state machine, examining its
input one line at a time.
The state machine mentioned basically operates on a set of states of the form (regex, match_method, next_state). It tries to match the current line to the regex based on the current state and runs match_method while transitioning to the next_state if a match succeeds, doing this until it runs out of lines to scan.
My question then is, is this the best approach to scanning a language such as rST? My approach thus far has been to create a Chars iterator of the source and eat away at the source while trying to match against structures at the current Unicode scalar. This works to some extent when all I'm doing is scanning inline content, but I've now run into the realization that handling recursive body level structures like nested lists is going to be a pain in the butt. It feels like I'm going to need a whole bunch of states with duplicate regexes and related methods in many states for matching against indentations before new lines and such.
Would it be better to simply have and iterator of the lines of the source and match on a per-line basis, and if a line such as
* this is an indented list item
is encountered in State::Body, simply transition to a state such as State::BulletList and start lexing lines based on the rules specified there? The above line could be lexed for example as a sequence
TokenType::Indent, TokenType::Bullet, TokenType::BodyText
Any thoughts on this?

I don't know much about rST. But you say it has "recursive" structures. If that's that case, you can't fully lex it as a recursive structure using just state machines or regexes or even lexer generators.
But this the wrong way to think about it. The lexer's job is to identify the atoms of the language. A parser's job is to recognize structure, especially if it is recursive (yes, parsers often build trees recording the recursive structures they found).
So build the lexer ignoring context if you can, and use a parser to pick up the recursive structures if you need them. You can read more about the distinction in my SO answer about Parsers Vs. Lexers https://stackoverflow.com/a/2852716/120163
If you insist on doing all of this in the lexer, you'll need to augment it with a pushdown stack to track the recursive structures. Then what are you building is a sloppy parser disguised as lexer. (You will probably still want a real parser to process the output of this "lexer").
Having a pushdown stack actually useful if the language has different atoms in different contexts especially if the contexts nest; in this case what you want is mode stack that you change as the lexer encounters tokens that indicate a switch from one mode to another. A really useful extension of this idea is to have mode changes select what amounts to different lexers, each of which produces lexemes unique to that mode.
As an example you might do this to lex a language that contains embedded SQL. We build parsers for JavaScript; our lexer uses a pushdown stack to process the content of regexp literals and track nesting of { ... } [...] and (... ). (This has arguably a downside: it rejects versions of JQuery.js that contain malformed regexes [yes, they exist]. Javascript doesn't care if you define a bad regex literal and never use it, but that seems pretty pointless.)
A special case of the stack occurs if you only have track single "(" ... ")" pairs or the equivalent. In this case you can use a counter to record how many "pushes" or "pop" you might have done on a real stack. If you have two or more pairs of tokens like this, counters don't work.

Related

Why would I use a lexer and not directly parse code?

I am trying to create a simple programming language from scratch (interpreter) but I wonder why I should use a lexer.
For me, it looks like it would be easier to create a parser that directly parses the code. what am I overlooking?
I think you'll agree that most languages (likely including the one you are implementing) have conceptual tokens:
operators, e.g * (usually multiply), '(', ')', ;
keywords, e.g., "IF", "GOTO"
identifiers, e.g. FOO, count, ...
numbers, e.g. 0, -527.23E-41
comments, e.g., /* this text is ignored in your file */
whitespace, e.g., sequences of blanks, tabs and newlines, that are ignored
As a practical matter, it takes a specific chunk of code to scan for/collect the characters that make each individual token. You'll need such a code chunk for each type of token your language has.
If you write a parser without a lexer, at each point where your parser is trying to decide what comes next, you'll have to have ALL the code that recognize the tokens that might occur at that point in the parse. At the next parser point, you'll need all the code to recognize the tokens that are possible there. This gives you an immense amount of code duplication; how many times do you want the code for blanks to occur in your parser?
If you think that's not a good way, the obvious cure to is remove all the duplication: place the code for each token in a subroutine for that token, and at each parser place, call the subroutines for the tokens. At this point, in some sense, you already have a lexer: an isolated collection of code to recognize tokens. You can code perfectly fine recursive descent parsers this way.
The next thing you'll discover is that you call the token subroutines for many of the tokens at each parser point. Even that seems like a lot of work and duplication. So, replace all the calls with a single "GetNextToken" call, that itself invokes the token recognizing code for all tokens, and returns a enum that identifies the specific token encountered. Now your parser starts to look reasonable: at each parser point, it makes one call on GetNextToken, and then branches on enum returned. This is basically the interface that people have standardized on as a "lexer".
One thing you will discover is the token-lexers sometimes have trouble with overlaps; keywords and identifiers usually have this trouble. It is actually easier to merge all the token recognizers into a single finite state machine, which can then distinguish the tokens more easily. This also turns out to be spectacularly fast when processing the programming language source text. Your toy language may never parse more than 100 lines, but real compilers process millions of lines of code a day, and most of that time is spent doing token recognition ("lexing") esp. white space suppression.
You can code this state machine by hand. This isn't hard, but it is rather tedious. Or, you can use a tool like FLEX to do it for you, that's just a matter of convenience. As the number of different kinds of tokens in your language grows, the FLEX solution gets more and more attractive.
TLDR: Your parser is easier to write, and less bulky, if you use a lexer. In addition, if you compile the individual lexemes into a state machine (by hand or using a "lexer generator"), it will run faster and that's important.
Well, for intelligently simplified programing language you can get away without either lexer or parser :-) Not kidding. Look up Forth. You can start with tags here on SO (gforth is GNU's) and then go to the Standard's site which has pointers to a few interpreters, sites and its Glossary.
Then you can check out Win32Forth and that should keep you busy for quite a while :-)
Interpreter also compiles (when you invoke words that switch system to compilation context). All without a distinct parser. Lookahead is actually lookbehind :-) - not kidding. It rarely absorbs one following word (== lookahead is max 1). The "words" (aka tokens) are at the same time keywords and variable names and they all live in a Dictionary. There's a whole online book at that site (plus pdf).
Control structures are also just words (they compile a few addresses and jumps on the fly).
You can find old Journals there as well, covering a wide spectrum from machine code generation to object oriented extensions. Yes still without parser - believe it or not.
There used to be more sophisticated (commercial) Forth systems which were reducing words to machine call instructions with immediate addressing (makes the engine run 2-4 times faster) but even plain interpreters were always considered to be fast. One is apparently still active - SwiftForth, but don't expect any freebies there.
There's one Forth on GitHub CiForth which is quite spartanic but has builds and releases for Win, Linux and Mac, 32 and 64 so you can just download and run. Claims to have a 16-bit build as well :-) For embedded systems I suppose.

Is Pug context free?

I was thinking to make a Pug parser but besides the indents are well-known to be context-sensitive (that can be trivially hacked with a lexer feedback loop to make it almost context-free which is adopted by Python), what otherwise makes it not context-free?
XML tags are definitely not context-free, that each starting tag needs to match an end tag, but Pug does not have such restriction, that makes me wonder if we could just parse each starting identifier as a production for a tag root.
The main thing that Pug seems to be missing, at least from a casual scan of its website, is a formal description of its syntax. Or even an informal description. Perhaps I wasn't looking in right places.
Still, based on the examples, it doesn't look awful. There will be some challenges; in particular, it does not have a uniform tokenisation context, so the scanner is going to be complicated, not just because of the indentation issue. (I got the impression from the section on whitespace that the indentation rule is much stricter than Python's, but I didn't find a specification of what it is exactly. It appeared to me that leading whitespace after the two-character indent is significant whitespace. But that doesn't complicate things much; it might even simplify the task.)
What will prove interesting is handling embedded JavaScript. You will at least need to tokenise the embedded JS, and the corner cases in the JS spec make it non-trivial to tokenise without parsing. Anyway, just tokenising isn't sufficient to know where the embedded code terminates. (For the lexical challenge, consider the correct identification of regular expression literals. /= might be the start of a regex or it might be a divide-and-assign operator; how a subsequent { is tokenised will depend on that decision.) Template strings present another challenge (recursive embedding). However, JavaScript parsers do exist, so you might be able to leverage one.
In other words, recognising tag nesting is not going to be the most challenging part of your project. Once you've identified that a given token is a tag, the nesting part is trivial (and context-free) because it is precisely defined by the indentation, so a DEDENT token will terminate the tag.
However, it is worth noting that tag parsing is not particularly challenging for XML (or XML-like HTML variants). If you adopt the XML rule that close tags cannot be omitted (except for self-closing tags), then the tagname in a close tag does not influence the parse of a correct input. (If the tagname in the close tag does not match the close tag in the corresponding open tag, then the input is invalid. But the correspondence between open and close tags doesn't change.) Even if you adopt the HTML-5 rule that close tags cannot be omitted except in the case of a finite list of special-case tagnames, then you could theoretically do the parse with a CFG. (However, the various error recovery rules in HTML-5 are far from context free, so that would only work for input which did not require rematching of close tags.)
Ira Baxter makes precisely this point in the cross-linked post he references in a comment: you can often implement context-sensitive aspects of a language by ignoring them during the parse and detecting them in a subsequent analysis, or even in a semantic predicate during the parse. Correct matching of open- and close tagnames would fall into this category, as would the "declare-before-use" rule in languages where the declaration of an identifier does not influence the parse. (Not true of C or C++, but true in many other languages.)
Even if these aspects cannot be ignored -- as with C typedefs, for example -- the simplest solution might be to use an ambiguous CFG and a parsing technology which produces all possible parses. After the parse forest is generated, you could walk the alternatives and reject the ones which are inconsistent. (In the case of C, that would include an alternative parse in which a name was typedef'd and then used in a context where a typename is not valid.)

Parsing special cases

If I understand correctly, parsing turns a sequence of symbols into a tree. My question is, is it possible to use some standard procedure (LR, LL, PEG, ..?) to parse the following two examples or is it necessary to write a specialized parser by hand?
Python source code, i.e. the whitespace-indented blocks
I think I read somewhere that the parser keeps track of the number of leading spaces, and pretends to replace them with curly brackets to delimitate the blocks. Is it fundamentally required because the standard parsing techniques are not powerful enough or is it for performance reasons?
PNG image format, where a block starts with a header and block size, after which there is the content of the block
The content could contain bytes which resemble some header so it is necessary to "know" that the next x bytes are not to be "parsed", i.e. they should be skipped. How to express this, say, with PEG? In other words, the "closing bracket" is represented by the length of the content.
Neither of the examples in the question are context-free, so strictly speaking they cannot be parsed with context-free grammars. But in practical terms, they are both pretty easy to parse.
The python algorithm is well-described in the Python reference manual (although you need to read that in context.) What's described there is a pre-processing step in which whitespace at the beginning of a line is systematically replaced with INDENT and DEDENT tokens.
To clarify: It's not really a preprocessing step, and it's important to observe that it happens after implicit and explicit line joining. (There are previous sections in the reference manual which describe these procedures.) In particular, lines are implicitly joined inside parentheses, braces and brackets, so the process is intertwined with parsing.
In practical terms, both the line-joining and indentation algorithms can be accomplished programmatically; typically, these would be done inside a custom scanner (tokenizer) which maintains both a stack of parentheses and indent levels. The token stream can then be parsed with normal context-free algorithms, but the tokenizer -- although it might use regular expressions -- needs context-sensitive logic (counting spaces, for example). [Note 1]
Similarly, formats which contain explicit sizes (such as most serialization formats, including PNG files, Google protobufs, and HTTP chunked encoding) are not context-free, but are obviously easy to tokenize since the tokenizer simply has to read the length and then read that many bytes.
There are a variety of context-sensitive formalisms, and these definitely have their uses, but in practical parsing the most common strategy is to use a Turing-equivalent formalism (such as any programming language, possibly augmented with a scanner-generator like flex) for the tokenizer and a context-free formalism for the parser. [Note 2]
Notes:
It may not be immediately obvious that Python indenting is not context-free, since context-free grammars can accept some categories of agreement. For example, {ωω-1 | ω∈Σ*} (the language of all even-length palindromes) is context-free, as is {anbn}.
However, these examples can't be extended, because the only count-agreement possible in a context-free language is bracketing. So while palindromes are context-free (you can implement the check with a single stack), the apparently very similar {ωω | ω∈Σ*} is not, and neither is {anbncn}
One such formalism is back-references in "regular" expressions, which might be available in some PEG implementation. Back-references allow the expression of a variety of context-sensitive languages, but do not allow the expression of all context-free languages. Unfortunately, regular expressions with back-references really suck in practice, because the problem of determining whether a string matches a regex with back-references is NP complete. You might find this question on a sister SE site interesting. (And you might want to reformulate your question in a way that could be asked on that site, http://cs.stackexchange.com.)
As a practical matter, almost all parser construction requires some clever hacks around the edges to overcome the limitations of the parsing machinery.
Pure context free parsers can't do Python; all the parser technologies you have listed are weaker than pure-context free, so they can't do it either. A hack in the lexer to keep track of indentation, and generate INDENT/DEDENT tokens, turns the indenting problem into explicit "parentheses", which are easily handled by context-free parsers.
Most binary files can't be processed either, as they usually contain, somewhere, a list of length N, where N is provided before the list body is encountered (this is kind of the example you gave). Again, you can get around this, with a more complicated hack; something must keep a stack of nested list lengths, and the parser has to signal when it moves from one list element to the next. The top-most length counter gets decremented, and the parser gets back a signal "reduce" or "shift". Other more complex linked structures are generally pretty hard to parse this way. Getting the parser to cooperate this way isn't always easy.

Lexical Analysis of a Scripting Language

I am trying to create a simple script for a resource API. I have a resource API mainly creates game resources in a structured manner. What I want is dealing with this API without creating c++ programs each time I want a resource. So we (me and my instructor from uni) decided to create a simple script to create/edit resource files without compiling every time. There are also some other irrelevant factors that I need a command line interface rather than a GUI program.
Anyway, here is script sample:
<path>.<command> -<options>
/Graphics[3].add "blabla.png"
I didn't design this script language, the owner of API did. The part before '.' as you can guess is the path and part after '.' is actual command and some options, flags etc. As a first step, I tried to create grammar of left part because I thought I could use it while searching info about lexical analyzers and parser. The problem is I am inexperienced when it comes to parsing and programming languages and I am not sure if it's correct or not. Here is some more examples and grammar of left side.
dir -> '/' | '/' path
path -> object '/' path | object
object -> number | string '[' number ']'
Notation if this grammar can be a mess, I don't know. There is 5 different possibilities, they are:
String
"String"
Number
String[Number]
"String"[Number]
It has to start with '/' symbol and if it's the only symbol, I will accept it as Root.
Now my problem is how can I lexically analyze this script? Is there a special method? What should my lexical analyzer do and do not(I read some lexical analysers also do syntactic analysis up to a point). Do you think grammar, etc. is technically appropriate? What kind of parsing method I should use(Recursive Descent, LL etc.)? I am trying to make it technically appropriate piece of work. It's not commercial so I have time thus I can learn lexical analysis and parsing better. I don't want to use a parser library.
What should my lexical analyzer do and not do?
It should:
recognize tokens
ignore ignorable whitespace and comments (if there are such things)
optionally, keep track of source location in order to produce meaningful error messages.
It should not attempt to parse the input, although that will be very tempting with such a simple language.
From what I can see, you have the following tokens:
punctuation: /, ., linear-white-space, new-line
numbers
unquoted strings (often called "atoms" or "ids")
quoted strings (possibly the same token type as unquoted strings)
I'm not sure what the syntax for -options is, but that might include more possibilities.
Choosing to return linear-white-space (that is, a sequence consisting only of tabs and spaces) as a token is somewhat questionable; it complicates the grammar considerably, particularly since there are probably places where white-space is ignorable, such as the beginning and end of a line. But I have the intuition that you do not want to allow whitespace inside of a path and that you plan to require it between the command name and its arguments. That is, you want to prohibit:
/left /right[3] .whimper "hello, world"
/left/right[3].whimper"hello, world"
But maybe I'm wrong. Maybe you're happy to accept both. That would be simpler, because if you accept both, then you can just ignore linear-whitespace altogether.
By the way, experience has shown that using new-line to separate commands can be awkward; sooner or later you will need to break a command into two lines in order to avoid having to buy an extra monitor to see the entire line. The convention (used by bash and the C preprocessor, amongst others) of putting a \ as the last character on a line to be continued is possible, but can lead to annoying bugs (like having an invisible space following the \ and thus preventing it from really continuing the line).
From here down is 100% personal opinion, offered for free. So take it for what its worth.
I am trying to make it technically appropriate piece of work. It's not commercial so I have time thus I can learn lexical analysis and parsing better. I don't want to use a parser library.
There is a contradiction here, in my opinion. Or perhaps two contradictions.
A technically appropriate piece of work would use standard tools; at least a lexical generator and probably a parser generator. It would do that because, properly used, the lexical and grammatical descriptions provided to the tools document precisely the actual language, and the tools guarantee that the desired language is what is actually recognized. Writing ad hoc code, even simple lexical recognizers and recursive descent parsers, for all that it can be elegant, is less self-documenting, less maintainable, and provides fewer guarantees of correctness. Consequently, best practice is "use standard tools".
Secondly, I disagree with your instructor (if I understand their proposal correctly, based on your comments) that writing ad hoc lexers and parsers aids in understanding lexical and parsing theory. In fact, it may be counterproductive. Bottom-up parsing, which is incredibly elegant both theoretically and practically, is almost impossible to write by hand and totally impossible to read. Consequently, many programmers prefer to use recursive-descent or Pratt parsers, because they understand the code. However, such parsers are not as powerful as a bottom-up parser (particularly GLR or Earley parsers, which are fully general), and their use leads to unnecessary grammatical compromises.
You don't need to write a regular expression library to understand regular expressions. The libraries abstract away the awkward implementation details (and there are lots of them, and they really are awkward) and let you concentrate on the essence of creating and using regular expressions.
In the same way, you do not need to write a compiler in order to understand how to program in C. After you have a good basis in C, you can improve your understanding (maybe) by understanding how it translates into machine code, but unless you plan a career in compiler writing, knowing the details of obscure optimization algorithms are not going to make you a better programmer. Or, at least, they're not first on your agenda.
Similarly, once you really understand regular expressions, you might find writing a library interesting. Or not -- you might find it incredibly frustrating and give up after a couple of months of hard work. Either way, you will appreciate existing libraries more. But learn to use the existing libraries first.
And the same with parser generators. If you want to learn how to translate an idea for a programming language into something precise and implementable, learn how to use a parser generator. Only after you have mastered the theory of parsing should you even think of focusing on low-level implementations.

How can a lexer extract a token in ambiguous languages?

I wish to understand how does a parser work. I learnt about the LL, LR(0), LR(1) parts, how to build, NFA, DFA, parse tables, etc.
Now the problem is, i know that a lexer should extract tokens only on the parser demand in some situation, when it's not possible to extract all the tokens in one separated pass. I don't exactly understand this kind of situation, so i'm open to any explanation about this.
The question now is, how should a lexer does its job ? should it base its recognition on the current "contexts", the current non-terminals supposed to be parsed ? is it something totally different ?
What about the GLR parsing : is it another case where a lexer could try different terminals, or is it only a syntactic business ?
I would also want to understand what it's related to, for example is it related to the kind of parsing technique (LL, LR, etc) or only the grammar ?
Thanks a lot
The simple answer is that lexeme extraction has to be done in context. What one might consider be lexemes in the language may vary considerably in different parts of the language. For example, in COBOL, the data declaration section has 'PIC' strings and location-sensitive level numbers 01-99 that do not appear in the procedure section.
The lexer thus to somehow know what part of the language is being processed, to know what lexemes to collect. This is often handled by having lexing states which each process some subset of the entire language set of lexemes (often with considerable overlap in the subset; e.g., identifiers tend to be pretty similar in my experience). These states form a high level finite state machine, with transitions between them when phase changing lexemes are encountered, e.g., the keywords that indicate entry into the data declaration or procedure section of the COBOL program. Modern languages like Java and C# minimize the need for this but most other languages I've encountered really need this kind of help in the lexer.
So-called "scannerless" parsers (you are thinking "GLR") work by getting rid of the lexer entirely; now there's no need for the lexer to produce lexemes, and no need to track lexical states :-} Such parsers work by simply writing the grammar down the level of individual characters; typically you find grammar rules that are the exact equivalent of what you'd write for a lexeme description. The question is then, why doesn't such a parser get confused as to which "lexeme" to produce? This is where the GLR part is useful. GLR parsers are happy to process many possible interpretations of the input ("locally ambiguous parses") as long as the choice gets eventually resolved. So what really happens in the case of "ambiguous tokens" is the the grammar rules for both "tokens" produce nonterminals for their respectives "lexemes", and the GLR parser continues to parse until one of the parsing paths dies out or the parser terminates with an ambiguous parse.
My company builds lots of parsers for languages. We use GLR parsers because they are very nice for handling complex languages; write the context-free grammar and you have a parser. We use lexical-state based lexeme extractors with the usual regular-expression specification of lexemes and lexical-state-transitions triggered by certain lexemes. We could arguably build scannerless GLR parsers (by making our lexers produce single characters as tokens :) but we find the efficiency of the state-based lexers to be worth the extra trouble.
As practical extensions, our lexers actually use push-down-stack automata for the high level state machine rather than mere finite state machines. This helps when one has high level FSA whose substates are identical, and where it is helpful for the lexer to manage nested structures (e.g, match parentheses) to manage a mode switch (e.g., when the parentheses all been matched).
A unique feature of our lexers: we also do a little tiny bit of what scannerless parsers do: sometimes when a keyword is recognized, our lexers will inject both a keyword and an identifier into the parser (simulates a scannerless parser with a grammar rule for each). The parser will of course only accept what it wants "in context" and simply throw away the wrong alternative. This gives us an easy to handle "keywords in context otherwise interpreted as identifiers", which occurs in many, many languages.
Ideally, the tokens themselves should be unambiguous; you should always be able to tokenise an input stream without the parser doing any additional work.
This isn't always so simple, so you have some tools to help you out:
Start conditions
A lexer action can change the scanner's start condition, meaning it can activate different sets of rules.
A typical example of this is string literal lexing; when you parse a string literal, the rules for tokenising usually become completely different to the language containing them. This is an example of an exclusive start condition.
You can separate ambiguous lexings if you can identify two separate start conditions for them and ensure the lexer enters them appropriately, given some preceding context.
Lexical tie-ins
This is a fancy name for carrying state in the lexer, and modifying it in the parser. If a certain action in your parser gets executed, it modifies some state in the lexer, which results in lexer actions returning different tokens. This should be avoided when necessary, because it makes your lexer and parser both more difficult to reason about, and makes some things (like GLR parsers) impossible.
The upside is that you can do things that would require significant grammar changes with relatively minor impact on the code; you can use information from the parse to influence the behaviour of the lexer, which in turn can come some way to solving your problem of what you see as an "ambiguous" grammar.
Logic, reasoning
It's probable that it is possible to lex it in one parse, and the above tools should come second to thinking about how you should be tokenising the input and trying to convert that into the language of lexical analysis. :)
The fact is, your input is comprised of tokens—whether you like it or not!—and all you need to do is find a way to make a program understand the rules you already know.

Resources