building a lexer with very many tokens - parsing

I've been searching for two hours now And I don't really know what to do.
I'm trying to build a analyser which use a lexer that can match several thousand words. Those are natural language words, that's why they are so many.
I tried first in a simple way with just 1000 differents matches for one token :
TOKEN :
{
<VIRG: ",">
| <COORD: "et">
| <ADVERBE: "vraiment">
| <DET: "la">
| <ADJECTIF: "bonne">
| <NOM: "pomme"
| "émails"
| "émaux"
| "APL"
| "APLs"
| "Acide"
| "Acides"
| "Inuk"
[...]
After javac compilation it returns that the code is too large.
So, how could I manage thousands tokens in my lexer ?
I've read that it is more efficient to use n tokens for each word, than using one token for n words. But in this case I will have rules with 1000+ tokens, which doesn't look like a better idea;
I could modify the token manager, or build one, so it just matchs words in a list;
Here I know that the lexer is a finite state machine, and this is why it's not possible, so is there anyway to use an other lexer ? ;
I could automatically generate a huge regular expression which match every word, but that wouldn't let me handle the words independantly afterwards, and I'm not sure that writing a 60-lines-regular-expression would be a great idea;
Maybe is there a way to load the token from a file, this solution is pretty close to solutions 2 and 3;
Maybe I should use another language ? I'm trying to migrate from XLE (which can handle a lexicon of more than 70 000 tokens) to java, and what is interesting here is to generate java files !
So here it is, I can find my way to handle several thousands tokens with a javacc lexer. That would be great if any one is use to that and have an idea ?
Best
Corentin

I don't know how javacc builds its DFA, but it's certain that a DFA capable of distinguishing thousands of words would be quite large. (But by no means unreasonably large: I've gotten flex to build DFAs with hundreds of thousands of states without major problems.)
The usual approach to lexicons with a huge number of fixed lexemes is to use the DFA to recognize a potential word (eg., a sequence of alphabetical characters) and then look the word up in a dictionary to get the token type. That's also more flexible because you can update the dictionary without recompiling.

Related

Parsing expressions - unary, binary and incrementing/decrementing operators

I am trying to write a parser for my language (for learning and fun). The problem is that I don't know how to parse expressions like a--b or a----b. When there are multiple operators made from - character - unary (-x), binary (x-y), pre-decrement (--x, like in C) and post-decrement (x--). Both a--b and a----b should be valid and produce:
a--b -> sub a----b -> sub
+-+-+ +--+--+
a neg decr neg
| | |
b a b
When lexer tokenizes a--b it does not know if it is decrement or minus sign repeated two times, so the parser must find out which one is it.
How could I determine if - is part of decrement operator or just minus sign?
The problem is not really parsing so much as deciding the rules.
Why should a----b be a-- - -b and not a - -(--b)? For that matter, should a---b be a-- - b or a - --b?
And what about a---3 or 3---a? Neither 3-- nor --3 make any sense, so if the criteria were "choose (one of) the sensible interpretations", you'd end up with a-- - 3 and 3 - --a. But even if that were implementable without excess effort, it would place a huge cognitive load on coders and code readers.
Once upon a time, submitting a program for execution was a laborious and sometimes bureaucratic process, and having a run cancelled because the compiler couldn't find the correct interpretation was enormously frustrating. . I still carry the memories of my student days, waiting in a queue to hand my programs to an computer operator and then in another queue to receive the printed results.
So it became momentarily popular to create programming languages which went to heroic lengths to find a valid interpretation of what they were given. But that effort also meant that some bugs passed without error, because the programmer and the programming language having different understandings of what the "natural interpretation" might be.
If you program in C/C++, you may well have at some time written a & 3 == 0 instead of (a & 3) == 0. Fortunately, modern compilers will warn you about this bug, if warnings are enabled. But it's at least reasonable to ask whether the construct should even be permitted. Although it's a little annoying to have to add parentheses and recompile, it's not nearly as frustrating as trying to debug the obscure behaviour which results. Or to have accepted the code in a code review without noticing the subtle error.
These days, the compile / test / edit cycle is much quicker, so there's little excuse fir insisting on clarity. If I were writing a compiler today, I'd probably flag as an error any potentially ambiguous sequence of operator characters. But that might be going too far.
In most languages, a relatively simple rule is used: at each point in the program, the lexical analysis chooses the longest possible token, whether or not it "makes sense". That's what C and C++ do (mostly) and it has the advantage of being easy to implement and also easy to verify. (Even so, in a code review I would insist that a---b be written as a-- -b.)
You could slightly modify this rule so that only the first pair of -- is taken as a token, which would capture some of your desired parses without placing too much load on the code reader.
You could use even more complicated rules, but remember that whatever you implement, you have to document. If it's too hard to document clearly, it's probably unsuitable.
Once you articulate your list of rules, you work on the implementation. In many cases, the simplest is to just try the possibilities in order, with backtracking or in parallel. Or you might be able to just precompute the possible parses.
Alternatively, you could use a GLR or GLL parser generator capable of finding all parses in an ambiguous grammar, and then select the "best" parse based on whatever criteria you prefer.

Should a Lexer be able to distinguish between Syntax Tokens contained in a "variable" and actual Syntax Tokens

I am writing a lexer for a simple language(Gherkin).
While some of the lexer is done, I am struggling with a design decision.
Currently, the lexer has an examples and a step mode.
That means it has to track context, which I would rather not do.
I want to make the lexer as dumb as possible, so that most of the work is done by the parser.
My problem with the current approach is that I don't know if the lexer should distinguish Syntax and Literals in certain cases.
For a better understanding, here is a brief overview of the language.
The language has syntax tokens like: : < > | #.
The language can have variables, written as <Name>.
The language has an examples section, where syntax tokens differ from the rest of the test case
An example table looks like this:
Examples:
| Name | Last Name |
| John | Doe |
A full(stripped out unneeded information) test written in Gherkin looks like this:
#Fancy-Test
Scenario Outline: User logs in
Given user is on login_view
And user enters <Username> in username_field
And user enters <Password> in password_field
And user answers <Qu|estion>
When user clicks on login_button
Then user is logged in
Examples:
|Username|Password|Qu\|estion|
|JohnDoe11|Test<Pass>##Word|Who am I|
Note how I escaped | in the first Examples column.
Also take note of all the syntax characters in the password example.
By escaping the | character, I can use it in the examples part of the test without it getting detected as a Syntax Token.
But for the variable in line And user answers <Qu|estion> I don't need or want to escape it.
By language specification, the example entries can contain any character, except |, unless escaped, as it marks the end of a column.
That means no other syntax character should be detected as a Syntax Token.
Without two modes, all the syntax characters in the password example would be detected as such tokens.
The opposite is the case for the other part of the tests.
Unless at the start of a new line(where # and : are Syntax Tokens),
only <> should be considered part of the syntax
The current implementation prevents this by having the two modes mentioned, which is not the best solution.
My question therefore is:
Should the lexer just detect it as Syntax Tokens, which then get picked up by the Parser which figures out that those are actualyl part of the literal ?
Or is having context the preferable way.
Thank you for answering.
If you have two different lexical environments, then you have two difference lexical environments. They need to be handled differently. Almost all real-world programming languages feature this kind of complication, and most lexical generators have mechanisms designed to help maintain a moderate amount of lexical state.
The problem is figuring out how to do the transitions between the different lexical contexts. As you note, that can be a lot of work, which is ugly. If it's really ugly, you might want to revisit your language design, because it is not just your parser which has to be able to predict which lexical context applies where: any human being reading the code also needs to understand that, and all of the subtleties built in to the algorithm. If you can't describe the algorithm in a couple of clear sentences, you'll be putting quite a burden on code readers.
In the case of Gherkin, it looks to me like the tables are fairly easy to recognise: they start with a line whose first token is | and presumably continue until you reach a line whose first token is not a |. So it should be pretty straight-forward to switch lexical contexts, particularly as your lexer probably already needs to be aware of line-endings.

Searching/predicting next terminal/non-terminal by CFG/Tree?

I'm looking for algorithm to help me predict next token given a string/prefix and Context free grammar.
First question is what is the exact structure representing CFG. It seems it is a tree, but what type of tree ? I'm asking because the leaves are always ordered , is there a ordered-tree ?
May be if i know the correct structure I can find algorithm for bottom-up search !
If it is not exactly a Search problem, then the next closest thing it looks like Parsing the prefix-string and then Generating the next-token ? How do I do that ?
any ideas
my current generated grammar is simple it has no OR rules (except when i decide to reuse the grammar for new sequences, i will be). It is generated by Sequitur algo and is so called SLG(single line grammar) .. but if I generate it using many seq's the TOP rule will be Ex:>
S : S1 z S3 | u S2 .. S5 S1 | S4 S2 .. |... | Sn
S1 : a b
S2 : h u y
...
..i.e. top-heavy SLG, except the top rule all others do not have OR |
As a side note I'm thinking of a ways to convert it to Prolog and/or DCG program, where may be there is easier way to do what I want easily ?! what do you think ?
TL;DR: In abstract, this is a hard problem. But it can be pretty simple for given grammars. Everything depends on the nature of the grammar.
The basic algorithm indeed starts by using some parsing algorithm on the prefix. A rough prediction can then be made by attempting to continue the parse with each possible token, retaining only those which do not produce immediate errors.
That will certainly give you a list which includes all of the possible continuations. But the list may also include tokens which cannot appear in a correct input. Indeed, it is possible that the correct list is empty (because the given prefix is not the prefix of any correct input); this will happen if the parsing algorithm is unable to correctly verify whether a token sequence is a possible prefix.
In part, this will depend on the grammar itself. If the grammar is LR(1), for example, then the LR(1) parsing algorithm can precisely identify the continuation set. If the grammar is LR(k) for some k>1, then it is theoretically possible to produce an LR(1) grammar for the same language, but the resulting grammar might be impractically large. Otherwise, you might have to settle for "false positives". That might be acceptable if your goal is to provide tab-completion, but in other circumstances it might not be so useful.
The precise datastructure used to perform the internal parse and exploration of alternatives will depend on the parsing algorithm used. Many parsing algorithms, including the standard LR parsing algorithm whose internal data structure is a simple stack, feature a mutable internal state which is not really suitable for the exploration step; you could adapt such an algorithm by making a copy of the entire internal data structure (that is, the stack) before proceeding with each trial token. Alternatively, you could implement a copy-on-write stack. But the parser stack is not usually very big, so copying it each time is generally feasible. (That's what Bison does to produce expanded error messages with an "expected token" list, and it doesn't seem to trigger unacceptable runtime overhead in practice.)
Alternatively, you could use some variant of CYK chart parsing (or a GLR algorithm like the Earley algorithm), whose internal data structures can be implemented in a way which doesn't involve destructive modification. Such algorithms are generally used for grammars which are not LR(1), since they can cope with any CFG although highly ambiguous grammars can take a long time to parse (proportional to the cube of the input length). As mentioned above, though, you will get false positives from such algorithms.
If false positives are unacceptable, then you could use some kind of heuristic search to attempt to find an input sequence which completes the trial prefix. This can in theory take quite a long time, but for many grammars a breadth-first search can find a completion within a reasonable time, so you could terminate the search after a given maximum time. This will not produce false positives, but the time limit might prevent it from finding the complete set of possible continuations.

Writing a Z80 assembler - lexing ASM and building a parse tree using composition?

I'm very new to the concept of writing an assembler and even after reading a great deal of material, I'm still having difficulties wrapping my head around a couple of concepts.
What is the process to actually break up a source file into tokens? I believe this process is called lexing, and I've searched high and low for a real code examples that make sense, but I can't find a thing so simple code examples very welcome ;)
When parsing, does information ever need to be passed up or down the tree? The reason I ask is as follows, take:
LD BC, nn
It needs to be turned into the following parse tree once tokenized(???)
___ LD ___
| |
BC nn
Now, when this tree is traversed it needs to produce the following machine code:
01 n n
If the instruction had been:
LD DE,nn
Then the output would need to be:
11 n n
Meaning that it raises the question, does the LD node return something different based on the operand or is it the operand that returns something? And how is this achieved? More simple code examples would be excellent if time permits.
I'm most interested in learning some of the raw processes here rather than looking at advanced existing tools so please bear that in mind before sending me to Yacc or Flex.
Well, the structure of the tree you really want for
an instruction that operates on a register and an memory
addressing mode involing an offset displacement and an index register
would look like this:
INSTRUCTION-----+
| | |
OPCODE REG OPERAND
| |
OFFSET INDEXREG
And yes, you want want to pass values up and down the tree.
A method for formally specifying such value passing is called
"attribute grammars", and you decorate the grammar for your
langauge (in your case, your assembler syntax) with the value-passing
and the computations over those values. For more background,
see Wikipedia on attribute grammars.
In a related question you asked, I discussed a
tool, DMS,
which handles expression grammars and building trees. As
language manipulation tool, DMS faces exactly these same up-and-down
the tree information flows issues. It shouldn't surprise you,
that as a high-end language manipulation tool, it can handle
attribute grammar computations directly.
It is not necessary to build a parse tree. Z80 op codes are very simple. They consist of the op code and 0, 1 or 2 operands, separated by commas. You just need to split the opcode up into the (maximum of 3) components with a very simple parser - no tree is needed.
Actually, the opcodes do have not a byte base, but an octal base. The best description I know is DECODING Z80 OPCODES.

Parsing basic math equations for children's educational software?

Inspired by a recent TED talk, I want to write a small piece of educational software. The researcher created little miniature computers in the shape of blocks called "Siftables".
(source: ted.com)
[David Merril, inventor - with Siftables in the background.]
There were many applications he used the blocks in but my favorite was when each block was a number or basic operation symbol. You could then re-arrange the blocks of numbers or operation symbols in a line, and it would display an answer on another siftable block.
So, I've decided I wanted to implemented a software version of "Math Siftables" on a limited scale as my final project for a CS course I'm taking.
What is the generally accepted way for parsing and interpreting a string of math expressions, and if they are valid, perform the operation?
Is this a case where I should implement a full parser/lexer? I would imagine interpreting basic math expressions would be a semi-common problem in computer science so I'm looking for the right way to approach this.
For example, if my Math Siftable blocks where arranged like:
[1] [+] [2]
This would be a valid sequence and I would perform the necessary operation to arrive at "3".
However, if the child were to drag several operation blocks together such as:
[2] [\] [\] [5]
It would obviously be invalid.
Ultimately, I want to be able to parse and interpret any number of chains of operations with the blocks that the user can drag together. Can anyone explain to me or point me to resources for parsing basic math expressions?
I'd prefer as much of a language agnostic answer as possible.
You might look at the Shunting Yard Algorithm. The linked wikipedia page has a ton of info and links to various examples of the algorithm.
Basically, given an expression in infix mathematical notation, it give back an AST or Reverse Polish Notation, whatever your preference might be.
This page is pretty good. There are also a couple related questions on SO.
In a lot of modern languages there are methods to evaluate arithmetic string expressions. For example in Python
>>> a = '1+3*3'
>>> eval(a)
10
You could use exception handling to catch the invalid syntax.
Alternatively you can build arithmetic expression trees, there are some examples of these here in SO: Expression Trees for Dummies.
As pointed out above, I'd convert the normal string (infix notation) to a post fix expression.
Then, given the post-fix expression it is easy to parse through and evaluate the expression. For example, add the operands to a stack and when you find an operator, pop values off the stack and apply them operator to the operands. If your code to convert it to a post fix expression is correct, you shouldn't need to worry about the order of operations or anything like that.
The majority of the work in this case would probably be done in the conversion. you could store the converted form in a list or array for easy access so you don't really need to parse each value again too.
You say that several operators in a row are not valid. But think about:
5 + -2
Which is perfectly valid.
The most basic expression grammar is like:
Expression = Term | Expression, AddOp, Term
Term = Factor | Term, MulOp, Factor
Factor = Number | SignOp, Factor | '(', Expression, ')'
AddOp = '+' | '-'
MulOp = '*' | '/'
SignOp = '+' | '-'
Number = Digit | Number, Digit
Digit = '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'
I once wrote a simple lightweight expression parser/evaluator (string in number out) which could handle variables and functions. The code is in Delphi but It shouldn't be that hard to translate. If you are interested I can put the sourcecode online.
Another note is that there are many parsing libraries available that a person can use to accomplish this task. It is not trivial to write a good expression parser from scratch, so I would recommend checking out a library.
I work for Singular Systems which specializes in mathematics components. We offer two math parsers Jep Java and Jep.Net which might help you in solving your problem. Good luck!
For this audience you'd want to give error feedback quite different than to programmers used to messages like "Syntax error: unexpected '/' at position foo." I tried to make something better for education applets here:
http://github.com/darius/expr
The main ideas: go to unusual lengths to find a minimal edit restoring parsability (practical since input expressions aren't pages long), and generate a longer plain-English explanation of what the parser is stuck on.

Resources