Inspired by a recent TED talk, I want to write a small piece of educational software. The researcher created little miniature computers in the shape of blocks called "Siftables".
(source: ted.com)
[David Merril, inventor - with Siftables in the background.]
There were many applications he used the blocks in but my favorite was when each block was a number or basic operation symbol. You could then re-arrange the blocks of numbers or operation symbols in a line, and it would display an answer on another siftable block.
So, I've decided I wanted to implemented a software version of "Math Siftables" on a limited scale as my final project for a CS course I'm taking.
What is the generally accepted way for parsing and interpreting a string of math expressions, and if they are valid, perform the operation?
Is this a case where I should implement a full parser/lexer? I would imagine interpreting basic math expressions would be a semi-common problem in computer science so I'm looking for the right way to approach this.
For example, if my Math Siftable blocks where arranged like:
[1] [+] [2]
This would be a valid sequence and I would perform the necessary operation to arrive at "3".
However, if the child were to drag several operation blocks together such as:
[2] [\] [\] [5]
It would obviously be invalid.
Ultimately, I want to be able to parse and interpret any number of chains of operations with the blocks that the user can drag together. Can anyone explain to me or point me to resources for parsing basic math expressions?
I'd prefer as much of a language agnostic answer as possible.
You might look at the Shunting Yard Algorithm. The linked wikipedia page has a ton of info and links to various examples of the algorithm.
Basically, given an expression in infix mathematical notation, it give back an AST or Reverse Polish Notation, whatever your preference might be.
This page is pretty good. There are also a couple related questions on SO.
In a lot of modern languages there are methods to evaluate arithmetic string expressions. For example in Python
>>> a = '1+3*3'
>>> eval(a)
10
You could use exception handling to catch the invalid syntax.
Alternatively you can build arithmetic expression trees, there are some examples of these here in SO: Expression Trees for Dummies.
As pointed out above, I'd convert the normal string (infix notation) to a post fix expression.
Then, given the post-fix expression it is easy to parse through and evaluate the expression. For example, add the operands to a stack and when you find an operator, pop values off the stack and apply them operator to the operands. If your code to convert it to a post fix expression is correct, you shouldn't need to worry about the order of operations or anything like that.
The majority of the work in this case would probably be done in the conversion. you could store the converted form in a list or array for easy access so you don't really need to parse each value again too.
You say that several operators in a row are not valid. But think about:
5 + -2
Which is perfectly valid.
The most basic expression grammar is like:
Expression = Term | Expression, AddOp, Term
Term = Factor | Term, MulOp, Factor
Factor = Number | SignOp, Factor | '(', Expression, ')'
AddOp = '+' | '-'
MulOp = '*' | '/'
SignOp = '+' | '-'
Number = Digit | Number, Digit
Digit = '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'
I once wrote a simple lightweight expression parser/evaluator (string in number out) which could handle variables and functions. The code is in Delphi but It shouldn't be that hard to translate. If you are interested I can put the sourcecode online.
Another note is that there are many parsing libraries available that a person can use to accomplish this task. It is not trivial to write a good expression parser from scratch, so I would recommend checking out a library.
I work for Singular Systems which specializes in mathematics components. We offer two math parsers Jep Java and Jep.Net which might help you in solving your problem. Good luck!
For this audience you'd want to give error feedback quite different than to programmers used to messages like "Syntax error: unexpected '/' at position foo." I tried to make something better for education applets here:
http://github.com/darius/expr
The main ideas: go to unusual lengths to find a minimal edit restoring parsability (practical since input expressions aren't pages long), and generate a longer plain-English explanation of what the parser is stuck on.
Related
I have developed a syntax checker for the Gerber format, using Tatsu. It works fine, my thanks to the Tatsu developers. However, it is not overly fast, and I am now optimizing the grammar.
The Gerber format is a stream of commands, and this is handled by main loop of the grammar is as follows:
start =
{
| ['X' integer] ['Y' integer] ((['I' integer 'J' integer] 'D01*')|'D02*'|'D03*')
| ('G01*'|'G02*'|'G03*'|'G1*'|'G2*'|'G3*')
... about 25 rules
}*
M02
$;
with
integer = /[+-]?[0-9]+/;
In big files, where the performance is important, the vast majority of the statements are covered by the first rule in the choice. (It is actually three commands. By putting them first, and merging then to eliminate common elements made the checker 2-3 times faster.)
Now I try to replace the first rule by a regex, assuming regex is faster as it is in C.
In the first step I inlined the integer:
| ['X' /[+-]?[0-9]+/] ['Y' /[+-]?[0-9]+/] ((['I' /[+-]?[0-9]+/ 'J' /[+-]?[0-9]+/] 'D01*')|'D02*'|'D03*')
This worked fine and gave a modest speedup.
Then I tried go regex the whole rule. Failure. As a test I only modified the first rule in the sequence:
| /(X[+-]?[0-9]+)?/ ['Y' /[+-]?[0-9]+/] ((['I' /[+-]?[0-9]+/ 'J' /[+-]?[0-9]+/] 'D01*')|'D02*'|'D03*')
This fails to recognize the following command:
X81479571Y-38450761D01*
I cannot see the difference between ['X' /[+-]?[0-9]+/] and /(X[+-]?[0-9]+)?/
What do I miss?
The difference is that an optional expression with [] will advance over whitespace and comments while a pattern expression with // will not. It's in the documentation. A trick for this case is to place the pattern in it's own, initial-lower-case rule, so there's whitespace and comments tokenization before applying the pattern, though I don't think adding that indirection will aid with performance.
As to optimization, a trick I've used in the "...25 more rules" case is to group rules with similar prefixes under a &lookahead, for example &/G0/ in your case.
TatSu is designed to be friendly to grammar writers in favor of being performant. If you need blazing speeds, through generation of parsers in C, you may want to take a look at pegen, the predecesor to the new PEG parser in CPython.
With a Backus-Naur form grammar (BNF), we can specify the syntax of the programming language in order to parse it and produce an abstract syntax tree (AST).
<if> ::= "if" <expression> "then" <action> "end"
But we can also specify the tokens with a BNF grammar, as the first usage of BNF did for ALGOL-60:
<digit> ::= "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"
<digit_with_zero> ::= <digit> | "0"
<integer> ::= <digit> | <digit_with_zero> <integer>
However, this usage of the BNF in order to lex (= produce a list of minimal meaningful units aka tokens) has been deprecated in favor of regular expressions (like [1-9][0-9]*).
It seems clear that the regex are much more concise.
It seems also that keeping the structure of an if statement is interesting for the interpreter or the compiler which will handle the AST produced by the parser, but keeping the structure of an integer (or a float) is not.
But do you agree that BNF could be used for both lexing and parsing?
And do you agree with the reasons which make regex much more suited for lexing?
Or are there others?
Regular expressions (in the mathematical sense) are equivalent in power to regular grammars and regular grammars can be written in BNF. So in that sense, it is clearly possible to write a full grammar for any context-free language in pure BNF.
Indeed, it is not even necessary to maintain the lexer/parser dichotomy. Some programmers find it convenient to use scannerless parsing (the article is not great but it has some interesting references), although many of these are based on the PEG formalism (which is not context-free) rather than BNF. (These are not the same despite the superficial resemblance.)
That said, it might not be convenient. In general, like most questions related to the structure of parsers, the answer is going to be based less on theory and more on a combination of practicality (with reference to a specific use case) and programmer prejudice.
As is well known, purity is rarely the most practical. Most real-life parser and scanner generators deviate from the pure theoretical models in order to provide mechanisms which are easier to use, easier to implement efficiently, or more powerful. For example, the character class syntax ([a-zA-Z]), which is almost universal in scanner generators, is a clear extension to regular expression syntax which deliberately avoids the need to explicitly list the entire contents of the set. One could say that the listing is implicit and unambiguous in the example I just presented, but most scanner generators also allow the use of classes like [[:alnum:]] ("alphanumeric symbols"), where the precise list of matched symbols is either locale-dependent or, in the Unicode world, extensible in the future. Regardless, this is obviously a useful extension.
While it is true that some aspects of regular expressions are more compact than their equivalent regular grammars -- especially the Kleene star operator, which in BNF requires an additional non-terminal and thus an additional name -- there are also cases where the ability to name subexpressions makes regular grammars more compact. Many scanner generators, starting with Lex, allowed named subpatterns as another regular expression extension. Furthermore, it is possible (with some caveats) to add the Kleene star and other operators to BNF as macros, and many parser generators do so. So there is a certain convergence of notation.
As you say, one difference between scanners and parsers is that the scanner generally makes no attempt to parse the substructure of a lexeme. But it is not true that no lexeme has substructure, and these substructures often do need to be analysed. The most notorious example is probably floating point numbers, which have to be analysed into a multiplier and an exponent, and the multiplier also analysed into an integer part and a fractional part. This analysis is commonly done using primitive functions available in the scanner implementation language (such as strtod for C scanners), but that does mean a second lexical scan. (Using the built-in avoids the considerable inconvenience of writing a mathematically correct string-to-internal converter, which is a much more difficult problem than it first appears. Rolling your own number converter is not recommended.)
Other lexemes with internal structure include string literals (which may contain escape sequences) and a large variety of more complex lexemes available in certain languages (dates and times, IP addresses, HTML tags, etc., etc.). All of these things tend to blur the boundary between scanning and parsing. Which is fine, because, as I said, the boundary is situational and not restrained by any absolute law of nature.
Still, it is certainly the case that many lexemes do not have any interesting internal structure, and furthermore that while it is easy to rewrite a regular expression as a regular grammar, it is considerably harder to rewrite it as an unambiguous, deterministic regular grammar, much less an LALR(1) regular grammar. (This is one of the reasons scannerless parsing is often associated with PEG, but it can also be solved with GLL or GLR parsers, at a slight loss of efficiency.)
I am getting the user to input a function, e.g. y = 2x^2 + 3, as a string. What I am looking to do is to enter that string into TChart and for TChart to graph the function.
As far as I know, TChart/TeeChart will only accept X values that are assigned values, e.g. -10 to 10 for X, so the X value would need to be calculated each time - this isn't an issue.
The issue is getting each part of the inputted function and substituting the X-values into each part. The workaround I have found is to get the user to enter the degree for each part of the function, e.g. 2 for X^2, 3 for X^3, etc. but is there a cleaner way of doing this?
If I could convert the inputted string into a Mathematical formula which TeeChart would accept, that would be the ideal outcome.
Saying that you can't use external units effectively makes your question unanswerable in the SO format, because the topic is far broader (and deeper) that can comfortably be dealt with in SO's Q&A format. So the following is at best an outline:
If you want to, or have to, write a DIY expression evaluator, one way to do it is to proceed as follows:
Write yourself a class that takes a string as input and snips it up into a series of symbols, aka "tokens" which represent the component parts of the expression, e.g, numbers, operators, parentheses, names of functions, names of variables, etc; these tokens might themselves be records or class instances and need to include a mechanisms for storing values associated with particular symbols (e.g. the tokens that represent numbers in the input). This step is called "tokenisation" or "lexing". Store the resulting list of symbols in a list or similiar structure. This class needs to implement a mechanism to retrieve the next symbol from the list (usually, this method is called something like "NextToken") and indicate whether there are any symbols left. This class also needs a mechanism to "put back" a symbol (or, equivalently, "peek" the symbol following the current one).
Then, write yourself a s/ware machine which takes the tokenised symbols and "evaluates" the list of symbols to produce the (mathematical) result you're after. This step is an order of magnitude or two more difficult than the tokenisation step. There are numerous ways to do it. As I said an a comment earlier, a recursive descent parser is probably the most tractable approach if you've never done anything like this before. There are countless examples in textbooks, but here's a link to an article about a Delphi implementation that should be understandable as an intro:
http://www8.umoncton.ca/umcm-deslierres_michel/Calcs/ParsingMathExpr-1.html
That article begins by noting that there are numerous pre-existing Delphi expression evaluators but makes the point that they are not necessarily the best place to start for someone wanting to learn how to write an evaluator/parser rather than just use one. Instead it goes through the coding of an evaluator to implement this simple expression grammar:
expression : term | term + term | term − term
term : factor | factor * factor | factor / factor
factor : number | ( expression ) | + factor | − factor
(the vertical bar | denotes ‘or’)
The article has a link to a second part which shows had to add exponentiation to the evaluator - this is trickier than it might sound and involves issues of ambiguity: e.g. how to evaluate - and what does it mean to write - an expression like
x^y^z
? This relates to the issue of "associativity": most operators are "left associative" which means that they bind more tightly to what's on the left of them than what's on their right. The exponentiation operator is an example of the reverse, where the operator binds more tightly to what's on its right.
Have fun!
By the way, you used to see suggestions to implement an evaluator using the "shunting yard algorithm"
http://en.wikipedia.org/wiki/Shunting-yard_algorithm
to convert an "infix" expression where the operators are between the operands, as in 1 + 3 * 4 to RPN (reverse Polish notation), as used on older HP calculators. The reason to do that was that RPN makes for much more efficient evaluation of an expression that the infix equivalent. Ymmv, but personally I found that implementing the SY algorithm properly was actually trickier than learning how to write an evaluator in the expression/term/factor style.
Fwiw, RPN is the basis of the Forth programming language, http://en.wikipedia.org/wiki/Forth_%28programming_language%29, so you could write a Forth implementation in Delphi if you wanted!
I'm making my own javascript-based programming language (yeah, it is crazy, but it's for learn only... maybe?). Well, I'm reading about parsers and the first pass is to convert the code source to tokens, like:
if(x > 5)
return true;
Tokenizer to:
T_IF "if"
T_LPAREN "("
T_IDENTIFIER "x"
T_GT ">"
T_NUMBER "5"
T_RPAREN ")"
T_IDENTIFIER "return"
T_TRUE "true"
T_TERMINATOR ";"
I don't know if my logic is correct for that for while. On my parser it is even better (or not?) and translate to it (yeah, multidimensional array):
T_IF "if"
T_EXPRESSION ...
T_IDENTIFIER "x"
T_GT ">"
T_NUMBER "5"
T_CLOSURE ...
T_IDENTIFIER "return"
T_TRUE "true"
I have some doubts:
Is my way better or worse that the original way? Note that my code will be read and compiled (translated to another language, like PHP), instead of interpreted all the time.
After I tokenizer, what I need do exactly? I'm really lost on this pass!
There are some good tutorial to learn how I can do it?
Well, is that. Bye!
Generally, you want to separate the functions of the tokeniser (also called a lexer) from other stages of your compiler or interpreter. The reason for this is basic modularity: each pass consumes one kind of thing (e.g., characters) and produces another one (e.g., tokens).
So you’ve converted your characters to tokens. Now you want to convert your flat list of tokens to meaningful nested expressions, and this is what is conventionally called parsing. For a JavaScript-like language, you should look into recursive descent parsing. For parsing expressions with infix operators of different precedence levels, Pratt parsing is very useful, and you can fall back on ordinary recursive descent parsing for special cases.
Just to give you a more concrete example based on your case, I’ll assume you can write two functions: accept(token) and expect(token), which test the next token in the stream you’ve created. You’ll make a function for each type of statement or expression in the grammar of your language. Here’s Pythonish pseudocode for a statement() function, for instance:
def statement():
if accept("if"):
x = expression()
y = statement()
return IfStatement(x, y)
elif accept("return"):
x = expression()
return ReturnStatement(x)
elif accept("{")
xs = []
while True:
xs.append(statement())
if not accept(";"):
break
expect("}")
return Block(xs)
else:
error("Invalid statement!")
This gives you what’s called an abstract syntax tree (AST) of your program, which you can then manipulate (optimisation and analysis), output (compilation), or run (interpretation).
Most toolkits split the complete process into two separate parts
lexer (aka. tokenizer)
parser (aka. grammar)
The tokenizer will split the input data into tokens. The parser will only operate on the token "stream" and build the structure.
Your question seems to be focused on the tokenizer. But your second solution mixes the grammar parser and the tokenizer into one step. Theoretically this is also possible but for a beginner it is much easier to do it the same way as most other tools/framework: keep the steps separate.
To your first solution: I would tokenize your example like this:
T_KEYWORD_IF "if"
T_LPAREN "("
T_IDENTIFIER "x"
T_GT ">"
T_LITARAL "5"
T_RPAREN ")"
T_KEYWORD_RET "return"
T_KEYWORD_TRUE "true"
T_TERMINATOR ";"
In most languages keywords cannot be used as method names, variable names and so on. This is reflected already on the tokenizer level (T_KEYWORD_IF, T_KEYWORD_RET, T_KEYWORD_TRUE).
The next level would take this stream and - by applying a formal grammar - would build some datastructure (often called AST - Abstract Syntax Tree) which might look like this:
IfStatement:
Expression:
BinaryOperator:
Operator: T_GT
LeftOperand:
IdentifierExpression:
"x"
RightOperand:
LiteralExpression
5
IfBlock
ReturnStatement
ReturnExpression
LiteralExpression
"true"
ElseBlock (empty)
Implementing the parser by hand is usually done by some frameworks. Implementing something like that by hand and efficiently is usually done at a university in the better part of a semester. So you really should use some kind of framework.
The input for a grammar parser framework is usually a formal grammar in some kind of BNF. Your "if" part migh look like this:
IfStatement: T_KEYWORD_IF T_LPAREN Expression T_RPAREN Statement ;
Expression: LiteralExpression | BinaryExpression | IdentifierExpression | ... ;
BinaryExpression: LeftOperand BinaryOperator RightOperand;
....
That's only to get the idea. Parsing a realworld-language like Javascript correctly is not an easy task. But funny.
Is my way better or worse that the original way? Note that my code will be read and compiled (translated to another language, like PHP), instead of interpreted all the time.
What's the original way ? There are many different ways to implement languages. I think yours is fine actually, I once tried to build a language myself that translated to C#, the hack programming language. Many language compilers translate to an intermediate language, it's quite common.
After I tokenizer, what I need do exactly? I'm really lost on this pass!
After tokenizing, you need to parse it. Use some good lexer / parser framework, such as the Boost.Spirit, or Coco, or whatever. There are hundreds of them. Or you can implement your own lexer, but that takes time and resources. There are many ways to parse code, I generally rely on recursive descent parsing.
Next you need to do Code Generation. That's the most difficult part in my opinion. There are tools for that too, but you can do it manually if you want to, I tried to do it in my project, but it was pretty basic and buggy, there's some helpful code here and here.
There are some good tutorial to learn how I can do it?
As I suggested earlier, use tools to do it. There are a lot of pretty good well-documented parser frameworks. For further information, you can try asking some people who know about this stuff. #DeadMG , over at the Lounge C++ is building a programming language called "Wide". You may try consulting him.
Let's say I have this statement in a programming language:
if (0 < 1) then
print("Hello")
The lexer will translate it into:
keyword: if
num: 0
op: <
num: 1
keyword: then
keyword: print
string: "Hello"
The parser will then take the information (aka "Token Stream") and make this:
if:
expression:
<:
0, 1
then:
print:
"Hello"
I don't know if this will help or not, but I hope it does.
I'm very new to the concept of writing an assembler and even after reading a great deal of material, I'm still having difficulties wrapping my head around a couple of concepts.
What is the process to actually break up a source file into tokens? I believe this process is called lexing, and I've searched high and low for a real code examples that make sense, but I can't find a thing so simple code examples very welcome ;)
When parsing, does information ever need to be passed up or down the tree? The reason I ask is as follows, take:
LD BC, nn
It needs to be turned into the following parse tree once tokenized(???)
___ LD ___
| |
BC nn
Now, when this tree is traversed it needs to produce the following machine code:
01 n n
If the instruction had been:
LD DE,nn
Then the output would need to be:
11 n n
Meaning that it raises the question, does the LD node return something different based on the operand or is it the operand that returns something? And how is this achieved? More simple code examples would be excellent if time permits.
I'm most interested in learning some of the raw processes here rather than looking at advanced existing tools so please bear that in mind before sending me to Yacc or Flex.
Well, the structure of the tree you really want for
an instruction that operates on a register and an memory
addressing mode involing an offset displacement and an index register
would look like this:
INSTRUCTION-----+
| | |
OPCODE REG OPERAND
| |
OFFSET INDEXREG
And yes, you want want to pass values up and down the tree.
A method for formally specifying such value passing is called
"attribute grammars", and you decorate the grammar for your
langauge (in your case, your assembler syntax) with the value-passing
and the computations over those values. For more background,
see Wikipedia on attribute grammars.
In a related question you asked, I discussed a
tool, DMS,
which handles expression grammars and building trees. As
language manipulation tool, DMS faces exactly these same up-and-down
the tree information flows issues. It shouldn't surprise you,
that as a high-end language manipulation tool, it can handle
attribute grammar computations directly.
It is not necessary to build a parse tree. Z80 op codes are very simple. They consist of the op code and 0, 1 or 2 operands, separated by commas. You just need to split the opcode up into the (maximum of 3) components with a very simple parser - no tree is needed.
Actually, the opcodes do have not a byte base, but an octal base. The best description I know is DECODING Z80 OPCODES.