Integer digits in Lex - flex-lexer

I've recently just started learning about lex and for strings, we can use the notation:
[a-z]+
while I'm curious about the case for integers. Would i be able to write something like:
[0-10]+
I'll appreciate some feedback on this as I've just started off learning Lex.

Would i be able to write something like:
[0-10]+
No, but you can write
[0-9]+
which is what you really mean.
This is all documented.

Related

Can ANTLR 4 parse a non prefix-free language?

I've recently begun using ANTLR 4. After quite a lot of testing, I wonder if ANTLR 4 can parse a non prefix-free context-free language. That's because its parsers seem to only make the parse tree out of the longest-from-the-start valid substring of input string.
For example, in the parsing stage of this grammar
grammar Hello;
r : 'hello' ID ;
ID : [a-z]+ ;
WS : [ \t\r\n]+ -> skip ;
and this input string
hello sir hello sir
, the parser seems to only parse one hello sir. So, the input string seems to be valid in the grammar, but actually it's not.
Being aware of the fact that I might know about ANTLR enough, I've googled quite a lot about the problem but got no results. Although I might haven't searched it in the right way, now I think it's better to just ask the pros.
Thanks in advance.

Parsing an equation erlang

I am learning erlang recently and I have a question.
I have an equation like this (~(2+1)).
I want to parse to polish notation? For eg.{unaryMin{add,2,1}}
How do I start?
If you want to parse something, from a simple formula to a programming language, you should start by learning about grammar, language and Compiler-compiler. Learning how to parse and translate/interpret something to another format is a very common task for any programmer (pretty much everything has a compiler/interpreter, even you image viewer, web browser, etc ...) so it's very important to learn about those things.
For Erlang, LYSE got a chapter about making a reverse-polish notation calculator here, and for converting from an infix equation to a prefix/postfix one, you should read about Shunt-Yard algorithm.
Erlang also have is own version of yacc & lex : yecc, leex.

Building a parser (Part I)

I'm making my own javascript-based programming language (yeah, it is crazy, but it's for learn only... maybe?). Well, I'm reading about parsers and the first pass is to convert the code source to tokens, like:
if(x > 5)
return true;
Tokenizer to:
T_IF "if"
T_LPAREN "("
T_IDENTIFIER "x"
T_GT ">"
T_NUMBER "5"
T_RPAREN ")"
T_IDENTIFIER "return"
T_TRUE "true"
T_TERMINATOR ";"
I don't know if my logic is correct for that for while. On my parser it is even better (or not?) and translate to it (yeah, multidimensional array):
T_IF "if"
T_EXPRESSION ...
T_IDENTIFIER "x"
T_GT ">"
T_NUMBER "5"
T_CLOSURE ...
T_IDENTIFIER "return"
T_TRUE "true"
I have some doubts:
Is my way better or worse that the original way? Note that my code will be read and compiled (translated to another language, like PHP), instead of interpreted all the time.
After I tokenizer, what I need do exactly? I'm really lost on this pass!
There are some good tutorial to learn how I can do it?
Well, is that. Bye!
Generally, you want to separate the functions of the tokeniser (also called a lexer) from other stages of your compiler or interpreter. The reason for this is basic modularity: each pass consumes one kind of thing (e.g., characters) and produces another one (e.g., tokens).
So you’ve converted your characters to tokens. Now you want to convert your flat list of tokens to meaningful nested expressions, and this is what is conventionally called parsing. For a JavaScript-like language, you should look into recursive descent parsing. For parsing expressions with infix operators of different precedence levels, Pratt parsing is very useful, and you can fall back on ordinary recursive descent parsing for special cases.
Just to give you a more concrete example based on your case, I’ll assume you can write two functions: accept(token) and expect(token), which test the next token in the stream you’ve created. You’ll make a function for each type of statement or expression in the grammar of your language. Here’s Pythonish pseudocode for a statement() function, for instance:
def statement():
if accept("if"):
x = expression()
y = statement()
return IfStatement(x, y)
elif accept("return"):
x = expression()
return ReturnStatement(x)
elif accept("{")
xs = []
while True:
xs.append(statement())
if not accept(";"):
break
expect("}")
return Block(xs)
else:
error("Invalid statement!")
This gives you what’s called an abstract syntax tree (AST) of your program, which you can then manipulate (optimisation and analysis), output (compilation), or run (interpretation).
Most toolkits split the complete process into two separate parts
lexer (aka. tokenizer)
parser (aka. grammar)
The tokenizer will split the input data into tokens. The parser will only operate on the token "stream" and build the structure.
Your question seems to be focused on the tokenizer. But your second solution mixes the grammar parser and the tokenizer into one step. Theoretically this is also possible but for a beginner it is much easier to do it the same way as most other tools/framework: keep the steps separate.
To your first solution: I would tokenize your example like this:
T_KEYWORD_IF "if"
T_LPAREN "("
T_IDENTIFIER "x"
T_GT ">"
T_LITARAL "5"
T_RPAREN ")"
T_KEYWORD_RET "return"
T_KEYWORD_TRUE "true"
T_TERMINATOR ";"
In most languages keywords cannot be used as method names, variable names and so on. This is reflected already on the tokenizer level (T_KEYWORD_IF, T_KEYWORD_RET, T_KEYWORD_TRUE).
The next level would take this stream and - by applying a formal grammar - would build some datastructure (often called AST - Abstract Syntax Tree) which might look like this:
IfStatement:
Expression:
BinaryOperator:
Operator: T_GT
LeftOperand:
IdentifierExpression:
"x"
RightOperand:
LiteralExpression
5
IfBlock
ReturnStatement
ReturnExpression
LiteralExpression
"true"
ElseBlock (empty)
Implementing the parser by hand is usually done by some frameworks. Implementing something like that by hand and efficiently is usually done at a university in the better part of a semester. So you really should use some kind of framework.
The input for a grammar parser framework is usually a formal grammar in some kind of BNF. Your "if" part migh look like this:
IfStatement: T_KEYWORD_IF T_LPAREN Expression T_RPAREN Statement ;
Expression: LiteralExpression | BinaryExpression | IdentifierExpression | ... ;
BinaryExpression: LeftOperand BinaryOperator RightOperand;
....
That's only to get the idea. Parsing a realworld-language like Javascript correctly is not an easy task. But funny.
Is my way better or worse that the original way? Note that my code will be read and compiled (translated to another language, like PHP), instead of interpreted all the time.
What's the original way ? There are many different ways to implement languages. I think yours is fine actually, I once tried to build a language myself that translated to C#, the hack programming language. Many language compilers translate to an intermediate language, it's quite common.
After I tokenizer, what I need do exactly? I'm really lost on this pass!
After tokenizing, you need to parse it. Use some good lexer / parser framework, such as the Boost.Spirit, or Coco, or whatever. There are hundreds of them. Or you can implement your own lexer, but that takes time and resources. There are many ways to parse code, I generally rely on recursive descent parsing.
Next you need to do Code Generation. That's the most difficult part in my opinion. There are tools for that too, but you can do it manually if you want to, I tried to do it in my project, but it was pretty basic and buggy, there's some helpful code here and here.
There are some good tutorial to learn how I can do it?
As I suggested earlier, use tools to do it. There are a lot of pretty good well-documented parser frameworks. For further information, you can try asking some people who know about this stuff. #DeadMG , over at the Lounge C++ is building a programming language called "Wide". You may try consulting him.
Let's say I have this statement in a programming language:
if (0 < 1) then
print("Hello")
The lexer will translate it into:
keyword: if
num: 0
op: <
num: 1
keyword: then
keyword: print
string: "Hello"
The parser will then take the information (aka "Token Stream") and make this:
if:
expression:
<:
0, 1
then:
print:
"Hello"
I don't know if this will help or not, but I hope it does.

A grammar for one variable functions in ANTLR

Hey!
I am looking for an ANTLR grammar for parsing one variable function expressions. It should support +,-, /, ^, special functions (e.g. cos, sin) and constants (pi, e) and parenthesis. I tried writing it myself but I get left-recursion warnings. Does anyone have a example that I can get started with?
I would like to write something like
x+sin(5x + pi^3)/(15e cos(x))
for example.
ANTLR grammars are preferred but other (E)BNF examples will be appreciated.
Eventually I would like to use it with AST output.
THANX
Ok, that was fast.
I found a great article on code project.
It has everything I wanted and more!

Parsing basic math equations for children's educational software?

Inspired by a recent TED talk, I want to write a small piece of educational software. The researcher created little miniature computers in the shape of blocks called "Siftables".
(source: ted.com)
[David Merril, inventor - with Siftables in the background.]
There were many applications he used the blocks in but my favorite was when each block was a number or basic operation symbol. You could then re-arrange the blocks of numbers or operation symbols in a line, and it would display an answer on another siftable block.
So, I've decided I wanted to implemented a software version of "Math Siftables" on a limited scale as my final project for a CS course I'm taking.
What is the generally accepted way for parsing and interpreting a string of math expressions, and if they are valid, perform the operation?
Is this a case where I should implement a full parser/lexer? I would imagine interpreting basic math expressions would be a semi-common problem in computer science so I'm looking for the right way to approach this.
For example, if my Math Siftable blocks where arranged like:
[1] [+] [2]
This would be a valid sequence and I would perform the necessary operation to arrive at "3".
However, if the child were to drag several operation blocks together such as:
[2] [\] [\] [5]
It would obviously be invalid.
Ultimately, I want to be able to parse and interpret any number of chains of operations with the blocks that the user can drag together. Can anyone explain to me or point me to resources for parsing basic math expressions?
I'd prefer as much of a language agnostic answer as possible.
You might look at the Shunting Yard Algorithm. The linked wikipedia page has a ton of info and links to various examples of the algorithm.
Basically, given an expression in infix mathematical notation, it give back an AST or Reverse Polish Notation, whatever your preference might be.
This page is pretty good. There are also a couple related questions on SO.
In a lot of modern languages there are methods to evaluate arithmetic string expressions. For example in Python
>>> a = '1+3*3'
>>> eval(a)
10
You could use exception handling to catch the invalid syntax.
Alternatively you can build arithmetic expression trees, there are some examples of these here in SO: Expression Trees for Dummies.
As pointed out above, I'd convert the normal string (infix notation) to a post fix expression.
Then, given the post-fix expression it is easy to parse through and evaluate the expression. For example, add the operands to a stack and when you find an operator, pop values off the stack and apply them operator to the operands. If your code to convert it to a post fix expression is correct, you shouldn't need to worry about the order of operations or anything like that.
The majority of the work in this case would probably be done in the conversion. you could store the converted form in a list or array for easy access so you don't really need to parse each value again too.
You say that several operators in a row are not valid. But think about:
5 + -2
Which is perfectly valid.
The most basic expression grammar is like:
Expression = Term | Expression, AddOp, Term
Term = Factor | Term, MulOp, Factor
Factor = Number | SignOp, Factor | '(', Expression, ')'
AddOp = '+' | '-'
MulOp = '*' | '/'
SignOp = '+' | '-'
Number = Digit | Number, Digit
Digit = '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'
I once wrote a simple lightweight expression parser/evaluator (string in number out) which could handle variables and functions. The code is in Delphi but It shouldn't be that hard to translate. If you are interested I can put the sourcecode online.
Another note is that there are many parsing libraries available that a person can use to accomplish this task. It is not trivial to write a good expression parser from scratch, so I would recommend checking out a library.
I work for Singular Systems which specializes in mathematics components. We offer two math parsers Jep Java and Jep.Net which might help you in solving your problem. Good luck!
For this audience you'd want to give error feedback quite different than to programmers used to messages like "Syntax error: unexpected '/' at position foo." I tried to make something better for education applets here:
http://github.com/darius/expr
The main ideas: go to unusual lengths to find a minimal edit restoring parsability (practical since input expressions aren't pages long), and generate a longer plain-English explanation of what the parser is stuck on.

Resources