Approaching Text Parsing in Scala - parsing

I'm making an application that will parse commands in Scala. An example of a command would be:
todo get milk for friday
So the plan is to have a pretty smart parser break the line apart and recognize the command part and the fact that there is a reference to time in the string.
In general I need to make a tokenizer in Scala. So I'm wondering what my options are for this. I'm familiar with regular expressions but I plan on making an SQL like search feature also:
search todo for today with tags shopping
And I feel that regular expressions will be inflexible implementing commands with a lot of variation. This leads me to think of implementing some sort of grammar.
What are my options in this regard in Scala?

You want to search for "parser combinators". I have a blog post using this approach (http://cleverlytitled.blogspot.com/2009/04/shunting-yard-algorithm.html), but I think the best reference is this series of posts by Stefan Zieger (http://szeiger.de/blog/2008/07/27/formal-language-processing-in-scala-part-1/)

Here are slides from a presentation I did in Sept. 2009 on Scala parser combinators. (http://sites.google.com/site/compulsiontocode/files/lambdalounge/ImplementingExternalDSLsUsingScalaParserCombinators.ppt) An implementation of a simple Logo-like language is demonstrated. It might provide some insights.

Scala has a parser library (scala.util.parsing.combinator) which enables one to write a parser directly from its EBNF specification. If you have an EBNF for your language, it should be easy to write the Scala parser. If not, you'd better first try to define your language formally.

Related

How to write a language with Python-like indentation in syntax?

I'm writing a tool with it's own built-in language similar to Python. I want to make indentation meaningful in the syntax (so that tabs and spaces at line beginning would represent nesting of commands).
What is the best way to do this?
I've written recursive-descent and finite automata parsers before.
The current CPython's parser seems to be generated using something called ASDL.
Regarding the indentation you're asking for, it's done using special lexer tokens called INDENT and DEDENT. To replicate that, just implement those tokens in your lexer (that is pretty easy if you use a stack to store the starting columns of previous indented lines), and then plug them into your grammar as usual (like any other keyword or operator token).
Check out the python compiler and in particular compiler.parse.
I'd suggest ANTLR for any lexer/parser generation ( http://www.antlr.org ).
Also, this website ( http://erezsh.wordpress.com/2008/07/12/python-parsing-1-lexing/ ) has some more information, in particular:
Python’s indentation cannot be solved with a DFA. (I’m still perplexed at whether it can even be solved with a context-free grammar).
PyPy produced an interesting post about lexing Python (they intend to solve it using post-processing the lexer output)
CPython’s tokenizer is written in C. It’s ad-hoc, hand-written, and
complex. It is the only official implementation of Python lexing that
I know of.

Do production compilers use parser generators?

I've heard that "real compiler writers" roll their own handmade parser rather than using parser generators. I've also heard that parser generators don't cut it for real-world languages. Supposedly, there are many special cases that are difficult to implement using a parser generator. I have my doubts about this:
Theoretically, a GLR parser generator should be able to handle most programming language designs (except maybe C++...)
I know of at least one production language that uses a parser generator: Ruby [1].
When I took my compilers class in school, we used a parser generator.
So my question: Is it reasonable to write a production compiler using a parser generator, or is using a parser generator considered a poor design decision by the compiler community?
[1] https://github.com/ruby/ruby/blob/trunk/parse.y
For what it's worth, GCC used a parser generator pre-4.0 I believe, then switched to a hand written recursive descent parser because it was easier to maintain and extend.
Parser generators DO "cut it" for "real" languages, but the amount of work to transform your grammar into something workable grows exponentially.
Edit: link to the GCC document detailing the change with reasons and benefits vs cost analysis: http://gcc.gnu.org/wiki/New_C_Parser.
I worked for a company for a few years where we were more or less writing compilers. We weren't concerned much with performance; just reducing the amount of work/maintenance. We used a combination of generated parsers + handwritten code to achieve this. The ideal balance is to automate the easy, repetitive parts with the parser generator and then tackle the hard stuff in custom functions.
Sometimes a combination of both methods, is used, like generating code with a parser, and later, modifying "by hand" that code.
Other way is that some scanner (lexer) and parser tools allow them to add custom code, additional to the grammar rules, called "semantic actions". A good example of this case, is that, a parser detects generic identifiers, and some custom code, transform some specific identifiers into keywords.
EDIT:
add "semantic actions"

VBScript Partial Parser

I am trying to create a VBScript parser. I was wondering what is the best way to go about it. I have researched and researched. The most popular way seems to be going for something like Gold Parser or ANTLR.
The feature I want to implement is to do dynamic checking of Syntax Errors in VBScript. I do not want to compile the entire VBS every time some text changes. How do I go about doing that? I tried to use Gold Parser, but i assume there is no incremental way of doing parsing through it, something like partial parse trees...Any ideas on how to implement a partial parse tree for such a scenario?
I have implemented VBscript Parsing via GOLD Parser. However it is still not a partial parser, parses the entire script after every text change. Is there a way to build such a thing.
thks
If you really want to do incremental parsing, consider this paper by Tim Wagner.
It is brilliant scheme to keep existing parse trees around, shuffling mixtures of string fragments at the points of editing and parse trees representing the parts of the source text that hasn't changed, and reintegrating the strings into the set of parse trees. It is done using an incremental GLR parser.
It isn't easy to implement; I did just the GLR part and never got around to the incremental part.
The GLR part was well worth the trouble.
There are lots of papers on incremental parsing. This is one of the really good ones.
I'd first look for an existing VBScript parser instead of writing your own, which is not a trivial task!
Theres a VBScript grammar in BNF format on this page: http://rosettacode.org/wiki/BNF_Grammar which you can translate into a ANTLR (or some other parser generator) grammar.
Before trying to do fancy things like re-parsing only a part of the source, I recommend you first create a parser that actually works.
Best of luck!

When is better to use a parser such as ANTLR vs. writing your own parsing code?

I need to parse a simple DSL which looks like this:
funcA Type1 a (funcB Type1 b) ReturnType c
As I have no experience with grammar parsing tools, I thought it would be quicker to write a basic parser myself (in Java).
Would it be better, even for a simple DSL, for me to use something like ANTLR and construct a proper grammar definition?
Simple answer: when it is easier to write the rules describing your grammar than to write code that accepts the language described by your grammar.
If the only thing you need to parse looks exactly like what you've written above, then I would say you could just write it by hand.
More generally speaking, I would say that most regular languages could be parsed more quickly by hand (using a regular expression).
If you are parsing a context-free language with lots of rules and productions, ANTLR (or other parser generators) can make life much easier.
Also, if you have a simple language that you expect to grow more complicated in the future, it will be easier to add rule descriptions to an ANTLR grammar than to build them into a hand-coded parser.
Grammars tend to evolve, (as do requirements). Home brew parsers are difficult to maintain and lead to re-inventing the wheel example. If you think you can write a quick parser in java, you should know that it would be quicker to use any of the lex/yacc/compiler-compiler solutions. Lexers are easier to write, then you would want your own rule precedence semantics which are not easy to test or maintain. ANTLR also provides an ide for visualising AST, can you beat that mate. Added advantage is the ability to generate intermediate code using string templates, which is a different aspect altogether.
It's better to use an off-the-shelf parser (generator) such as ANTLR when you want to develop and use a custom language. It's better to write your own parser when your objective is to write a parser.
UNLESS you have a lot of experience writing parsers and can get a working parser that way more quickly than using ANTLR. But I surmise from your asking the question that this get-out clause does not apply.

Generate Fortran 77 parser from a yacc grammar using Happy (Haskell)

I have stumbled upon the following F77 yacc grammar: http://yaxx.cvs.sourceforge.net/viewvc/yaxx/yaxx/fortran/fortran.y?revision=1.3&view=markup.
How can I make a Fortran 77 parser out of this file using Happy?
Why is there some C?/C++? code in that .y file?
UPDATE: Thank you for your replies!
I've been playing with two fresh approaches for a while now:
extracting and modifiying the parser from the source code package bundled with a paper titled Parametric Fortran,
writing a grammar from scratch with the help of BNFC.
I've got both to parse simple code excerpts already. I'll keep people in the know should something usable come into existence within this century ^__^" hehe.
P/S: Want to see whether I could gather enough momentum on my own to initiate a project for an automatic differentiation engine to replace a binary-only one we depend on for the time being. For entertainment at the initial stages: I'm watching Love Shuffle! It's a very enjoyable J-Drama! Highly recommendable ...
The C is the semantic action for reducing the stack when the syntax is read in. These actions are in C because the definition is intended for Bison/Yacc which produces a C source file.
If you want to use Happy, port the BNF to the Happy definition syntax and write your semantics in Haskell.
Just the tip of the iceberg for getting anything useful however.
If you don't have a copy already, invest in the Dragon Book (Compilers: Principles, Techniques & tools by Aho, Lam, Sethi, Ullman - Pearson)
Why the other answers are true in the general sense, in that you'll need to write your own actions to do anything meaningful the Yacc definition that you linked to actually doesn't have any actions associated with the grammar rules. What it does is that it defines the yyerror function and some code for extracting values from yylval based on the token type.
If you have no clue what yyerror/yylval are about you should read a bison/flex tutorial. The Dragon book is also a good resource if you're more serious about this. There are also some excellent handouts from a Stanford course on compilers floating around the Net, which are based on the book.
You'll need an AST to build that can be constructed in an equivalent way to the C fragments in the Yacc file.
Use BNFC and write your own grammar from scratch! BNFC works wonders and you could do your parsing exactly as you desire.

Resources