How to build a parser to parse lucene syntax to an AST - parsing

I have a requirement but I don't know much about implementation detail.
I have a query string like -->
(title:java or author:john) and date:[20110303 TO 20110308]
basically the query string is composed with lucene syntax.
What I really need to do is parse query string into AST and convert AST to lucene query.
I'm not familiar with compiler or parser technology and I ran into Irony project.
Can someone point me how to and where to start? Using Irony or hand-made will be okay.
Thanks a lot.

Sorry for the late response:
Generally speaking, to create a parser, it's best to describe the grammar in the abstract, then generate the parser using a parser generator.
I created the lucene-query-parser.js library using a PEG grammar, which is in the Github repo here. That grammar is specific to PEG.js and uses JavaScript to implement an AST style result for the parsed query.
It's not necessary to return an AST style structure, but I found that to be most useful for the project that I wrote the syntax for. You could re-implement the grammar to return any sort of parser result that you wanted to.

If your query String is in Lucene syntax, then simply pass it to the parse(String) method of Lucene's QueryParser.
That will return a Query object representing the query String.
If you need to extend or modify the standard lucene syntax, then you could start by looking at the JavaCC Grammar for QueryParser.
Others have modified it in the past to add support for RegExps

You could also look at the Myna parser which is a JavaScript parsing library that has a sample Lucene grammar. The Myna parser automatically generates an AST that you can easily transform into whatever form you want.

Related

"Batteries" for Parsec in Haskell

I am new to Haskell, and I have been trying to write a JSON parser using Parsec as an exercise. This has mostly been going well, I am able to parse lists and objects with relatively little code which is also readable (great!). However, for JSON I also need to parse primitives like
Integers (possibly signed)
Floats (possibly using scientific notation such as "3.4e-8")
Strings with e.g. escaped quotes
I was hoping to find ready to use parsers for things like these as part of Parsec. The closest I get is the Parsec.Tokens module (defines integer and friends), but those parsers require a "language definition" that seems way beyond what I should have to make to parse something as simple as JSON -- it appears to be designed for programming languages.
So my questions are:
Are the functions in Parsec.Token the right way to go here? If so, how to make a suitable language definition?
Are "primitive" parsers for integers etc defined somewhere else? Maybe in another package?
Am I supposed to write these kinds of low-level parsers myself? I can see myself reusing them frequently... (obscure scientific data formats etc.)
I have noticed that a question on this site says Megaparsec has these primitives included [1], but I suppose these cannot be used with parsec.
Related questions:
How do I get Parsec to let me call `read` :: Int?
How to parse an Integer with parsec
Are the functions in Parsec.Token the right way to go here?
Yes, they are. If you don't care about the minutiae specified by a language definition (i.e. you don't plan to use the parsers which depend on them, such as identifier or reserved), just use emptyDef as a default:
import Text.Parsec
import qualified Text.Parsec.Token as P
import Text.Parsec.Language (emptyDef)
lexer = P.makeTokenParser emptyDef
integer = P.integer lexer
As you noted, this feels unnecesarily clunky for your use case. It is worth mentioning that megaparsec (cf. Alec's suggestion) provides a corresponding integer parser without the ceremony. (The flip side is that megaparsec doesn't try to bake in support for e.g. reserved words, but that isn't difficult to implement in the cases you actually need it.)

how to modify a XPATH parser?

I want to modify a XPATH parser, but I don't know how to start with.
The soure code look like:
Any help would be appreciated:)
This looks like a "shift reduce" (see the S and Rs) table for an LALR parser. My guess is that it is produced by the GOLD parser which produces application-independent parse tables.
But you won't reasonably be able to modify this without the original grammmar, and the parser generator.
Why would you want to modify a perfectly working XPath parser anyway? If it isn't perfect, why don't you just use a perfect one?
It is meaningless to try to modify a parser if the set of rules and the codes for the terminal symbols (and the lexer) aren't available.
The provided code looks like the action table for a general, table driven LR parser. However you also need the GOTO table.
This whole approach is as unsound as reverse engineering. If you want, just build your own parser starting from clean zero so that you'll have freedom and flexibility.

When is better to use a parser such as ANTLR vs. writing your own parsing code?

I need to parse a simple DSL which looks like this:
funcA Type1 a (funcB Type1 b) ReturnType c
As I have no experience with grammar parsing tools, I thought it would be quicker to write a basic parser myself (in Java).
Would it be better, even for a simple DSL, for me to use something like ANTLR and construct a proper grammar definition?
Simple answer: when it is easier to write the rules describing your grammar than to write code that accepts the language described by your grammar.
If the only thing you need to parse looks exactly like what you've written above, then I would say you could just write it by hand.
More generally speaking, I would say that most regular languages could be parsed more quickly by hand (using a regular expression).
If you are parsing a context-free language with lots of rules and productions, ANTLR (or other parser generators) can make life much easier.
Also, if you have a simple language that you expect to grow more complicated in the future, it will be easier to add rule descriptions to an ANTLR grammar than to build them into a hand-coded parser.
Grammars tend to evolve, (as do requirements). Home brew parsers are difficult to maintain and lead to re-inventing the wheel example. If you think you can write a quick parser in java, you should know that it would be quicker to use any of the lex/yacc/compiler-compiler solutions. Lexers are easier to write, then you would want your own rule precedence semantics which are not easy to test or maintain. ANTLR also provides an ide for visualising AST, can you beat that mate. Added advantage is the ability to generate intermediate code using string templates, which is a different aspect altogether.
It's better to use an off-the-shelf parser (generator) such as ANTLR when you want to develop and use a custom language. It's better to write your own parser when your objective is to write a parser.
UNLESS you have a lot of experience writing parsers and can get a working parser that way more quickly than using ANTLR. But I surmise from your asking the question that this get-out clause does not apply.

Approaching Text Parsing in Scala

I'm making an application that will parse commands in Scala. An example of a command would be:
todo get milk for friday
So the plan is to have a pretty smart parser break the line apart and recognize the command part and the fact that there is a reference to time in the string.
In general I need to make a tokenizer in Scala. So I'm wondering what my options are for this. I'm familiar with regular expressions but I plan on making an SQL like search feature also:
search todo for today with tags shopping
And I feel that regular expressions will be inflexible implementing commands with a lot of variation. This leads me to think of implementing some sort of grammar.
What are my options in this regard in Scala?
You want to search for "parser combinators". I have a blog post using this approach (http://cleverlytitled.blogspot.com/2009/04/shunting-yard-algorithm.html), but I think the best reference is this series of posts by Stefan Zieger (http://szeiger.de/blog/2008/07/27/formal-language-processing-in-scala-part-1/)
Here are slides from a presentation I did in Sept. 2009 on Scala parser combinators. (http://sites.google.com/site/compulsiontocode/files/lambdalounge/ImplementingExternalDSLsUsingScalaParserCombinators.ppt) An implementation of a simple Logo-like language is demonstrated. It might provide some insights.
Scala has a parser library (scala.util.parsing.combinator) which enables one to write a parser directly from its EBNF specification. If you have an EBNF for your language, it should be easy to write the Scala parser. If not, you'd better first try to define your language formally.

What is a tree parser in ANTLR and am I forced to write one?

I'm writing a lexer/parser for a small subset of C in ANTLR that will be run in a Java environment. I'm new to the world of language grammars and in many of the ANTLR tutorials, they create an AST - Abstract Syntax Tree, am I forced to create one and why?
Creating an AST with ANTLR is incorporated into the grammar. You don't have to do this, but it is a really good tool for more complicated requirements. This is a tutorial on tree construction you can use.
Basically, with ANTLR when the source is getting parsed, you have a few options. You can generate code or an AST using rewrite rules in your grammar. An AST is basically an in memory representation of your source. From there, there's a lot you can do.
There's a lot to ANTLR. If you haven't already, I would recommend getting the book.
I found this answer to the question on jGuru written by Terence Parr, who created ANTLR. I copied this explanation from the site linked here:
Only simple, so-called syntax directed translations can be done with actions within the parser. These kinds of translations can only spit out constructs that are functions of information already seen at that point in the parse. Tree parsers allow you to walk an intermediate form and manipulate that tree, gradually morphing it over several translation phases to a final form that can be easily printed back out as the new translation.
Imagine a simple translation problem where you want to print out an html page whose title is "There are n items" where n is the number of identifiers you found in the input stream. The ids must be printed after the title like this:
<html>
<head>
<title>There are 3 items</title>
</head>
<body>
<ol>
<li>Dog</li>
<li>Cat</li>
<li>Velociraptor</li>
</body>
</html>
from input
Dog
Cat
Velociraptor
So with simple actions in your grammar how can you compute the title? You can't without reading the whole input. Ok, so now we know we need an intermediate form. The best is usually an AST I've found since it records the input structure. In this case, it's just a list but it demonstrates my point.
Ok, now you know that a tree is a good thing for anything but simple translations. Given an AST, how do you get output from it? Imagine simple expression trees. One way is to make the nodes in the tree specific classes like PlusNode, IntegerNode and so on. Then you just ask each node to print itself out. For input, 3+4 you would have tree:
+
|
3 -- 4
and classes
class PlusNode extends CommonAST {
public String toString() {
AST left = getFirstChild();
AST right = left.getNextSibling();
return left + " + " + right;
}
}
class IntNode extends CommonAST {
public String toString() {
return getText();
}
}
Given an expression tree, you can translate it back to text with t.toString(). SO, what's wrong with this? Seems to work great, right? It appears to work well in this case because it's simple, but I argue that, even for this simple example, tree grammars are more readable and are formalized descriptions of precisely what you coded in the PlusNode.toString().
expr returns [String r]
{
String left=null, right=null;
}
: #("+" left=expr right=expr) {r=left + " + " + right;}
| i:INT {r=i.getText();}
;
Note that the specific class ("heterogeneous AST") approach actually encodes a complete recursive-descent parser for #(+ INT INT) by hand in toString(). As parser generator folks, this should make you cringe. ;)
The main weakness of the heterogeneous AST approach is that it cannot conveniently access context information. In a recursive-descent parser, your context is easily accessed because it can be passed in as a parameter. You also know precisely which rule can invoke which other rule (e.g., is this expression a WHILE condition or an IF condition?) by looking at the grammar. The PlusNode class above exists in a detached, isolated world where it has no idea who will invoke it's toString() method. Worse, the programmer cannot tell in which context it will be invoked by reading it.
In summary, adding actions to your input parser works for very straightforward translations where:
the order of output constructs is the same as the input order
all constructs can be generated from information parsed up to the point when you need to spit them out
Beyond this, you will need an intermediate form--the AST is the best form usually. Using a grammar to describe the structure of the AST is analogous to using a grammar to parse your input text. Formalized descriptions in a domain-specific high-level language like ANTLR are better than hand coded parsers. Actions within a tree grammar have very clear context and can conveniently access information passed from invoking rlues. Translations that manipulate the tree for multipass translations are also much easier using a tree grammar.
I think the creation of the AST is optional. The Abstract Syntax Tree is useful for subsequent processing like semantic analysis of the parsed program.
Only you can decide if you need to create one. If your only objective is syntactic validation then you don't need to generate one. In javacc (similar to ANTLR) there is a utility called JJTree that allows the generation of the AST. So I imagine this is optional in ANTLR as well.

Resources