ANTLR 2.7 Get a Stream of Objects from the Parser - parsing

I'm using ANTLR 2.7.6 to parse the messy output of another application. Sadly, I do not have the ability to upgrade to ANTLR 3, even though it has been out for quite a while. A log file of the sort I will be parsing is better conceptualized as a list of objects than a tree of objects, and could be very large (>100 MB) so it is not practical to read it all into one AST. (My application is multithreaded and will process half a dozen to a dozen of these files at once, so memory will fill up quick.) I want to be able to read out each of these objects as from a stream so I can process them one by one. Note that the objects themselves could be conceptualized as small trees. Is there a way to get my ANTLR parser to act like an object stream, an iterator, or something of that nature?
[See Javadoc for ANTLR 2.]
Edit: Here is a conceptual example of what I would like to do with the parser.
import java.io.FileReader;
import antlr.TokenStream;
import antlr.CharBuffer;
//...
FileReader fileReader = new FileReader(filepath);
TokenStream lexer = new MyExampleLexer(new CharBuffer(fileReader));
MyExampleParser parser = new MyExampleParser(lexer);
for (Object obj : parser)
{
processObject(obj);
}
Am I perhaps working with the wrong paradigm of how to use an Antlr parser? (I realize that the parser does not implement Iterator; but that is conceptually the sort of behavior I'm looking for.)

AFAIK, ANTLR v2.x buffers the creating of tokens. The parser takes a TokenBuffer, which in its turn takes a TokenStream. This TokenStream is then polled through its nextToken() method when the parser needs more tokens.
In other words, if you provide the input source as a file, ANTLR does not read the entire file and create tokens of it, but only when needed are tokens created (and discarded).
Note that I never worked with ANTLR 2.x, so I could be wrong. Have you observed something different? If so, how do you offer the source to ANTLR: as a file, or as a big string? If it's the latter, I recommend providing a file instead.
EDIT
Let's say you want to parse a file that consists of lines with numbers, delimited by white spaces (which you want to ignore). You also want your parser to process the file line by line because collecting all numbers at once would result in memory problems.
You can do this by letting your main parser rule, parse, return a list of numbers for each line. If the EOF (end-of-file) is reached, you simply return null instead of a list.
A demo using ANTLR 2.7.6:
file: My.g
class MyParser extends Parser;
parse returns [java.util.List<Integer> numbers]
{
numbers = new java.util.ArrayList<Integer>();
}
: (n:Number {numbers.add(Integer.valueOf(n.getText()));})+ LineBreak
| EOF {numbers = null;}
;
class MyLexer extends Lexer;
Number
: ('0'..'9')+
;
LineBreak
: ('\r')? '\n'
;
Space
: (' ' | '\t') {$setType(Token.SKIP);}
;
file: Main.java
import antlr.*;
public class Main {
public static void main(String[] args) throws Exception {
MyLexer lexer = new MyLexer(new java.io.StringReader("1 2 3\n4 5 6 7 8\n9 10\n"));
MyParser parser = new MyParser(new TokenBuffer(lexer));
int line = 0;
java.util.List<Integer> numbers = null;
while((numbers = parser.parse()) != null) {
line++;
System.out.println("line " + line + " = " + numbers);
}
}
}
To run the demo on:
*nix
java -cp antlr-2.7.6.jar antlr.Tool My.g
javac -cp antlr-2.7.6.jar *.java
java -cp .:antlr-2.7.6.jar Main
or on:
Windows
java -cp antlr-2.7.6.jar antlr.Tool My.g
javac -cp antlr-2.7.6.jar *.java
java -cp .;antlr-2.7.6.jar Main
which will produce the following output:
line 1 = [1, 2, 3]
line 2 = [4, 5, 6, 7, 8]
line 3 = [9, 10]
Warning
Anyone trying this code, please note that this uses ANTLR 2.7.6. Unless you have a very compelling reason to use this version, it is highly recommended to use the latest stable version of ANTLR (v3.3 at the time of this writing).

Related

How can two Haskell programs exchange an integer value via stdin and stdout without treating the data as text?

I am interested in learning how to send data efficiently between Haskell programs using standard input and output. Suppose I want to pipe two programs together: "P1" outputs the number 5 to stdout, and "P2" takes an integer from stdin, adds 1, and outputs it to stdout again. Right now, the best way I know to do this involves outputting the data as text from P1, parsing that text back to an integer in P2, and proceeding from there. For example:
P1.hs:
module Main where
main = do
print 5
P2.hs:
module Main where
main = fmap manipulateData getLine >>= print
where
manipulateData = (+ 1) . (read :: String -> Int)
Output:
$ (stack exec p1) | (stack exec p2)
6
I'd like to use standard i/o to send an integer without treating it as text, if possible. I'm assuming this still requires some sort of parsing to work, but I'm hoping it's possible to parse the data as binary and get a faster program.
Does Haskell have any way to make this straightforward? Since I am going from one fundamental Haskell datatype (Int) to the same type again with a pass through standard i/o in the middle, I'm wondering if there is an easy solution that doesn't require writing a custom binary parser (which I don't know how to do). Can anyone provide such a method?
Here is the code that I ended up with:
module Main where
import qualified Data.ByteString.Lazy as BS
import qualified Data.Binary as B
main :: IO ()
main = do
dat <- BS.getContents
print $ (B.decode dat :: Int) + 1
The other program uses similar imports and outputs 5 with the following line:
BS.putStr $ B.encode (5 :: Int)
The resulting programs can be piped together, and the resulting program behaves as required.

How to use context free grammars?

Could someone help me with using context free grammars. Up until now I've used regular expressions to remove comments, block comments and empty lines from a string so that it can be used to count the PLOC. This seems to be extremely slow so I was looking for a different more efficient method.
I saw the following post: What is the best way to ignore comments in a java file with Rascal?
I have no idea how to use this, the help doesn't get me far as well. When I try to define the line used in the post I immediately get an error.
lexical SingleLineComment = "//" ~[\n] "\n";
Could someone help me out with this and also explain a bit about how to setup such a context free grammar and then to actually extract the wanted data?
Kind regards,
Bob
First this will help: the ~ in Rascal CFG notation is not in the language, the negation of a character class is written like so: ![\n].
To use a context-free grammar in Rascal goes in three steps:
write it, like for example the syntax definition of the Func language here: http://docs.rascal-mpl.org/unstable/Recipes/#Languages-Func
Use it to parse input, like so:
// This is the basic parse command, but be careful it will not accept spaces and newlines before and after the TopNonTerminal text:
Prog myParseTree = parse(#Prog, "example string");
// you can do the same directly to an input file:
Prog myParseTree = parse(#TopNonTerminal, |home:///myProgram.func|);
// if you need to accept layout before and after the program, use a "start nonterminal":
start[Prog] myParseTree = parse(#start[TopNonTerminal], |home:///myProgram.func|);
Prog myProgram = myParseTree.top;
// shorthand for parsing stuff:
myProgram = [Prog] "example";
myProgram = [Prog] |home:///myLocation.txt|;
Once you have the tree you can start using visit and / deepmatch to extract information from the tree, or write recursive functions if you like. Examples can be found here: http://docs.rascal-mpl.org/unstable/Recipes/#Languages-Func , but here are some common idioms as well to extract information from a parse tree:
// produces the source location of each node in the tree:
myParseTree#\loc
// produces a set of all nodes of type Stat
{ s | /Stat s := myParseTree }
// pattern match an if-then-else and bind the three expressions and collect them in a set:
{ e1, e2, e3 | (Stat) `if <Exp e1> then <Exp e2> else <Exp e3> end` <- myExpressionList }
// collect all locations of all sub-trees (every parse tree is of a non-terminal type, which is a sub-type of Tree. It uses |unknown:///| for small sub-trees which have not been annotated for efficiency's sake, like literals and character classes:
[ t#\loc?|unknown:///| | /Tree t := myParseTree ]
That should give you a start. I'd go try out some stuff and look at more examples. Writing a grammar is a nice thing to do, but it does require some trial and error methods like writing a regex, but even more so.
For the grammar you might be writing, which finds source code comments but leaves the rest as "any character" you will need to use the longest match disambiguation a lot:
lexical Identifier = [a-z]+ !>> [a-z]; // means do not accept an Identifier if there is still [a-z] to add to it; so only the longest possible Identifier will match.
This kind of context-free grammar is called an "Island Grammar" metaphorically, because you will write precise rules for the parts you want to recognize (the comments are "Islands") while leaving the rest as everything else (the rest is "Water"). See https://dl.acm.org/citation.cfm?id=837160

Running Antlr4 parser with lexer grammar gets token recognition errors

I'm trying to create a grammar to parse Solr queries (only mildly relevant and you don't need to know anything about solr to answer the question -- just know more than I do about antlr 4.7). I'm basing it on the QueryParser.jj file from solr 6. I looked for an existing one, but there doesn't seem to be one that isn't old and out-of-date.
I'm stuck because when I try to run the parser I get "token recognition error"s.
The lexer I created uses lexer modes which, as I understand it means I need to have a separate lexer grammar file. So, I have a parser and a lexer file.
I whittled it down to a simple example to show I'm seeing. Maybe someone can tell me what I'm doing wrong. Here's the parser (Junk.g4):
grammar Junk;
options {
language = Java;
tokenVocab=JLexer;
}
term : TERM '\r\n';
I can't use an import because of the lexer modes in the lexer file I'm trying to create (the tokens in the modes become "undefined" if I use an import). That's why I reference the lexer file with the tokenVocab parameter (as shown in the XML example in github).
Here's the lexer (JLexer.g4):
lexer grammar JLexer;
TERM : TERM_START_CHAR TERM_CHAR* ;
TERM_START_CHAR : [abc] ;
TERM_CHAR : [efg] ;
WS : [ \t\n\r\u3000]+ -> skip;
If I copy the lexer code into the parser, then things work as expected (e.g., "aeee" is a term). Also, if I run the lexer file with grun (specifying tokens as the target), then the string parses as a TERM (as expected).
If I run the parser ("grun Junk term -tokens"), then I get:
line 1:0 token recognition error at: 'a'
line 1:1 token recognition error at: 'e'
line 1:2 token recognition error at: 'e'
line 1:3 token recognition error at: 'e'
[#0,4:5='\r\n',<'
'>,1:4]
I "compile" the lexer first, then "compile" the parser and then javac the resulting java files. I do this in a batch file, so I'm pretty confident that I'm doing this every time.
I don't understand what I'm doing wrong. Is it the way I'm running grun? Any suggestions would be appreciated.
Always trust your intuition! There is some convention internal to grun :-) See here TestRig.java c. lines 125, 150. Would have been lot nicer if some additional CLI args were also added.
When lexer and grammar are compiled separately, the grammar name - in your case - would be (insofar as TestRig goes) "Junk" and the two files must be named "JunkLexer.g4" and "JunkParser.g4". Accordingly the headers in parser file JunkParser.g4 should be modified too
parser grammar JunkParser;
options { tokenVocab=JunkLexer; }
... stuff
Now you can run your tests
> antlr4 JunkLexer
> antlr4 JunkParser
> javac Junk*.java
> grun Junk term -tokens
aeee
^Z
[#0,0:3='aeee',<TERM>,1:0]
[#1,6:5='<EOF>',<EOF>,2:0]
>

Append text file to lexicon in Rascal

Is it possible to append terminals retrieved from a text file to a lexicon in Rascal? This would happen at run time, and I see no obvious way to achieve this. I would rather keep the data separate from the Rascal project. For example, if I had read in a list of countries from a text file, how would I add these to a lexicon (using the lexical keyword)?
In the data-dependent version of the Rascal parser this is even easier and faster but we haven't released this yet. For now I'd write a generic rule with a post-parse filter, like so:
rascal>set[str] lexicon = {"aap", "noot", "mies"};
set[str]: {"noot","mies","aap"}
rascal>lexical Word = [a-z]+;
ok
rascal>syntax LexiconWord = word: Word w;
ok
rascal>LexiconWord word(Word w) { // called when the LexiconWord.word rule is use to build a tree
>>>>>>> if ("<w>" notin lexicon)
>>>>>>> filter; // remove this parse tree
>>>>>>> else fail; // just build the tree
>>>>>>>}
rascal>[Sentence] "hello"
|prompt:///|(0,18,<1,0>,<1,18>): ParseError(|prompt:///|(0,18,<1,0>,<1,18>))
at $root$(|prompt:///|(0,64,<1,0>,<1,64>))
rascal>[Sentence] "aap"
Sentence: (Sentence) `aap`
rascal>
Because the filter function removed all possible derivations for hello, the parser eventually returns a parse error on hello. It does not do so for aap which is in the lexicon, so hurray. Of course you can make interestingly complex derivations with this kind of filtering. People sometimes write ambiguous grammars and use filters like so to make it unambiguous.
Parsing and filtering in this way is in cubic worst-case time in terms of the length of the input, if the filtering function is in amortized constant time. If the grammar is linear, then of course the entire process is also linear.
A completely different answer would be to dynamically update the grammar and generate a parser from this. This involves working against the internal grammar representation of Rascal like so:
set[str] lexicon = {"aap", "noot", "mies"};
syntax Word = ; // empty definition
typ = #Word;
grammar = typ.definitions;
grammar[sort("Word")] = { prod(sort("Word"), lit(x), {}) | x <- lexicon };
newTyp = type(sort("Word"), grammar);
This newType is a reified grammar + type for the definition of the lexicon, and which can now be used like so:
import ParseTree;
if (type[Word] staticGrammar := newType) {
parse(staticGrammar, "aap");
}
Now having written al this, two things:
I think this may trigger unknown bugs since we did not test dynamic parser generation, and
For a lexicon with a reasonable size, this will generate an utterly slow parser since the parser is optimized for keywords in programming languages and not large lexicons.

Haskell/Parsec: how do I use Text.Parsec.Token with Text.Parsec.Indent (from the indents package)

The indents package for Haskell's Parsec provides a way to parse indentation-style languages (like Haskell and Python). It redefines the Parser type, so how do you use the token parser functions exported by Parsec's Text.Parsec.Token module, which are of the normal Parser type?
Background
Parsec is a parser combinator library, whatever that means.
IndentParser 0.2.1 is an old package providing the two modules Text.ParserCombinators.Parsec.IndentParser and Text.ParserCombinators.Parsec.IndentParser.Token
indents 0.3.3 is a new package providing the single module Text.Parsec.Indent
Parsec comes with a load of modules. most of them export a bunch of useful parsers (e.g. newline from Text.Parsec.Char, which parses a newline) or parser combinators (e.g. count n p from Text.Parsec.Combinator, which runs the parser p, n times)
However, the module Text.Parsec.Token would like to export functions which are parametrized by the user with features of the language being parsed, so that, for example, the braces p function will run the parser p after parsing a '{' and before parsing a '}', ignoring things like comments, the syntax of which depends on your language.
The way that Text.Parsec.Token achieves this is that it exports a single function makeTokenParser, which you call, giving it the parameters of your specific language (like what a comment looks like) and it returns a record containing all of the functions in Text.Parsec.Token, adapted to your language as specified.
Of course, in an indentation-style language, these would need to be adapted further (perhaps? here's where I'm not sure – I'll explain in a moment) so I note that the (presumably obsolete) IndentParser package provides a module Text.ParserCombinators.Parsec.IndentParser.Token which looks to be a drop-in replacement for Text.Parsec.Token.
I should mention at some point that all the Parsec parsers are monadic functions, so they do magic things with state so that error messages can say at what line and column in the source file the error appeared
My Problem
For a couple of small reasons it appears to me that the indents package is more-or-less the current version of IndentParser, however it does not provide a module that looks like Text.ParserCombinators.Parsec.IndentParser.Token, it only provides Text.Parsec.Indent, so I am wondering how one goes about getting all the token parsers from Text.Parsec.Token (like reserved "something" which parses the reserved keyword "something", or like braces which I mentioned earlier).
It would appear to me that (the new) Text.Parsec.Indent works by some sort of monadic state magic to work out at what column bits of source code are, so that it doesn't need to modify the token parsers like whiteSpace from Text.Parsec.Token, which is probably why it doesn't provide a replacement module. But I am having a problem with types.
You see, without Text.Parsec.Indent, all my parsers are of type Parser Something where Something is the return type and Parser is a type alias defined in Text.Parsec.String as
type Parser = Parsec String ()
but with Text.Parsec.Indent, instead of importing Text.Parsec.String, I use my own definition
type Parser a = IndentParser String () a
which makes all my parsers of type IndentParser String () Something, where IndentParser is defined in Text.Parsec.Indent. but the token parsers that I'm getting from makeTokenParser in Text.Parsec.Token are of the wrong type.
If this isn't making much sense by now, it's because I'm a bit lost. The type issue is discussed a bit here.
The error I'm getting is that I've tried replacing the one definition of Parser above with the other, but then when I try to use one of the token parsers from Text.Parsec.Token, I get the compile error
Couldn't match expected type `Control.Monad.Trans.State.Lazy.State
Text.Parsec.Pos.SourcePos'
with actual type `Data.Functor.Identity.Identity'
Expected type: P.GenTokenParser
String
()
(Control.Monad.Trans.State.Lazy.State Text.Parsec.Pos.SourcePos)
Actual type: P.TokenParser ()
Links
Parsec
IndentParser (old package)
indents, providing Text.Parsec.Indent (new package)
some discussion of Parser types with example code
another example of using Text.Parsec.Indent
Sadly, neither of the examples above use token parsers like those in Text.Parsec.Token.
What are you trying to do?
It sounds like you want to have your parsers defined everywhere as being of type
Parser Something
(where Something is the return type) and to make this work by hiding and redefining the Parser type which is normally imported from Text.Parsec.String or similar. You still need to import some of Text.Parsec.String, to make Stream an instance of a monad; do this with the line:
import Text.Parsec.String ()
Your definition of Parser is correct. Alternatively and equivalently (for those following the chat in the comments) you can use
import Control.Monad.State
import Text.Parsec.Pos (SourcePos)
type Parser = ParsecT String () (State SourcePos)
and possibly do away with the import Text.Parsec.Indent (IndentParser) in the file in which this definition appears.
Error, error on the wall
Your problem is that you're looking at the wrong part of the compiler error message. You're focusing on
Couldn't match expected type `State SourcePos' with actual type `Identity'
when you should be focusing on
Expected type: P.GenTokenParser ...
Actual type: P.TokenParser ...
It compiles!
Where you "import" parsers from Text.Parsec.Token, what you actually do, of course (as you briefly mentioned) is first to define a record your language parameters and then to pass this to the function makeTokenParser, which returns a record containing the token parsers.
You must therefore have some lines that look something like this:
import qualified Text.Parsec.Token as P
beetleDef :: P.LanguageDef st
beetleDef =
haskellStyle {
parameters, parameters etc.
}
lexer :: P.TokenParser ()
lexer = P.makeTokenParser beetleDef
... but a P.LanguageDef st is just a GenLanguageDef String st Identity, and a P.TokenParser () is really a GenTokenParser String () Identity.
You must change your type declarations to the following:
import Control.Monad.State
import Text.Parsec.Pos (SourcePos)
import qualified Text.Parsec.Token as P
beetleDef :: P.GenLanguageDef String st (State SourcePos)
beetleDef =
haskellStyle {
parameters, parameters etc.
}
lexer :: P.GenTokenParser String () (State SourcePos)
lexer = P.makeTokenParser beetleDef
... and that's it! This will allow your "imported" token parsers to have type ParsecT String () (State SourcePos) Something, instead of Parsec String () Something (which is an alias for ParsecT String () Identity Something) and your code should now compile.
(For maximum generality, I'm assuming that you might be defining the Parser type in a file separate from, and imported by, the file in which you define your actual parser functions. Hence the two repeated import statements.)
Thanks
Many thanks to Daniel Fischer for helping me with this.

Resources