I want to create a grammar that will parse a text file and create a tree of levels according to configurable "segmentors". This is what I have created so far, it kind of works, but will halt when a "segmentor" appears in the beginning of a text. For example, text "and location" will fail to parse. Any ideas?
Also, I'm pretty certain that the grammar could be greatly improved, so any suggestions are welcome.
grammar DocSegmentor;
#header {
package segmentor.antlr;
}
// PARSER RULES
levelOne: (levelTwo LEVEL1_SEG*)+ ;
levelTwo: (levelThree+ LEVEL2_SEG?)+ ;
levelThree: (levelFour+ LEVEL3_SEG?)+ ;
levelFour: (levelFive+ LEVEL4_SEG?)+ ;
levelFive: tokens;
tokens: (DELIM | PAREN | TEXT | WS)+ ;
// LEXER RULES
LEVEL1_SEG : '\r'? '\n'| EOF ;
LEVEL2_SEG : '.' ;
LEVEL3_SEG : ',' ;
LEVEL4_SEG : 'and' | 'or' ;
DELIM : '`' | '"' | ';' | '/' | ':' | '’' | '‘' | '=' | '?' | '-' | '_';
PAREN : '(' | ')' | '[' | ']' | '{' | '}' ;
TEXT : (('a'..'z') | ('A'..'Z') | ('0'..'9'))+ ;
WS : [ \t]+ ;
I'd definitely go with a Scala parser combinator library.
https://lihaoyi.github.io/fastparse/
https://github.com/scala/scala-parser-combinators
Those are just two examples for a library you can write by hand with little effort and tune to whatever you need. I should mention that you should go with Scalaz (https://github.com/scalaz/scalaz) if you're writing a parser monad on your own.
I wouldn't use a parser at all for that task. All you need is keyword spotting.
It's much easier and more flexibel if you just scan your text for the "segmentators" by walking over the input. This also allows to handle text of any size (e.g. by using memory mapped files) while parsers usually (ANTLR for sure) load the entire text into memory and tokenize it fully, before it comes to parsing.
Related
Is there a suggested practice on whether to label large alternate rules or not?
I thought that it would basically be a nice thing that you could get "for free", but in some very basic tests of it with two basic files that parse expressions -- one with and one without expressions -- the performance is a toss up, and the size is almost 50% larger.
Here are the two grammars I used for testing:
grammar YesLabels;
root: (expr ';')* EOF;
expr
: '(' expr ')' # ParenExpr
| '-' expr # UMinusExpr
| '+' expr # UPlusExpr
| expr '*' expr # TimesExpr
| expr '/' expr # DivExpr
| expr '-' expr # MinusExpr
| expr '+' expr # PlusExpr
| Atom # AtomExpr
;
Atom:
[a-z]+ | [0-9]+ | '\'' Atom '\''
;
WHITESPACE: [ \t\r\n] -> skip;
grammar NoLabels;
root: (expr ';')* EOF;
expr
: '(' expr ')'
| '-' expr
| '+' expr
| expr '*' expr
| expr '/' expr
| expr '-' expr
| expr '+' expr
| Atom
;
Atom:
[a-z]+ | [0-9]+ | '\'' Atom '\''
;
WHITESPACE: [ \t\r\n] -> skip;
Testing this on the following expression (repeated 100k times, ~ 2MB file):
1+1-2--(2+3/2-5)+4;
I get the following timings and size:
$ ant YesLabels root ../tests/expr.txt
real 0m1.096s # varies quite a bit, sometimes larger than the other
116K ./out
$ ant NoLabels root ../tests/expr.txt
real 0m0.821s # varies quite a bit, sometimes smaller than the other
80K ./out
So when you use labelled alternatives, you'll get more classes. This is (IMHO) the advantage, it makes listeners easier to target and each class is simpler with properties that apply only to that alternative.
While that means the executable will be larger, with a meaningful sized parse tree, the memory savings of the targeted class instances will probably make up for the difference in executable size. (in your example, the difference is 46K, that's NOTHING memory wise to the memory used by the parse tree, token stream, etc. of your actual running program.
Your sample input is probably not big enough to really show any difference in performance, but, as mentioned elsewhere, you should first focus on a usable parse tree and then address size/performance should you determine that it's actually an issue.
Not everyone agrees, but in my opinion, the benefits of the labelled alternatives are substantial, and the space/performance needs are negligible (if even measurable)
I'm trying to write a grammar for Prolog interpreter. When I run grun from command line on input like "father(john,mary).", I get a message saying "no viable input at 'father(john,'" and I don't know why. I've tried rearranging rules in my grammar, used different entry points etc., but still get the same error. I'm not even sure if it's caused by my grammar or something else like antlr itself. Can someone point out what is wrong with my grammar or think of what could be the cause if not the grammar?
The commands I ran are:
antlr4 -no-listener -visitor Expr.g4
javac *.java
grun antlr.Expr start tests/test.txt -gui
And this is the resulting parse tree:
Here is my grammar:
grammar Expr;
#header{
package antlr;
}
//start rule
start : (program | query) EOF
;
program : (rule_ '.')*
;
query : conjunction '?'
;
rule_ : compound
| compound ':-' conjunction
;
conjunction : compound
| compound ',' conjunction
;
compound : Atom '(' elements ')'
| '.(' elements ')'
;
list : '[]'
| '[' element ']'
| '[' elements ']'
;
element : Term
| list
| compound
;
elements : element
| element ',' elements
;
WS : [ \t\r\n]+ -> skip ;
Atom : [a-z]([a-z]|[A-Z]|[0-9]|'_')*
| '0'
;
Var : [A-Z]([a-z]|[A-Z]|[0-9]|'_')*
;
Term : Atom
| Var
;
The lexer will always produce the same tokens for any input. The lexer does not "listen" to what the parser is trying to match. The rules the lexer applies are quite simple:
try to match as many characters as possible
when 2 or more lexer rules match the same amount of characters, let the rule defined first "win"
Because of the 2nd rule, the rule Term will never be matched. And moving the Term rule above Var and Atom will cause the latter rules to be never matched. The solution: "promote" the Term rule to a parser rule:
start : (program | query) EOF
;
program : (rule_ '.')*
;
query : conjunction '?'
;
rule_ : compound (':-' conjunction)?
;
conjunction : compound (',' conjunction)?
;
compound : Atom '(' elements ')'
| '.' '(' elements ')'
;
list : '[' elements? ']'
;
element : term
| list
| compound
;
elements : element (',' element)*
;
term : Atom
| Var
;
WS : [ \t\r\n]+ -> skip ;
Atom : [a-z] [a-zA-Z0-9_]*
| '0'
;
Var : [A-Z] [a-zA-Z0-9_]*
;
I'm following the example given here-
https://datapsyche.wordpress.com/2014/10/23/back-to-learning-grammar-with-antlr/
which basically has following grammar-
grammar Simpleql;
statement : expr command* ;
expr : expr ('AND' | 'OR' | 'NOT') expr # expopexp
| expr expr # expexp
| predicate # predicexpr
| text # textexpr
| '(' expr ')' # exprgroup
;
predicate : text ('=' | '!=' | '>=' | '<=' | '>' | '<') text ;
command : '| show' text* # showcmd
| '| show' text (',' text)* # showcsv
;
text : NUMBER # numbertxt
| QTEXT # quotedtxt
| UQTEXT # unquotedtxt
;
AND : 'AND' ;
OR : 'OR' ;
NOT : 'NOT' ;
EQUALS : '=' ;
NOTEQUALS : '!=' ;
GREQUALS : '>=' ;
LSEQUALS : '<=' ;
GREATERTHAN : '>' ;
LESSTHAN : '<' ;
NUMBER : DIGIT+
| DIGIT+ '.' DIGIT+
| '.' DIGIT+
;
QTEXT : '"' (ESC|.)*? '"' ;
UQTEXT : ~[ ()=,<>!\r\n]+ ;
fragment
DIGIT : [0-9] ;
fragment
ESC : '\\"' | '\\\\' ;
WS : [ \t\r\n]+ -> skip ;
When I pass input like this-
Abishek AND (country=India OR city=NY) LOGIN 404 | show name city
I get error- line 1:65 no viable alternative at input '<EOF>'
I went through a couple of SO posts related to the error but can't seem to be able to figure out what is wrong with the grammar.
I tried running your example but was thrown a number of errors in antlrworks 2. However i was able to run it without any errors in the test rig getting the following output:
(statement (expr (expr (expr (text Abishek)) AND (expr ( (expr (expr (predicate (text country) = (text India))) OR (expr (predicate (text city) = (text NY)))) ))) (expr (expr (text LOGIN)) (expr (text 404)))) (command | show (text name) (text city)))
And the same output of the tree shown on the website.
My opinion on what's wrong may be your actual input, iv had problems in the past with ANTLR reading text from a file if the file was not encoded to be ascii/ansi/utf-8 or whatever works for the os you are using. I encountered this when i saved a file on linux from a linux text editor and tried to run it on windows with the same generated parser. So my recommendation is try re-saving your text input - 'Abishek AND (country=India OR city=NY) LOGIN 404 | show name city' and make sure the encoding is different each time incase this is the cause.
Note you can also specify the encoding like this or similar ways :
CharStream charStream = new ANTLRInputStream(inputStream, "UTF-8");
Since having an encoding error will cause it to try and parse irrelevant of encoding and result in no matches being found.
Let me know if it works after saving encoded in a few different ways and i'll try and help further. Hope this helps.
So i have a lexer with a token defined so that on a boolean property it is enabled/disabled
I create an input stream and parse a text. My token is called PHRASE_TEXT and should match anything falling within this pattern '"' ('\\' ~[] |~('\"'|'\\')) '"' {phraseEnabled}?
I tokenize "foo bar" and as expected I get a single token. After setting the property to false on the lexer and calling setInputStream on it with the same text I get "foo , bar" so 2 tokens instead of one. This is also expected behavior.
The problem comes when setting the property to true again. I would expect the same text to tokenize to the whole 1 token "foo bar" but instead is tokenized to the 2 tokens from before. Is this a bug on my part? What am I doing wrong here? I tried using new instances of the tokenizer and reusing the same instance but it doesn't seem to work either way. Thanks in advance.
Edit : Part of my grammar follows below
grammar LuceneQueryParser;
#header{package com.amazon.platformsearch.solr.queryparser.psclassicqueryparser;}
#lexer::members {
public boolean phrases = true;
}
#parser::members {
public boolean phraseQueries = true;
}
mainQ : LPAREN query RPAREN
| query
;
query : not ((AND|OR)? not)* ;
andClause : AND ;
orClause : OR ;
not : NOT? modifier? clause;
clause : qualified
| unqualified
;
unqualified : LBRACK range_in LBRACK
| LCURL range_out RCURL
| truncated
| {phraseQueries}? quoted
| LPAREN query RPAREN
| normal
;
truncated : TERM_TEXT_TRUNCATED;
range_in : (TERM_TEXT|STAR) TO (TERM_TEXT|STAR);
range_out : (TERM_TEXT|STAR) TO (TERM_TEXT|STAR);
qualified : TERM_TEXT COLON unqualified ;
normal : TERM_TEXT;
quoted : PHRASE_TEXT;
modifier : PLUS
| MINUS
;
PHRASE_TEXT : '"' (ESCAPE|~('\"'|'\\'))+ '"' {phrases}?;
TERM_TEXT : (TERM_CHAR|ESCAPE)+;
TERM_CHAR : ~(' ' | '\t' | '\n' | '\r' | '\u3000'
| '\\' | '\'' | '(' | ')' | '[' | ']' | '{' | '}'
| '+' | '-' | '!' | ':' | '~' | '^'
| '*' | '|' | '&' | '?' );
ESCAPE : '\\' ~[];
The problem seems to be that after i set the phrases to false, and then to true again, no more tokens seem to be recognized as PHRASE_TEXT. I know that as a guideline i should define my grammars to be unambiguous but this is basically the way it has to end up looking : tokenizing a string with quotes in 2 different modes, depending on the situation.
I'm gonna have to update this with an answer a colleague of mine helpfully pointed out. The lexer generated class has a static DFA[] array shared between all instances of the class. Once the property was set to false instead of the default true the decision tree was apparently changed for all object instances. A fix for this was to have to separate DFA[] arrays for both the true and false instances of the property i was modifying. I think making that array not static would be too expensive and i really can't think about another fix.
I am trying to parse CSP(Communicating Sequential Processes) CSP Reference Manual. I have defined following grammar rules.
assignment
: IDENT '=' processExpression
;
processExpression
: ( STOP
| SKIP
| chaos
| prefix
| prefixWithValue
| seqComposition
| interleaving
| externalChoice
....
seqComposition
: processExpression ';' processExpression
;
interleaving
: processExpression '|||' processExpression
;
externalChoice
: processExpression '[]' processExpression
;
Now ANTLR reports that
seqComposition
interleaving
externalChoice
are left recursive . Is there any way to remove this or I should better used Bison Flex for this type of grammar. (There are many such rules)
Define a processTerm. Then write rules looking like
assignment
: IDENT '=' processExpression
;
processTerm
: ( STOP
| SKIP
| chaos
| prefix
...
processExpression
: ( processTerm
| processTerm ';' processExpression
| processTerm '|||' processExpression
| processTerm '[]' processExpression
....
If you want to have things like seqComposition still defined, I think that would be OK as well. But you need to make sure that the parsing of processExpansion is going to always consume more text as you proceed through your rules.
Read the guide to removing left recursion in on the ANTLR wiki. It helped me a lot.