Lark : how to pick only some patterns - parsing

I would like to extract from a text file only some structured patterns.
example, in the text below:
blablabla
foo FUNC1 ; blabliblo blu
I would like to isolate only 'foo FUNC1 ;'.
I was trying to use lark parser with the following parser
foo=Lark('''
start: statement*
statement: foo
| anything
anything : /.+/
foo : "foo" ID ";"
ID : /_?[a-z][_a-z0-9]*/i
%import common.WS
%import common.NEWLINE
%ignore WS
%ignore NEWLINE
''',
parser="lalr" ,
propagate_positions=True)
But the token 'anything' captures all. Is there a way to make it not greedy ? So that the token 'foo' can capture the given pattern ?

You could solve this with priorities.
For parser="lalr", Lark supports priorities on terminals. So you could move "foo" into its own terminal and then assign that terminal a higher priority than the anything terminal (which has default priority 1):
foo : FOO ID ";"
FOO.2: "foo"
Parsing your example string then results in:
start
statement
anything blablabla
statement
foo
foo
FUNC1
statement
anything blabliblo blu
For parser="earley", Lark supports priorities on rules, so you could use:
foo.2 : "foo" ID ";"
Parsing your example string then results in:
start
statement
anything blablabla
statement
foo FUNC1
statement
anything blabliblo blu

Related

How can I define an ANTLR4 indentation block based grammar?

I am trying to define a language using ANTLR4 to generate its parser. While the language is actually a bit more complex, this is a tiny valid example of a file I want the parser to read, which triggers the problem I am trying to fix:
features \\ Keyword which initializes the "features" block
Server
mandatory \\ Relation word
FileSystem
OperatingSystem
optional \\ Relation word
Logging
features word simply starts the block, while mandatory and optional are relation words. The words remaining are just simple words (called features in this context). What I want is to make Server child of features block, then, mandatory and optional both children of Server and finally, FileSystem and OperatingSystem children of mandatory, and Logging child of optional.
The following grammar is my attempt to achieve this structure:
grammar MyGrammar;
tokens {
INDENT,
DEDENT
}
#lexer::header {
from antlr_denter.DenterHelper import DenterHelper
from UVLParser import UVLParser
}
#lexer::members {
class UVLDenter(DenterHelper):
def __init__(self, lexer, nl_token, indent_token, dedent_token, ignore_eof):
super().__init__(nl_token, indent_token, dedent_token, ignore_eof)
self.lexer: UVLLexer = lexer
def pull_token(self):
return super(UVLLexer, self.lexer).nextToken()
denter = None
def nextToken(self):
if not self.denter:
self.denter = self.UVLDenter(self, self.NL, UVLParser.INDENT, UVLParser.DEDENT, True)
return self.denter.next_token()
}
// parser rules
feature_model: features?;
features: 'features' INDENT child;
child: feature_spec INDENT relation* DEDENT;
relation: relation_spec INDENT child* DEDENT;
feature_spec: WORD ('.' WORD)*;
relation_spec: RELATION_WORD;
//lexer rules
RELATION_WORD: ('alternative' | 'or' | 'optional' | 'mandatory');
WORD: [a-zA-Z][a-zA-Z0-9_]*;
WS: [ \n\r]+ -> skip;
NL: ('\r'? '\n' '\t');
I am using antlr-denter in order to manage indent and dedent.
Then, I am defining RELATION_WORD and WORD separately in the lexer.
Finally, the parser rules attempt to construct the structure I described before. I want the features word to be followed by a single child. Then, any child is going to be a feature spec followed by any amount of relations between an INDENT and DEDENT. Same happens with relations being a relation spec followed by a similar set of children, with this loop being repeated indefinitely.
However, I can't manage to make the parser read this structure correctly. With the previous example as input, I am getting mandatory as child of Server, but not optional. Changing the example to this one:
features
Server
mandatory
optional
Logging
Both mandatory and optional are interpreted as children of mandatory. It must have something to do with INDENT and DEDENT interpretation to correctly find blocks, but I have been unable to find a solution so far.
Any ideas to fix this would be very welcome. Thanks in advance!
Try changing your child and feature rules as follows:
child: feature_spec (INDENT relation* DEDENT)?;
relation: relation_spec (INDENT child* DEDENT)?;
Explanation:
As #Kaby76 mentions, it's quite helpful to print out the token stream to understand how your parser stream sees the stream of tokens.
I've not used antlr-denter before, but the way it plugs in, it would appear that you're not going to get a token stream just by using the grun tool.
As a substitute, I tried just making up INDENT and OUTDENT Tokens (I used -> and <-, respectively)
revised grammar:
grammar MyGrammar;
// parser rules
feature_model: features?;
features: 'features' INDENT child;
child: feature_spec INDENT relation* DEDENT;
relation: relation_spec INDENT child* DEDENT;
feature_spec: WORD ('.' WORD)*;
relation_spec: RELATION_WORD;
//lexer rules
RELATION_WORD: ('alternative' | 'or' | 'optional' | 'mandatory');
WORD: [a-zA-Z][a-zA-Z0-9_]*;
WS: [ \n\r]+ -> skip;
// Temporary
//NL: ('\r'? '\n' '\t');
NL: ('\r'? '\n' '\t') -> skip;
INDENT: '->';
DEDENT: '<-';
And revised to input file to use the explicit tokens:
features
->Server
->mandatory
optional
->Logging
Just making this change, you'll notice that there are no <- tokens in your sample.
But, now I can dump the token stream:
➜ grun MyGrammar tokens -tokens < MGIn.txt
[#0,0:7='features',<'features'>,1:0]
[#1,12:13='->',<'->'>,2:3]
[#2,14:19='Server',<WORD>,2:5]
[#3,28:29='->',<'->'>,3:7]
[#4,30:38='mandatory',<RELATION_WORD>,3:9]
[#5,47:48='->',<'->'>,4:7]
[#6,49:56='optional',<RELATION_WORD>,4:9]
[#7,69:70='->',<'->'>,5:11]
[#8,71:77='Logging',<WORD>,5:13]
[#9,78:77='<EOF>',<EOF>,5:20]
Now let's try parsing:
➜ grun MyGrammar feature_model -tree < MGIn.txt
line 4:9 mismatched input 'optional' expecting {WORD, '<-'}
line 5:20 mismatched input '<EOF>' expecting {'.', '->'}
(feature_model (features features -> (child (feature_spec Server) -> (relation (relation_spec mandatory) ->) (relation (relation_spec optional) -> (child (feature_spec Logging))) <missing '<-'>)))
So, your grammar calls for 'mandatory' (as a RELATION_WORD) to be followed by an INDENT as well as a DEDENT (which isn't present). This makes sense as they don't have any children, So, it seems that the INDENT/DEDENT need to be connected to whether there are any children:
Let's change that:
child: feature_spec (INDENT relation* DEDENT)?;
relation: relation_spec (INDENT child* DEDENT)?;
Try again:
➜ grun MyGrammar feature_model -tree < MGIn.txt
➜ grun MyGrammar feature_model -tree < MGIn.txt
line 5:20 extraneous input '<EOF>' expecting {WORD, '<-'}
(feature_model (features features -> (child (feature_spec Server) -> (relation (relation_spec mandatory)) (relation (relation_spec optional) -> (child (feature_spec Logging))) <missing '<-'>)))
Now we're missing a <- (OUTDENT) at EOF. The solution to this depends on whether the antlr-denter closes all the INDENTs at <EOF>
Assuming it does, my fake input should look something like:
features
->Server
->mandatory
optional
->Logging
<-
<-
<-
and, we try again:
grun MyGrammar feature_model -gui < MGIn.txt

How to specify beginning-of-line keywords in ANTLR grammars (which also works for the first input line)

This is a question about the remaining problem of the solution proposed for another Stackoverflow question about beginning-of-line keywords.
I am writing an ANTLR4 lexer and parser for a programming language where something is a keyword in case it is the first non-whitespace token of a line. Let me explain this with an example. Suppose "bla" is a keyword then in the following example:
foo bla
bla foo foo
foo bla bla
the second "bla" should be recognized as a keyword but the others shouldn't.
In order to achieve this I have defined the following simple ANTLR4 grammar:
grammar foobla;
// PARSER
main
: line* EOF
;
line
: indicator text*
;
indicator
: foo
| bla
;
foo: FOO ;
bla: BLA ;
text: TEXT ;
// LEXER
WHITESPACE: [ \t] -> skip ;
fragment NL: [\n\r\f]+[ \t]* ;
fragment NONNL: ~[\n\r\f] ;
// Indicators
FOO: NL 'foo' ;
BLA: NL 'bla' ;
TEXT: NONNL+ ;
This is similar to the answer given in How to detect beginning of line, or: "The name 'getCharPositionInLine' does not exist in the current context".
Now my question. This works fine, except in case the "bla" or "foo" keyword is used in the first line of the input program. I can think of 2 ways to solve this but I don't know how this can be achieved:
Use something like a BOF (beginning of file) token. However, I can't find such a concept in the manual
Use a hook to dynamically add a new line at the beginning of the input file before the parsing starts, preferably by specifying something in the g4 file itself. This I couldn't find either in the manual
I don't want to write an extra application/wrapper to add a new line to the input file just because of this.
Here's another idea:
In your BLA lexer rule add a predicate which checks the end of the token stream (where the BLA token is not yet added) to see on which line the last non-whitespace token was. If that line differs from the current token line you know the BLA token is really a BLA token, otherwise set its type to IDENTIIFIER.

Grammar for a recognizer of a spice-like language

I'm trying to build a grammar for a recognizer of a spice-like language using Antlr-3.1.3 (I use this version because of the Python target). I don't have experience with parsers. I've found a master thesis where the student has done the syntactic analysis of the SPICE 2G6 language and built a parser using the LEX and YACC compiler writing tools. (http://digitool.library.mcgill.ca/R/?func=dbin-jump-full&object_id=60667&local_base=GEN01-MCG02) In chapter 4, he describes a grammar in Backus-Naur form for the SPICE 2G6 language, and appends to the work the LEX and YACC code files of the parser.
I'm basing myself in this work to create a simpler grammar for a recognizer of a more restrictive spice language.
I read the Antlr manual, but could not figure out how to solve two problems, that the code snippet below illustrates.
grammar Najm_teste;
resistor
: RES NODE NODE VALUE 'G2'? COMMENT? NEWLINE
;
// START:tokens
RES : ('R'|'r') DIG+;
NODE : DIG+; // non-negative integer
VALUE : REAL; // non-negative real
fragment
SIG : '+'|'-';
fragment
DIG : '0'..'9';
fragment
EXP : ('E'|'e') SIG? DIG+;
fragment
FLT : (DIG+ '.' DIG+)|('.' DIG+)|(DIG+ '.');
fragment
REAL : (DIG+ EXP?)|(FLT EXP?);
COMMENT : '%' ( options {greedy=false;} : . )* NEWLINE;
NEWLINE : '\r'? '\n';
WS : (' '|'\t')+ {$channel=HIDDEN;};
// END:tokens
In the grammar above, the token NODE is a subset of the set represented by the VALUE token. The grammar correctly interprets an input like "R1 5 0 1.1/n", but cannot interpret an input like "R1 5 0 1/n", because it maps "1" to the token NODE, instead of mapping it to the token VALUE, as NODE comes before VALUE in the tokens section. Given such inputs, does anyone has an idea of how can I map the "1" to the correct token VALUE, or a suggestion of how can I alter the grammar so that I can correctly interpret the input?
The second problem is the presence of a comment at the end of a line. Because the NEWLINE token delimits: (1) the end of a comment; and (2) the end of a line of code. When I include a comment at the end of a line of code, two newline characters are necessary to the parser correctly recognize the line of code, otherwise, just one newline character is necessary. How could I improve this?
Thanks!
Problem 1
The lexer does not "listen" to the parser. The lexer simply creates tokens that contain as much characters as possible. In case two tokens match the same amount of characters, the token defined first will "win". In other words, "1" will always be tokenized as a NODE, even though the parser is trying to match a VALUE.
You can do something like this instead:
resistor
: RES NODE NODE value 'G2'? COMMENT? NEWLINE
;
value : NODE | REAL;
// START:tokens
RES : ('R'|'r') DIG+;
NODE : DIG+;
REAL : (DIG+ EXP?) | (FLT EXP?);
...
E.g., I removed VALUE, added value and removed fragment from REAL
Problem 2
Do not let the comment match the line break:
COMMENT : '%' ~('\r' | '\n')*;
where ~('\r' | '\n')* matches zero or more chars other than line break characters.

Unable to parse a complex language with regex and Scala parser combinators

I'm trying to write a parser for a certain language as part of my research. Currently I have problems getting the following code to work in a way I want:
private def _uw: Parser[UW] = _headword ~ _modifiers ~ _attributes ^^ {
case hw ~ mods ~ attrs => new UW(hw, mods, attrs)
}
private def _headword[String] = "\".*\"".r | "[^(),]*".r
private def _modifiers: Parser[List[UWModifier]] = opt("(" ~> repsep(_modifier, ",") <~ ")") ^^ {
case Some(mods) => mods
case None => List[UWModifier]()
}
private def _modifier: Parser[UWModifier] = ("[^><]*".r ^^ (RelTypes.toRelType(_))) ~ "[><]".r ~ _uw ^^ {
case (rel: RelType) ~ x ~ (uw: UW) => new UWModifier(rel, uw)
}
private def _attributes: Parser[List[UWAttribute]] = rep(_attribute) ^^ {
case Nil => List[UWAttribute]()
case attrs => attrs
}
private def _attribute: Parser[UWAttribute] = ".#" ~> "[^>.]*".r ^^ (new UWAttribute(_))
The above code contains just one part of the language, and to spare time and space, I won't go much into details about the whole language. _uw method is supposed to parse a string that consists of three parts, although just the first part must exist in the string.
_uw should be able to parse these test strings correctly:
test0
test1.#attr
"test2"
"test3".#attr
test4..
test5..#attr
"test6..".#attr
"test7.#attr".#attr
test8(urel>uw)
test9(urel>uw).#attr
"test10..().#"(urel>uw).#attr
test11(urel1>uw1(urel2>uw2,urel3>uw3),urel4>uw4).#attr1.#attr2
So if the headword starts and ends with ", everything inside the double quotes is considered to be part of the headword. All words starting with .#, if they are not inside the double quotes, are attributes of the headword.
E.g. in test5, the parser should parse test5. as headword, and attr as an attribute. Just .# is omitted, and all dots before that should be contained in the headword.
So, after headword there CAN be attributes and/or modifiers. The order is strict, so attributes always come after modifiers. If there are attributes but no modifiers, everything until .# is considered as part of the headword.
The main problem is "[^#(]*".r. I've tried all kind of creative alternatives, such as "(^[\\w\\.]*)((\\.\\#)|$)".r, but nothing seems to work. How does lookahead or lookbehind even affect parser combinators? I'm not an expert on parsing or regex, so all help is welcome!
I don't think "[^#(]*".r has anything to do with your problem. I see this:
private def _headword[String] = "\".*\"".r | "[^(),]*".r
which is the first thing in _uw (and, by the way, using underscores in names in Scala is not recommended), so when it tries to parse test5..#attr, the second regexp will match all of it!
scala> "[^(),]*".r findFirstIn "test5..#attr"
res0: Option[String] = Some(test5..#attr)
So there will be nothing left for the remaining parsers. Also, the first regex in _headword is also problematic, because .* will accept quotes, which means that something like this becomes valid:
"test6 with a " inside of it..".#attr
As for look-ahead and look-behind, it doesn't affect parser combinators at all. Either the regex matches, or it doesn't -- that's all the parser combinators care about.

string format check

Suppose I have string variables like following:
s1="10$"
s2="10$ I am a student"
s3="10$Good"
s4="10$ Nice weekend!"
As you see above, s2 and s4 have white space(s) after 10$ .
Generally, I would like to have a way to check if a string start with 10$ and have white-space(s) after 10$ . For example, The rule should find s2 and s4 in my above case. how to define such rule to check if a string start with '10$' and have white space(s) after?
What I mean is something like s2.RULE? should return true or false to tell if it is the matched string.
---------- update -------------------
please also tell the solution if 10# is used instead of 10$
You can do this using Regular Expressions (Ruby has Perl-style regular expressions, to be exact).
# For ease of demonstration, I've moved your strings into an array
strings = [
"10$",
"10$ I am a student",
"10$Good",
"10$ Nice weekend!"
]
p strings.find_all { |s| s =~ /\A10\$[ \t]+/ }
The regular expression breaks down like this:
The / at the beginning and the end tell Ruby that everything in between is part of the regular expression
\A matches the beginning of a string
The 10 is matched verbatim
\$ means to match a $ verbatim. We need to escape it since $ has a special meaning in regular expressions.
[ \t]+ means "match at least one blank and/or tab"
So this regular expressions says "Match every string that starts with 10$ followed by at least one blank or tab character". Using the =~ you can test strings in Ruby against this expression. =~ will return a non-nil value, which evaluates to true if used in a conditional like if.
Edit: Updated white space matching as per Asmageddon's suggestion.
this works:
"10$ " =~ /^10\$ +/
and returns either nil when false or 0 when true. Thanks to Ruby's rule, you can use it directly.
Use a regular expression like this one:
/10\$\s+/
EDIT
If you use =~ for matching, note that
The =~ operator returns the character position in the string of the
start of the match
So it might return 0 to denote a match. Only a return of nil means no match.
See for example http://www.regular-expressions.info/ruby.html on a regular expression tutorial for ruby.
If you want to proceed to cases with $ and # then try this regular expression:
/^10[\$#] +/

Resources