ANTLR4 not able to parse normal text - parsing

I want to parse latex file so that i can convert it into html. I started with basic minimum text file. But the tree created by "grun Tex_grammar start -gui" is showing errors. I think there is some error in the code which i am not able to figure out
Below is my Grammar code:
///////////////////////////////////////////////////////
grammar Tex_grammar;
r : START BEGINDOC body_text ENDDOC EOF ; //
body_text: body_text body_parts
| body_parts
;
body_parts: section
| subsection
;
section: section (SECTION'{' name'}') text ('\r\n')*
|(SECTION'{' name'}') text ('\r\n')*
;
subsection: subsection (SUBSECTION'{' name'}') text ('\r\n')*
|(SUBSECTION'{' name'}') text ('\r\n')*
;
name : WORD ' ' WORD*;
text: text name* '\r|\n'
|name '\r|\n'
|' '+
;
\\Lexer
START: '\\documentclass{article}';
BEGINDOC: '\\begin{document}';
ENDDOC: '\\end{document}';
SECTION : '\\section';
SUBSECTION : '\\subsection';
INTEGER : [0-9]+ ;
WORD : (((PUNCTUATION)|[a-zA-Z0-9]|(GREEK_LETTERS))+|'('|')');
GREEK_LETTERS : '\\alpha'
|'\\beta'|'\\gamma'|'\\delta'|'\\epsilon';
fragment LETTERS : [a-zA-Z];
fragment PUNCTUATION : ('.'|'\'|!|'?'|:|;|');

Related

How to highlight QScintilla using ANTLR4?

I'm trying to learn ANTLR4 and I'm already having some issues with my first experiment.
The goal here is to learn how to use ANTLR to syntax highlight a QScintilla component. To practice a little bit I've decided I'd like to learn how to properly highlight *.ini files.
First things first, in order to run the mcve you'll need:
Download antlr4 and make sure it works, read the instructions on the main site
Install python antlr runtime, just do: pip install antlr4-python3-runtime
Generate the lexer/parser of ini.g4:
grammar ini;
start : section (option)*;
section : '[' STRING ']';
option : STRING '=' STRING;
COMMENT : ';' ~[\r\n]*;
STRING : [a-zA-Z0-9]+;
WS : [ \t\n\r]+;
by running antlr ini.g4 -Dlanguage=Python3 -o ini
Finally, save main.py:
import textwrap
from PyQt5.Qt import *
from PyQt5.Qsci import QsciScintilla, QsciLexerCustom
from antlr4 import *
from ini.iniLexer import iniLexer
from ini.iniParser import iniParser
class QsciIniLexer(QsciLexerCustom):
def __init__(self, parent=None):
super().__init__(parent=parent)
lst = [
{'bold': False, 'foreground': '#f92472', 'italic': False}, # 0 - deeppink
{'bold': False, 'foreground': '#e7db74', 'italic': False}, # 1 - khaki (yellowish)
{'bold': False, 'foreground': '#74705d', 'italic': False}, # 2 - dimgray
{'bold': False, 'foreground': '#f8f8f2', 'italic': False}, # 3 - whitesmoke
]
style = {
"T__0": lst[3],
"T__1": lst[3],
"T__2": lst[3],
"COMMENT": lst[2],
"STRING": lst[0],
"WS": lst[3],
}
for token in iniLexer.ruleNames:
token_style = style[token]
foreground = token_style.get("foreground", None)
background = token_style.get("background", None)
bold = token_style.get("bold", None)
italic = token_style.get("italic", None)
underline = token_style.get("underline", None)
index = getattr(iniLexer, token)
if foreground:
self.setColor(QColor(foreground), index)
if background:
self.setPaper(QColor(background), index)
def defaultPaper(self, style):
return QColor("#272822")
def language(self):
return self.lexer.grammarFileName
def styleText(self, start, end):
view = self.editor()
code = view.text()
lexer = iniLexer(InputStream(code))
stream = CommonTokenStream(lexer)
parser = iniParser(stream)
tree = parser.start()
print('parsing'.center(80, '-'))
print(tree.toStringTree(recog=parser))
lexer.reset()
self.startStyling(0)
print('lexing'.center(80, '-'))
while True:
t = lexer.nextToken()
print(lexer.ruleNames[t.type-1], repr(t.text))
if t.type != -1:
len_value = len(t.text)
self.setStyling(len_value, t.type)
else:
break
def description(self, style_nr):
return str(style_nr)
if __name__ == '__main__':
app = QApplication([])
v = QsciScintilla()
lexer = QsciIniLexer(v)
v.setLexer(lexer)
v.setText(textwrap.dedent("""\
; Comment outside
[section s1]
; Comment inside
a = 1
b = 2
[section s2]
c = 3 ; Comment right side
d = e
"""))
v.show()
app.exec_()
and run it, if everything went well you should get this outcome:
Here's my questions:
As you can see, the outcome of the demo is far away from being usable, you definitely don't want that, it's really disturbing. Instead, you'd like to get a similar behaviour than all IDEs out there. Unfortunately I don't know how to achieve that, how would you modify the snippet providing such a behaviour?
Right now I'm trying to mimick a similar highlighting than the below snapshot:
you can see on that screenshot the highlighting is different on variable assignments (variable=deeppink and values=yellowish) but I don't know how to achieve that, I've tried using this slightly modified grammar:
grammar ini;
start : section (option)*;
section : '[' STRING ']';
option : VARIABLE '=' VALUE;
COMMENT : ';' ~[\r\n]*;
VARIABLE : [a-zA-Z0-9]+;
VALUE : [a-zA-Z0-9]+;
WS : [ \t\n\r]+;
and then changing the styles to:
style = {
"T__0": lst[3],
"T__1": lst[3],
"T__2": lst[3],
"COMMENT": lst[2],
"VARIABLE": lst[0],
"VALUE": lst[1],
"WS": lst[3],
}
but if you look at the lexing output you'll see there won't be distinction between VARIABLE and VALUES because order precedence in the ANTLR grammar. So my question is, how would you modify the grammar/snippet to achieve such visual appearance?
The problem is that the lexer needs to be context sensitive: everything on the left hand side of the = needs to be a variable, and to the right of it a value. You can do this by using ANTLR's lexical modes. You start off by classifying successive non-spaces as being a variable, and when encountering a =, you move into your value-mode. When inside the value-mode, you pop out of this mode whenever you encounter a line break.
Note that lexical modes only work in a lexer grammar, not the combined grammar you now have. Also, for syntax highlighting, you probably only need the lexer.
Here's a quick demo of how this could work (stick it in a file called IniLexer.g4):
lexer grammar IniLexer;
SECTION
: '[' ~[\]]+ ']'
;
COMMENT
: ';' ~[\r\n]*
;
ASSIGN
: '=' -> pushMode(VALUE_MODE)
;
KEY
: ~[ \t\r\n]+
;
SPACES
: [ \t\r\n]+ -> skip
;
UNRECOGNIZED
: .
;
mode VALUE_MODE;
VALUE_MODE_SPACES
: [ \t]+ -> skip
;
VALUE
: ~[ \t\r\n]+
;
VALUE_MODE_COMMENT
: ';' ~[\r\n]* -> type(COMMENT)
;
VALUE_MODE_NL
: [\r\n]+ -> skip, popMode
;
If you now run the following script:
source = """
; Comment outside
[section s1]
; Comment inside
a = 1
b = 2
[section s2]
c = 3 ; Comment right side
d = e
"""
lexer = IniLexer(InputStream(source))
stream = CommonTokenStream(lexer)
stream.fill()
for token in stream.tokens[:-1]:
print("{0:<25} '{1}'".format(IniLexer.symbolicNames[token.type], token.text))
you will see the following output:
COMMENT '; Comment outside'
SECTION '[section s1]'
COMMENT '; Comment inside'
KEY 'a'
ASSIGN '='
VALUE '1'
KEY 'b'
ASSIGN '='
VALUE '2'
SECTION '[section s2]'
KEY 'c'
ASSIGN '='
VALUE '3'
COMMENT '; Comment right side'
KEY 'd'
ASSIGN '='
VALUE 'e'
And an accompanying parser grammar could look like this:
parser grammar IniParser;
options {
tokenVocab=IniLexer;
}
sections
: section* EOF
;
section
: COMMENT
| SECTION section_atom*
;
section_atom
: COMMENT
| KEY ASSIGN VALUE
;
which would parse your example input in the following parse tree:
I already implemented something like this in C++.
https://github.com/tora-tool/tora/blob/master/src/editor/tosqltext.cpp
Sub-classed QScintilla class and implemented custom Lexer based on ANTLR generated source.
You might even use ANTLR parser (I did not use it), QScitilla allows you to have more than one analyzer (having different weight), so you can periodically perform some semantic check on text. What can not be done easily in QScintilla is to associate token with some additional data.
Syntax highlighting in Sctintilla is done by dedicated highlighter classes, which are lexers. A parser is not well suited for such kind of work, because the syntax highlighting feature must work, even if the input contains errors. A parser is a tool to verify the correctness of the input - 2 totally different tasks.
So I recommend you stop thinking about using ANTLR4 for that and just take one of the existing Lex classes and create a new one for the language you want to highlight.

changing text of rule in antlr4 using setText

I want to change every entry in csv file to 'BlahBlah'
For that I have antlr grammar as
grammar CSV;
file : hdr row* row1;
hdr : row;
row : field (',' value1=field)* '\r'? '\n'; // '\r' is optional at the end of a row of CSV file ..
row1 : field (',' field)* '\r'? '\n'?;
field
: TEXT
{
$setText("BlahBlah");
}
| STRING
|
;
TEXT : ~[,\n\r"]+ ;
STRING : '"' ('""' | ~'"')* '"' ;
But when I run this on antlr4
error(63): CSV.g4:13:3: unknown attribute reference setText in $setText
make: *** [run] Error 1
why is setText not supported in antlr4 and is there any other alternative to replace text?
Couple of problems here:
First, have to identify the receiver of the setText method. Probably want
field : TEXT { $TEXT.setText("BlahBlah"); }
| STRING
;
Second is that setText is not defined in the Token class.
Typically, create your own token class extending CommonToken and corresponding token factory class. Set the TokenLableType (in the options block) to your token class name. The setText method in CommonToken will then be visible.
tl;dr:
Given the following grammar (derived from original CSV.g4 sample and grammar attempt of OP (cf. question)):
grammar CSVBlindText;
#header {
import java.util.*;
}
/** Derived from rule "file : hdr row+ ;" */
file
locals [int i=0]
: hdr ( rows+=row[$hdr.text.split(",")] {$i++;} )+
{
System.out.println($i+" rows");
for (RowContext r : $rows) {
System.out.println("row token interval: "+r.getSourceInterval());
}
}
;
hdr : row[null] {System.out.println("header: '"+$text.trim()+"'");} ;
/** Derived from rule "row : field (',' field)* '\r'? '\n' ;" */
row[String[] columns] returns [Map<String,String> values]
locals [int col=0]
#init {
$values = new HashMap<String,String>();
}
#after {
if ($values!=null && $values.size()>0) {
System.out.println("values = "+$values);
}
}
// rule row cont'd...
: field
{
if ($columns!=null) {
$values.put($columns[$col++].trim(), $field.text.trim());
}
}
( ',' field
{
if ($columns!=null) {
$values.put($columns[$col++].trim(), $field.text.trim());
}
}
)* '\r'? '\n'
;
field
: TEXT
| STRING
|
;
TEXT : ~[',\n\r"]+ {setText( "BlahBlah" );} ;
STRING : '"' ('""'|~'"')* '"' ; // quote-quote is an escaped quote
One has:
$> antlr4 -no-listener CSVBlindText.g4
$> grep setText CSVBlindText*java
CSVBlindTextLexer.java: setText( "BlahBlah" );
Compiling it works flawlessly:
$> javac CSVBlindText*.java
Testdata (the users.csv file just renamed):
$> cat blinded_by_grammar.csv
User, Name, Dept
parrt, Terence, 101
tombu, Tom, 020
bke, Kevin, 008
Yields in test:
$> grun CSVBlindText file blinded_by_grammar.csv
header: 'BlahBlah,BlahBlah,BlahBlah'
values = {BlahBlah=BlahBlah}
values = {BlahBlah=BlahBlah}
values = {BlahBlah=BlahBlah}
3 rows
row token interval: 6..11
row token interval: 12..17
row token interval: 18..23
So it looks as if the setText() should be injected before the semicolon of a production and not between alternatives (wild guessing here ;-)
Previous iterations below:
Just guessing, as I 1) have no working antlr4 available currently and 2) did not write ANTLR4 grammars for quite some time now - maybe without the Dollar ($) ?
grammar CSV;
file : hdr row* row1;
hdr : row;
row : field (',' value1=field)* '\r'? '\n'; // '\r' is optional at the end of a row of CSV file ..
row1 : field (',' field)* '\r'? '\n'?;
field
: TEXT
{
setText("BlahBlah");
}
| STRING
|
;
TEXT : ~[,\n\r"]+ ;
STRING : '"' ('""' | ~'"')* '"' ;
Update: Now that an antlr 4.5.2 (at least via brew) instead of a 4.5.3 is available, I digged into that and answering some comment below from OP: the setText() will be generated in lexer java module if the grammar is well defined. Unfortunately debugging antlr4 grammars for a dilettant like me is ... but nevertheless very nice language construction kit IMO.
Sample session:
$> antlr4 -no-listener CSV.g4
$> grep setText CSVLexer.java
setText( String.valueOf(getText().charAt(1)) );
The grammar used:
(hacked up from example code retrieved via:
curl -O http://media.pragprog.com/titles/tpantlr2/code/tpantlr2-code.tgz )
grammar CSV;
#header {
import java.util.*;
}
/** Derived from rule "file : hdr row+ ;" */
file
locals [int i=0]
: hdr ( rows+=row[$hdr.text.split(",")] {$i++;} )+
{
System.out.println($i+" rows");
for (RowContext r : $rows) {
System.out.println("row token interval: "+r.getSourceInterval());
}
}
;
hdr : row[null] {System.out.println("header: '"+$text.trim()+"'");} ;
/** Derived from rule "row : field (',' field)* '\r'? '\n' ;" */
row[String[] columns] returns [Map<String,String> values]
locals [int col=0]
#init {
$values = new HashMap<String,String>();
}
#after {
if ($values!=null && $values.size()>0) {
System.out.println("values = "+$values);
}
}
// rule row cont'd...
: field
{
if ($columns!=null) {
$values.put($columns[$col++].trim(), $field.text.trim());
}
}
( ',' field
{
if ($columns!=null) {
$values.put($columns[$col++].trim(), $field.text.trim());
}
}
)* '\r'? '\n'
;
field
: TEXT
| STRING
| CHAR
|
;
TEXT : ~[',\n\r"]+ ;
STRING : '"' ('""'|~'"')* '"' ; // quote-quote is an escaped quote
/** Convert 3-char 'x' input sequence to string x */
CHAR: '\'' . '\'' {setText( String.valueOf(getText().charAt(1)) );} ;
Compiling works:
$> javac CSV*.java
Now test with a matching weird csv file:
a,b
"y",'4'
As:
$> grun CSV file foo.csv
line 1:0 no viable alternative at input 'a'
line 1:2 no viable alternative at input 'b'
header: 'a,b'
values = {a="y", b=4}
1 rows
row token interval: 4..7
So in conclusion, I suggest to rework the logic of the grammar (I presume inserting "BlahBlahBlah" was not essential but a mere debugging hack).
And citing http://www.antlr.org/support.html :
ANTLR Discussions
Please do not start discussions at stackoverflow. They have asked us to
steer discussions (i.e., non-questions/answers) away from Stackoverflow; we
have a discussion forum at Google specifically for that:
https://groups.google.com/forum/#!forum/antlr-discussion
We can discuss ANTLR project features, direction, and generally argue about
whatever we want at the google discussion forum.
I hope this helps.

Is it possible to keep track of precedent Tokens to resolve ambiguities in ANTLR4?

I'm starting in ANTLR4, what I would want is to recognize this format while doing some action according to the Token read.
what I'm trying to produce:
IDENTIFIER:Test1 ([a-zA-Z09]{10})
{insert 'Test1' in personId column}
CODE: F0101F
FULL_NAME: FIRST_NAME ( [A-Z]+)LAST_NAME ( [A-Z]+ )
{insert FIRST_NAME.value in firstName column and insert LAST_NAME.value in
lastName column}
ADRESS: DIGIT+ STREET_NAME ([A-Z]+)
{insert STREET_NAME.value in streetName column }
OTHER_INFORMATION: ([A-Z]+)
{insert OTHER_INFORMATION.value in other column}
What I did:
prod
:
read_information+
;
read_information
:
{getCurrentToken().getType()== ID }?
idElement
|
{getCurrentToken().getType()== CODE }?
codeElement
|
{getCurrentToken().getType()== FULLNAME}?
fullNameElement
|
{getCurrentToken().getType()== STREET}?
streetElement
|
{getCurrentToken().getType()== OTHER}?
otherElement
;
codeElement
:
CODE
{getCurrentToken().getText().matches("[A-F0-9]{6}")}?
codeInformation
|
{/*throw someException*/}
;
codeInformation
:
HEXCODE
;
HEXCODE
:
[a-fA-F0-9]+
;
CODE
:
'CODE:'
;
otherElement
:
OTHER otherInformation
;
otherInformation
:
STR
;
OTHER
:
'OTHER:'
;
streetElement
:
STREET streetInformation
;
STREET
:
'STREET:'
;
streetInformation
:
STR
;
STR
:
[a-zA-Z0-9]+
;
WORD
:
[a-zA-Z]+
;
fullNameElement
:
FULLNAME firstNameInformation lastNameInformation
;
FULLNAME
:
'FULL_NAME:'
;
firstNameInformation
:
WORD
;
lastNameInformation
:
WORD
;
idElement
:
ID idInformation
;
ID
:
'ID:'
;
idInformation
:
{getCurrentToken().getText().length()<=10}?
STR
;
I'm not sure If this is the right approach since I have problems reading WORD token.
Since all the tokens are basically of the same format, I'm trying to find a way to keep track of the precedent token or context to resolve the ambiguity, and check the format at the same time ( example if it's more than 10 char throw exception )
A thing you could do to find out which rules the generated parser would enter (i.e. which context is visited), you could use ANTLR to create visitors. There is a great explanation of it here (See Bart Kiers response).
Generally, if there are two rules, which are the same, you could just merge them into one, and then label the usage of them. For example, for these rules:
firstNameInformation
:
WORD
;
lastNameInformation
:
WORD
;
there is no reason to actually have them. Instead, you could write the grammar for the full name this way:
fullNameElement
:
FULLNAME firstname=WORD lastname=WORD
;
In that case, you only use the WORD token, but you label them so you can distinct between them when doing a tree walk.

Search a section of text, which occurs regularly in tcl

I have a file called log_file with following text:
....some text....
line wire (1)
mode : 2pair , annex : a
coding : abcd
rate : 1024
status : up
....some text....
line wire (2)
mode : 4pair , annex : b
coding : xyz
rate : 1024
status : down
....some text....
The values may differ but the attributes are always the same. Is there a way to find each line wire and display their attributes? The number of line wires also may differ.
EDIT: File doesn't have any blank lines. There are more attributes but only these are needed. Can I get like the first "n" lines, instead of searching for every line? i.e if there is line wire (1), copy that line plus the next 4 lines.
And I am copying the searched lines to a output file $fout, which I have used earlier in the script with the same $line.
Given your sample:
set fh [open log_file r]
while {[gets $fh line] != -1} {
switch -glob -- $line {
{line wire*} {puts $line}
{mode : *} -
{coding : *} -
{rate : *} -
{status : *} {puts " $line"}
}
}
close $fh
outputs
line wire (1)
mode : 2pair , annex : a
coding : abcd
rate : 1024
status : up
line wire (2)
mode : 4pair , annex : b
coding : xyz
rate : 1024
status : down
Edit: print the next "n" lines following the "line wire" line to a file
set in [open log_file r]
set out [open log_file_filtered w]
set n 4
while {[gets $in line] != -1} {
if {[string match {line wire*} $line]} {
puts $line
for {set i 1} {$i <= $n} {incr i} {
if {[gets $in line] != -1} {
puts $out " $line"
}
}
}
}
close $fh

Can anyone help me convert this ANTLR 2.0 grammar file to ANTLR 3.0 syntax?

I've converted the 'easy' parts (fragment, #header and #member
declerations etc.), but since I'm new to Antlr I have a really hard
time converting the Tree statements etc.
I use the following migration guide.
The grammar file can be found here....
Below you can find some examples where I run into problems:
For instance, I have problems with:
n3Directive0!:
d:AT_PREFIX ns:nsprefix u:uriref
{directive(#d, #ns, #u);}
;
or
propertyList![AST subj]
: NAME_OP! anonnode[subj] propertyList[subj]
| propValue[subj] (SEMI propertyList[subj])?
| // void : allows for [ :a :b ] and empty list "; .".
;
propValue [AST subj]
: v1:verb objectList[subj, #v1]
// Reverse the subject and object
| v2:verbReverse subjectList[subj, #v2]
;
subjectList![AST oldSub, AST prop]
: obj:item { emitQuad(#obj, prop, oldSub) ; }
(COMMA subjectList[oldSub, prop])? ;
objectList! [AST subj, AST prop]
: obj:item { emitQuad(subj,prop,#obj) ; }
(COMMA objectList[subj, prop])?
| // Allows for empty list ", ."
;
n3Directive0!:
d=AT_PREFIX ns=nsprefix u=uriref
{directive($d, $ns, $u);}
;
You have to use '=' for assignments.
Tokens can then be used as '$tokenname.getText()', ...
Rule results can then be used in your code as 'rulename.result'
If you have rules having declared result names, you have to use these names iso.
'result'.

Resources