antlr html pcdata - html-parsing

antlr html pcdata - html-parsing

Im trying to write very simple HTML parser with ANTLR and Im facing problem, that ~ rule which should match all until specified character is not working.
My lexer grammar:
lexer grammar HtmlParserLexer;
HTML: OHTML PCDATA CHTML;
PCDATA :(~'<') ; //match all until <
OHTML: '<html>';
CHTML: '</html>';
Im trying to match:
<html>foo bar</html>
Error from Eclipse ANTLR plugin Interpreter:
MismatchedTokenException: line 1:7 mismatched input UNKNOW expecting '<'
Which means, that my grammar ignore PCDATA rule and I dont know why.
Thanks in advance for your help.

The rule PCDATA :(~'<') ; matches a single character other than '<'. You'll need to repeat it once or more: PCDATA :(~'<')+ ; (notice the +).
You may also want to allow <html></html> (nothing in between<html> and </html>). In that case, you shouldn't change PCDATA :(~'<')+ ; into PCDATA :(~'<')* ;, but do this instead:
HTML: OHTML PCDATA? CHTML;
PCDATA : (~'<')+ ;
because you shouldn't create lexer rules that could potentially match an empty string.

Related

How to differentiate between a parser and lexer rule in the following?

Sometimes I get a bit confused between a lexing rule vs. a parsing rule, and there's been a nice thread on it here. For example in the following:
value
: string CAST_OPERATOR type
;
string
: S_QUOTE STRING_VALUE S_QUOTE
;
# <-- what is this?
type
: 'date' | 'string'
;
STRING_VALUE
: [a-zA-Z0-9-]+
;
CAST_OPERATOR
: '::'
;
For the type -- this is either the string (or character stream) date or string. Should that be a lexing rule or a parsing rule? I suppose I could break it down even more into:
type
: DATE_TYPE | STRING_TYPE
;
DATE_TYPE
: 'date'
;
STRING_TYPE
: 'string'
;
But still I'm not quite sure which of the above is preferable, and why it would be so. The first two rules -- value and string seem clear to me to be parsing rules -- and the last two rules -- STRING_VALUE and CAST_OPERATOR seem clear to me to be lexing rules (only by intuition though, I could not give a proper explanation). So why would the type be one way or the other?
Literally the only practical difference I've found is that a lexing rule can include a character class and a parsing rule cannot.
Update: I suppose another thing is a lexing rule is terminal, it won't provide any subdivision of parts. For example in the following we can break down $55 into $ and 55:
But if we set the cost as a lexing rule, it will not break it down any further:
So basically a lexing rule is atomic and terminal, whereas a parsing rule is more like a molecule that consists of various parts (atoms) that can be seen within it. Is that a good description/understanding of it?

Your "Update" is on the right track. That's a definite distinction.
You also need to understand the ANTLR pipeline. I.e. that the stream of characters is processed by the Lexer rules to produce a stream of tokens (atoms, in you analogy). It does not do that with recursive descent rule matching, but rather attempts to match you input against all of the Lexer rules. Where:
The rule that matches the longest sequence of input characters will "win"
In the event that multiple Lexer rules match the same length character sequence, then the rule that occurs first will "win"
Once you've got you stream of "atoms" (aka Tokens), then ANTLR uses the parser rules (recursively from the start rule) to try to match sequences of tokens.

ANTLR4 match any not-matched sections into one single STRING token

I am trying to create a Lexer/Parser with ANTLR that can parse plain text with 'tags' scattered inbetween.
These tags are denoted by opening ({) and closing (}) brackets and they represent Java objects that can evaluate to a string, that is then replaced in the original input to create a dynamic template of sorts.
Here is an example:
{player:name} says hi!
The {player:name} should be replaced by the name of the player and result in the output i.e. Mark says hi! for the player named Mark.
Now I can recognize and parse the tags just fine, what I have problems with is the text that comes after.
This is the grammar I use:
grammar : content+
content : tag
| literal
;
tag : player_tag
| <...>
| <other kinds of tags, not important for this example>
| <...>
;
player_tag : BRACKET_OPEN player_identifier SEMICOLON player_string_parameter BRACKET_CLOSE ;
player_string_parameter : NAME
| <...>
;
player_identifier : PLAYER ;
literal : NUMBER
| STRING
;
BRACKET_OPEN : '{';
BRACKET_CLOSE : '}';
PLAYER : 'player'
NAME : 'name'
NUMBER : <...>
STRING : (.+)? /* <- THIS IS THE PROBLEMATIC PART !*/
Now this STRING Lexer definition should match anything that is not an empty string but the problem is that it is too greedy and then also consumes the { } bracket tokens needed for the tag rule.
I have tried setting it to ~[{}]+ which is supposed to match anything that does not include the { } brackets but that screws with the tag parsing which I don't understand either.
I could set it to something like [ a-zA-Z0-9!"§$%&/()= etc...]+ but I really don't want to restrict it to parse only characters available on the british keyboard (German umlaute or French accents and all other special characters other languages have must to work!)
The only thing that somewhat works though I really dislike it is to force strings to have a prefix and a suffix like so:
STRING : '\'' ~[}{]+ '\'' ;
This forces me to alter the form from "{player:name} says hi!" to "{player:name}' says hi!'" and I really desperately want to avoid such restrictions because I would then have to account for literal ' characters in the string itself and it's just ugly to work with.
The two solutions I have in mind are the following:
- Is there any way to match any number of characters that has not been matched by the lexer as a STRING token and pass it to the parser? That way I could match all the tags and say the rest of the input is just plain text, give it back to me as a STRING token or whatever...
- Does ANTLR support lookahead and lookbehind regex expressions with which I could match any number of characters before the first '{', after the last '}' and anything inbetween '}' and '{' ?
I have tried
STRING : (?<=})(.+)?(?={) ;
but I can't seem to get the syntax right because that won't compile at all, which leads me to believe that ANTLR does not support lookahead and lookbehind syntax, but I could not find a definitive answer on the internet to that question.
Any advice on what to do?

Antlr does not support lookahead or lookbehind. It does support non-greedy wildcard matches, but only when the .* non-greedy wildcard is followed in the rule with the termination sequence (which, as you say, is also contained in the match, although you could push it back into the input stream).
So ~[{}]* is correct. But there's a little problem: lexer rules are (normally) always active. So that lexer rule will be active inside the braces as well, which means that it will swallow the entire contents between the braces (unless there are nested braces or braces inside quotes or some such, and that's even worse).
So you need to define different lexical contents, called "lexical modes" in Antlr. There's a publically viewable example in the Antlr Definitive Reference, which shows a solution to a very similar problem: parsing HTML.

Need keywords to be recognized as such only in the correct places

I am new to Antlr and parsing, so this is a learning exercise for me.
I am trying to parse a language that allows free-format text in some locations. The free-format text may therefore be ANY word or words, including the keywords in the language - their location in the language's sentences defines them as keywords or free text.
In the following example, the first instance of "JOB" is a keyword; the second "JOB" is free-form text:
JOB=(JOB)
I have tried the following grammar, which avoids defining the language's keywords in lexer rules.
grammar Test;
test1 : 'JOB' EQ OPAREN (utext) CPAREN ;
utext : UNQUOTEDTEXT ;
COMMA : ',' ;
OPAREN : '(' ;
CPAREN : ')' ;
EQ : '=' ;
UNQUOTEDTEXT : ~[a-z,()\'\" \r\n\t]*? ;
SPC : [ \t]+ -> skip ;
I was hoping that by defining the keywords a string literals in the parser rules, as above, that they would apply only in the location in which they were defined. This appears not to be the case. On testing the "test1" rule (with the Antlr4 plug-in in IDEA), and using the above example phrase shown above - "JOB=(JOB)" (without quotes) - as input, I get the following error message:
line 1:5 mismatched input 'JOB' expecting UNQUOTEDTEXT
So after creating an implicit token for 'JOB', it looks like Antlr uses that token in other points in the parser grammar, too, i.e. whenever it sees the 'JOB' string. To test this, I added another parser rule:
test2 : 'DATA' EQ OPAREN (utext) CPAREN ;
and tested with "DATA=(JOB)"
I got the following error (similar to before):
line 1:6 mismatched input 'JOB' expecting UNQUOTEDTEXT
Is there any way to ask Antlr to enforce the token recognition in the locations only where it is defined/introduced?
Thanks!

What you have is essentially a Lake grammar, the opposite of an island grammar. A lake grammar is one in which you mostly have structured text and then lakes of stuff you don't care about. Generally the key is having some lexical Sentinel that says "enter unstructured text area" and then " reenter structured text area". In your case it seems to be (...). ANTLR has the notion of a lexical mode, which is what you want to to handle areas with different lexical structures. When you see a '(' you want to switch modes to some free-form area. When you see a ')' in that area you want to switch back to the default mode. Anyway "mode" is your key word here.

I had a similar problem with keywords that are sometimes only identifiers. I did it this way:
OnlySometimesAKeyword : 'value' ;
identifier
: Identifier // defined as usual
| maybeKeywords
;
maybeKeywords
: OnlySometimesAKeyword
// ...
;
In your parser rules simply use identifier instead of Identifier and you'll also be able to match the "maybe keywords". This will of course also match them in places where they will be keywords, but you could check this in the parser if necessary.

Antlr4 lexer predicate needed?

I try to parse this piece of text
:20: test :254:
aapje
:21: rest
...
:20: and :21: are special tags, because they start the line. :254: should be 'normal' text, as it does not start on a newline.
I would like the result to be
(20, 'test :254: \naapje')
(21, 'rest')
Lines are terminated using either \r\n or '\n'
I started out trying to ignore the whitespace, but then I match the ':254:' tag as well. So I have to create something that uses the whitespace information.
What I would like to be able to do is something like this:
lexer grammar MT9740_lexer;
InTagNewLine : '\r\n' ~':';
ReadNewLine :'\r\n' ;
But the first would consume the : How can I still generate these tokens? Or is there a smarted approach?

What I understand is that you're looking for some lexer rules that match the start of a line. This lexer rule should tokenize your :20: or :21: appearing at the start of a line only
SOL : {getCharPositionInLine() == 0}? ':' [0-9]+ ':' ;
Your parser rules can then look for this SOL token before parsing the rest of the line.

ANTLR charVocabulary error

I'm trying to have a UNICODE grammar in ANTLR, but this always causes error (snippet of grammar):
grammar Expression;
options {
charVocabulary='\u000'..'\uFFFE';
}
parse
: exp EOF
;
exp
: 'a'
;
It always ends up at: '\uFFFE' not expected ';'. How to write correct UNICODE grammars - what's the correct charVocabulary definition?
I'm using ANTLR 3.2, but it causes same error in new versions also.

charVocabulary is an ANTLR v2 option, not available in ANTLR v3 grammars. All lexers generated from ANTLR v3 grammars accept characters in the range \u0000..\uFFFF (be sure to use the proper encoding while creating an ANTLRInputStream!).
When using ANTLRWorks, you can see this by defining a rule, Any, that matches any character:
Any : . ;
and you will see the following diagram being displayed in the lower part of ANTLRWorks:

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart