Understanding ABNF syntax "0<pchar>" - parsing

In RFC 3986, they defined the rule:
path-empty = 0<pchar>
For simplicity, let's assume pchar is defined:
pchar = 'a' / 'b' / 'c'
What does path-empty match and how is it matched?
I've read the Wikipedia page on ABNF. My guess is that matches the empty string (regex ^(?![\s\S])). If that is the case, why even reference pchar? Is there not a simpler way to match the empty string in ABNF syntax without referencing another rule?
How could this be translated to ANTLR4?

Yes, you are correct. path-empty derives the empty string.
In ABNF, the right-hand side of a rule must contain an element, which will be anything other than spaces, newlines, and comments. See rfc5234, page 10. Given this syntax, there are several ways to define the empty string. path-empty = 0<pchar> is one way. It means "exactly zero of <pchar>". But, path-empty = "" and path-empty = 0pchar would also work. ABNF does not define whether one is preferred over the other.
Note, the rfc3986 spec uses a prose-val, i.e., <pchar> instead of the "rulename" pchar (or even "", 0pchar, or just <empty string>). It's unclear why, but it has the same effect--path-empty derives the empty string. But, <pchar> is not the same as pchar. A prose value is a "last resort" way to add a non-formal description of the rule.
In Antlr4, the rule would just be path_empty : ;. Note, Antlr has a different naming convention that defines a strict boundary between lexer and parser. ABNF does not have this distinction. In fact, this grammar could be converted to a single Antlr lexer grammar, an exercise in understanding the power of Antlr lexers.

Related

how to tokenize input based on a set of ABNF grammar rules

I have read the RFC on the ABNF specification and am having difficulty understanding how a set of ABNF rules could be used to reliably extract tokens from some input string that matches the grammar. It seems that the specification doesn't ever mention tokens or ASTs, so it may not concern itself with that, but I believe that would be the ultimate goal of applying any BNF grammar, unless I am mistaken.
In the specification, they list example rules for parsing a postal-address:
postal-address = name-part street zip-part
name-part = *(personal-part SP) last-name [SP suffix] CRLF
name-part =/ personal-part CRLF
personal-part = first-name / (initial ".")
first-name = *ALPHA
initial = ALPHA
last-name = *ALPHA
suffix = ("Jr." / "Sr." / 1*("I" / "V" / "X"))
street = [apt SP] house-num SP street-name CRLF
apt = 1*4DIGIT
house-num = 1*8(DIGIT / ALPHA)
street-name = 1*VCHAR
zip-part = town-name "," SP state 1*2SP zip-code CRLF
town-name = 1*(ALPHA / SP)
state = 2ALPHA
zip-code = 5DIGIT ["-" 4DIGIT]
There is also a list of core rules that I won't post here describing expected common-usage rules.
Ultimately, what I would like to do is figure out the rules necessary for taking the input
John H. Doe
12345 Fakestreet
Springfield, IL 55555
and generating what I believe would be the correct token sequence which is:
["John", " ", "H", ".", "Doe", "\r\n",
"12345", " ", "Fakestreet", "\r\n",
"Springfield", ",", " ", "IL", " ", "55555", "\r\n"]
(I believe the spaces and CRLFs need to be returned as "tokens" because they are specified as requirements in certain rules)
Some problems I am considering:
It makes sense that "Fakestreet" should be its own token, but according to the definition it is a variable repetition of the visible-character core rule. Ideally I would not like to read out each letter as its own token ("F", "a", "k", and so on), so (assuming core-rules can be treated as terminals?) any potential token string would need to be checked against the entire, theoretically infinite, rule definition 1*VCHAR to see if it is a match. And some rules are more complicated than that, like zip-code's 5DIGIT ["-" 4DIGIT], but any potential token needs to be checked against this rule as well ("12345" and "12345-6789" are both valid tokens). So it seems like entire rule element concatenations need to be checked completely as well, unless "12345-6789" should rather be tokenized as ["12345", "-", "6789"] which... may be correct?
I'd assume we would not want to completely check rules that reference other rules, otherwise we may end up tokenizing the entire postal-address as a single token of type "postal-address". Maybe rules that reference other rules shouldn't be checked? Maybe there is such a thing as a "terminal-rule" that includes no rule refs (excluding core rules)?
Occasionally in the rules, terminal values are combined with rule references, for instance in the definition of "personal-part", the literal "." is defined. So, while we may not want to match any potential token string against the entire "personal-part" rule definition, it seems we do want to try to match it against the literal "." because it is a required token for parsing a personal-part. Maybe in non-terminal rules, terminal values listed there should be considered?
I realize this is a lengthy question, but it seems BNF supersets like EBNF and ABNF are being used for this kind of thing but I cannot find a standard specification for how to tokenize from ABNF grammar.

ANTLR4 match any not-matched sections into one single STRING token

I am trying to create a Lexer/Parser with ANTLR that can parse plain text with 'tags' scattered inbetween.
These tags are denoted by opening ({) and closing (}) brackets and they represent Java objects that can evaluate to a string, that is then replaced in the original input to create a dynamic template of sorts.
Here is an example:
{player:name} says hi!
The {player:name} should be replaced by the name of the player and result in the output i.e. Mark says hi! for the player named Mark.
Now I can recognize and parse the tags just fine, what I have problems with is the text that comes after.
This is the grammar I use:
grammar : content+
content : tag
| literal
;
tag : player_tag
| <...>
| <other kinds of tags, not important for this example>
| <...>
;
player_tag : BRACKET_OPEN player_identifier SEMICOLON player_string_parameter BRACKET_CLOSE ;
player_string_parameter : NAME
| <...>
;
player_identifier : PLAYER ;
literal : NUMBER
| STRING
;
BRACKET_OPEN : '{';
BRACKET_CLOSE : '}';
PLAYER : 'player'
NAME : 'name'
NUMBER : <...>
STRING : (.+)? /* <- THIS IS THE PROBLEMATIC PART !*/
Now this STRING Lexer definition should match anything that is not an empty string but the problem is that it is too greedy and then also consumes the { } bracket tokens needed for the tag rule.
I have tried setting it to ~[{}]+ which is supposed to match anything that does not include the { } brackets but that screws with the tag parsing which I don't understand either.
I could set it to something like [ a-zA-Z0-9!"ยง$%&/()= etc...]+ but I really don't want to restrict it to parse only characters available on the british keyboard (German umlaute or French accents and all other special characters other languages have must to work!)
The only thing that somewhat works though I really dislike it is to force strings to have a prefix and a suffix like so:
STRING : '\'' ~[}{]+ '\'' ;
This forces me to alter the form from "{player:name} says hi!" to "{player:name}' says hi!'" and I really desperately want to avoid such restrictions because I would then have to account for literal ' characters in the string itself and it's just ugly to work with.
The two solutions I have in mind are the following:
- Is there any way to match any number of characters that has not been matched by the lexer as a STRING token and pass it to the parser? That way I could match all the tags and say the rest of the input is just plain text, give it back to me as a STRING token or whatever...
- Does ANTLR support lookahead and lookbehind regex expressions with which I could match any number of characters before the first '{', after the last '}' and anything inbetween '}' and '{' ?
I have tried
STRING : (?<=})(.+)?(?={) ;
but I can't seem to get the syntax right because that won't compile at all, which leads me to believe that ANTLR does not support lookahead and lookbehind syntax, but I could not find a definitive answer on the internet to that question.
Any advice on what to do?
Antlr does not support lookahead or lookbehind. It does support non-greedy wildcard matches, but only when the .* non-greedy wildcard is followed in the rule with the termination sequence (which, as you say, is also contained in the match, although you could push it back into the input stream).
So ~[{}]* is correct. But there's a little problem: lexer rules are (normally) always active. So that lexer rule will be active inside the braces as well, which means that it will swallow the entire contents between the braces (unless there are nested braces or braces inside quotes or some such, and that's even worse).
So you need to define different lexical contents, called "lexical modes" in Antlr. There's a publically viewable example in the Antlr Definitive Reference, which shows a solution to a very similar problem: parsing HTML.

Lexer rule optional suffix not matching, when it should match

Using ANTLR 3, my lexer has rule
SELECT_ASSIGN:
'SELECT' WS+ IDENTIFIER WS+ 'ASSIGN' WS+ (('TO'|'USING') WS+)?
using this these match correctly
SELECT VAR1 ASSIGN TO
SELECT VAR1 ASSIGN USING
and this also matches
SELECT VAR1 ASSIGN FOO
However this does not match
SELECT VAR1 ASSIGN TWO
Whereas I have marked TO|USING as optional in the rule.
From generated Java code I see...
When lexer notices T of TWO, it goes to match('TO')
but since does not find O after T
then generates failure.... and returns all the way from the rule -- hence not matching it.
How do I get my lexer rule to match, when input has word with chars starting with suffixed optional part of the rule
Basically I want my rule to match this also (beside what it already matches - as lised at the start):
SELECT VAR1 ASSIGN TWO
Kindly suggest how I approach/resolve this situation.
NOTE:
Such rules are recommended in the parser - But I have this in lexer - because I do not want to parse the entire input by the parser, and want to parse only content of interest. So using such rules in lexer, I locate sections which I really want to parse by the parser.
UPDATE 1
I could circumvent this problem by making 2 rules, like so:
SELECT_ASSIGN_USING_TO
: tok='SELECT' WS+ name=IDENTIFIER WS+ 'ASSIGN' WS+ ('USING'|'TO')
SELECT_ASSIGN
: tok='SELECT' WS+ name=IDENTIFIER WS+ 'ASSIGN'
But is it possible to do the desired in one lexer rule?
An approach to get this in one rule, suggested by my senior - use syntactic predicate
SELECT_ASSIGN
: tok='SELECT' WS+ name=IDENTIFIER WS+ 'ASSIGN'
(
(WS+ ('TO'|'USING') WS+)=> (WS+ ('TO'|'USING') WS+)
| (WS+)
)
Tokens match a complete char sequence or none. It cannot match partially and the grammar rule determines which exactly. You cannot expect a rule for TO to match TWO. If you want TWO to match too you have to add it to your lexer rule.
A few notes here:
The solution your "senior" gave you makes no sense at all. A
syntactic predicate is a kinda lookahead to guide the parser in case
of ambiquities. There are no ambiquities involved here.
Writing
the entire SELECT_ASSIGN rule as a lexer rule is very uncommon and
not flexible. A lexer rule should not be used for entire sentences,
but only for a small set of characters to find tokens to assign them
a type (usually elementary structures of a language like string,
number, comment etc.).
ANTLR3 is totally outdated and I wonder why this is still used in your class. ANTLR4 is out since 5 years and should be the choice for any new project.

How to match any symbol in ANTLR parser (not lexer)?

How to match any symbol in ANTLR parser (not lexer)? Where is the complete language description for ANTLR4 parsers?
UPDATE
Is the answer is "impossible"?
You first need to understand the roles of each part in parsing:
The lexer: this is the object that tokenizes your input string. Tokenizing means to convert a stream of input characters to an abstract token symbol (usually just a number).
The parser: this is the object that only works with tokens to determine the structure of a language. A language (written as one or more grammar files) defines the token combinations that are valid.
As you can see, the parser doesn't even know what a letter is. It only knows tokens. So your question is already wrong.
Having said that it would probably help to know why you want to skip individual input letters in your parser. Looks like your base concept needs adjustments.
It depends what you mean by "symbol". To match any token inside a parser rule, use the . (DOT) meta char. If you're trying to match any character inside a parser rule, then you're out of luck, there is a strict separation between parser- and lexer rules in ANTLR. It is not possible to match any character inside a parser rule.
It is possible, but only if you have such a basic grammar that the reason to use ANTlr is negated anyway.
If you had the grammar:
text : ANY_CHAR* ;
ANY_CHAR : . ;
it would do what you (seem to) want.
However, as many have pointed out, this would be a pretty strange thing to do. The purpose of the lexer is to identify different tokens that can be strung together in the parser to form a grammar, so your lexer can either identify the specific string "JSTL/EL" as a token, or [A-Z]'/EL', [A-Z]'/'[A-Z][A-Z], etc - depending on what you need.
The parser is then used to define the grammar, so:
phrase : CHAR* jstl CHAR* ;
jstl : JSTL SLASH QUALIFIER ;
JSTL : 'JSTL' ;
SLASH : '/'
QUALIFIER : [A-Z][A-Z] ;
CHAR : . ;
would accept "blah blah JSTL/EL..." as input, but not "blah blah EL/JSTL...".
I'd recommend looking at The Definitive ANTlr 4 Reference, in particular the section on "Islands in the stream" and the Grammar Reference (Ch 15) that specifically deals with Unicode.

Can any path segments of a URI have a query component?

According to the Section 3.3, Path Component of RFC2396 - Uniform Resource Identifiers,
The path may consist of a sequence of path segments separated by a single slash "/" character. Within a path segment, the characters "/", ";", "=", and "?" are reserved. Each path segment may include a sequence of parameters, indicated by the semicolon ";" character. The parameters are not significant to the parsing of relative references.
However, I have never seen a URL with a query parameters in any segment other than the final one. So, I am not sure if I am reading this correctly.
Is http://www.url.com/segment1?seg1param1=val1/page.html?pageparam1=val2 a valid URL?
What the RFC is referring to is something like this:
http://www.example.com/foo/bar;param=value/baz.html
That could be interpreted as the path /foo/bar/baz.html with the parameter param=value to the bar segment. No question marks are used.
Note that RFC 2396 has been obsoleted by RFC 3986, which omits specification of segment-specific parameters in favor of a general note that implementations can (and do) do different things to embed segment-specific parameters:
Aside from dot-segments in hierarchical paths, a path segment is
considered opaque by the generic syntax. URI producing applications
often use the reserved characters allowed in a segment to delimit
scheme-specific or dereference-handler-specific subcomponents. For
example, the semicolon (";") and equals ("=") reserved characters are
often used to delimit parameters and parameter values applicable to
that segment. The comma (",") reserved character is often used for
similar purposes. For example, one URI producer might use a segment
such as "name;v=1.1" to indicate a reference to version 1.1 of
"name", whereas another might use a segment such as "name,1.1" to
indicate the same. Parameter types may be defined by scheme-specific
semantics, but in most cases the syntax of a parameter is specific to
the implementation of the URI's dereferencing algorithm.
When you look at the grammar which is just below, it is written:
path = [ abs_path | opaque_part ]
path_segments = segment *( "/" segment )
segment = *pchar *( ";" param )
param = *pchar
pchar = unreserved | escaped |
":" | "#" | "&" | "=" | "+" | "$" | ","
A segment is composed of pchar and param, param being itself a pchar.
When we continue to read, there is absolutely no "?" character in the pchar character components. So the parameters cannot have any "?", and there cannot be a "?" in segments.
So I agree with the answer of Edward Thomson, who says that "?" only delimit the query segment, and cannot be used inside a path.
According to my reading of RFC 2396, no. The ? is a reserved character and serves only to delimit the query segment. The ? is not allowed in either the path or the query segment.
In your example, the first ? marks the beginning of the query segment. The second ? is inside the query segment, and is disallowed.
I believe you could do a get with that and most web servers would process it but I don't believe you would get the results you are expecting. That is the pageparam1=val2 would not evaluate.
If you want parameters like that you could always use the # symbol (as a lot of javascript based GUIs do now).

Resources