how to write a PLY grammar to parse paths? - parsing

I'm trying to write a grammar with PLY that will parse paths in a file. I'm running into shift reduce conflicts and I'm not sure how to change the grammar to fix it.
Here's an example of the file I'm trying to parse. The path/filename can be any acceptable linux path.
file : ../../dir/filename.txt
file : filename.txt
file : filename
So here is the grammar that I wrote.
header : ID COLON path
path : pathexpr filename
pathexpr : PERIOD PERIOD DIVIDE pathexpr
| PERIOD DIVIDE pathexpr
| ID DIVIDE pathexpr
|
filename : ID PERIOD ID
| ID
Here are my tokens. I am using the PLY included ctokens library. Just to save effort in writing my own.
t_ID = r'[A-Za-z_][A-Za-z0-9_]*'
t_PERIOD = r'\.'
t_DIVIDE = r'/'
t_COLON = r':'
So I believe there is a shift reduce conflict in the "filename" rule because the parser doesn't know whether to reduce the token to "ID" or to shift for "ID PERIOD ID". I think there is another issue with the case of no path ("filename") where it will consume the token in pathexpr instead of reducing to empty.
How can I fix my grammar to handle these cases? Maybe I need to change my tokens?

The simple solution: Use left-recursion instead of right-recursion.
LR parsers (like PLY and yacc) prefer left-recursion, because it avoids having to expand the parser stack. It is also usually closer to the semantics of the expression -- which is useful when you want to actually interpret the language, and not just recognize it -- and it often, as in this case, avoids the need to left-factor.
In this case, for example, each path segment needs to be applied to the preceding pathexpr, by looking for the segment directory inside the currently found directory. The parser action is clear: look up $2 in $1. How do you right the action for the right recursive version?
So, a simple transformation:
header : ID COLON path
path : pathexpr filename
pathexpr : pathexpr PERIOD PERIOD DIVIDE
| pathexpr PERIOD DIVIDE
| pathexpr ID DIVIDE
|
filename : ID PERIOD ID
| ID

I think you may be using PLY, not pyparsing, looking at those "t_xxx" names. But here is a pyparsing solution to your problem, see below with helpful comments:
"""
header : ID COLON path
path : pathexpr filename
pathexpr : PERIOD PERIOD DIVIDE pathexpr
| PERIOD DIVIDE pathexpr
| ID DIVIDE pathexpr
|
filename : ID PERIOD ID
| ID
"""
from pyparsing import *
ID = Word(alphanums)
PERIOD = Literal('.')
DIVIDE = Literal('/')
COLON = Literal(':')
# move this to the top, so we can reference it in a negative
# lookahead while parsing the path
file_name = ID + Optional(PERIOD + ID)
# simple path_element - not sufficient, as it will consume
# trailing ID that should really be part of the filename
path_element = PERIOD+PERIOD | PERIOD | ID
# more complex path_element - adds lookahead to avoid consuming
# filename as a part of the path
path_element = (~(file_name + WordEnd())) + (PERIOD+PERIOD | PERIOD | ID)
# use repetition for these kind of expressions, not recursion
path_expr = path_element + ZeroOrMore(DIVIDE + path_element)
# use Combine so that all the tokens will get returned as a
# contiguous string, not as separate path_elements and slashes
path = Combine(Optional(path_expr + DIVIDE) + file_name)
# define header - note the use of results names, which will allow
# you to access the separate fields by name instead of by position
# (similar to using named groups in regexp's)
header = ID("id") + COLON + path("path")
tests = """\
file: ../../dir/filename.txt
file: filename.txt
file: filename""".splitlines()
for t in tests:
print t
print header.parseString(t).dump()
print
prints
file: ../../dir/filename.txt
['file', ':', '../../dir/filename.txt']
- id: file
- path: ../../dir/filename.txt
file: filename.txt
['file', ':', 'filename.txt']
- id: file
- path: filename.txt
file: filename
['file', ':', 'filename']
- id: file
- path: filename

I believe this grammar should work and it has an added advantage of being able to recoganize the parts of the path like extension, directory, drive etc.
I've not made the parser yet, only this grammar.
fullfilepath : path SLASH filename
path : root
| root SLASH directories
root : DRIVE
| PERCENT WIN_DEF_DIR PERCENT
directories : directory
| directory SLASH directories
directory : VALIDNAME
filename : VALIDNAME
| VALIDNAME DOT EXTENSION

Related

What is the best way to handle overlapping lexer patterns that are sensitive to context?

I'm attempting to write an Antlr grammar for parsing the C4 DSL. However, the DSL has a number of places where the grammar is very open ended, resulting in overlapping lexer rules (in the sense that multiple token rules match).
For example, the workspace rule can have a child properties element defining <name> <value> pairs. This is a valid file:
workspace "Name" "Description" {
properties {
xyz "a string property"
nonstring nodoublequotes
}
}
The issue I'm running into is that the rules for the <name> and <value> have to be very broad, basically anything except whitespace. Also, properties with spaces with double quotes will match my STRING token.
My current solution is the grammar below, using property_element: BLOB | STRING; to match values and BLOB to match names. Is there a better way here? If I could make context sensitive lexer tokens I would make NAME and VALUE tokens instead. In the actual grammar I define case insensitive name tokens for thinks like workspace and properties. This allows me to easily match the existing DSL semantics, but raises the wrinkle that a property name or value of workspace will tokenize to K_WORKSPACE.
grammar c4mce;
workspace : 'workspace' (STRING (STRING)?)? '{' NL workspace_body '}';
workspace_body : (workspace_element NL)* ;
workspace_element: 'properties' '{' NL (property_element NL)* '}';
property_element: BLOB property_value;
property_value : BLOB | STRING;
BLOB: [\p{Alpha}]+;
STRING: '"' (~('\n' | '\r' | '"' | '\\') | '\\\\' | '\\"')* '"';
NL: '\r'? '\n';
WS: [ \t]+ -> skip;
This tokenizes to
[#0,0:8='workspace',<'workspace'>,1:0]
[#1,10:15='"Name"',<STRING>,1:10]
[#2,17:29='"Description"',<STRING>,1:17]
[#3,31:31='{',<'{'>,1:31]
[#4,32:32='\n',<NL>,1:32]
[#5,37:46='properties',<'properties'>,2:4]
[#6,48:48='{',<'{'>,2:15]
[#7,49:49='\n',<NL>,2:16]
[#8,58:60='xyz',<BLOB>,3:8]
[#9,62:80='"a string property"',<STRING>,3:12]
[#10,81:81='\n',<NL>,3:31]
[#11,90:98='nonstring',<BLOB>,4:8]
[#12,100:113='nodoublequotes',<BLOB>,4:18]
[#13,114:114='\n',<NL>,4:32]
[#14,119:119='}',<'}'>,5:4]
[#15,120:120='\n',<NL>,5:5]
[#16,121:121='}',<'}'>,6:0]
[#17,122:122='\n',<NL>,6:1]
[#18,123:122='<EOF>',<EOF>,7:0]
This is all fine, and I suppose it's as much as the DSL grammar gives me. Is there a better way to handle situations like this?
As I expand the grammar I expect to have a lot of BLOB tokens simply because creating a narrower token in the lexer would be pointless because BLOB would match instead.
This is the classic keywords-as-identifier problem. If you want that a specific char combination, which is lexed as keyword, can also be used as a normal identifier in certain places, then you have to list this keyword as possible alternative. For example:
property_element: (BLOB | K_WORKSPACE) property_value;
property_value : BLOB | STRING | K_WORKSPACE;

Cannot resolve the following reduce-reduce error (LALR parses)

I am currently implementing the part of the Decaf (programming language) grammar. Here is the relevant snippet of bison code:
type:
INT
| ID
| type LS RS
;
local_var_decl:
type ID SEMICOLON
;
name:
THIS
| ID
| name DOT ID
| name LS expression RS
;
Nevertheless, as soon as I started working on name production rule, my parser gives the reduce-reduce warning.
Here what it's inside the .output file (generated by bison):
State 84
23 type: ID .
61 name: ID .
ID reduce using rule 23 (type)
LS reduce using rule 23 (type)
LS [reduce using rule 61 (name)]
$default reduce using rule 61 (name)
So, if we give the following input { abc[1] = abc; }, it says that syntax error, unexpected NUMBER, expected RS. NUMBER comes here from expression rule (basically, how it must have parsed it), though it tries to parse it through local_var_decl rule.
What do you think should be changed in order to solve this problem? Spent around 2 hours, tried different stuff, did not work.
Thank you!!
PS. Here is the link to the full .y source code
This is a specific instance of a common problem where the parser is being forced to make a decision before it has enough information. In some cases, such as this one, the information needed is not far away, and it would be sufficient to increase the lookahead, if that were possible. (Unfortunately, few parser generators produce LR(k) parsers with k > 1, and bison is no exception.) The usual solution is to simply allow the parse to continue without having to decide. Another solution, with bison (but only in C mode) is to ask for a %glr-parser, which is much more flexible about when reductions need to be resolved at the cost of additional processing time.
In this case, the context allows either a type or a name, both of which can start with an ID followed by a [ (LS). In the case of a name, the [ must be followed by a number; in the case of a type, the [ must be followed by a ]. So if we could see the second token after the ID, we could immediately decide.
But we can only see one token ahead, which is the ]. And the grammar insists that we be able to make an immediate decision because in one case we must reduce the ID to a name and in the other case, to a type. So we have a reduce-reduce conflict, which bison resolves by always using whichever reduction comes first in the grammar file.
One solution is to avoid forcing this choice, at the cost of duplicating productions. For example:
type_other:
INT
| ID LS RS
| type_other LS RS
;
type: ID
| type_other
;
name_other:
THIS
| ID LS expression RS
| name_other DOT ID
| name_other LS expression RS
;
name: ID
| name_other
;

Does -> skip change the behavior of the lexer rule precedence?

I am writing a grammar to parse a configuration export file from a closed system. when a parameter identified in the export file has a particularly long string value assigned to it, the export file inserts "\r\n\t" (double quotes included) every so often in the value. In the file I'll see something like:
"stuff""morestuff""maybesomemorestuff"\r\n\t"morestuff""morestuff"...etc."
In that line, "" is the way the export file escapes a " that is part of the actual string value - vs. a single " which indicates the end of the string value.
my current approach for the grammar to get this string value to the parser is to grab "stuff" as a token and \r\n\t as a token. So I have rules like:
quoted_value : (QUOTED_PART | QUOTE_SEPARATOR)+ ;
QUOTED_PART : '"' .*? '"';
QUOTE_SEPARATOR : '\r\n\t';
WS : [ \t\r\n] -> skip; //note - just one char at a time
I get no errors when I lex or parse a sample string. However, in the token stream - no QUOTE_SEPARATOR tokens show up and there is literally nothing in the stream where they should have been.
I had expected that since QUOTE_SEPARATOR is longer than WS and that it is first in the grammar that it would be selected, but it is behaving as if WS was matched and the characters were skipped and not send to the token string.
Does the -> skip do something to change how rule precedence works?
I am also open to a different approach to the lexing that completely removes the "\r\n\t" (all five characters) - this way just seemed easier and it should be easy enough for the program that will process the parse tree to deal with as other manipulations to the data will be done there anyway (my first grammar - teach me;) ).
No, skip does not affect rule precedence.
Change the QUOTE_SEPARATOR rule to
QUOTE_SEPARATOR : '\\r\\n\\t' ;
in order to match the actual textual content of the source string.

How to resolve Xtext variables' names and keywords statically?

I have a grammar describing an assembler dialect. In code section programmer can refer to registers from a certain list and to defined variables. Also I have a rule matching both [reg0++413] and [myVariable++413]:
BinaryBiasInsideFetchOperation:
'['
v = (Register|[IntegerVariableDeclaration]) ( gbo = GetBiasOperation val = (Register|IntValue|HexValue) )?
']'
;
But when I try to compile it, Xtext throws a warning:
Decision can match input such as "'[' '++' 'reg0' ']'" using multiple alternatives: 2, 3. As a result, alternative(s) 3 were disabled for that input
Spliting the rules I've noticed, that
BinaryBiasInsideFetchOperation:
'['
v = Register ( gbo = GetBiasOperation val = (Register|IntValue|HexValue) )?
']'
;
BinaryBiasInsideFetchOperation:
'['
v = [IntegerVariableDeclaration] ( gbo = GetBiasOperation val = (Register|IntValue|HexValue) )?
']'
;
work well separately, but not at the same time. When I try to compile both of them, XText writes a number of errors saying that registers from list could be processed ambiguously. So:
1) Am I right, that part of rule v = (Register|[IntegerVariableDeclaration]) matches any IntegerVariable name including empty, but rule v = [IntegerVariableDeclaration] matches only nonempty names?
2) Is it correct that when I try to compile separate rules together Xtext thinks that [IntegerVariableDeclaration] can concur with Register?
3) How to resolve this ambiguity?
edit: definitors
Register:
areg = ('reg0' | 'reg1' | 'reg2' | 'reg3' | 'reg4' | 'reg5' | 'reg6' | 'reg7' )
;
IntegerVariableDeclaration:
section = SectionServiceWord? name=ID ':' type = IntegerType ('[' size = IntValue ']')? ( value = IntegerVariableDefinition )? ';'
;
ID is a standart terminal which parses a single word, a.k.a identifier
No, (Register|[IntegerVariableDeclaration]) can't match Empty. Actually, [IntegerVariableDeclaration] is the same than [IntegerVariableDeclaration|ID], it is matching ID rule.
Yes, i think you can't split your rules.
I can't reproduce your problem (i need full grammar), but, in order to solve your problem you should look at this article about xtext grammar debugging:
Compile grammar in debug mode by adding the following line into your workflow.mwe2
fragment = org.eclipse.xtext.generator.parser.antlr.DebugAntlrGeneratorFragment {}
Open generated antrl debug grammar with AntlrWorks and check the diagram.
In addition to Fabien's answer, I'd like to add that an omnimatching rule like
AnyId:
name = ID
;
instead of
(Register|[IntegerVariableDeclaration])
solves the problem. One need to dynamically check if AnyId.name is a Regiser, Variable or something else like Constant.

Can . (period) be part of the path part of an URL?

Is the following URL valid?
http://www.example.com/module.php/lib/lib.php
According to https://www.rfc-editor.org/rfc/rfc1738 section the hpath element of an URL can not contain a '.' (period). There is in the above case a '.' after "module" which is not allowed according to RFC1738.
Am I reading the RFC wrong or is this RFC succeed by another RFC? Some other RFC's allows '.' in URLs (https://www.rfc-editor.org/rfc/rfc1808).
I don't see where RFC1738 disallows periods (.) in URLs. Here are some excerpts from there:
hpath = hsegment *[ "/" hsegment ]
hsegment = *[ uchar | ";" | ":" | "#" | "&" | "=" ]
uchar = unreserved | escape
unreserved = alpha | digit | safe | extra
safe = "$" | "-" | "_" | "." | "+"
So the answer to your question is: Yes, http://www.example.com/module.php/lib/lib.php is a valid URL.
As others have noted, periods are allowed in URLs, but be careful. If a single or double period is used in part of a URL's path, the browser will treat it as a change in the path, and you may not get the behavior you want.
For example:
www.example.com/foo/./ redirects to www.example.com/foo/
www.example.com/foo/../ redirects to www.example.com/
Whereas the following will not redirect:
www.example.com/foo/bar.biz/
www.example.com/foo/..biz/
www.example.com/foo/biz../
Periods are allowed. See section "2.3 Unreserved Characters" in this document:
https://www.rfc-editor.org/rfc/rfc3986
"Characters that are allowed in a URI but do not have a reserved purpose are called unreserved. These include uppercase and lowercase letters, decimal digits, hyphen, period, underscore, and tilde".
Nothing wrong with a period in a url. If you look at the makeup in the grammar in the link you provided a period is mentioned via the 'safe' group, which is included via uchar a
Ignore my answer, Adams is better

Resources