Code taken from ply.lex documentation: http://www.dabeaz.com/ply/ply.html#ply_nn6
reserved = {
'if' : 'IF',
'then' : 'THEN',
'else' : 'ELSE',
'while' : 'WHILE',
...
}
tokens = ['LPAREN','RPAREN',...,'ID'] + list(reserved.values())
def t_ID(t):
r'[a-zA-Z_][a-zA-Z_0-9]*'
t.type = reserved.get(t.value,'ID') # Check for reserved words
return t
For the reserved words, we need to change the token type. Doing reserved.get() by passing to it the t.value is understandable. Now it should return the entities in the second column in the reserved specification.
But why are we passing to it ID? What does it mean and what purpose does it solve?
The second parameter specifies the value to return should the key not exist in the dictionary. So in this case, if the value of t.value does not exist as a key in the reserved dictionary, the string 'ID' will be returned instead.
In other words, a.get(b, c) when a is a dict is roughly equivalent to a[b] if b in a else c (except it is presumably more efficient, as it would only look up the key once in the success case).
See the Python documentation for dict.get().
Related
I am using ply (a popular python implementation of Lex and Yacc) to create a simple compiler for a custom language.
Currently my lexer looks as follows:
reserved = {
'begin': 'BEGIN',
'end': 'END',
'DECLARE': 'DECL',
'IMPORT': 'IMP',
'Dow': 'DOW',
'Enddo': 'ENDW',
'For': 'FOR',
'FEnd': 'ENDF',
'CASE': 'CASE',
'WHEN': 'WHN',
'Call': 'CALL',
'THEN': 'THN',
'ENDC': 'ENDC',
'Object': 'OBJ',
'Move': 'MOV',
'INCLUDE': 'INC',
'Dec': 'DEC',
'Vibration': 'VIB',
'Inclination': 'INCLI',
'Temperature': 'TEMP',
'Brightness': 'BRI',
'Sound': 'SOU',
'Time': 'TIM',
'Procedure': 'PROC'
}
tokens = ["INT", "COM", "SEMI", "PARO", "PARC", "EQ", "NAME"] + list(reserved.values())
t_COM = r'//'
t_SEMI = r";"
t_PARO = r'\('
t_PARC = r'\)'
t_EQ = r'='
t_NAME = r'[a-z][a-zA-Z_&!0-9]{0,9}'
def t_INT(t):
r'\d+'
t.value = int(t.value)
return t
def t_error(t):
print("Syntax error: Illegal character '%s'" % t.value[0])
t.lexer.skip(1)
Per the documentation, I am creating a dictionary for reserved keywords and then adding them to the tokens list, rather than adding individual rules for them. The documentation also states that precedence is decided following these 2 rules:
All tokens defined by functions are added in the same order as they appear in the lexer file.
Tokens defined by strings are added next by sorting them in order of decreasing regular expression length (longer expressions are added first).
The problem I'm having is that when I test the lexer using this test string
testInput = "// ; begin end DECLARE IMPORT Dow Enddo For FEnd CASE WHEN Call THEN ENDC (asdf) = Object Move INCLUDE Dec Vibration Inclination Temperature Brightness Sound Time Procedure 985568asdfLYBasdf ; Alol"
The lexer returns the following error:
LexToken(COM,'//',1,0)
LexToken(SEMI,';',1,2)
LexToken(NAME,'begin',1,3)
Syntax error: Illegal character ' '
LexToken(NAME,'end',1,9)
Syntax error: Illegal character ' '
Syntax error: Illegal character 'D'
Syntax error: Illegal character 'E'
Syntax error: Illegal character 'C'
Syntax error: Illegal character 'L'
Syntax error: Illegal character 'A'
Syntax error: Illegal character 'R'
Syntax error: Illegal character 'E'
(That's not the whole error but that's enough to see whats happening)
For some reason, Lex is parsing NAME tokens before parsing the keywords. Even after it's done parsing NAME tokens, it doesn't recognize the DECLARE reserved keyword. I have also tried to add reserved keywords with the rest of the tokens, using regular expressions but I get the same result (also the documentation advises against doing so).
Does anyone know how to fix this problem? I want the Lexer to identify reserved keywords first and then to attempt to tokenize the rest of the input.
Thanks!
EDIT:
I get the same result even when using the t_ID function exemplified in the documentation:
def t_NAME(t):
r'[a-z][a-zA-Z_&!0-9]{0,9}'
t.type = reserved.get(t.value,'NAME')
return t
The main problem here is that you are not ignoring whitespace; all the errors are a consequence. Adding a t_ignore definition to your grammar will eliminate those errors.
But the grammar won't work as expected even if you fix the whitespace issue, because you seem to be missing an important aspect of the documentation, which tells you how to actually use the dictionary reserved:
To handle reserved words, you should write a single rule to match an identifier and do a special name lookup in a function like this:
reserved = {
'if' : 'IF',
'then' : 'THEN',
'else' : 'ELSE',
'while' : 'WHILE',
...
}
tokens = ['LPAREN','RPAREN',...,'ID'] + list(reserved.values())
def t_ID(t):
r'[a-zA-Z_][a-zA-Z_0-9]*'
t.type = reserved.get(t.value,'ID') # Check for reserved words
return t
(In your case, it would be NAME and not ID.)
Ply knows nothing about the dictionary reserved and it also has no idea how you produce the token names enumerated in tokens. The only point of tokens is to let Ply know which symbols in the grammar represent tokens and which ones represent non-terminals. The mere fact that some word is in tokens does not serve to define the pattern for that token.
Through an online Dart course, I've found some values bracketed with "less than" and "greater than" marks such as "List< E >".
e.g.
List<int> fixedLengthList = new List(5);
I couldn't find a direct answer online, probably because that question was too basic. Could someone explain what those marks exactly indicate? Or any links if possible.
This are generic type parameters. It allows specializations of classes.
List is a list that can contain any value (if no type parameter is passed dynamic is used by default).
List<int> is a list that only allows integer values andnull`.
You can add such Type parameters to your custom classes as well.
Usually single upper-case letters are used for type parameter names like T, U, K but they can be other names like TKey ...
class MyClass<T> {
T value;
MyClass(this.value);
}
main() {
var mcInt = MyClass<int>(5);
var mcString = MyClass<String>('foo');
var mcStringError = MyClass<String>(5); // causes error because `5` is an invalid value when `T` is `String`
}
See also https://www.dartlang.org/guides/language/language-tour#generics
for example If you intend for a list to contain only strings, you can declare it as List<String> (read that as “list of string”)
Say I have a scope like this:
scope :by_templates, ->(t) { joins(:template).where('templates.label ~* ?', t) }
How can I retrieve multiple templates with t like so?
Document.first.by_templates(%w[email facebook])
This code returns this error.
PG::DatatypeMismatch: ERROR: argument of AND must be type boolean, not type record
LINE 1: ...template_id" WHERE "documents"."user_id" = $1 AND (templates...
PostgreSQL allows you to apply a boolean valued operator to an entire array of values using the op any(array_expr) construct:
9.23.3. ANY/SOME (array)
expression operator ANY (array expression)
expression operator SOME (array expression)
The right-hand side is a parenthesized expression, which must yield an array value. The left-hand expression is evaluated and compared to each element of the array using the given operator, which must yield a Boolean result. The result of ANY is “true” if any true result is obtained. The result is “false” if no true result is found (including the case where the array has zero elements).
PostgreSQL also supports the array constructor syntax for creating arrays:
array[value, value, ...]
Conveniently, ActiveRecord will expand a placeholder as a comma-delimited list when the value is an array.
Putting these together gives us:
scope :by_templates, ->(templates) { joins(:template).where('templates.label ~* any(array[?])', templates) }
As an aside, if you're using the case-insensitive regex operator (~*) as a case-insensitive comparison (i.e. no real regex pattern matching going on) then you might want to use upper instead:
# Yes, this class method is still a scope.
def self.by_templates(templates)
joins(:template).where('upper(templates.label) = any(array[?])', templates.map(&:upcase) }
end
Then you could add an index to templates on upper(label) to speed things up and avoid possible issues with stray regex metacharacters in the templates. I tend to use upper case for this sort of thing because of oddities lie 'ß'.upcase being 'SS' but 'SS'.downcase being 'ss'.
Sorry if it's a novice question - I want to parse something defined by
Exp ::= Mandatory_Part Optional_Part0 Optional_Part1
I thought I could do this:
proc::Parser String
proc = do {
;str<-parserMandatoryPart
;str0<-optional(parserOptionalPart0) --(1)
;str1<-optional(parserOptionalPart1) --(2)
;return str++str0++str1
}
I want to get str0/str1 if optional parts are present, otherwise, str0/str1 would be "".
But (1) and (2) won't work since optional() doesn't allow extracting result from its parameters, in this case, parserOptionalPart0/parserOptionalPart1.
Now What would be the proper way to do it?
Many thanks!
Billy R
The function you're looking for is optionMaybe. It returns Nothing if the parser failed, and returns the content in Just if it consumed input.
From the docs:
option x p tries to apply parser p. If p fails without consuming input, it returns the value x, otherwise the value returned by p.
So you could do:
proc :: Parser String
proc = do
str <- parserMandatoryPart
str0 <- option "" parserOptionalPart0
str1 <- option "" parserOptionalPart1
return (str++str0++str1)
Watch out for the "without consuming input" part. You may need to wrap either or both optional parsers with try.
I've also adjusted your code style to be more standard, and fixed an error on the last line. return isn't a keyword; it's an ordinary function. So return a ++ b is (return a) ++ b, i.e. almost never what you want.
I am new at language processing and I want to create a parser with Irony for a following syntax:
name1:value1 name2:value2 name3:value ...
where name1 is the name of an xml element and value is the value of the element which can also include spaces.
I have tried to modify included samples like this:
public TestGrammar()
{
var name = CreateTerm("name");
var value = new IdentifierTerminal("value");
var queries = new NonTerminal("queries");
var query = new NonTerminal("query");
queries.Rule = MakePlusRule(queries, null, query);
query.Rule = name + ":" + value;
Root = queries;
}
private IdentifierTerminal CreateTerm(string name)
{
IdentifierTerminal term = new IdentifierTerminal(name, "!##$%^*_'.?-", "!##$%^*_'.?0123456789");
term.CharCategories.AddRange(new[]
{
UnicodeCategory.UppercaseLetter, //Ul
UnicodeCategory.LowercaseLetter, //Ll
UnicodeCategory.TitlecaseLetter, //Lt
UnicodeCategory.ModifierLetter, //Lm
UnicodeCategory.OtherLetter, //Lo
UnicodeCategory.LetterNumber, //Nl
UnicodeCategory.DecimalDigitNumber, //Nd
UnicodeCategory.ConnectorPunctuation, //Pc
UnicodeCategory.SpacingCombiningMark, //Mc
UnicodeCategory.NonSpacingMark, //Mn
UnicodeCategory.Format //Cf
});
//StartCharCategories are the same
term.StartCharCategories.AddRange(term.CharCategories);
return term;
}
but this doesn't work if the values include spaces. Can this be done (using Irony) without modifying the syntax (like adding quotes around values)?
Many thanks!
If newlines were included between key-value pairs, it would be easily achievable. I have no knowledge of "Irony", but my initial feeling is that almost no parser/lexer generator is going to deal with this given only a naive grammar description. This requires essentially unbounded lookahead.
Conceptually (because I know nothing about this product), here's how I would do it:
Tokenise based on spaces and colons (i.e. every continguous sequence of characters that isn't a space or a colon is an "identifier" token of some sort).
You then need to make it such that every "sentence" is described from colon-to-colon:
sentence = identifier_list
| : identifier_list identifier : sentence
That's not enough to make it work, but you get the idea at least, I hope. You would need to be very careful to distinguish an identifier_list from a single identifier such that they could be parsed unambiguously. Similarly, if your tool allows you to define precedence and associativity, you might be able to get away with making ":" bind very tightly to the left, such that your grammar is simply:
sentence = identifier : identifier_list
And the behaviour of that needs to be (identifier :) identifier_list.