In a tree-sitter grammar, how do I match strings except for reserved keywords in identifiers? - parsing

This might be related to me not understanding the Keyword Extraction feature, which from the docs seems to be about avoiding an issue where no space exists between a keyword and the following expression. But say I have a fairly standard identifier regex for variable names, function names, etc.:
/\w*[A-Za-z]\w*/
How do I keep this from matching a reserved keyword like IF or ELSE or something like that? So this expression would produce an error:
int IF = 5;
while this would not:
int x = 5;

There is a pull request pending since 2019 to add an EXCLUDE feature, but this is not currently implemented as of time of writing this (April 2021 - if some time has passed and you're reading this, please do re-check this!). And since treesitter also does not support negative lookbehind in its regular expressions, this has to be handled at the semantic level. One thing you can do to make this check easier is to enumerate all your reserved words then add them as an alternative to your identifier regex:
keyword: $ => choice('IF', 'THEN', 'ELSE'),
name: $ => /\w*[A-Za-z]\w*/,
identifier: $ => choice($.keyword, $.name)
According to rule 4 of treesitter's match rules, in the expression int IF = 5; the IF token would match (identifier keyword) rather than (identifier name) since it is a more specific match. This means you can do an easy query for illegal (identifier keyword) nodes and surface the error to the user in your language server or from wherever it is you're using the treesitter grammar.
Note that this approach does run the risk of creating many conflicts between your (identifier keyword) match and the actual language constructs that use those keywords. If so, you'll have to handle the whole thing at the semantic level: scan all identifiers to check whether they're a reserved word.

Related

How to use context free grammars?

Could someone help me with using context free grammars. Up until now I've used regular expressions to remove comments, block comments and empty lines from a string so that it can be used to count the PLOC. This seems to be extremely slow so I was looking for a different more efficient method.
I saw the following post: What is the best way to ignore comments in a java file with Rascal?
I have no idea how to use this, the help doesn't get me far as well. When I try to define the line used in the post I immediately get an error.
lexical SingleLineComment = "//" ~[\n] "\n";
Could someone help me out with this and also explain a bit about how to setup such a context free grammar and then to actually extract the wanted data?
Kind regards,
Bob
First this will help: the ~ in Rascal CFG notation is not in the language, the negation of a character class is written like so: ![\n].
To use a context-free grammar in Rascal goes in three steps:
write it, like for example the syntax definition of the Func language here: http://docs.rascal-mpl.org/unstable/Recipes/#Languages-Func
Use it to parse input, like so:
// This is the basic parse command, but be careful it will not accept spaces and newlines before and after the TopNonTerminal text:
Prog myParseTree = parse(#Prog, "example string");
// you can do the same directly to an input file:
Prog myParseTree = parse(#TopNonTerminal, |home:///myProgram.func|);
// if you need to accept layout before and after the program, use a "start nonterminal":
start[Prog] myParseTree = parse(#start[TopNonTerminal], |home:///myProgram.func|);
Prog myProgram = myParseTree.top;
// shorthand for parsing stuff:
myProgram = [Prog] "example";
myProgram = [Prog] |home:///myLocation.txt|;
Once you have the tree you can start using visit and / deepmatch to extract information from the tree, or write recursive functions if you like. Examples can be found here: http://docs.rascal-mpl.org/unstable/Recipes/#Languages-Func , but here are some common idioms as well to extract information from a parse tree:
// produces the source location of each node in the tree:
myParseTree#\loc
// produces a set of all nodes of type Stat
{ s | /Stat s := myParseTree }
// pattern match an if-then-else and bind the three expressions and collect them in a set:
{ e1, e2, e3 | (Stat) `if <Exp e1> then <Exp e2> else <Exp e3> end` <- myExpressionList }
// collect all locations of all sub-trees (every parse tree is of a non-terminal type, which is a sub-type of Tree. It uses |unknown:///| for small sub-trees which have not been annotated for efficiency's sake, like literals and character classes:
[ t#\loc?|unknown:///| | /Tree t := myParseTree ]
That should give you a start. I'd go try out some stuff and look at more examples. Writing a grammar is a nice thing to do, but it does require some trial and error methods like writing a regex, but even more so.
For the grammar you might be writing, which finds source code comments but leaves the rest as "any character" you will need to use the longest match disambiguation a lot:
lexical Identifier = [a-z]+ !>> [a-z]; // means do not accept an Identifier if there is still [a-z] to add to it; so only the longest possible Identifier will match.
This kind of context-free grammar is called an "Island Grammar" metaphorically, because you will write precise rules for the parts you want to recognize (the comments are "Islands") while leaving the rest as everything else (the rest is "Water"). See https://dl.acm.org/citation.cfm?id=837160

why we need both Look Ahead symbol and read ahead symbol in Compiler

well i was reading some common concepts regarding parsing in compiler..i came across look ahead and read ahead symbol i search and read about them but i am stuck like why we need both of them ? would be grateful for any kind suggestion
Lookahead symbol: when node being considered in parse tree is for a terminal, and the
terminal matches lookahead symbol,then we advance in both parse and
input
read aheadsymbol: lexical analyzer may need to read some character
before it can decide on the token to be returned
One of these is about parsing and refers to the next token to be produced by the lexical scanner. The other one, which is less formal, is about lexical analysis and refers to the next character in the input stream. It should be clear which is which.
Note that while most parsers only require a single lookahead token, it is not uncommon for lexical analysis to have to backtrack, which is equivalent to examining several unconsumed input characters.
I hope I got your question right.
Consider C.
It has several punctuators that begin the same way:
+, ++, +=
-, --, -=, ->
<, <=, <<, <<=
...
In order to figure out which one it is when you see the first + or - or <, you need to look ahead one character in the input (and then maybe one more for <<=).
A similar thing can happen at a higher level:
{
ident1 ident2;
ident3;
ident4:;
}
Here ident1, ident3 and ident4 can begin a declaration, an expression or a label. You can't tell which one immediately. You can consult your existing declarations to see if ident1 or ident3 is already known (as a type or variable/function/enumeration), but it's still ambiguous because a colon may follow and if it does, it's a label because it's permitted to use the same identifier for both a label and a type/variable/function/enumeration (those two name spaces do not intersect), e.g.:
{
typedef int ident1;
ident1 ident2; // same as int ident2
int ident3 = 0;
ident3; // unused expression of value 0
ident1:; // unused label
ident2:; // unused label
ident3:; // unused label
}
So, you may very well need to look ahead a character or a token (or "unread" one) to deal with situations like these.

Lexer rule optional suffix not matching, when it should match

Using ANTLR 3, my lexer has rule
SELECT_ASSIGN:
'SELECT' WS+ IDENTIFIER WS+ 'ASSIGN' WS+ (('TO'|'USING') WS+)?
using this these match correctly
SELECT VAR1 ASSIGN TO
SELECT VAR1 ASSIGN USING
and this also matches
SELECT VAR1 ASSIGN FOO
However this does not match
SELECT VAR1 ASSIGN TWO
Whereas I have marked TO|USING as optional in the rule.
From generated Java code I see...
When lexer notices T of TWO, it goes to match('TO')
but since does not find O after T
then generates failure.... and returns all the way from the rule -- hence not matching it.
How do I get my lexer rule to match, when input has word with chars starting with suffixed optional part of the rule
Basically I want my rule to match this also (beside what it already matches - as lised at the start):
SELECT VAR1 ASSIGN TWO
Kindly suggest how I approach/resolve this situation.
NOTE:
Such rules are recommended in the parser - But I have this in lexer - because I do not want to parse the entire input by the parser, and want to parse only content of interest. So using such rules in lexer, I locate sections which I really want to parse by the parser.
UPDATE 1
I could circumvent this problem by making 2 rules, like so:
SELECT_ASSIGN_USING_TO
: tok='SELECT' WS+ name=IDENTIFIER WS+ 'ASSIGN' WS+ ('USING'|'TO')
SELECT_ASSIGN
: tok='SELECT' WS+ name=IDENTIFIER WS+ 'ASSIGN'
But is it possible to do the desired in one lexer rule?
An approach to get this in one rule, suggested by my senior - use syntactic predicate
SELECT_ASSIGN
: tok='SELECT' WS+ name=IDENTIFIER WS+ 'ASSIGN'
(
(WS+ ('TO'|'USING') WS+)=> (WS+ ('TO'|'USING') WS+)
| (WS+)
)
Tokens match a complete char sequence or none. It cannot match partially and the grammar rule determines which exactly. You cannot expect a rule for TO to match TWO. If you want TWO to match too you have to add it to your lexer rule.
A few notes here:
The solution your "senior" gave you makes no sense at all. A
syntactic predicate is a kinda lookahead to guide the parser in case
of ambiquities. There are no ambiquities involved here.
Writing
the entire SELECT_ASSIGN rule as a lexer rule is very uncommon and
not flexible. A lexer rule should not be used for entire sentences,
but only for a small set of characters to find tokens to assign them
a type (usually elementary structures of a language like string,
number, comment etc.).
ANTLR3 is totally outdated and I wonder why this is still used in your class. ANTLR4 is out since 5 years and should be the choice for any new project.

(F) Lex, how do I match negation?

Some language grammars use negations in their rules. For example, in the Dart specification the following rule is used:
~('\'|'"'|'$'|NEWLINE)
Which means match anything that is not one of the rules inside the parenthesis. Now, I know in flex I can negate character rules (ex: [^ab] , but some of the rules I want to negate could be more complicated than a single character so I don't think I could use character rules for that. For example I may need to negate the sequence '"""' for multiline strings but I'm not sure what the way to do it in flex would be.
(TL;DR: Skip down to the bottom for a practical answer.)
The inverse of any regular language is a regular language. So in theory it is possible to write the inverse of a regular expression as a regular expression. Unfortunately, it is not always easy.
The """ case, at least, is not too difficult.
First, let's be clear about what we are trying to match.
Strictly speaking "not """" would mean "any string other than """". But that would include, for example, x""".
So it might be tempting to say that we're looking for "any string which does not contain """". (That is, the inverse of .*""".*). But that's not quite correct either. The typical usage is to tokenise an input like:
"""This string might contain " or ""."""
If we start after the initial """ and look for the longest string which doesn't contain """, we will find:
This string might contain " or "".""
whereas what we wanted was:
This string might contain " or "".
So it turns out that we need "any string which does not end with " and which doesn't contain """", which is actually the conjunction of two inverses: (~.*" ∧ ~.*""".*)
It's (relatively) easy to produce a state diagram for that:
(Note that the only difference between the above and the state diagram for "any string which does not contain """" is that in that state diagram, all the states would be accepting, and in this one states 1 and 2 are not accepting.)
Now, the challenge is to turn that back into a regular expression. There are automated techniques for doing that, but the regular expressions they produce are often long and clumsy. This case is simple, though, because there is only one accepting state and we need only describe all the paths which can end in that state:
([^"]|\"([^"]|\"[^"]))*
This model will work for any simple string, but it's a little more complicated when the string is not just a sequence of the same character. For example, suppose we wanted to match strings terminated with END rather than """. Naively modifying the above pattern would result in:
([^E]|E([^N]|N[^D]))* <--- DON'T USE THIS
but that regular expression will match the string
ENENDstuff which shouldn't have been matched
The real state diagram we're looking for is
and one way of writing that as a regular expression is:
([^E]|E(E|NE)*([^EN]|N[^ED]))
Again, I produced that by tracing all the ways to end up in state 0:
[^E] stays in state 0
E in state 1:
(E|NE)*: stay in state 1
[^EN]: back to state 0
N[^ED]:back to state 0 via state 2
This can be a lot of work, both to produce and to read. And the results are error-prone. (Formal validation is easier with the state diagrams, which are small for this class of problems, rather than with the regular expressions which can grow to be enormous).
A practical and scalable solution
Practical Flex rulesets use start conditions to solve this kind of problem. For example, here is how you might recognize python triple-quoted strings:
%x TRIPLEQ
start \"\"\"
end \"\"\"
%%
{start} { BEGIN( TRIPLEQ ); /* Note: no return, flex continues */ }
<TRIPLEQ>.|\n { /* Append the next token to yytext instead of
* replacing yytext with the next token
*/
yymore();
/* No return yet, flex continues */
}
<TRIPLEQ>{end} { /* We've found the end of the string, but
* we need to get rid of the terminating """
*/
yylval.str = malloc(yyleng - 2);
memcpy(yylval.str, yytext, yyleng - 3);
yylval.str[yyleng - 3] = 0;
return STRING;
}
This works because the . rule in start condition TRIPLEQ will not match " if the " is part of a string matched by {end}; flex always chooses the longest match. It could be made more efficient by using [^"]+|\"|\n instead of .|\n, because that would result in longer matches and consequently fewer calls to yymore(); I didn't write it that way above simply for clarity.
This model is much easier to extend. In particular, if we wanted to use <![CDATA[ as the start and ]]> as the terminator, we'd only need to change the definitions
start "<![CDATA["
end "]]>"
(and possibly the optimized rule inside the start condition, if using the optimization suggested above.)

Matching function in erlang based on string format

I have user information coming in from an outside source and I need to check if that user is active. Sometimes I have a User and a Server and other times I have User#Server. The former case is no problem, I just have:
active(User, Server) ->
do whatever.
What I would like to do with the User#Server case is something like:
active([User, "#", Server]) ->
active(User, Server).
Doesn't seem to work. When calling active in the erlang terminal with a#b for example, I get an error that there is no match. Any help would be appreciated!
You can tokenize the string to get the result:
active(UserString) ->
[User,Server] = string:tokens(UserString,"#"),
active(User,Server).
If you need something more elaborate, or with better handling of something like email addresses, it might then be time to delve into using regular expressions with the re module.
active(UserString) ->
RegEx = "^([\\w\\.-]+)#([\\w\\.-]+)$",
{match, [User,Server]} = re:run(UserString,RegEx,[{capture,all_but_first,list}]),
active(User,Server).
Note: The supplied Regex is hardly sufficient for email address validation, it's just an example that allows all alphanumeric characters including underscores (\\w), dots (\\.), and dashes (-) seperated by an at symbol. And it will fail if the match doesn't stretch the whole length of the string: (^ to $).
A note on the pattern matching, for the real solution to your problem I think #chops suggestions should be used.
When matching patterns against strings I think it's useful to keep in mind that erlang strings are really lists of integers. So the string "#" is actually the same as [64] (64 being the ascii code for #)
This means that you match pattern [User, "#", Server] will match lists like: [97,[64],98], but not "a#b" (which in list form is [97,64,98]).
To match the string you need to do [User,$#,Server]. The $ operator gives you the ascii value of the character.
However this match pattern limits the matching string to be 1 character followed by # and then one more character...
It can be improved by doing [User, $# | Server] which allows the server part to have arbitrary length, but the User variable will still only match one single character (and I don't see a way around that).

Resources