I am trying to get operator precedence correct in a Treesitter grammar. Treesitter is a LR1 parser generator.
I have a straightforward artithmetic grammar, which partially looks like this:
multiply_expression: $ => prec.left(2, seq(
$._expression,
'*',
$._expression,
)),
addition_expression: $ => prec.left(1, seq(
$._expression,
'+',
$._expression,
)),
This works correctly. multiply_expression indeed gets a higher precedence than addition_expression.
However, the precedence changes when I add an intermediate rule:
_partial_multi: $ => seq(
$._expression,
'*',
),
multiply_expression: $ => prec.left(2, seq(
$._partial_multi,
$._expression,
)),
I moved $.expression, '*' to its own rule. To me, this seems to be an equivalent grammar, and I expect no changes. However, with this change the precedence is no longer correct. addition_expression, which remained unchanged, seems to have a higher precedence than multiply_expression.
Why does introducing an extra step change the precedence? Is there a name for this problem, or where can I find more information about it? When writing a grammar or fixing precedence problems, are there rules to follow or ways to think about this?
Here is your full grammar, for reproducibility:
module.exports = grammar({
name: 'github_example',
conflicts: $ => [],
rules: {
source_file: $ => $._expression,
_expression: $ => choice(
$.number,
$.multiply_expression,
$.addition_expression
),
number: $ => /\d+/,
_partial_multi: $ => seq(
$._expression,
'*',
),
multiply_expression: $ => prec.left(2, seq(
$._partial_multi,
$._expression,
)),
addition_expression: $ => prec.left(1, seq(
$._expression,
'+',
$._expression,
)),
}
});
You can fix this issue by adding precedence to the _partial_multi rule and removing left-associative precedence from the multiply_expression rule:
_partial_multi: $ => prec(2, seq(
$._expression,
'*',
)),
multiply_expression: $ => seq(
$._partial_multi,
$._expression,
),
What you've done here is make multiplication a right-associative operator of precedence 2. This is how you define left- or right-associativity in grammars which don't expose it as a primitive. You can make multiplication left-associative by writing it as follows:
_partial_multi: $ => prec(2, seq(
'*',
$._expression,
)),
multiply_expression: $ => seq(
$._expression,
$._partial_multi,
),
You've actually stumbled across something quite interesting, which is that you don't need explicit language constructs to define precedence & associativity in a grammar! They're just "syntactic sugar" to make the grammar easier to read & write. See this page for more information on how to specify precedence & associativity by decomposing rules. You can see that specifying precedence & associativity purely through grammar constructs is confusing, and almost reads backwards from what you expect unless you think about it! As you also discovered, mixing these two approaches (specifying precedence & associativity through language constructs and grammar constructs) can lead to confusing behavior. It is best to stick with one or the other.
Related
I am trying to define a language using ANTLR4 to generate its parser. While the language is actually a bit more complex, this is a tiny valid example of a file I want the parser to read, which triggers the problem I am trying to fix:
features \\ Keyword which initializes the "features" block
Server
mandatory \\ Relation word
FileSystem
OperatingSystem
optional \\ Relation word
Logging
features word simply starts the block, while mandatory and optional are relation words. The words remaining are just simple words (called features in this context). What I want is to make Server child of features block, then, mandatory and optional both children of Server and finally, FileSystem and OperatingSystem children of mandatory, and Logging child of optional.
The following grammar is my attempt to achieve this structure:
grammar MyGrammar;
tokens {
INDENT,
DEDENT
}
#lexer::header {
from antlr_denter.DenterHelper import DenterHelper
from UVLParser import UVLParser
}
#lexer::members {
class UVLDenter(DenterHelper):
def __init__(self, lexer, nl_token, indent_token, dedent_token, ignore_eof):
super().__init__(nl_token, indent_token, dedent_token, ignore_eof)
self.lexer: UVLLexer = lexer
def pull_token(self):
return super(UVLLexer, self.lexer).nextToken()
denter = None
def nextToken(self):
if not self.denter:
self.denter = self.UVLDenter(self, self.NL, UVLParser.INDENT, UVLParser.DEDENT, True)
return self.denter.next_token()
}
// parser rules
feature_model: features?;
features: 'features' INDENT child;
child: feature_spec INDENT relation* DEDENT;
relation: relation_spec INDENT child* DEDENT;
feature_spec: WORD ('.' WORD)*;
relation_spec: RELATION_WORD;
//lexer rules
RELATION_WORD: ('alternative' | 'or' | 'optional' | 'mandatory');
WORD: [a-zA-Z][a-zA-Z0-9_]*;
WS: [ \n\r]+ -> skip;
NL: ('\r'? '\n' '\t');
I am using antlr-denter in order to manage indent and dedent.
Then, I am defining RELATION_WORD and WORD separately in the lexer.
Finally, the parser rules attempt to construct the structure I described before. I want the features word to be followed by a single child. Then, any child is going to be a feature spec followed by any amount of relations between an INDENT and DEDENT. Same happens with relations being a relation spec followed by a similar set of children, with this loop being repeated indefinitely.
However, I can't manage to make the parser read this structure correctly. With the previous example as input, I am getting mandatory as child of Server, but not optional. Changing the example to this one:
features
Server
mandatory
optional
Logging
Both mandatory and optional are interpreted as children of mandatory. It must have something to do with INDENT and DEDENT interpretation to correctly find blocks, but I have been unable to find a solution so far.
Any ideas to fix this would be very welcome. Thanks in advance!
Try changing your child and feature rules as follows:
child: feature_spec (INDENT relation* DEDENT)?;
relation: relation_spec (INDENT child* DEDENT)?;
Explanation:
As #Kaby76 mentions, it's quite helpful to print out the token stream to understand how your parser stream sees the stream of tokens.
I've not used antlr-denter before, but the way it plugs in, it would appear that you're not going to get a token stream just by using the grun tool.
As a substitute, I tried just making up INDENT and OUTDENT Tokens (I used -> and <-, respectively)
revised grammar:
grammar MyGrammar;
// parser rules
feature_model: features?;
features: 'features' INDENT child;
child: feature_spec INDENT relation* DEDENT;
relation: relation_spec INDENT child* DEDENT;
feature_spec: WORD ('.' WORD)*;
relation_spec: RELATION_WORD;
//lexer rules
RELATION_WORD: ('alternative' | 'or' | 'optional' | 'mandatory');
WORD: [a-zA-Z][a-zA-Z0-9_]*;
WS: [ \n\r]+ -> skip;
// Temporary
//NL: ('\r'? '\n' '\t');
NL: ('\r'? '\n' '\t') -> skip;
INDENT: '->';
DEDENT: '<-';
And revised to input file to use the explicit tokens:
features
->Server
->mandatory
optional
->Logging
Just making this change, you'll notice that there are no <- tokens in your sample.
But, now I can dump the token stream:
➜ grun MyGrammar tokens -tokens < MGIn.txt
[#0,0:7='features',<'features'>,1:0]
[#1,12:13='->',<'->'>,2:3]
[#2,14:19='Server',<WORD>,2:5]
[#3,28:29='->',<'->'>,3:7]
[#4,30:38='mandatory',<RELATION_WORD>,3:9]
[#5,47:48='->',<'->'>,4:7]
[#6,49:56='optional',<RELATION_WORD>,4:9]
[#7,69:70='->',<'->'>,5:11]
[#8,71:77='Logging',<WORD>,5:13]
[#9,78:77='<EOF>',<EOF>,5:20]
Now let's try parsing:
➜ grun MyGrammar feature_model -tree < MGIn.txt
line 4:9 mismatched input 'optional' expecting {WORD, '<-'}
line 5:20 mismatched input '<EOF>' expecting {'.', '->'}
(feature_model (features features -> (child (feature_spec Server) -> (relation (relation_spec mandatory) ->) (relation (relation_spec optional) -> (child (feature_spec Logging))) <missing '<-'>)))
So, your grammar calls for 'mandatory' (as a RELATION_WORD) to be followed by an INDENT as well as a DEDENT (which isn't present). This makes sense as they don't have any children, So, it seems that the INDENT/DEDENT need to be connected to whether there are any children:
Let's change that:
child: feature_spec (INDENT relation* DEDENT)?;
relation: relation_spec (INDENT child* DEDENT)?;
Try again:
➜ grun MyGrammar feature_model -tree < MGIn.txt
➜ grun MyGrammar feature_model -tree < MGIn.txt
line 5:20 extraneous input '<EOF>' expecting {WORD, '<-'}
(feature_model (features features -> (child (feature_spec Server) -> (relation (relation_spec mandatory)) (relation (relation_spec optional) -> (child (feature_spec Logging))) <missing '<-'>)))
Now we're missing a <- (OUTDENT) at EOF. The solution to this depends on whether the antlr-denter closes all the INDENTs at <EOF>
Assuming it does, my fake input should look something like:
features
->Server
->mandatory
optional
->Logging
<-
<-
<-
and, we try again:
grun MyGrammar feature_model -gui < MGIn.txt
What do I need to insert into TextField(inputFormatters:?
I want to disallow \ and / in one TextField and only allow a to Z in another.
Formatters
In the services library you will find the TextInputFormatter abstract class (this means that you have to import package:flutter/services.dart).
It already has implementations, which are FilteringTextInputFormatter (formerly BlacklistingTextInputFormatter and WhitelistingTextInputFormatter) and LengthLimitingTextInputFormatter.
If you want to implement your own formatter, you can do so by extending TextInputFormatter itself and implementing formatEditUpdate in there.
I will show how to apply the premade FilteringTextInputFormatter with given context.
Examples
disallow \ and /
For this we are going to use the FilteringTextInputFormatter.deny constructor:
TextField(
inputFormatters: [
FilteringTextInputFormatter.deny(RegExp(r'[/\\]')),
],
)
For the Pattern, which needs to be supplied to the formatter, I will be using RegExp, i.e. regular expressions. You can find out more about that here, which also links you to the features I will be using in my examples.
Pay attention to the double backslash \\ and the raw string (r'') in this example. This represents only a single backslash in reality. The reason for this is that backslashes are escape keys in RegExp, so we need to use two backslashes if we want to match the \ character. We would even need quadruple backslashes(\\\\) without the raw string (r'…') because Dart also uses backslashes as escape keys. Using a raw string will ensure that Dart does not unescape characters.
If we were to block a, b, F, !, and ., we would also put it in a list […] like this:
FilteringTextInputFormatter.deny(RegExp('[abF!.]'))
This translates to "deny/blacklist all 'a', 'b', 'F', '!' and '.'".
only allow a to Z
This time we use the FilteringTextInputFormatter.allow constructor:
TextField(
inputFormatters: [
FilteringTextInputFormatter.allow(RegExp('[a-zA-Z]')),
],
)
For this, we are specifying two ranges of characters: a-z and A-Z, which will also accept all the characters (here all the letters) in-between those two specified. This will also work for 0-9 and you can append any character to that list, e.g. a-zA-Z0-9!. will also take ! and . into account.
We can combine this
TextField(
inputFormatters: [
FilteringTextInputFormatter.allow(RegExp('[a-zA-Z]')),
FilteringTextInputFormatter.deny(RegExp('[abFeG]')),
],
)
This is only to show that inputFormatters takes a List<InputFormatter> and multiple formatters can be combined. In reality, you can solve this with one allow/whitelist and a regular expression, but this does work as well.
digitsOnly
There are also already included static properties in the FilteringTextInputFormatter class: one of these is FilteringTextInputFormatter.digitsOnly.
It will only accept/allow digits and is equivalent to an .allow(RegExp('[0-9]')) formatter.
Other options:
lowercase letters : a-z
capital letters : A-Z
lowercase vowels accented : á-ú
capital vowels accented : Á-Ú
numbers : 0-9
space : (a space)
Note: the spacings are to explain better
inputFormatters: [
WhitelistingTextInputFormatter(RegExp("[a-z A-Z á-ú Á-Ú 0-9]"))
]
BlacklistingTextInputFormatter and WhitelistingTextInputFormatter is #Deprecated version 1.20.0
Now you can use FilteringTextInputFormatter to do InputFormatter on TextField or TextFormField.
inputFormatters: [FilteringTextInputFormatter.allow(RegExp(r'^ ?\d*')),]
inputFormatters: [FilteringTextInputFormatter.deny(' ')]
inputFormatters: [FilteringTextInputFormatter.digitsOnly]
For e.x
TextFormField(
keyboardType: TextInputType.number,
inputFormatters: [
FilteringTextInputFormatter.digitsOnly
],
),
First of all you have to check you have imported the following package:
import 'package:flutter/services.dart';
then you can use it like following:
TextFormField(
inputFormatters: [
FilteringTextInputFormatter(RegExp("[a-zA-Z]"), allow: true),
]);
import 'package:flutter/services.dart';
Use these parameters in the TextFormField.
maxLength: 5,
keyboardType: TextInputType.number,
inputFormatters: [WhitelistingTextInputFormatter.digitsOnly,],
I've written a lexer and parser in ml-ulex and ml-antlr. (Running sml/nj)
I have a the following rule in my lexer:
<TRUSTED> [\"] => ( YYBEGIN INQUOTE ; Tokens.ValidText yytext );
It's identified correctly, but yytext has several additional '\' (backslashes)in the string. I get \\\\\\\" instead of \".
To make things more perplexing, I replaced the rule in my lexer with this one:
<TRUSTED> [\"] => ( YYBEGIN INQUOTE ; Tokens.DBLQUOTE );
And in my grammer I had the token
prod: DBLQUOTE => ( "\"" );
With the same results as before....
If I replace the rule with
prod: DBLQUOTE => ( "*test*" );
Then I get exactly one test as my string.
I've tried using String.fromString on the resulting string, and it doesn't do the trick. I'm not sure what's happening or why.
I want to create a very simple parser to convert:
"I wan't this to be ready by 10:15 p.m. today Mr. Gönzalés.!" to:
(
'I',
' ',
'wan',
'\'',
't',
' ',
'this',
' ',
'to',
' ',
'be',
' ',
'ready',
' ',
'by',
' ',
'10',
':',
'15',
' ',
'p',
'.',
'm',
'.',
' ',
'today',
' ',
'Mr'
'.'
' ',
'Gönzalés',
'.'
'!'
)
So basically I want consecutive letters and numbers to be grouped into a single string. I'm using Python 3 and I don't want to install external libs. I also would like the solution to be as efficient as possible as I will be processing a book.
So what approaches would you recommend me with regard to solving this problem. Any examples?
The only way I can think of now is to step trough the text, character for character, in a for loop. But I'm guessing there's a better more elegant approach.
Thanks,
Barry
You are looking for a procedure called tokenization. That means splitting raw text into discrete "tokens", in our case just words. For programming languages this is fairly easy, but unfortunately it is not so for natural language.
You need to do two things: Split up the text in sentences and split the sentences into words. Usually we do this with regular expressions. Naïvely you could split sentences by the pattern ". ", ie period followed by space, and then split up the words in sentences by space. This won't work very well however, because abbreviations are often also ending in periods. As it turns out, tokenizing and sentence segmentation is actually fairly tricky to get right. You could experiment with several regexps, but it would be better to use a ready made tokenizer. I know you didn't want to install any external libs, but im sure this will spare you pain later on. NLTK has good tokenizers.
I believe this is a solution:
import regex
text = "123 2 can't, 4 Å, é, and 中ABC _ sh_t"
print(regex.findall('\d+|\P{alpha}|\p{alpha}+', text))
Can it be improved?
Thank!
How can I write a (La)TeX command that replaces all [ with { and all ] with }, assuming that every [ has a matching ], and that any braces between the [ and ] are balanced? It needs to be able to deal with nested brackets.
For example, I want to be able to write a command \mynewcommand so that \mynewcommand{{[[{1}{2}][{3}{4}]]}} is the same as \mycommand{{{{{1}{2}}{{3}{4}}}}}.
Probably the easiest way is to use e-TeX and \scantokens
\newcommand*\mycommand[1]{%
\begingroup
\everyeof{\noexpand}%
\endlinechar=-1\relax
\catcode`\[=1\relax
\catcode`\]=2\relax
\edef\temp{\scantokens{#1}}%
\expandafter\endgroup
\expandafter\def\expandafter\temp\expandafter{\temp}%
}
This will define \temp with the material in #1 but with every "[" ... "]" pair turned into a TeX brace group ("{" ... "}"). You can then use \temp to do whatever you want. As I say, this requires e-TeX, which is available in all modern TeX systems.