I want to parse all names from a random text. Names will be formatted like this:
Lastname F.
where F - first letter of first name. So, I created this grammar:
grammar org.xtext.example.mydsl.Article with org.eclipse.xtext.common.Terminals
generate article "http://www.xtext.org/example/mydsl/Article"
Model : {Model}((per += Person)|(words += NON_WS))*;
Person : lastName = NAME firstName = IN;
terminal NAME : ('A'..'Z')('a'..'z')+;
terminal IN : ('A'..'Z')'.';
terminal NON_WS : !(' '|'\t'|'\r'|'\n')+;
It works on this example:
Lastname F. some text. Lastname F.
But it crashes on this one:
Lastname F. some text. New sentence. Lastname F.
^^^^^^^^^ missing RULE_IN at 'sentence.'
How do I include a checking of all tokens before the generation of the 'Person' object or before the entering the 'Person' rule?
lexing is done kontext free. thus one lexed a name, always lexed a name
Model : {Model}((per += Person)|(words += (NON_WS|NAME)))*;
Related
ASP-NET MVC PROJECT
I have tried to set a regular lookaround expression to a password field of a form. A simple example I tried is check that the user wrote a capital letter using System.ComponentModel.DataAnnotations.
Examples:
[RegularExpression("(?=.*[A-Z])", ErrorMessage = "You have to write
a capital letter")]
[RegularExpression(#"^(?=.*[A-Z])", ErrorMessage = "You have to
write a capital letter")]
I prove this regular expression in this web page:http://regexstorm.net/tester
and I think this patron is correct.
In the login view I am using the directive asp-validation-for=#Model.password for checking the model is correct and if is not the server print the error message in the view and this is what always happens.
MICROSOFT GUIDE:
https://learn.microsoft.com/en-us/dotnet/standard/base-types/regular-expression-language-quick-reference#lookarounds-at-a-glance
You must provide a pattern that consumes the entire string when using regex with RegularExpressionAttribute (see RegularExpression Validation attribute not working correctly). Your ^(?=.*[A-Z]) regex pattern matches only the start of a string position because ^ is an anchor asserting the string start position and the (?=.*[A-Z]) is a positive lookahead, a non-consuming pattern, that just returns "true" if there are zero or more chars other than line break chars as many as possible and then an ASCII uppercase letter.
You can use
[RegularExpression(#"^.*[A-Z].*", ErrorMessage = "You have to write a capital letter")]
Or, a more efficient
[RegularExpression(#"^[^A-Z]*[A-Z].*", ErrorMessage = "You have to write a capital letter")]
In case there can be line breaks in the input text, replace . with [\w\W]:
[RegularExpression(#"^[^A-Z]*[A-Z][\w\W]*", ErrorMessage = "You have to write a capital letter")]
Details:
^ - start of string
[^A-Z]* - zero or more chars other than ASCII uppercase letters
[A-Z] - an uppercase letter
[\w\W]* - zero or more chars as many as possible.
I have a section of a ALTLR grammar which goes like this:
mainfilter: mandatoryfilter (optionalfilter1)? (optionalfilter2)? (optionalfilter3)? ;
mandatoryfilter: 'NAME' '=' ID;
optionalfilter1: 'VALUE1' EQ ID;
optionalfilter2: 'VALUE2' EQ ID;
optionalfilter3: 'VALUE3' EQ ID;
EQ: '=' ;
ID: [A-Za-z0-9]+
//Also I will skip spaces and whitespace
My requirement is that the "optionalfilter" rules can occur in any order.
One approach I think of is rewrite the rule like below and then validate using a Listener:
mainfilter: mandatoryfilter (optionalfilter1|optionalfilter2|optionalfilter3)*;
Another way to achieve this is to put all combinations in one parser rule each . but that may not be a wiser solution if the number of optionalfilter increases.
Sample input:
NAME = BOB VALUE1=X VALUE2=Y VALUE3 = Z
NAME = BILL VALUE3=X VALUE1=Y VALUE2 = Z
my grammar will successfully parse the first input but not the second one.
So is there an elegant way to handle this in my grammar itself ?
So is there an elegant way to handle this in my grammar itself ?
No.
Usually, zero or more are matched and then after parsing, it is validated that a filter only occurs once.
Take for example the Java Language Specification that defines that a class definition can have zero or more class modifiers (the {ClassModifier} part)
NormalClassDeclaration:
{ClassModifier} class Identifier [TypeParameters] [Superclass] [Superinterfaces] ClassBody
ClassModifier:
(one of)
Annotation public protected private abstract static final strictfp
which would match public public class Foo {}. This is rejected at the stage after parsing.
I am using ANTLR4 to try and parse the following text:
ex1, ex2: examples
var1,var2,var3: variables
Since the second line does not have whitespace after the commas, it doesn't parse correctly. If I add in the whitespace, then it works. The rules I am currently using to parse this:
line : list ':' name;
list : listitem (',' listitem)*;
listitem : [a-zA-Z0-9]+;
name : [a-zA-Z0-9]+;
This works perfectly for lines like line 1, but fails on lines like line 2, if there are parenthesis or pretty much any punctuation, it wants some whitespace after the punctuation and I can't always guarantee that about the input.
Does anyone know how to fix this?
First add explicit lexer rules (starting with a capital letter). Then add a lexer rule for whitespace and ignore the whitespace:
line : list ':' name;
list : listitem (',' listitem)*;
listitem : Identifier;
name : Identifier;
Identifier : [a-zA-Z0-9]+; // only one lexer rule for name and listitem, since and Identifier may be a name or listitem depending only on the position
WhiteSpace : (' '|'\t') -> skip;
NewLine : ('\r'?'\n'|'\r') -> skip; // or don't skip if you need it as a statement terminator
I want to write a parser using Bison/Yacc + Lex which can parse statements like:
VARIABLE_ID = 'STRING'
where:
ID [a-zA-Z_][a-zA-Z0-9_]*
and:
STRING [a-zA-Z0-9_]+
So, var1 = '123abc' is a valid statement while 1var = '123abc' isn't.
Therefore, a VARIABLE_ID is a STRING but a STRING not always is a VARIABLE_ID.
What I would like to know is if the only way to distinguish between the two is writing a checking procedure at a higher level (i.e. inside Bison code) or if I can work it out in the Lex code.
Your abstract statement syntax is actually:
VARIABLE = STRING
and not
VARIABLE = 'STRING'
because the quote delimiters are a lexical detail that we generally want to keep out of the syntax. And so, the token patterns are actually this:
ID [a-zA-Z_][a-zA-Z0-9_]*
STRING '[a-zA-Z_0-9]*'
An ID is a letter or underscore, followed by any combination (including empty) of letters, digits and underscores.
A STRING is a single quote, followed by a sequence (possibly empty) letters, digits and underscores, followed by another single quote.
So the ambiguity you are concerned about does not exist. An ID is not in fact a STRING, nor vice versa.
Somewhere inside your Bison parser, or possibly in the lexer, you might want to massage the yytext of a STRING match to remove the quotes and just retain the text in between them as a string. This could be a Bison rule, possibly similar to:
string : STRING
{
$$ = strip_quotes($1);
}
;
I am new at language processing and I want to create a parser with Irony for a following syntax:
name1:value1 name2:value2 name3:value ...
where name1 is the name of an xml element and value is the value of the element which can also include spaces.
I have tried to modify included samples like this:
public TestGrammar()
{
var name = CreateTerm("name");
var value = new IdentifierTerminal("value");
var queries = new NonTerminal("queries");
var query = new NonTerminal("query");
queries.Rule = MakePlusRule(queries, null, query);
query.Rule = name + ":" + value;
Root = queries;
}
private IdentifierTerminal CreateTerm(string name)
{
IdentifierTerminal term = new IdentifierTerminal(name, "!##$%^*_'.?-", "!##$%^*_'.?0123456789");
term.CharCategories.AddRange(new[]
{
UnicodeCategory.UppercaseLetter, //Ul
UnicodeCategory.LowercaseLetter, //Ll
UnicodeCategory.TitlecaseLetter, //Lt
UnicodeCategory.ModifierLetter, //Lm
UnicodeCategory.OtherLetter, //Lo
UnicodeCategory.LetterNumber, //Nl
UnicodeCategory.DecimalDigitNumber, //Nd
UnicodeCategory.ConnectorPunctuation, //Pc
UnicodeCategory.SpacingCombiningMark, //Mc
UnicodeCategory.NonSpacingMark, //Mn
UnicodeCategory.Format //Cf
});
//StartCharCategories are the same
term.StartCharCategories.AddRange(term.CharCategories);
return term;
}
but this doesn't work if the values include spaces. Can this be done (using Irony) without modifying the syntax (like adding quotes around values)?
Many thanks!
If newlines were included between key-value pairs, it would be easily achievable. I have no knowledge of "Irony", but my initial feeling is that almost no parser/lexer generator is going to deal with this given only a naive grammar description. This requires essentially unbounded lookahead.
Conceptually (because I know nothing about this product), here's how I would do it:
Tokenise based on spaces and colons (i.e. every continguous sequence of characters that isn't a space or a colon is an "identifier" token of some sort).
You then need to make it such that every "sentence" is described from colon-to-colon:
sentence = identifier_list
| : identifier_list identifier : sentence
That's not enough to make it work, but you get the idea at least, I hope. You would need to be very careful to distinguish an identifier_list from a single identifier such that they could be parsed unambiguously. Similarly, if your tool allows you to define precedence and associativity, you might be able to get away with making ":" bind very tightly to the left, such that your grammar is simply:
sentence = identifier : identifier_list
And the behaviour of that needs to be (identifier :) identifier_list.