I accidentally hit the spacebar and wrote this:
lTTEvent .CustUpdateStatus := usUnchanged;
and was surprised to see that the compiler accepted the space in front of the dot (actually, any number of spaces).
Is the dot such a special character that the parser can interpret it correctly? How would that work in Pascal?
The parser first translates text to tokens. So the text:
lTTEvent .CustUpdateStatus := usUnchanged;
Is translated to the tokens:
identifier
period
identifier
becomes
identifier
semicolon
The space is a whitespace and it can have three functions:
separator between tokens (for example between an identifier and a keyword).
a literal space (in that case it is included in a string.
cosmetic.
The first and last function spaces are lost in the translation to tokens.
An identifier and a period don't have any characters in common so there is no way those can be confused so a space is not required but it still can be used.
short answer
'lTTEvent' and '.' are tokens. Tokens can (sometimes) be separated by whitespace.
Related
I am trying to create a Lexer/Parser with ANTLR that can parse plain text with 'tags' scattered inbetween.
These tags are denoted by opening ({) and closing (}) brackets and they represent Java objects that can evaluate to a string, that is then replaced in the original input to create a dynamic template of sorts.
Here is an example:
{player:name} says hi!
The {player:name} should be replaced by the name of the player and result in the output i.e. Mark says hi! for the player named Mark.
Now I can recognize and parse the tags just fine, what I have problems with is the text that comes after.
This is the grammar I use:
grammar : content+
content : tag
| literal
;
tag : player_tag
| <...>
| <other kinds of tags, not important for this example>
| <...>
;
player_tag : BRACKET_OPEN player_identifier SEMICOLON player_string_parameter BRACKET_CLOSE ;
player_string_parameter : NAME
| <...>
;
player_identifier : PLAYER ;
literal : NUMBER
| STRING
;
BRACKET_OPEN : '{';
BRACKET_CLOSE : '}';
PLAYER : 'player'
NAME : 'name'
NUMBER : <...>
STRING : (.+)? /* <- THIS IS THE PROBLEMATIC PART !*/
Now this STRING Lexer definition should match anything that is not an empty string but the problem is that it is too greedy and then also consumes the { } bracket tokens needed for the tag rule.
I have tried setting it to ~[{}]+ which is supposed to match anything that does not include the { } brackets but that screws with the tag parsing which I don't understand either.
I could set it to something like [ a-zA-Z0-9!"ยง$%&/()= etc...]+ but I really don't want to restrict it to parse only characters available on the british keyboard (German umlaute or French accents and all other special characters other languages have must to work!)
The only thing that somewhat works though I really dislike it is to force strings to have a prefix and a suffix like so:
STRING : '\'' ~[}{]+ '\'' ;
This forces me to alter the form from "{player:name} says hi!" to "{player:name}' says hi!'" and I really desperately want to avoid such restrictions because I would then have to account for literal ' characters in the string itself and it's just ugly to work with.
The two solutions I have in mind are the following:
- Is there any way to match any number of characters that has not been matched by the lexer as a STRING token and pass it to the parser? That way I could match all the tags and say the rest of the input is just plain text, give it back to me as a STRING token or whatever...
- Does ANTLR support lookahead and lookbehind regex expressions with which I could match any number of characters before the first '{', after the last '}' and anything inbetween '}' and '{' ?
I have tried
STRING : (?<=})(.+)?(?={) ;
but I can't seem to get the syntax right because that won't compile at all, which leads me to believe that ANTLR does not support lookahead and lookbehind syntax, but I could not find a definitive answer on the internet to that question.
Any advice on what to do?
Antlr does not support lookahead or lookbehind. It does support non-greedy wildcard matches, but only when the .* non-greedy wildcard is followed in the rule with the termination sequence (which, as you say, is also contained in the match, although you could push it back into the input stream).
So ~[{}]* is correct. But there's a little problem: lexer rules are (normally) always active. So that lexer rule will be active inside the braces as well, which means that it will swallow the entire contents between the braces (unless there are nested braces or braces inside quotes or some such, and that's even worse).
So you need to define different lexical contents, called "lexical modes" in Antlr. There's a publically viewable example in the Antlr Definitive Reference, which shows a solution to a very similar problem: parsing HTML.
Trying to parse operators (+, -, =, <<, !=), using states like
%{
%}
OP ["+"|";"|":"|","|"*"|"/"|"="|"("|")"|"{"|"}"|"*"|"#"|"$"|
"<"|">"|"&"|"|"|"!"|]
DOUBOP [":="|".."|"<<"|">>"|"<>"|"<="|">="|"=>"|"**"|"!="|"{:"|"}:"|"\-"]
and later on
{DOUBOP} { printf("%s (operator)\n", yytext); }
{OP} { printf("%s (operator)\n", yytext); }
but Lex is identifying operators like "<<" as "<" and "<". I thought since it was in double quotes this would work, but I see that's not the case.
Is there anyway I can give a regular expression precedence, ie have lex check for a double operator first, and then a single operator?
Thanks in advance.
[...] is a character class, not an eccentric type of parenthesis. If you want to parenthesize a sub-expression in a pattern, use ordinary parentheses. In this case, however, parentheses are not necessary. (Indeed, most of the quotes aren't necessary either, but they don't hurt and some of them would be useful.)
"==" recognises the two character-sequence consisting of two equal signs. "=="|"++" recognizes either two equal signs or two plus signs.
By contrast, ["=="] recognises a single character, which could be either a quote or an equals sign. Since a character class is a set, the fact that each of those appears twice is irrelevant (although I think it would save a lot of grief if flex issued a warning). Similarly, ["=="|"<<"] recognises a single character if it is a quote, an equals sign, a vertical bar or a less than sign.
Flex pattern syntax is documented in the flex manual. It differs in a few ways from regexes in other systems, so it's worth reading the short document. However, character classes are mostly the same in all regex syntaxes in common use, especially the use of square brackets to delimit the set.
An easier way is to put all single characters together, and run the * command on the end up curly braces.
i.e.
OP ["+"|";"|":"|","|"*"|"/"|"="|"("|")"|"{"|"}"|"*"|"#"|"$"|
"<"|">"|"&"|"|"|"!"|]*
There is one thing which I don't understand about reference modification in Cobol.
The example goes like this:
MOVE VARIABLE(VARIABLE2 +4:2) TO VARIABLE3
Now I do not qutie understand what the "+4:2" references to. Does it mean that the first two signs 4 signs after the target are moved? Meaning if for example VARIABLE (the 1st) is filled with "123456789" and VARIABLE2 contains the 2nd and 3rd position within that variable (so"23"), the target is "23 +4" meaning "789". Then the first two positions in the target (indicated by the ":2") are moved to VARIABLE3. So in the end VARIABLE3 would contain "78".
Am I understanding this right or am I making a false assumption about that instruction?
(VARIABLE2 +4:2) is a syntax error, because the starting position must be an arithmetic expression. There must be a space after the + for this reference modification to be valid. And, VARIABLE2 must be numeric and the expression shall evaluate to an integer.
Once corrected, then 4 is added to the content of VARIABLE2. That is the left-most (or starting position) within VARIABLE1 for the move. 2 characters are moved to VARIABLE3. If VARIABLE3 is longer than two characters, the remaining positions are filled with spaces.
From the 2002 COBOL standard:
8.7.1 Arithmetic operators
There are five binary arithmetic operators and two unary arithmetic operators that may be used in arithmetic expressions. They are represented by specific COBOL characters that shall be preceded by a space and followed by a space except that no space is required between a left parenthesis and a unary operator or between a unary operator and a left parenthesis.
Emphasis added.
I am writing a small program which needs to preprocess some data files that are inputs to another program. Because of this I can't change the format of the input files and I have run into a problem.
I am working in a language that doesn't have libraries for this sort of thing and I wouldn't mind the exercise so I am planning on implementing the lexer and parser by hand. I would like to implement a Lexer based roughly on this which is a fairly simple design.
The input file I need to interpret has a section which contains chemical reactions. The different chemical species on each side of the reaction are separated by '+' signs, but the names of the species can also have + characters in them (symbolizing electric charge). For example:
N2+O2=>NO+NO
N2++O2-=>NO+NO
N2+ + O2 => NO + NO
are all valid and the tokens output by the lexer should be
'N2' '+' 'O2' '=>' 'NO' '+' 'NO'
'N2+' '+' 'O2-' '=>' 'NO' '+' 'NO'
'N2+' '+' 'O2-' '=>' 'NO' '+' 'NO'
(note that the last two are identical). I would like to avoid look ahead in the lexer for simplicity. The problem is that the lexer would start reading the any of the above inputs, but when it got to the 3rd character (the first '+'), it wouldn't have any way to know whether it was a part of the species name or if it was a separator between reactants.
To fix this I thought I would just split it off so the second and third examples above would output:
'N2' '+' '+' 'O2-' '=>' 'NO' '+' 'NO'
The parser then would simply use the context, realize that two '+' tokens in a row means the first is part of the previous species name, and would correctly handle all three of the above cases. The problem with this is that now imagine I try to lex/parse
N2 + + O2- => NO + NO
(note the space between 'N2' and the first '+'). This is invalid syntax, however the lexer I just described would output exactly the same token outputs as the second and third examples and my parser wouldn't be able to catch the error.
So possible solutions as I see it:
implement a lexer with atleast one character look ahead
include tokens for whitespace
include leading white space in the '+' token
create a "combined" token that includes both the species name and any trailing '+' without white space between, then letting the parser sort out whether the '+' is actually part of the name or not.
Since I am very new to this kind of programming I am hoping someone can comment on my proposed solutions (or suggest another). My main reservation about the first solution is I simply do not know how much more complicated implementing a lexer with look ahead is.
You don't mention your implementation language, but with an input syntax as relatively simple as the one you outline, I don't think having logic along the lines of the following pseudo-code would be unreasonable.
string GetToken()
{
string token = GetAlphaNumeric(); // assumed to ignore (eat) white-space
var ch = GetChar(); // assumed to ignore (eat) white-space
if (ch == '+')
{
var ch2 = GetChar();
if (ch2 == '+')
token += '+';
else
PutChar(ch2);
}
PutChar(ch);
return token;
}
I have seen the following on StackOverflow about URL characters:
There are two sets of characters you need to watch out for - Reserved and Unsafe.
The reserved characters are:
ampersand ("&")
dollar ("$")
plus sign ("+")
comma (",")
forward slash ("/")
colon (":")
semi-colon (";")
equals ("=")
question mark ("?")
'At' symbol ("#").
The characters generally considered unsafe are:
space,
question mark ("?")
less than and greater than ("<>")
open and close brackets ("[]")
open and close braces ("{}")
pipe ("|")
backslash ("\")
caret ("^")
tilde ("~")
percent ("%")
pound ("#").
I'm trying to code a URL so I can parse it using delimiters. They can't be numbers or letters though. Does anyone have a list of characters that are NOT Reserved but ARE safe to use?
Thanks for any help you can provide.
Don't bother trying to use safe/unreserved characters. Just use whatever delimiters you want and URLencode the whole thing. Then URL decode it on the other end and parse normally.
Is there a reason you can't just use the standard delimiter for URL parameters (&)? That is the most straightforward way to do it instead of trying to roll your own.
For example the standard URL syntax already allows for multi-valued paramaters natively. This is perfectly legal and doesn't require any trickery.
Somepage.aspx?parameterName=A¶meterName=B
The result is that the page would be passed "A,B" in the parameterName attribute.