Capture names containing --but not ending-- in dashes - parsing

I am trying to capture names (not starting with a number) which could contain dashes, such as hello-world. My problem is that I also have rules for single dashes and symbols which conflict with it:
[A-Za-z][A-Za-z0-9-]+ { /* capture "hello-world" */ }
"-" { return '-'; }
">" { return '>'; }
When the lexer reads hello-world-> the previous rules yield hello-world- and >, whereas I expected hello-world, - and > to be captured individually. To solve it I fixed it this way:
[A-Za-z][A-Za-z0-9-]*[A-Za-z0-9]+ { /* ensure final dash is never included at the end */ }
That works, except for single-letter words, so finally I implemented this:
[A-Za-z][A-Za-z0-9-]*[A-Za-z0-9]+ { /* ensure final dash is never included at the end */ }
[A-Za-z][A-Za-z0-9]* { /* capture possible single letter words */ }
Question: Is there a more elegant way to do it?

[A-Za-z][A-Za-z0-9-]*[A-Za-z0-9]+
[A-Za-z][A-Za-z0-9]*
Note that, as you said, the first rule already covers everything that's not a single letter. So the second rule only has to match single letters and can be shortened to just [A-Za-z]:
[A-Za-z][A-Za-z0-9-]*[A-Za-z0-9]+
[A-Za-z]
Now the second rule is a mere prefix of the first, so we can combine this into a single rule by making the part after the first letter optional:
[A-Za-z]([A-Za-z0-9-]*[A-Za-z0-9]+)?
The + on the last bit is unnecessary because everything except the last character can as well be matched by the middle part, so the simplest version is:
[A-Za-z]([A-Za-z0-9-]*[A-Za-z0-9])?

Related

Flex confusing to transform string character by character

I want to use flex to transform a string based on simple rules. I have rules like the first character stays the same and the second and third characters might change. Like if the second character was a letter, it becomes the number listed in the rules below. If the third is a digit, it becomes a certain letter.
%%
/*^[a-z] {char *yycopy = strdup( yytext ); unput(yycopy[0]);}*/
[ajs] {putchar('1');}
[bkt] {putchar('2');}
[clu] {putchar('3');}
[dmv] {putchar('4');}
[1] {putchar('j');}
[2] {putchar('k');}
[3] {putchar('l');}
[4] {putchar('m');} /*more number rules till 9*/
%%
int yywrap(void){return 1;}
int main( int argc, char **argv )
{
++argv, --argc; /* skip over program name */
if ( argc > 0 )
yyin = fopen( argv[0], "r" );
else
yyin = stdin;
while (yylex());
}
If there are different rules for characters in different positions within the string, how can I use start conditions to change a particular character (i.e. the rules for the second and third character are different).
You switch start condition by using the BEGIN action. Flex never automatically changes start condition, so you when you need to return to the initial start condition (called INITIAL), you have to do so explicitly (BEGIN(INITIAL)).
You need to declare start condition names in the (f)lex prologue, usually with the %x command. (%s is also possible but with different semantics. See the Flex manual for details.)
You indicate that a start condition applies to a rule by starting the rule with a start condition name in angle brackets. You can put more than one start condition inside the angle brackets; separate them with commas and don't use spaces. Don't put a space after the angle brackets either; they are part of the pattern and (f)lex patterns cannot include unquoted space characters.
BEGIN is a macro and it does not require parentheses around the start condition name, but I suggest always using them anyway, so you don't have to worry about what the macro expands to. Start condition names are small integers (either enum constants or preprocessor macros) but nothing guarantees their value, so don't make assumptions.
That's about it. So you could implement your astro numerological codifier with:
%x SECOND THIRD REST
%%
[a-z] ECHO; BEGIN(SECOND);
<SECOND>[ajs] putchar('1'); BEGIN(THIRD);
/* More SECOND rules */
<THIRD>1 putchar('j'); BEGIN(REST);
/* More THIRD rules */
<*>.*\n? ECHO; BEGIN(INITIAL);
(I deliberately did not add any <REST> rules beacause the fallback at the end covers it. I also deliberately left out the anchor in the first rule because my rules guarantee that the INITIAL start condition is 9nly in force at the beginning of a line. See the last rule. The last rule specifies an optional newline in case the file does not end with a newline, which occasionally happens although it's technically invalid.)

Flex scanning, differentiating between string (with single spaces) and padding (more than one space)

I am having trouble with flex to scan lines that looks something like this
DESCRIPTION This is the device description
I would like the line to be scanned such that DESCRIPTION is one token and "This is the device description" is the other.
I have been playing endlessly with my rules but cannot seem to get it to work.
From the documentation I think I want to implement a rule using
`r/s'
an r but only if it is followed by an s
where spaces are only accepted is they are followed by something that is not a while space. I have no idea how to write this rule with flex's syntax. In my mind the rule should be something like
[a-zA-Z](" "/[a-zA-Z0-9]|[a-zA-Z0-9])* return IDENTIFIER;
But this is invalid.
I can get the lines to chop up each word but I cannot get the rules to differentiate between 1 space and 1 < spaces. Halp.
This is not really a good match for flex, since the recognition of tokens is context-dependent. You can achieve context-dependent scanning using start conditions but excessive use of start conditions is often an indication that some other scanning mechanism would be better.
Regardless of how you do it, the key is figuring out exactly how to decide on the token division. Consider the following four lines, for example:
DEVICE This is the device
MODE This is the mode
DESCRIPTION This is the device description
UNDOCUMENTED FIELD
Of course, it is possible that the corner cases represented by the third and fourth lines never show up in any of your inputs.
If the first token cannot include whitespace, then the problem is relatively simple, although you still need a start condition (and I'm going to assume you read the documentation linked above):
%x WHITE WORDS
%%
/* Possibly should be [[:alpha:]] instead of [[:upper:]] */
[[:upper:]]+ { /* copy yytext */; BEGIN(WHITE); return KEYWORD; }
/* Handle other possible line beginnings */
<WHITE>\n { /* Blank descriptive text */; BEGIN(INITIAL); }
<WHITE>[ \t]+ { BEGIN(WORDS); }
<WHITE>. { /* Something not correct in this line */; ... }
<WORDS>.+ { /* copy yytext */; BEGIN(INITIAL); return DESCRIPTION; }
<WORDS>\n { BEGIN(INITIAL); }
If there might be whitespace in the first token but never two spaces in a row, you could replace the first pattern above with:
[[:alpha:]]+( [[:alpha:]]+)*
which will match any sequence of words (consisting only of letters) where there is exactly one space between successive words. Like the original pattern above, this will end on the first non-alphabetic character found. That error will be detected by the rules in <WHITE>, because any non-whitespace character encountered when that start condition becomes active will be handled by the start condition's default rule (the <WHITE>. rule).
My opinion is that you are using the wrong horse here. lex (flex) should be only used for lexical analysis and yacc (or bison) for syntactic one. Saying that one single character is not a separator but multiple are is not appropriate for a lexer.
My opinion is that lex should only reports words and padding and that yacc should later re-combine words that are not separated by padding elements.
The lex part would be as simple as:
[[:alnum:]_]+ {
// printf("WORD: >%s<\n", yytext); // for debugging
return WORD;
}
[[:blank:]]{2,} {
// printf("PADDING: >%s<\n", yytext);
return PADDING;
}
and the yacc part would contain:
elt: PADDING
| ident
ident: WORD
| ident WORD
action are omitted here because they depend too much on your actual processing.

Finding error in JavaCC parser/lexer code

I am writing a JavaCC parser/lexer which is meant to recognise all input strings in the following language L:
A string from L consists of several blocks separated by space characters.
At least one block must be present (i.e., no input consisting only of some number of white spaces is allowed).
A block is an odd-length sequence of lowercase letters (a-z).
No spaces are allowed before the first block or after the last one.
The number of spaces between blocks must be odd.
The I/O specifications include the following specification:
If the input does represent a string from L, then the word YES must be printed out to System.out, ending with the EOL character.
If the input is not in L, then only a single line with the word NO needs
to be printed out to System.out, also ending with the EOL character.
In addition, a brief error message should be printed out on System.err explaining the reason why the input is not in L.
Issue:
This is my current code:
PARSER_BEGIN(Assignment)
/** A parser which determines if user's input belongs to the langauge L. */
public class Assignment {
public static void main(String[] args) {
try {
Assignment parser = new Assignment(System.in);
parser.Input();
if(parser.Input()) {
System.out.println("YES"); // If the user's input belongs to L, print YES.
} else if(!(parser.Input())) {
System.out.println("NO");
System.out.println("Empty input");
}
} catch (ParseException e) {
System.out.println("NO"); // If the user's input does not belong to L, print NO.
}
}
}
PARSER_END(Assignment)
//** A token which matches any lowercase letter from the English alphabet. */
TOKEN :
{
< ID: (["a"-"z"]) >
}
//* A token which matches a single white space. */
TOKEN :
{
<WHITESPACE: " ">
}
/** This production is the basis for the construction of strings which belong to language L. */
boolean Input() :
{}
{
<ID>(<ID><ID>)* ((<WHITESPACE>(<WHITESPACE><WHITESPACE>)*)<ID>(<ID><ID>)*)* ("\n"|"\r") <EOF>
{
System.out.println("ABOUT TO RETURN TRUE");
return true;
}
|
{
System.out.println("ABOUT TO RETURN FALSE");
return false;
}
}
The issue that I am having is as follows:
I am trying to write code which will ensure that:
If the user's input is empty, then the text NO Empty input will be printed out.
If there is a parsing error because the input does not follow the description of L above, then only the text NO will be printed out.
At the moment, when I input the string "jjj jjj jjj", which, by definition, is in L (and I follow this with a carriage return and an EOF [CTRL + D]), the text NO Empty input is printed out.
I did not expect this to happen.
In an attempt to resolve the issue I wrote the ...TRUE and ...FALSE print statements in my production (see code above).
Interestingly enough, I found that when I inputted the same string of js, the terminal printed out the ...TRUE statement once, immediately followed by two occurrences of the ...FALSE statement.
Then the text NO Empty input was printed out, as before.
I have also used Google to try to find out if I am incorrectly using the OR symbol | in my production Input(), or if I am not using the return keyword properly, either. However, this has not helped.
Could I please have hint(s) for resolving this issue?
You're calling the Input method three times. The first time it will read from stdin until it reaches the end of the stream. This will successfully parse the input and return true. The other two times, the stream will be empty, so it will fail and return false.
You shouldn't call a rule multiple times unless you actually want it to be applied multiple times (which only makes sense if the rule only consumes part of the input rather than going until the end of the stream). Instead when you need the result in multiple places, just call the method once and store the result in a variable.
Or in your case you could just call it once in the if and no variable would even be needed:
Assignment parser = new Assignment(System.in);
if(parser.Input()) {
System.out.println("YES"); // If the user's input belongs to L, print YES.
} else {
System.out.println("NO");
System.out.println("Empty input");
}
When the input is jjj jjj jjj followed by a newline or carriage return (but not both), your main method invokes Parser.Input three times.
The first time, your parser consumes all the input and returns true.
The second and third times, all the input having already been consumed, the parser returns false.
Once the input is consumed, the lexer will just keep returning <EOF> tokens.

Flex regular expression for comments

I'm trying to learn flex and having trouble with a regular expression to catch comments.
Assuming a comment begins with // and runs to the end of the line, I would like the program to recognize the entire comment and set yytext equal to it.
So far ["//".*$] is not cutting the mustard.
Thank you
Putting your text in square brackets creates a character class matching any one character from among those between the brackets. Also, quotation marks are not special in Flex's regex syntax. You want something along these lines:
/* definitions (for more readable rules) */
/* The \134 are octal escapes for the '/' character, for clarity: */
CMNT_START \134\134
%%
/* rules */
{CMNT_START}.*$ /* yytext automatically contains the matched text*/;

Start states in Lex / Flex

I'm using Flex and Bison for a parser generator, but having problems with the start states in my scanner.
I'm using exclusive rules to deal with commenting, but this grammar doesn't seem to match quoted tokens:
%x COMMENT
// { BEGIN(COMMENT); }
<COMMENT>[^\n] ;
<COMMENT>\n { BEGIN(INITIAL); }
"==" { return EQUALEQUAL; }
. ;
In this simple example the line:
// a == b
isn't matched entirely as a comment, unless I include this rule:
<COMMENT>"==" ;
How do I get round this without having to add all these tokens into my exclusive rules?
Matching C-style comments in Lex/Flex or whatever is well documented:
in the documentation, as well as various variations around the Internet.
Here is a variation on that found in the Flex documentation:
<INITIAL>{
"//" BEGIN(IN_COMMENT);
}
<IN_COMMENT>{
\n BEGIN(INITIAL);
[^\n]+ // eat comment
"/" // eat the lone /
}
Try adding a "+" after the [^n] rule. I don't know why the exclusive state is still picking up '==' even in an exclusive state, but apparently it is. Flex will normally match the rule that matches the most text, and adding the "+" will at least make the two rules tie in length. Putting the COMMENT rule first will cause it to be used in case of a tie.
The clue is:
The problem is this 'eat comment'
rule doesn't seem to match tokens with
more than one character
so add a * to match zero or more non-newlines. You want Zero otherwise a empty comment will not match.
%x COMMENT
// { BEGIN(COMMENT); }
<COMMENT>[^\n]* ;
<COMMENT>\n { BEGIN(INITIAL); }
"==" { return EQUALEQUAL; }
. ;

Resources