Modal parser using yacc - parsing

I am writing my first parser in yacc. I would like to parse a file that has 3 "modes":
Statement mode
Table heading mode
Table row mode
I would like my parser to start out in statement mode, then when it sees a line consisting of minus signs, switch to table heading mode. When it sees another line of minus signs, switch to table row mode, and finally when it sees a third set of minus signs switch to statement mode:
statement...
statement...
statement...
----
table heading
----
table row
table row
table row
----
statement
statement
statement
One thing that occures to me, is that I could have 3 separate grammars which I would switch between in my line feed loop. However, I don't know how to create multiple grammars in one .y file.
The other thing that seems like a possibility is using "Lexical Tie-ins" (unfortunately, you'll have to search for that string in the document). However, the author of the yacc tutorial doesn't really tell me anything about these "lexical tie-ins" other than that "This kind of ``backdoor'' approach can be elaborated to a noxious degree. Nevertheless, it represents a way of doing some things that are difficult, if not impossible, to do otherwise." Which is hardly encouraging.

I have solved the problem by creating pseudo-symbols which I insert using the lexer:
line
: TABLE_HEADING sentences ',' table_heading_columns ',' sentences
{
fmt.Println("TABLE_HEADING")
}
| TABLE_BODY table_body_columns
{
fmt.Println("TABLE_BODY")
}
| STATEMENT sentences
{
fmt.Println("STATEMENT")
}
;

Related

How to make a variable non delimited file to be a delimited one

Hello guys I want to convert my non delimited file into a delimited file
Example of the file is as follows.
Name. CIF Address line 1 State Phn Address line 2 Country Billing Address line 3
Alex. 44A. Biston NJ 25478163 4th,floor XY USA 55/2018 kenning
And so on all the data are in this format.
First three lines are metadata and then the data.
How can I make it delimited in proper format using logic.
There are two parts in the problem:
how to find the column widths
how to split each line into fields and output a new line with delimiters
I could not propose an automated solution for the first one, because (not knowing anything about the metadata format), there is no clear way to find where one column ends and the next one begins. Some of the column headings contain multiple space-separated words and space is also used as a separator between the headings (and apparently one cannot use the rule "more than one space means the end of a heading name" because there's only one space between "Address line 2" and "Country" - and they're clearly separate columns. Clearly, finding the correct column widths requires understanding English and this is not something that you can write a program for.
For the second problem, things are much easier - once you have the column positions. If you figure the column positions manually (or programmatically, if you know something about the metadata that I don't - and you have a simple method for finding what's a column heading), then a program written in AWK can do this, for example:
cols="8,15,32,40,53,66,83,105"
awk_prog='BEGIN {
nt=split(cols,tabs,",")
delim=","
ORS=""
}
{ o=1 ;
for (i in tabs) { t=tabs[i] ; f=substr($0,o,t-o); sub(" *$","",f) ; print f
delim ; o=t } ;
print substr($0, o) "\n"
}'
awk -v cols="$cols" "$awk_prog" input_file
NOTE that the above program does not deal correctly with the case when the separator character (e.g. ",") appears inside the data. If you decide to use this as-is, be sure to use a separator that is not present in the input data. It may be better to modify the code to escape any separator characters found in the input data (there are different ways to do this - depends on what you plan to feed the output file to).

Handling arbitrary text blocks in an Xtext grammar

In an effort to better understand Xtext, I'm working on writing a grammar and have hit a roadblock. I've boiled it down to the following scenario. I have some input such as this:
thing {abc}
{def}
There may be keywords (e.g.'thing') followed by other language elements (e.g. ID) in braces. Or, there can just be a block of content inside braces. This content should simply be passed along to the parser en masse.
If I try something like this:
Model: (things+=AThing | blocks+=ABlock)*;
AThing : 'thing' '{' name = ID '}';
ABlock : block=BLOCK;
terminal BLOCK:'{' -> '}';
and parse the sample text above, I get an error:
'mismatched input '{abc}' expecting '{'' on ABlock, offset 6, length 5
So, '{abc}' is being matched by the BLOCK terminal rule, which I understand. But how do I alter the grammar to properly handle the sample input? I've been wrestling with this problem for a while and have come up empty. So it's either something very simple that I've missed, or the problem is really complex and I don't realize it. Any enlightenment would be greatly appreciated.
Parsing happens in two stages: tokenizer and lexical. In the first one the text input is divided into tokens, in the second one the tokens are matched against lexical rules. Broadly something like (with some arbitrary language):
1st phase:
text: class X { this ; }
----- --- --- ---- --- ---
tokens: ID ID LB ID SC RB
2nd phase:
Is there a rule that starts with a 'class' string?
YES: Is the next expected token an ID?
YES: Is the next expected token a LB?
...
NO: Is there another rule that starts with 'class'?
...
NO: Is there a rule that starts with an ID token?
...
The lexer implementation is a bit more complex, but I hope you get the idea.
The issue with your grammar is that your termial BLOCK rule is used during the first phase, hence you get
thing {abc} {def}
----- ----- -----
ID BLOCK BLOCK
That is why the error message says if found '{abc}' and not a '{'. The lexer matched the thing and was expecting the next token to be a '{' but it got a BLOCK.
If you want arbitrary text inside the block, I don't think you can use '{' to identify the name of things.
This looks like what is mentioned here:
A quite common case requiring backtracking is when your language uses the same delimiter pair for two different semantics
So the simplest solution seems to use different delimiters. Otherwise you may have to look into enabling backtracking.

Flex scanning, differentiating between string (with single spaces) and padding (more than one space)

I am having trouble with flex to scan lines that looks something like this
DESCRIPTION This is the device description
I would like the line to be scanned such that DESCRIPTION is one token and "This is the device description" is the other.
I have been playing endlessly with my rules but cannot seem to get it to work.
From the documentation I think I want to implement a rule using
`r/s'
an r but only if it is followed by an s
where spaces are only accepted is they are followed by something that is not a while space. I have no idea how to write this rule with flex's syntax. In my mind the rule should be something like
[a-zA-Z](" "/[a-zA-Z0-9]|[a-zA-Z0-9])* return IDENTIFIER;
But this is invalid.
I can get the lines to chop up each word but I cannot get the rules to differentiate between 1 space and 1 < spaces. Halp.
This is not really a good match for flex, since the recognition of tokens is context-dependent. You can achieve context-dependent scanning using start conditions but excessive use of start conditions is often an indication that some other scanning mechanism would be better.
Regardless of how you do it, the key is figuring out exactly how to decide on the token division. Consider the following four lines, for example:
DEVICE This is the device
MODE This is the mode
DESCRIPTION This is the device description
UNDOCUMENTED FIELD
Of course, it is possible that the corner cases represented by the third and fourth lines never show up in any of your inputs.
If the first token cannot include whitespace, then the problem is relatively simple, although you still need a start condition (and I'm going to assume you read the documentation linked above):
%x WHITE WORDS
%%
/* Possibly should be [[:alpha:]] instead of [[:upper:]] */
[[:upper:]]+ { /* copy yytext */; BEGIN(WHITE); return KEYWORD; }
/* Handle other possible line beginnings */
<WHITE>\n { /* Blank descriptive text */; BEGIN(INITIAL); }
<WHITE>[ \t]+ { BEGIN(WORDS); }
<WHITE>. { /* Something not correct in this line */; ... }
<WORDS>.+ { /* copy yytext */; BEGIN(INITIAL); return DESCRIPTION; }
<WORDS>\n { BEGIN(INITIAL); }
If there might be whitespace in the first token but never two spaces in a row, you could replace the first pattern above with:
[[:alpha:]]+( [[:alpha:]]+)*
which will match any sequence of words (consisting only of letters) where there is exactly one space between successive words. Like the original pattern above, this will end on the first non-alphabetic character found. That error will be detected by the rules in <WHITE>, because any non-whitespace character encountered when that start condition becomes active will be handled by the start condition's default rule (the <WHITE>. rule).
My opinion is that you are using the wrong horse here. lex (flex) should be only used for lexical analysis and yacc (or bison) for syntactic one. Saying that one single character is not a separator but multiple are is not appropriate for a lexer.
My opinion is that lex should only reports words and padding and that yacc should later re-combine words that are not separated by padding elements.
The lex part would be as simple as:
[[:alnum:]_]+ {
// printf("WORD: >%s<\n", yytext); // for debugging
return WORD;
}
[[:blank:]]{2,} {
// printf("PADDING: >%s<\n", yytext);
return PADDING;
}
and the yacc part would contain:
elt: PADDING
| ident
ident: WORD
| ident WORD
action are omitted here because they depend too much on your actual processing.

Funny CSV format help

I've been given a large file with a funny CSV format to parse into a database.
The separator character is a semicolon (;). If one of the fields contains a semicolon it is "escaped" by wrapping it in doublequotes, like this ";".
I have been assured that there will never be two adjacent fields with trailing/ leading doublequotes, so this format should technically be ok.
Now, for parsing it in VBScript I was thinking of
Replacing each instance of ";" with a GUID,
Splitting the line into an array by semicolon,
Running back through the array, replacing the GUIDs with ";"
It seems to be the quickest way. Is there a better way? I guess I could use substrings but this method seems to be acceptable...
Your method sounds fine with the caveat that there's absolutely no possibility that your GUID will occur in the text itself.
On approach I've used for this type of data before is to just split on the semi-colons regardless then, if two adjacent fields end and start with a quote, combine them.
For example:
Pax;is;a;good;guy";" so;says;his;wife.
becomes:
0 Pax
1 is
2 a
3 good
4 guy"
5 " so
6 says
7 his
8 wife.
Then, when you discover that fields 4 and 5 end and start (respectively) with a quote, you combine them by replacing the field 4 closing quote with a semicolon and removing the field 5 opening quote (and joining them of course).
0 Pax
1 is
2 a
3 good
4 guy; so
5 says
6 his
7 wife.
In pseudo-code, given:
input: A string, first character is input[0]; last
character is input[length]. Further, assume one dummy
character, input[length+1]. It can be anything except
; and ". This string is one line of the "CSV" file.
length: positive integer, number of characters in input
Do this:
set start = 0
if input[0] = ';':
you have a blank field in the beginning; do whatever with it
set start = 2
endif
for each c between 1 and length:
next iteration unless string[c] = ';'
if input[c-1] ≠ '"' or input[c+1] ≠ '"': // test for escape sequence ";"
found field consting of half-open range [start,c); do whatever
with it. Note that in the case of empty fields, start≥c, leaving
an empty range
set start = c+1
endif
end foreach
Untested, of course. Debugging code like this is always fun….
The special case of input[0] is to make sure we don't ever look at input[-1]. If you can make input[-1] safe, then you can get rid of that special case. You can also put a dummy character in input[0] and then start your data—and your parsing—from input[1].
One option would be to find instances of the regex:
[^"];[^"]
and then break the string apart with substring:
List<string> ret = new List<string>();
Regex r = new Regex(#"[^""];[^""]");
Match m;
while((m = r.Match(line)).Success)
{
ret.Add(line.Substring(0,m.Index + 1);
line = line.Substring(m.Index + 2);
}
(Sorry about the C#, I don't known VBScript)
Using quotes is normal for .csv files. If you have quotes in the field then you may see opening and closing and the embedded quote all strung together two or three in a row.
If you're using SQL Server you could try using T-SQL to handle everything for you.
SELECT * INTO MyTable FROM OPENDATASOURCE('Microsoft.JET.OLEDB.4.0',
'Data Source=F:\MyDirectory;Extended Properties="text;HDR=No"')...
[MyCsvFile#csv]
That will create and populate "MyTable". Read more on this subject here on SO.
I would recommend using RegEx to break up the strings.
Find every ';' that is not a part of
";" and change it to something else
that does not appear in your fields.
Then go through and replace ";" with ;
Now you have your fields with the correct data.
Most importers can swap out separator characters pretty easily.
This is basically your GUID idea. Just make sure the GUID is unique to your file before you start and you will be fine. I tend to start using 'Z'. After enough 'Z's, you will be unique (sometimes as few as 1-3 will do).
Jacob

Character column parsing in Boost::Spirit

I'm working on a Boost Spirit 2.0 based parser for a small subset of Fortran 77. The issue I'm having is that Fortran 77 is column oriented, and I have been unable to find anything in Spirit that can allow its parsers to be column-aware. Is there any way to do this?
I don't really have to support the full arcane Fortran syntax, but it does need to be able to ignore lines that have a character in the first column (Fortran comments), and recognize lines with a character in the sixth column as continuation lines.
It seems like folks dealing with batch files would at least have the same first-column problem as me. Spirit appears to have an end-of-line parser, but not a start-of-line parser (and certianly not a column(x) parser).
Well, since I now have an answer to this, I guess I should share it.
Fortran 77, like probably all other languages that care about columns, is a line-oriented language. That means your parser has to keep track of the EOL and actually use it in its parsing.
Another important fact is that in my case, I didn't care about parsing the line numbers that Fortran can put in those early control columns. All I need is to know when it is telling me to scan rest of the line differently.
Given those two things, I could entirely handle this issue with a Spirit skip parser. I wrote mine to
skip the entire line if the first (comment) column contains an alphabetic charater.
skip the entire line if there is nothing on it.
ignore the preceeding EOL and everything up to the fifth column if the fifth column contains a '.' (continuation line). This tacks it to the preceeding line.
skip all non-eol whitespace (even spaces don't matter in Fortran. Yes, it's a wierd language.)
Here's the code:
skip =
// Full line comment
(spirit::eol >> spirit::ascii::alpha >> *(spirit::ascii::char_ - spirit::eol))
[boost::bind (&fortran::parse_info::skipping_line, &pi)]
|
// remaining line comment
(spirit::ascii::char_ ('!') >> *(spirit::ascii::char_ - spirit::eol)
[boost::bind (&fortran::parse_info::skipping_line_comment, &pi)])
|
// Continuation
(spirit::eol >> spirit::ascii::blank >>
spirit::qi::repeat(4)[spirit::ascii::char_ - spirit::eol] >> ".")
[boost::bind (&fortran::parse_info::skipping_continue, &pi)]
|
// empty line
(spirit::eol >>
-(spirit::ascii::blank >> spirit::qi::repeat(0, 4)[spirit::ascii::char_ - spirit::eol] >>
*(spirit::ascii::blank) ) >>
&(spirit::eol | spirit::eoi))
[boost::bind (&fortran::parse_info::skipping_empty, &pi)]
|
// whitespace (this needs to be the last alternative).
(spirit::ascii::space - spirit::eol)
[boost::bind (&fortran::parse_info::skipping_space, &pi)]
;
I would advise against blindly using this yourself for line-oriented Fortran, as I ignore line numbers, and different compilers have different rules for valid comment and continuation characters.

Resources