fslex - How to switch between two token sets? - f#

I'm trying to write a small DSL parser using fslex and fsyacc. The input is composed of interleaving chunks of two different languages which require different lexing rules. How do I write my fslex file to support that?
(I guess a similar case would be how to define an fslex file for the c language but with support for inline assembly, which requires different lexing rules?)
What I have currently is something like this:
rule tokenize = parse
| "core" { core lexbuf }
...
and core = parse
| ...
The thing is, once a token gets returned by the core parser, the next part of the input gets passed to tokenize instead. However I want to stay (as it were) in the core state. How do I do that?
Thanks!

I actually managed to find a solution on my own. I defined my own tokenizer function which decides based on the BufferLocalStore state which tokenizer to call.
let mytokenizer (lexbuf : LexBuffer<char>) =
if lexbuf.BufferLocalStore.["state"].Equals("core") then FCLexer.core lexbuf
else FCLexer.tokenize lexbuf
let aString (x : string) =
let lexbuf = LexBuffer<_>.FromString x
lexbuf.BufferLocalStore.["state"] <- "fc"
let y = try (FCParser.PROG mytokenizer) lexbuf
...
And I modified my fslex input file slightly:
rule tokenize = parse
| "core" { lexbuf.BufferLocalStore.["state"] <- "core"; core lexbuf }
...
Amazing how simply asking the question can lead you to the solution, and I hope this helps someone besides me :)

Related

How to use context free grammars?

Could someone help me with using context free grammars. Up until now I've used regular expressions to remove comments, block comments and empty lines from a string so that it can be used to count the PLOC. This seems to be extremely slow so I was looking for a different more efficient method.
I saw the following post: What is the best way to ignore comments in a java file with Rascal?
I have no idea how to use this, the help doesn't get me far as well. When I try to define the line used in the post I immediately get an error.
lexical SingleLineComment = "//" ~[\n] "\n";
Could someone help me out with this and also explain a bit about how to setup such a context free grammar and then to actually extract the wanted data?
Kind regards,
Bob
First this will help: the ~ in Rascal CFG notation is not in the language, the negation of a character class is written like so: ![\n].
To use a context-free grammar in Rascal goes in three steps:
write it, like for example the syntax definition of the Func language here: http://docs.rascal-mpl.org/unstable/Recipes/#Languages-Func
Use it to parse input, like so:
// This is the basic parse command, but be careful it will not accept spaces and newlines before and after the TopNonTerminal text:
Prog myParseTree = parse(#Prog, "example string");
// you can do the same directly to an input file:
Prog myParseTree = parse(#TopNonTerminal, |home:///myProgram.func|);
// if you need to accept layout before and after the program, use a "start nonterminal":
start[Prog] myParseTree = parse(#start[TopNonTerminal], |home:///myProgram.func|);
Prog myProgram = myParseTree.top;
// shorthand for parsing stuff:
myProgram = [Prog] "example";
myProgram = [Prog] |home:///myLocation.txt|;
Once you have the tree you can start using visit and / deepmatch to extract information from the tree, or write recursive functions if you like. Examples can be found here: http://docs.rascal-mpl.org/unstable/Recipes/#Languages-Func , but here are some common idioms as well to extract information from a parse tree:
// produces the source location of each node in the tree:
myParseTree#\loc
// produces a set of all nodes of type Stat
{ s | /Stat s := myParseTree }
// pattern match an if-then-else and bind the three expressions and collect them in a set:
{ e1, e2, e3 | (Stat) `if <Exp e1> then <Exp e2> else <Exp e3> end` <- myExpressionList }
// collect all locations of all sub-trees (every parse tree is of a non-terminal type, which is a sub-type of Tree. It uses |unknown:///| for small sub-trees which have not been annotated for efficiency's sake, like literals and character classes:
[ t#\loc?|unknown:///| | /Tree t := myParseTree ]
That should give you a start. I'd go try out some stuff and look at more examples. Writing a grammar is a nice thing to do, but it does require some trial and error methods like writing a regex, but even more so.
For the grammar you might be writing, which finds source code comments but leaves the rest as "any character" you will need to use the longest match disambiguation a lot:
lexical Identifier = [a-z]+ !>> [a-z]; // means do not accept an Identifier if there is still [a-z] to add to it; so only the longest possible Identifier will match.
This kind of context-free grammar is called an "Island Grammar" metaphorically, because you will write precise rules for the parts you want to recognize (the comments are "Islands") while leaving the rest as everything else (the rest is "Water"). See https://dl.acm.org/citation.cfm?id=837160

Append text file to lexicon in Rascal

Is it possible to append terminals retrieved from a text file to a lexicon in Rascal? This would happen at run time, and I see no obvious way to achieve this. I would rather keep the data separate from the Rascal project. For example, if I had read in a list of countries from a text file, how would I add these to a lexicon (using the lexical keyword)?
In the data-dependent version of the Rascal parser this is even easier and faster but we haven't released this yet. For now I'd write a generic rule with a post-parse filter, like so:
rascal>set[str] lexicon = {"aap", "noot", "mies"};
set[str]: {"noot","mies","aap"}
rascal>lexical Word = [a-z]+;
ok
rascal>syntax LexiconWord = word: Word w;
ok
rascal>LexiconWord word(Word w) { // called when the LexiconWord.word rule is use to build a tree
>>>>>>> if ("<w>" notin lexicon)
>>>>>>> filter; // remove this parse tree
>>>>>>> else fail; // just build the tree
>>>>>>>}
rascal>[Sentence] "hello"
|prompt:///|(0,18,<1,0>,<1,18>): ParseError(|prompt:///|(0,18,<1,0>,<1,18>))
at $root$(|prompt:///|(0,64,<1,0>,<1,64>))
rascal>[Sentence] "aap"
Sentence: (Sentence) `aap`
rascal>
Because the filter function removed all possible derivations for hello, the parser eventually returns a parse error on hello. It does not do so for aap which is in the lexicon, so hurray. Of course you can make interestingly complex derivations with this kind of filtering. People sometimes write ambiguous grammars and use filters like so to make it unambiguous.
Parsing and filtering in this way is in cubic worst-case time in terms of the length of the input, if the filtering function is in amortized constant time. If the grammar is linear, then of course the entire process is also linear.
A completely different answer would be to dynamically update the grammar and generate a parser from this. This involves working against the internal grammar representation of Rascal like so:
set[str] lexicon = {"aap", "noot", "mies"};
syntax Word = ; // empty definition
typ = #Word;
grammar = typ.definitions;
grammar[sort("Word")] = { prod(sort("Word"), lit(x), {}) | x <- lexicon };
newTyp = type(sort("Word"), grammar);
This newType is a reified grammar + type for the definition of the lexicon, and which can now be used like so:
import ParseTree;
if (type[Word] staticGrammar := newType) {
parse(staticGrammar, "aap");
}
Now having written al this, two things:
I think this may trigger unknown bugs since we did not test dynamic parser generation, and
For a lexicon with a reasonable size, this will generate an utterly slow parser since the parser is optimized for keywords in programming languages and not large lexicons.

OCaml: How to test scanner and parser?

We are writing a compiler in OCaml for our own domain specific language. So far, we have working scanner, parser and ast.
What is the best way to test scanner/parser at this point? I know it is possible to pass a sequence of tokens to the parser/scanner and see if it gets accepted/rejected by the scanner/parser. (such as, echo "FLOAT ID" | menhir --interpret --interpret-show-cst parser.mly).
But, is there a way to pass the actual program written in our own language to the scanner/parser and see whether it gets accepted?
I have to add that I am very new to OCaml and I know very little about compilers.
If what you want to do is to give a string to your parser and see if it works, you could do this (supposing your starting point in the parser is prog)
main.ml :
let () =
(* Taking the string given as a parameter or the program *)
let lb = Lexing.from_string Sys.argv.(1) in
(* if you want to parse a file you should write :
let ci = open_in filename in
let lb = Lexing.from_channel ci in
*)
try
let p = Parser.prog Lexer.token lb in
Printf.printf "OK\n"
with _ -> Printf.printf "Not OK\n"
Did I help ? ;-)

stemming of a word without Regex

Is it possible to stem words without using Regex in F#?
I want to know how can I write a F# function which inputs a string and stems it.
eg.
input = "going"
output = "go"
I can't find a way to write the code without using the regex: .*ing\b and replace function which would be almost like doing in C# without any advantage.
Semi pseudo code of what I am trying to write is:
let stemming word =
match word
|(word-"ing")+ing -> (word-"ing")
A quick bit of googling reveals just how complex stemming is:
http://en.wikipedia.org/wiki/Stemming
The standard seems to be the "Porter Algorithm", it seems several people have ported it to .NET, I count two C# versions and a VB.net version on the "The Porter Stemming Algorithm" homepage:
http://tartarus.org/martin/PorterStemmer/
I would use one of these libraries from F# to do the stemming.
Here is a function applying the simplest stemming rule:
let (|Suffix|_|) (suffix: string) (s: string) =
if s.EndsWith(suffix) then
Some(s.Substring(0, s.Length - suffix.Length))
else
None
let stem = function
| Suffix "ing" s -> s
| _ -> failwith "Not ending with ing"
Parameterized active patterns makes pattern matching more readable and more convenient in this case. If stemming rules get complicated, you could update active patterns to keep the stem function unchanged.

Recursive list-constructing parser in Opa

I'd like to write a parser for hashtags. I have been reading the blog
entries on parsing on the opa blog, but they didn't cover recursive
parsers and constructions of lists a lot.
Hashtags are used by some social networks (Twitter, Diaspora*)
to tag a post. They consist of a hash sign (#) and an alphanumeric
string such as "interesting" or "funny". One example of a post using
hashtags:
Oh #Opa, you're so #lovely! (Are you a relative of #Haskell?)
Parsing that would result in ["Opa", "lovely", "Haskell"].
I have tried to do it, but it doesn't quite what I want. (It could
either only parse one hashtag and nothing else, would fail in an endless
loop or fail because there was input it didn't understand...)
Additionally, here is a Haskell version that implements it.
To begin with a remark: by posing question in Haskell-terms you're effectively looking for somebody who knows Opa and Haskell hence decreasing chances of finding a person to answer the question ;). Ok, I'm saying it half jokingly as your comments help a lot but still I'd rather see the question phrased in plain English.
I think a solution keeping the structure of the Haskell one would be something like this:
parse_tags =
hashtag = parser "#" tag=Rule.alphanum_string -> tag
notag = parser (!"#" .)* -> void
Rule.parse_list_sep(true, hashtag, notag)
Probably the main 'trick' is to use the Rule.parse_list_sep function to parse a list. I suggest you take a look at the implementation of some functions in the Rule module to get inspiration and learn more about parsing in Opa.
Of course I suggest testing this function, for instance with the following code:
_ =
test(s) =
res =
match Parser.try_parse(parse_tags, s) with
| {none} -> "FAILURE"
| {some=tags} -> "{tags}"
println("Parsing '{s}' -> {res}")
do test("#123 #test #this-is-not-a-single-tag, #lastone")
do test("#how#about#this?")
void
which will give the following output:
Parsing '#123 #test #this-is-not-a-single-tag, #lastone' -> [123, test, this, lastone]
Parsing '#how#about#this?' -> FAILURE
I suspect that you will need to fine tune this solution to really conform to what you want but it should give you a good head start (I hope).
The following work, just using plain parsers
hashtag = parser "#" tag=Rule.alphanum_string -> tag
list_tag = parser
| l=((!"#" .)* hashtag -> hashtag)* .* -> l
parsetag(s) = Parser.parse(list_tag, s)
do println("{parsetag("")}")
do println("{parsetag("aaabbb")}")
do println("{parsetag(" #tag1 #tag2")}")
do println("{parsetag("#tag1 #tag2 ")}")
do println("{parsetag("#tag1#tag2 ")}")

Resources