This is a variation of Parsing single-quoted string with escaped quotes with Nom 5 and Parse string with escaped single quotes. I want to parse strings like '1 \' 2 \ 3 \\ 4' (a raw sequence of characters) as "1 \\' 2 \\ 3 \\\\ 4" (a Rust string), so I'm not concerned with any escaping other than the possibility of having \' inside the strings. Attempts using code from the linked questions:
use nom::{
branch::alt,
bytes::complete::{escaped, tag},
character::complete::none_of,
combinator::recognize,
multi::{many0, separated_list0},
sequence::delimited,
IResult,
};
fn parse_quoted_1(input: &str) -> IResult<&str, &str> {
delimited(
tag("'"),
alt((escaped(none_of("\\\'"), '\\', tag("'")), tag(""))),
tag("'"),
)(input)
}
fn parse_quoted_2(input: &str) -> IResult<&str, &str> {
delimited(
tag("'"),
recognize(separated_list0(tag("\\'"), many0(none_of("'")))),
tag("'"),
)(input)
}
fn main() {
println!("{:?}", parse_quoted_1(r#"'1'"#));
println!("{:?}", parse_quoted_2(r#"'1'"#));
println!("{:?}", parse_quoted_1(r#"'1 \' 2'"#));
println!("{:?}", parse_quoted_2(r#"'1 \' 2'"#));
println!("{:?}", parse_quoted_1(r#"'1 \' 2 \ 3'"#));
println!("{:?}", parse_quoted_2(r#"'1 \' 2 \ 3'"#));
println!("{:?}", parse_quoted_1(r#"'1 \' 2 \ 3 \\ 4'"#));
println!("{:?}", parse_quoted_2(r#"'1 \' 2 \ 3 \\ 4'"#));
}
/*
Ok(("", "1"))
Ok(("", "1"))
Ok(("", "1 \\' 2"))
Ok((" 2'", "1 \\"))
Err(Error(Error { input: "1 \\' 2 \\ 3'", code: Tag }))
Ok((" 2 \\ 3'", "1 \\"))
Err(Error(Error { input: "1 \\' 2 \\ 3 \\\\ 4'", code: Tag }))
Ok((" 2 \\ 3 \\\\ 4'", "1 \\"))
*/
Only first 3 cases work as intended.
A non-nice/imperative solution:
use nom::{bytes::complete::take, character::complete::char, sequence::delimited, IResult};
fn parse_quoted(input: &str) -> IResult<&str, &str> {
fn escaped(input: &str) -> IResult<&str, &str> {
let mut pc = 0 as char;
let mut n = 0;
for (i, c) in input.chars().enumerate() {
if c == '\'' && pc != '\\' {
break;
}
pc = c;
n = i + 1;
}
take(n)(input)
}
delimited(char('\''), escaped, char('\''))(input)
}
fn main() {
println!("{:?}", parse_quoted(r#"'' ..."#));
println!("{:?}", parse_quoted(r#"'1' ..."#));
println!("{:?}", parse_quoted(r#"'1 \' 2' ..."#));
println!("{:?}", parse_quoted(r#"'1 \' 2 \ 3' ..."#));
println!("{:?}", parse_quoted(r#"'1 \' 2 \ 3 \\ 4' ..."#));
}
/*
Ok((" ...", ""))
Ok((" ...", "1"))
Ok((" ...", "1 \\' 2"))
Ok((" ...", "1 \\' 2 \\ 3"))
Ok((" ...", "1 \\' 2 \\ 3 \\\\ 4"))
*/
To allow for '...\\' we can similarly store more previous characters:
let mut pc = 0 as char;
let mut ppc = 0 as char;
let mut pppc = 0 as char;
let mut n = 0;
for (i, c) in input.chars().enumerate() {
if (c == '\'' && pc != '\\') || (c == '\'' && pc == '\\' && ppc == '\\' && pppc != '\\') {
break;
}
pppc = ppc;
ppc = pc;
pc = c;
n = i + 1;
}
Here is my way to parse quoted string.
It returns Cow type with reference to original string when there is not strings that require escaping or copy of the string without escaping slashes.
You might need to adjust is_gdtext and is_quited_char to your needs.
// is valid character that do not require escaping
fn is_qdtext(chr: char) -> bool {
match chr {
'\t' => true,
' ' => true,
'!' => true,
'#'..='[' => true,
']'..='~' => true,
_ => {
let x = chr as u8;
x >= 0x80
}
}
}
// check if character can be escaped
fn is_quoted_char(chr: char) -> bool {
match chr {
' '..='~' => true,
'\t' => true,
_ => {
let x = chr as u8;
x >= 0x80
}
}
}
/// parse single escaped character
fn parse_quoted_pair(data: &str) -> IResult<&str, char> {
let (data, (_, chr)) = pair(tag("\\"), satisfy(is_quoted_char))(data)?;
Ok((data, chr))
}
// parse content of quoted string
fn parse_quoted_content(data: &str) -> IResult<&str, Cow<'_, str>> {
let (mut data, content) = data.split_at_position_complete(|item| !is_qdtext(item))?;
if data.chars().next() == Some('\\') {
// we need to escape some characters
let mut content = content.to_string();
while data.chars().next() == Some('\\') {
// unescape next char
let (next_data, chr) = parse_quoted_pair(data)?;
content.push(chr);
data = next_data;
// parse next plain text chunk
let (next_data, extra_content) =
data.split_at_position_complete(|item| !is_qdtext(item))?;
content.push_str(extra_content);
data = next_data;
}
Ok((data, Cow::Owned(content)))
} else {
// quick version, there is no characters to escape
Ok((data, Cow::Borrowed(content)))
}
}
fn parse_quoted_string(data: &str) -> IResult<&str, Cow<'_, str>> {
let (data, (_, content, _)) = tuple((tag("'"), parse_quoted_content, tag("'")))(data)?;
Ok((data, content))
}
Related
I have two pretty similar patterns in Lexer.x first for numbers second for byte. Here they are.
$digit=0-9
$byte=[a-f0-9]
$digit+ { \s -> TNum (readRational s) }
$digit+.$digit+ { \s -> TNum (readRational s) }
$digit+.$digit+e$digit+ { \s -> TNum (readRational s) }
$digit+e$digit+ { \s -> TNum (readRational s) }
$byte$byte { \s -> TByte (encodeUtf8(pack s)) }
I have Parser.y
%token
cnst { TNum $$}
byte { TByte $$}
'[' { TOSB }
']' { TCSB }
%%
Expr:
'[' byte ']' {$1}
| const {$1}
when I write, I got.
[ 11 ] parse error
11 ok
but when I put byte pattern in Lexer before numbers
$digit=0-9
$byte=[a-f0-9]
$byte$byte { \s -> TByte (encodeUtf8(pack s)) }
$digit+ { \s -> TNum (readRational s) }
$digit+.$digit+ { \s -> TNum (readRational s) }
$digit+.$digit+e$digit+ { \s -> TNum (readRational s) }
$digit+e$digit+ { \s -> TNum (readRational s) }
I got
[ 11 ] ok
11 parse error
I think that happens because Lexer makes tokens from string and then gives them to parser.
And when parser wait for byte token it got number token and parser don't have opportunity to make from this value another token.
What I should do in this situation?
In that case you should postpone parsing. You can for example make a TNumByte data constructor that stores the value as String:
Token
= TByte ByteString
| TNum Rational
| TNumByte String
-- …
For a sequence of $digits, it is not yet clear if we have to interpret this as byte or number, so we construct a TNumByte for this:
$digit=0-9
$byte=[a-f0-9]
$digit$digit { TNumByte }
$byte$byte { \s -> TByte (encodeUtf8(pack s)) }
$digit+ { \s -> TNum (readRational s) }
$digit+.$digit+ { \s -> TNum (readRational s) }
$digit+.$digit+e$digit+ { \s -> TNum (readRational s) }
$digit+e$digit+ { \s -> TNum (readRational s) }
then in the parser we can decide based on the context:
%token
cnst { TNum $$ }
byte { TByte $$ }
numbyte { TNumByte $$ } -- 🖘 can be int or byte
'[' { TOSB }
']' { TCSB }
%%
Expr
: '[' byte ']' { $2 }
| '[' numbyte ']' { encodeUtf8(pack $2) } -- 🖘 interpret as byte
| const { $1 }
| numbyte { readRational $1 } -- 🖘 interpret as int
;
I want to match exactly one alphabetic character (a-zA-Z) with nom.
I know I can match greedily using take_while! with something like this:
// match one or more alphabetical characters
pub fn alpha_many(input: &[u8]) -> IResult<&[u8], &[u8]> {
take_while!(input, |c| {
(c >= 0x41 && c <= 0x5a) || (c >= 0x61 && c <= 0x7a)
})
}
But I can't find how to match only one byte. There is one_of!, but I can't use a closure, I have to pass a whole slice:
// match exactly one alphabetical character
pub fn alpha_one(input: &[u8]) -> IResult<&[u8], u8> {
one_of!(
input,
[
0x41, 0x42, 0x43,
// etc until 0x5a and then from 0x61 to 0x7a
// ...
].as_ref()
)
}
I've come up with this. I'll mark this as the accepted answer tomorrow if nobody comes up with a better solution:
use nom::{self, ErrorKind, IResult, Needed};
/// Alphabetical characters ([RFC5234 appendix B.1])
///
/// [RFC5234 appendix B.1]: https://tools.ietf.org/html/rfc5234#appendix-B.1
///
/// ```no_rust
/// ALPHA = %x41-5A / %x61-7A ; A-Z / a-z
/// ```
pub struct Alpha;
impl Alpha {
/// Return true if the given byte represents an alphabetical character
pub fn is_alpha(c: u8) -> bool {
(c >= 0x41 && c <= 0x5a) || (c >= 0x61 && c <= 0x7a)
}
/// Parse one or more alphabetical characters
pub fn parse_many(input: &[u8]) -> IResult<&[u8], &[u8]> {
take_while!(input, Self::is_alpha)
}
/// Parse one alphabetical character
pub fn parse_one(input: &[u8]) -> IResult<&[u8], u8> {
Self::parse_n(input, 1).map(|res| res[0])
}
/// Parse n alphabetical characters
pub fn parse_n(input: &[u8], n: usize) -> IResult<&[u8], &[u8]> {
Self::parse_m_n(input, n, n)
}
/// Parse between m and n alphabetical characters
pub fn parse_m_n(input: &[u8], m: usize, n: usize) -> IResult<&[u8], &[u8]> {
if input.len() < m {
return IResult::Incomplete(Needed::Size(input.len() - m));
}
for i in 0..n {
if !Self::is_alpha(input[i]) {
// We were supposed to have at least m printable bytes
if i < m {
return IResult::Error(error_position!(ErrorKind::ManyMN, &input[..]));
} else {
return IResult::Done(&input[i..], &input[0..i]);
}
}
}
return IResult::Done(&input[n..], &input[0..n]);
}
}
I want handle some ambiguities in dypgen. I found something in the manual, that I want to know, how I can use that.
In the manual point 5.2 "Pattern matching on Symbols" there is an example:
expr:
| expr OP<"+"> expr { $1 + $2 }
| expr OP<"*"> expr { $1 * $2 }
OP is matched with "+" or "*", as I understand. I also find there:
The patterns can be any Caml patterns (but without the keyword when).
For instance this is possible:
expr: expr<(Function([arg1;arg2],f_body)) as f> expr
{ some action }
So I tried to put there some other expressions, but I dont understand, what happens. If I put in there printf it outputs the value of the matched string. But if I put in there (fun x -> printf x), that seems to me the same as printf, dypgen complains about a syntax error and points to the end of the expression. If I put Printf.printf in there, it complains about Syntax error: operator expected. And if I put there (fun x -> Printf.printf x) it says: Lexing failed with message: lexing: empty token
What do these different error-messages mean?
In the end I would like to look up something in a hashtable, if the value is in there, but I don't know, if it is possible this way. Is it or isn't it possible?
EDIT: A minimal example derived from the forest-example from the dypgen-demos.
The grammarfile forest_parser.dyp contains:
{
open Parse_tree
let dyp_merge = Dyp.keep_all
}
%start main
%layout [' ' '\t']
%%
main : np "." "\n" { $1 }
np:
| sg {Noun($1)}
| pl {Noun($1)}
sg: word <Word("sheep"|"fish")> {Sg($1)}
sg: word <Word("cat"|"dog")> {Sg($1)}
pl: word <Word("sheep"|"fish")> {Pl($1)}
pl: word <Word("cats"|"dogs")> {Pl($1)}
/* OR try:
sg: word <printf> {Sg($1)}
pl: word <printf> {Pl($1)}
*/
word:
| (['A'-'Z' 'a'-'z']+) {Word($1)}
The forest.ml has the following print_forest-function now:
let print_forest forest =
let rec aux1 t = match t with
| Word x
-> print_string x
| Noun (x) -> (
print_string "N [";
aux1 x;
print_string " ]")
| Sg (x) -> (
print_string "Sg [";
aux1 x;
print_string " ]")
| Pl (x) -> (
print_string "Pl [";
aux1 x;
print_string " ]")
in
let aux2 t = aux1 t; print_newline () in
List.iter aux2 forest;
print_newline ()
And the parser_tree.mli contains:
type tree =
| Word of string
| Noun of tree
| Sg of tree
| Pl of tree
And then you can determine, what numeri fish, sheep, cat(s) etc. are.
sheep or fish can be singular and plural. cats and dogs cannot.
fish.
N [Sg [fish ] ]
N [Pl [fish ] ]
I know nothing about Dypgen so I tried to figure it out.
Let's see what I found out.
In the parser.dyp file you can define the lexer and the parser or you can use an external lexer. Here's what I did :
My ast looks like this :
parse_prog.mli
type f =
| Print of string
| Function of string list * string * string
type program = f list
prog_parser.dyp
{
open Parse_prog
(* let dyp_merge = Dyp.keep_all *)
let string_buf = Buffer.create 10
}
%start main
%relation pf<pr
%lexer
let newline = '\n'
let space = [' ' '\t' '\r']
let uident = ['A'-'Z']['a'-'z' 'A'-'Z' '0'-'9' '_']*
let lident = ['a'-'z']['a'-'z' 'A'-'Z' '0'-'9' '_']*
rule string = parse
| '"' { () }
| _ { Buffer.add_string string_buf (Dyp.lexeme lexbuf);
string lexbuf }
main lexer =
newline | space + -> { () }
"fun" -> ANONYMFUNCTION { () }
lident -> FUNCTION { Dyp.lexeme lexbuf }
uident -> MODULE { Dyp.lexeme lexbuf }
'"' -> STRING { Buffer.clear string_buf;
string lexbuf;
Buffer.contents string_buf }
%parser
main : function_calls eof
{ $1 }
function_calls:
|
{ [] }
| function_call ";" function_calls
{ $1 :: $3 }
function_call:
| printf STRING
{ Print $2 } pr
| "(" ANONYMFUNCTION lident "->" printf lident ")" STRING
{ Print $6 } pf
| nested_modules "." FUNCTION STRING
{ Function ($1, $3, $4) } pf
| FUNCTION STRING
{ Function ([], $1, $2) } pf
| "(" ANONYMFUNCTION lident "->" FUNCTION lident ")" STRING
{ Function ([], $5, $8) } pf
printf:
| FUNCTION<"printf">
{ () }
| MODULE<"Printf"> "." FUNCTION<"printf">
{ () }
nested_modules:
| MODULE
{ [$1] }
| MODULE "." nested_modules
{ $1 :: $3 }
This file is the most important. As you can see, if I have a function printf "Test" my grammar is ambiguous and this can be reduced to either Print "Test" or Function ([], "printf", "Test") but !, as I realized, I can give priorities to my rules so if one as a higher priority it will be the one chosen for the first parsing. (try to uncomment let dyp_merge = Dyp.keep_all and you'll see all the possible combinations).
And in my main :
main.ml
open Parse_prog
let print_stlist fmt sl =
match sl with
| [] -> ()
| _ -> List.iter (Format.fprintf fmt "%s.") sl
let print_program tl =
let aux1 t = match t with
| Function (ml, f, p) ->
Format.printf "I can't do anything with %a%s(\"%s\")#." print_stlist ml f p
| Print s -> Format.printf "You want to print : %s#." s
in
let aux2 t = List.iter (fun (tl, _) ->
List.iter aux1 tl; Format.eprintf "------------#.") tl in
List.iter aux2 tl
let input_file = Sys.argv.(1)
let lexbuf = Dyp.from_channel (Forest_parser.pp ()) (Pervasives.open_in input_file)
let result = Parser_prog.main lexbuf
let () = print_program result
And, for example, for the following file :
test
printf "first print";
Printf.printf "nested print";
Format.eprintf "nothing possible";
(fun x -> printf x) "Anonymous print";
If I execute ./myexec test I will get the following prompt
You want to print : first print
You want to print : nested print
I can't do anything with Format.eprintf("nothing possible")
You want to print : x
------------
So, TL;DR, the manual example was just here to show you that you can play with your defined tokens (I never defined the token PRINT, just FUNCTION) and match on them to get new rules.
I hope it's clear, I learned a lot with your question ;-)
[EDIT] So, I changed the parser to match what you wanted to watch :
{
open Parse_prog
(* let dyp_merge = Dyp.keep_all *)
let string_buf = Buffer.create 10
}
%start main
%relation pf<pp
%lexer
let newline = '\n'
let space = [' ' '\t' '\r']
let uident = ['A'-'Z']['a'-'z' 'A'-'Z' '0'-'9' '_']*
let lident = ['a'-'z']['a'-'z' 'A'-'Z' '0'-'9' '_']*
rule string = parse
| '"' { () }
| _ { Buffer.add_string string_buf (Dyp.lexeme lexbuf);
string lexbuf }
main lexer =
newline | space + -> { () }
"fun" -> ANONYMFUNCTION { () }
lident -> FUNCTION { Dyp.lexeme lexbuf }
uident -> MODULE { Dyp.lexeme lexbuf }
'"' -> STRING { Buffer.clear string_buf;
string lexbuf;
Buffer.contents string_buf }
%parser
main : function_calls eof
{ $1 }
function_calls:
|
{ [] } pf
| function_call <Function((["Printf"] | []), "printf", st)> ";" function_calls
{ (Print st) :: $3 } pp
| function_call ";" function_calls
{ $1 :: $3 } pf
function_call:
| nested_modules "." FUNCTION STRING
{ Function ($1, $3, $4) }
| FUNCTION STRING
{ Function ([], $1, $2) }
| "(" ANONYMFUNCTION lident "->" FUNCTION lident ")" STRING
{ Function ([], $5, $8) }
nested_modules:
| MODULE
{ [$1] }
| MODULE "." nested_modules
{ $1 :: $3 }
Here, as you can see, I don't handle the fact that my function is print when I parse it but when I put it in my functions list. So, I match on the algebraic type that was built by my parser. I hope this example is ok for you ;-) (but be warned, this is extremely ambiguous ! :-D)
I need to parse statements of the form
var1!=var2
var1==var2
and so on. I have the following construct:
lazy val Line : Parser[Any] = (Expr ~ "!=" ~ Expr)^^ {e => SMT( "(not (= " + e._1._1 + " " + e._2 + "))")} | (Expr ~ "==" ~ Expr)^^ {e => SMT( "(" + (e._1._2) + " " + e._1._1 + " " + e._2 + ")")}
The second part for the "==" works just fine, returning me (== var1 var2), but the first part just does not parse. Whatever I try to parse instead of the "!=", neither "!= " nor " !=" or " != " are recognized.
Of course I can replace the "!=" before I hand it to the parser, but is there a more elegant way?
Have a look at this minimal example (Scala 2.9.2):
import scala.util.parsing.combinator.syntactical._
import scala.util.parsing.combinator._
sealed trait ASTNode
case class Eq(v1: String, v2: String) extends ASTNode
case class Not(n: ASTNode) extends ASTNode
object MyParser extends StandardTokenParsers {
lexical.delimiters += ("==", "!=")
lazy val line = (
(ident ~ ("==" ~> ident)) ^^ { case e1 ~ e2 => Eq(e1, e2) }
| (ident ~ ("!=" ~> ident)) ^^ { case e1 ~ e2 => Not(Eq(e1, e2)) }
)
def main(code: String) = {
val tokens = new lexical.Scanner(code)
line(tokens) match {
case Success(tree, _) => println(tree)
case e: NoSuccess => Console.err.println(e)
}
}
}
MyParser.main("x == y")
MyParser.main("x != y")
I have a hand-written CSS parser done in C# which is getting unmanageable and was trying to do it i FParsec to make it more mantainable. Here's a snippet that parses a css selector element made with regexes:
var tagRegex = #"(?<Tag>(?:[a-zA-Z][_\-0-9a-zA-Z]*|\*))";
var idRegex = #"(?:#(?<Id>[a-zA-Z][_\-0-9a-zA-Z]*))";
var classesRegex = #"(?<Classes>(?:\.[a-zA-Z][_\-0-9a-zA-Z]*)+)";
var pseudoClassRegex = #"(?::(?<PseudoClass>link|visited|hover|active|before|after|first-line|first-letter))";
var selectorRegex = new Regex("(?:(?:" + tagRegex + "?" + idRegex + ")|" +
"(?:" + tagRegex + "?" + classesRegex + ")|" +
tagRegex + ")" +
pseudoClassRegex + "?");
var m = selectorRegex.Match(str);
if (m.Length != str.Length) {
cssParserTraceSwitch.WriteLine("Unrecognized selector: " + str);
return null;
}
string tagName = m.Groups["Tag"].Value;
string pseudoClassString = m.Groups["PseudoClass"].Value;
CssPseudoClass pseudoClass;
if (pseudoClassString.IsEmpty()) {
pseudoClass = CssPseudoClass.None;
} else {
switch (pseudoClassString.ToLower()) {
case "link":
pseudoClass = CssPseudoClass.Link;
break;
case "visited":
pseudoClass = CssPseudoClass.Visited;
break;
case "hover":
pseudoClass = CssPseudoClass.Hover;
break;
case "active":
pseudoClass = CssPseudoClass.Active;
break;
case "before":
pseudoClass = CssPseudoClass.Before;
break;
case "after":
pseudoClass = CssPseudoClass.After;
break;
case "first-line":
pseudoClass = CssPseudoClass.FirstLine;
break;
case "first-letter":
pseudoClass = CssPseudoClass.FirstLetter;
break;
default:
cssParserTraceSwitch.WriteLine("Unrecognized selector: " + str);
return null;
}
}
string cssClassesString = m.Groups["Classes"].Value;
string[] cssClasses = cssClassesString.IsEmpty() ? EmptyArray<string>.Instance : cssClassesString.Substring(1).Split('.');
allCssClasses.AddRange(cssClasses);
return new CssSelectorElement(
tagName.ToLower(),
cssClasses,
m.Groups["Id"].Value,
pseudoClass);
My first attempt yielded this:
type CssPseudoClass =
| None = 0
| Link = 1
| Visited = 2
| Hover = 3
| Active = 4
| Before = 5
| After = 6
| FirstLine = 7
| FirstLetter = 8
type CssSelectorElement =
{ Tag : string
Id : string
Classes : string list
PseudoClass : CssPseudoClass }
with
static member Default =
{ Tag = "";
Id = "";
Classes = [];
PseudoClass = CssPseudoClass.None; }
open FParsec
let ws = spaces
let str = skipString
let strWithResult str result = skipString str >>. preturn result
let identifier =
let isIdentifierFirstChar c = isLetter c || c = '-'
let isIdentifierChar c = isLetter c || isDigit c || c = '_' || c = '-'
optional (str "-") >>. many1Satisfy2L isIdentifierFirstChar isIdentifierChar "identifier"
let stringFromOptional strOption =
match strOption with
| Some(str) -> str
| None -> ""
let pseudoClassFromOptional pseudoClassOption =
match pseudoClassOption with
| Some(pseudoClassOption) -> pseudoClassOption
| None -> CssPseudoClass.None
let parseCssSelectorElement =
let tag = identifier <?> "tagName"
let id = str "#" >>. identifier <?> "#id"
let classes = many1 (str "." >>. identifier) <?> ".className"
let parseCssPseudoClass =
choiceL [ strWithResult "link" CssPseudoClass.Link;
strWithResult "visited" CssPseudoClass.Visited;
strWithResult "hover" CssPseudoClass.Hover;
strWithResult "active" CssPseudoClass.Active;
strWithResult "before" CssPseudoClass.Before;
strWithResult "after" CssPseudoClass.After;
strWithResult "first-line" CssPseudoClass.FirstLine;
strWithResult "first-letter" CssPseudoClass.FirstLetter]
"pseudo-class"
// (tag?id|tag?classes|tag)pseudoClass?
pipe2 ((pipe2 (opt tag)
id
(fun tag id ->
{ CssSelectorElement.Default with
Tag = stringFromOptional tag;
Id = id })) |> attempt
<|>
(pipe2 (opt tag)
classes
(fun tag classes ->
{ CssSelectorElement.Default with
Tag = stringFromOptional tag;
Classes = classes })) |> attempt
<|>
(tag |>> (fun tag -> { CssSelectorElement.Default with Tag = tag })))
(opt (str ":" >>. parseCssPseudoClass) |> attempt)
(fun selectorElem pseudoClass -> { selectorElem with PseudoClass = pseudoClassFromOptional pseudoClass })
But I'm not really liking how it's shaping up. I was expecting to come up with something easier to understand, but the part parsing (tag?id|tag?classes|tag)pseudoClass? with a few pipe2's and attempt's is really bad.
Came someone with more experience in FParsec educate me on better ways to accomplish this?
I'm thinking on trying FSLex/Yacc or Boost.Spirit instead of FParsec is see if I can come up with nicer code with them
You could extract some parts of that complex parser to variables, e.g.:
let tagid =
pipe2 (opt tag)
id
(fun tag id ->
{ CssSelectorElement.Default with
Tag = stringFromOptional tag
Id = id })
You could also try using an applicative interface, personally I find it easier to use and think than pipe2:
let tagid =
(fun tag id ->
{ CssSelectorElement.Default with
Tag = stringFromOptional tag
Id = id })
<!> opt tag
<*> id
As Mauricio said, if you find yourself repeating code in an FParsec parser, you can always factor out the common parts into a variable or custom combinator. This is one of the great advantages of combinator libraries.
However, in this case you could also simplify and optimize the parser by reorganizing the grammer a bit. You could, for example, replace the lower half of the parseCssSelectorElement parser with
let defSel = CssSelectorElement.Default
let pIdSelector = id |>> (fun str -> {defSel with Id = str})
let pClassesSelector = classes |>> (fun strs -> {defSel with Classes = strs})
let pSelectorMain =
choice [pIdSelector
pClassesSelector
pipe2 tag (pIdSelector <|> pClassesSelector <|>% defSel)
(fun tagStr sel -> {sel with Tag = tagStr})]
pipe2 pSelectorMain (opt (str ":" >>. parseCssPseudoClass))
(fun sel optPseudo ->
match optPseudo with
| None -> sel
| Some pseudo -> {sel with PseudoClass = pseudo})
By the way, if you want to parse a large number of string constants, it's more efficient to use a dictionary based parsers, like
let pCssPseudoClass : Parser<CssPseudoClass,unit> =
let pseudoDict = dict ["link", CssPseudoClass.Link
"visited", CssPseudoClass.Visited
"hover", CssPseudoClass.Hover
"active", CssPseudoClass.Active
"before", CssPseudoClass.Before
"after", CssPseudoClass.After
"first-line", CssPseudoClass.FirstLine
"first-letter", CssPseudoClass.FirstLetter]
fun stream ->
let reply = identifier stream
if reply.Status <> Ok then Reply(reply.Status, reply.Error)
else
let mutable pseudo = CssPseudoClass.None
if pseudoDict.TryGetValue(reply.Result, &pseudo) then Reply(pseudo)
else // skip to beginning of invalid pseudo class
stream.Skip(-reply.Result.Length)
Reply(Error, messageError "unknown pseudo class")