OCaml Parsing String into String*Int list - parsing

type unit_t = (string * int) list
How would I go about creating an expression to parse a string into this form? Do I need to implement code in a parser.mly format in order to handle this?

There's no real way to give a detailed answer, since you don't say how the original string is supposed to correspond to your list of pairs.
Based just on the simplicity of your output type, I'd say you could do this without invoking the full machinery of ocamlyacc. I.e., I doubt you need a parser.mly file. You could use ocamllex, or you could just apply some regular expressions from the Str module. It depends on what your parsing actually needs to do.
Here's one possibility that uses regular expressions from Str:
let p s =
let fs = Str.full_split (Str.regexp "[0-9]+") s in
let rec go = function
| [] -> []
| Str.Text s :: Str.Delim i :: rest -> (s, int_of_string i) :: go rest
| Str.Text s :: rest -> (s, 0) :: go rest
| Str.Delim i :: rest -> ("", int_of_string i) :: go rest
in
go fs
You can run it like this:
# #load "str.cma";;
# #use "p.ml";;
val p : string -> (string * int) list = <fun>
# p "123abc456def";;
- : (string * int) list = [("", 123); ("abc", 456); ("def", 0)]
# p "ghi789jkl100";;
- : (string * int) list = [("ghi", 789); ("jkl", 100)]

Related

Efficiently truncate string "float" in Erlang

I have a string that can be like this: "100.4", "100.4867", "100.0". I'd like to truncate it so that it has at most two digits after the period. So for example: "100.4", "100.48", "100.0". What's an efficient way to do this in Erlang?
If you sure, that strings contains right representation of float , can do, for example, so:
List = [ "100.4", "100.4867", "100.0"].
[fun()->lists:sublist(X,string:chr(X,$.)+2) end() || X<-List].
and result:
["100.4","100.48","100.0"]
if no - add the processing of these cases.
As rightly noted Lyn Headley in the comments anonymous function here is not required, and you can do so:
[lists:sublist(X,string:chr(X,$.)+2) || X<-List].
If you ask about efficient implementation:
trunc([]) -> []; %% or raise exception because it is not a float
trunc(".") -> []; %% or "." = L) -> L or raise exception or ".0" or what ever you want
trunc([$.,_] = L) -> L;
trunc([$.,_,_] = L) -> L;
trunc([$.,X,Y|_]) -> [$.,X,Y];
trunc([H|T]) -> [H|trunc(T)].

Parsing int or float with FParsec

I'm trying to parse a file, using FParsec, which consists of either float or int values. I'm facing two problems that I can't find a good solution for.
1
Both pint32 and pfloat will successfully parse the same string, but give different answers, e.g pint32 will return 3 when parsing the string "3.0" and pfloat will return 3.0 when parsing the same string. Is it possible to try parsing a floating point value using pint32 and have it fail if the string is "3.0"?
In other words, is there a way to make the following code work:
let parseFloatOrInt lines =
let rec loop intvalues floatvalues lines =
match lines with
| [] -> floatvalues, intvalues
| line::rest ->
match run floatWs line with
| Success (r, _, _) -> loop intvalues (r::floatvalues) rest
| Failure _ ->
match run intWs line with
| Success (r, _, _) -> loop (r::intvalues) floatvalues rest
| Failure _ -> loop intvalues floatvalues rest
loop [] [] lines
This piece of code will correctly place all floating point values in the floatvalues list, but because pfloat returns "3.0" when parsing the string "3", all integer values will also be placed in the floatvalues list.
2
The above code example seems a bit clumsy to me, so I'm guessing there must be a better way to do it. I considered combining them using choice, however both parsers must return the same type for that to work. I guess I could make a discriminated union with one option for float and one for int and convert the output from pint32 and pfloat using the |>> operator. However, I'm wondering if there is a better solution?
You're on the right path thinking about defining domain data and separating definition of parsers and their usage on source data. This seems to be a good approach, because as your real-life project grows further, you would probably need more data types.
Here's how I would write it:
/// The resulting type, or DSL
type MyData =
| IntValue of int
| FloatValue of float
| Error // special case for all parse failures
// Then, let's define individual parsers:
let pMyInt =
pint32
|>> IntValue
// this is an alternative version of float parser.
// it ensures that the value has non-zero fractional part.
// caveat: the naive approach would treat values like 42.0 as integer
let pMyFloat =
pfloat
>>= (fun x -> if x % 1 = 0 then fail "Not a float" else preturn (FloatValue x))
let pError =
// this parser must consume some input,
// otherwise combined with `many` it would hang in a dead loop
skipAnyChar
>>. preturn Error
// Now, the combined parser:
let pCombined =
[ pMyFloat; pMyInt; pError ] // note, future parsers will be added here;
// mind the order as float supersedes the int,
// and Error must be the last
|> List.map (fun p -> p .>> ws) // I'm too lazy to add whitespase skipping
// into each individual parser
|> List.map attempt // each parser is optional
|> choice // on each iteration, one of the parsers must succeed
|> many // a loop
Note, the code above is capable working with any sources: strings, streams, or whatever. Your real app may need to work with files, but unit testing can be simplified by using just string list.
// Now, applying the parser somewhere in the code:
let maybeParseResult =
match run pCombined myStringData with
| Success(result, _, _) -> Some result
| Failure(_, _, _) -> None // or anything that indicates general parse failure
UPD. I have edited the code according to comments. pMyFloat was updated to ensure that the parsed value has non-zero fractional part.
FParsec has the numberLiteral parser that can be used to solve the problem.
As a start you can use the example available at the link above:
open FParsec
open FParsec.Primitives
open FParsec.CharParsers
type Number = Int of int64
| Float of float
// -?[0-9]+(\.[0-9]*)?([eE][+-]?[0-9]+)?
let numberFormat = NumberLiteralOptions.AllowMinusSign
||| NumberLiteralOptions.AllowFraction
||| NumberLiteralOptions.AllowExponent
let pnumber : Parser<Number, unit> =
numberLiteral numberFormat "number"
|>> fun nl ->
if nl.IsInteger then Int (int64 nl.String)
else Float (float nl.String)```

F# "Stateful" Computation Expression

I'm currently learning F# and hitting a few stumbling blocks; I think a lot of it is learning to think functionally.
One of the things I'm learning at the moment are computation expressions, and I want to be able to define a computation expression that handles some tracking state, e.g:
let myOptions = optionListBuilder {
let! opt1 = {name="a";value=10}
let! opt2 = {name="b";value=12}
}
I want to be able to have it so that myOptions is a Option<'T> list, so each let! bind operation effectively causes the builder to "track" the defined options as it goes along.
I don't want to have to do it using mutable state - e.g. having a list maintained by the builder and updated with each bind call.
Is there some way of having it so that this is possible?
Update: The resultant Option<'T> list type is just representative, in reality I'll likely have an OptionGroup<'T> type to contain a list as well as some additional information - so as Daniel mentioned below, I could use a list comprehension for a simple list.
I wrote a string builder computation expression here.
open System.Text
type StringBuilderUnion =
| Builder of StringBuilder
| StringItem of string
let build sb =
sb.ToString()
type StringBuilderCE () =
member __.Yield (txt : string) = StringItem(txt)
member __.Yield (c : char) = StringItem(c.ToString())
member __.Combine(f,g) = Builder(match f,g with
| Builder(F), Builder(G) ->F.Append(G.ToString())
| Builder(F), StringItem(G)->F.Append(G)
| StringItem(F),Builder(G) ->G.Append(F)
| StringItem(F),StringItem(G)->StringBuilder(F).Append(G))
member __.Delay f = f()
member __.Zero () = StringItem("")
member __.For (xs : 'a seq, f : 'a -> StringBuilderUnion) =
let sb = StringBuilder()
for item in xs do
match f item with
| StringItem(s)-> sb.Append(s)|>ignore
| Builder(b)-> sb.Append(b.ToString())|>ignore
Builder(sb)
let builder1 = new StringBuilderCE ()
Noticed the underlying type is immutable (the contained StringBuilder is mutable, but it doesn't have to be). Instead of updating the existing data, each yield combines the current state and the incoming input resulting in a new instance of StringBuilderUnion You could do this with an F# list since adding an element to the head of the list is merely the construction of a new value rather than mutating the existing values.
Using the StringBuilderCE looks like this:
//Create a function which builds a string from an list of bytes
let bytes2hex (bytes : byte []) =
string {
for byte in bytes -> sprintf "%02x" byte
} |> build
//builds a string from four strings
string {
yield "one"
yield "two"
yield "three"
yield "four"
} |> build
Noticed the yield instead of let! since I don't actually want to use the value inside the computation expression.
SOLUTION
With the base-line StringBuilder CE builder provided by mydogisbox, I was able to produce the following solution that works a charm:
type Option<'T> = {Name:string;Item:'T}
type OptionBuilderUnion<'T> =
| OptionItems of Option<'T> list
| OptionItem of Option<'T>
type OptionBuilder () =
member this.Yield (opt: Option<'t>) = OptionItem(opt)
member this.Yield (tup: string * 't) = OptionItem({Name=fst tup;Item=snd tup})
member this.Combine (f,g) =
OptionItems(
match f,g with
| OptionItem(F), OptionItem(G) -> [F;G]
| OptionItems(F), OptionItem(G) -> G :: F
| OptionItem(F), OptionItems(G) -> F :: G
| OptionItems(F), OptionItems(G) -> F # G
)
member this.Delay f = f()
member this.Run (f) = match f with |OptionItems items -> items |OptionItem item -> [item]
let options = OptionBuilder()
let opts = options {
yield ("a",12)
yield ("b",10)
yield {Name = "k"; Item = 20}
}
opts |> Dump
F# supports list comprehensions out-of-the-box.
let myOptions =
[
yield computeOptionValue()
yield computeOptionValue()
]

Parse sequence of tokens into hierarchical type in F#

I processed some HTML to extract various information from a website (no proper API exists there), and generated a list of tokens using an F# discriminated union. I have simplified my code to the essence:
type tokens =
| A of string
| B of int
| C of string
let input = [A "1"; B 2; C "2.1"; C "2.2"; B 3; C "3.1"]
// how to transform the input to the following ???
let desiredOutput = [A "1", [[ B 2, [ C "2.1"; C "2.2" ]]; [B 3, [ C "3.1" ]]]]
This roughly corresponds to parsing the grammar: g -> A b* ; b -> B c* ; c-> C
The key thing is my token list is flat, but I want to work with the hierarchy implied by the grammar.
Perhaps there is another representation of my desiredOutput which would be better; what I really want to do is process exactly one A followed by a zero or more sequence of Bs, which happen to contain zero or more Cs.
I've looked at parser combinators articles, e.g. about FParsec, but I couldn't find a good solution that allows me to start from a list of tokens rather than a stream of characters. I'm familiar with imperative techniques for parsing, but I don't know what is idiomatic F#.
Progress made due to Answer
Thanks to the answer from Vandroiy, I was able to write the following to move forward a hobby project I am working on to learn idiomatic F# (and also to scrape quiz websites).
// transform flat data scraped from a Quiz website into a hierarchical data structure
type ScrapedQuiz =
| Title of string
| Description of string
| Blurb of string * picture: string
| QuizId of string
| Question of num:string * text:string * picture : string
| Answer of text:string
| Error of exn
let input =
[Title "Example Quiz Scraped from a website";
Description "What the Quiz is about";
Blurb ("more details","and a URL for a picture");
Question ("#1", "How good is F#", "URL to picture of new F# logo");
Answer ("we likes it");
Answer ("we very likes it");
Question ("#2", "How useful is Stack Overflow", "URL to picture of Stack Overflow logo");
Answer ("very good today");
Answer ("lobsters");
]
type Quiz =
{ Title : string
Description : string
Blurb : string * PictureURL
Questions : Quest list }
and Quest =
{ Number : string
Text : string
Pic : PictureURL
Answers : string list}
and PictureURL = string
let errorMessage = "unexpected input format"
let parseList reader input =
let rec run acc inp =
match reader inp with
| Some(o, inp') -> run (o :: acc) inp'
| None -> List.rev acc, inp
run [] input
let readAnswer = function Answer(a) :: t -> Some(a, t) | _ -> None
let readDescription =
function Description(a) :: t -> (a, t) | _ -> failwith errorMessage
let readBlurb = function Blurb(a,b) :: t -> ((a,b),t) | _ -> failwith errorMessage
let readQuests = function
| Question(n,txt,pic) :: t ->
let answers, input' = parseList readAnswer t
Some( { Number=n; Text=txt; Pic=pic; Answers = answers}, input')
| _ -> None
let readQuiz = function
| Title(s) :: t ->
let d, input' = readDescription t
let b, input'' = readBlurb input'
let qs, input''' = parseList readQuests input''
Some( { Title = s; Description = d; Blurb = b; Questions = qs}, input''')
| _ -> None
match readQuiz input with
| Some(a, []) -> a
| _ -> failwith errorMessage
I could not have written this yesterday; neither the target data type, nor the parsing code. I see room for improvement, but I think I have started to meet my goal of not writing C# in F#.
Indeed, it might help to first find a good representation.
Original output format
I presume the suggested output form, in standard printing, would be:
[(A "1", [(B 2, [C "2.1"; C "2.2"]); (B 3, [C "3.1"])])]
(This differs from the one in the question in the amount of list levels.) The code I used to get there is ugly. In part, this is because it abstracts at an awkward position, constraining input and output types very far without giving them a well-defined type. I'm posting it for the sake of completeness, but I recommend to skip over it.
let rec readBranch checkOne readInner acc = function
| h :: t when checkOne h ->
let dat, inp' = readInner t
readBranch checkOne readInner ((h, dat) :: acc) inp'
| l -> List.rev acc, l
let rec readCs acc = function
| C(s) :: t -> readCs (C(s) :: acc) t
| l -> List.rev acc, l
let readBs = readBranch (function B _ -> true | _ -> false) (readCs []) []
let readAs = readBranch (function A _ -> true | _ -> false) readBs []
input |> readAs |> fst
Surely, other people can do this more sensibly, but I doubt it would tackle the main problem: we're just projecting one weird data structure to the next. If it is difficult to read or formulate a parser's output format, there is probably something going wrong.
Strongly typed output
Rather than focus on how we are parsing, I prefer to first pay attention to what we are parsing into. These A B C things don't mean anything to me. Let's say they represent objects:
type Bravo =
{ ID : int
Charlies : string list }
type Alpha =
{ Name : string
Bravos : Bravo list }
There are two places where sequences of objects of the same type are parsed. Let's create a helper that repeatedly uses a specific parser to read a list of objects:
/// Parses objects into a list. reader takes an input and returns either
/// Some(parsed item, new input state), or None if the list is finished.
/// Returns a list of parsed objects and the remaining input.
let parseList reader input =
let rec run acc inp =
match reader inp with
| Some(o, inp') -> run (o :: acc) inp'
| None -> List.rev acc, inp
run [] input
Note that this is quite generic in the type of input. This helper could be used with strings, sequences, or whatever.
Now, we add concrete parsers. The following functions have the signature used in reader in the helper; they either return the parsed object and the remaining input, or None if parsing wasn't possible.
let readC = function C(s) :: t -> Some(s, t) | _ -> None
let readB = function
| B(i) :: t ->
let charlies, input' = parseList readC t
Some( { ID = i; Charlies = charlies }, input' )
| _ -> None
let readA = function
| A(s) :: t ->
let bravos, input' = parseList readB t
Some( { Name = s; Bravos = bravos }, input' )
| _ -> None
The code for reading Alphas and Bravos is practically a duplicate. If that happens in production code, I would recommend again to check whether the data structure is optimal, and only look at improving the algorithm afterwards.
We request to read one A into one Alpha, which was the goal after all:
match readA input with
| Some(a, []) -> a
| _ -> failwith "Unexpected input format"
There may be many better ways to do the parsing, especially when knowing more about the exact problem. The important fact is not how the parser works, but what the output looks like, which will be the focus when actual work is done in the program. The second version's output should be much easier to navigate in both code and debugger:
val it : Alpha =
{ Name = "1";
Bravos = [ { ID = 2; Charlies = ["2.1"; "2.2"] }
{ ID = 3; Charlies = ["3.1"] } ] }
One could take this a step further and replace the tokenized data structure with DOM (Document Object Model). Then, the first step would be to read HTML into DOM using a standard parsing library. In a second step, the concrete parsers would construct objects, using the DOM representation as input, calling one another top-down.
To work with structured hierarchy, you have to create matching structure of types. Something like
type
RootType = Level1 list
and
Level1 =
| A of string
| B of Level2 list
| C of string
and
Level2 =
{ b: int; c: string list }

Why the tuple type can not be inferred in the list recursion?

I want to refine the raw text by using regular expression, given a list of (patten,replacement) tuple.
I tried to use the patten matching on the list element but failed, the error showed that "This expression was expected to have type string * string list but here has type 'a list".
How can I fix this problem? Thanks a lot.
Codes are as follows:
let rec refine (raw:string) (rules:string*string list) =
match rules with
| (pattern,replacement) :: rest ->
refine <| Regex.Replace(raw,pattern,replacement) rest
| [] -> raw
The problem is that a string * string list is a pair consisting of a string and a list of strings, whereas you want a (string * string) list:
let rec refine (raw:string) (rules:(string*string) list) =
match rules with
| (pattern,replacement) :: rest ->
refine (Regex.Replace(raw,pattern,replacement)) rest
| [] -> raw
Alternatively, the only reason you need that particular annotation is because Regex.Replace is overloaded. This is why your other solution works, but there are other (more minimal) places you can put an annotation that will work:
let rec refine (raw:string) rules =
match rules with
| (pattern,replacement:string) :: rest ->
refine (Regex.Replace(raw,pattern,replacement)) rest
| [] -> raw
Finally it works when I try this:
let rec refine (raw:string) rules =
match rules with
| rule :: rest ->
//get tuple values beyond the patten matching
let (pattern:string,replacement:string) = rule
refine (Regex.Replace(raw,pattern,replacement)) rest
| [] -> raw

Resources