Can this F# byte array -> string slicer be made faster? (F#) - f#

I want to slice a byte array that represents null terminated strings and return a string sequence.
Test data:
let a: byte array = [| 65uy;73uy;76uy;74uy;73uy;0uy;73uy;74uy;0uy;72uy;75uy;72uy;0uy;0uy;73uy;75uy; |]
The slicer:
let toTextSlices (x: byte array) (separator: byte) : string seq =
let mutable last = 0
let length = x.Length - 1
let rec findSeparator position : int =
if position < length && x[position] <> separator then findSeparator (position + 1)
else position
seq {
while (last < length) do
let l = findSeparator last
if x[last] <> separator then
yield Text.Encoding.ASCII.GetString (x[last .. l])
last <- l + 1
}
Getting the output:
toTextSlices a 0uy
The output:
[| "AILJI"; "IJ"; "HKH"; "IK" |]
The arrays are quite large, ~10mb sometimes, so I'd like to avoid memory allocations and get the best performance.
How can this be improved?

This is simpler and perhaps faster:
let toTextSlices (bytes : byte array) (separator : byte) =
Text.Encoding.ASCII
.GetString(bytes)
.Split(char separator, StringSplitOptions.RemoveEmptyEntries)
It does allocate a large string, but only once and 10MB isn't much memory these days (and it works fine if the separator is 0uy).

Related

F#: Testing whether two strings are anagrams

I am new to programming and F# is my first language.
Here is my code:
let areAnagrams (firstString: string) (secondString: string) =
let countCharacters (someString: string) =
someString.ToLower().ToCharArray() |> Array.toSeq
|> Seq.countBy (fun eachChar -> eachChar)
|> Seq.sortBy (snd >> (~-))
countCharacters firstString = countCharacters secondString
let testString1 = "Laity"
let testString2 = "Italy"
printfn "It is %b that %s and %s are anagrams." (areAnagrams testString1 testString2) (testString1) (testString2)
This is the output:
It is false that Laity and Italy are anagrams.
What went wrong? What changes should I make?
Your implementation of countCharacters sorts the tuples just using the second element (the number of occurrences for each character), but if there are multiple characters that appear the same number of times, then the order is not defined.
If you run the countCharacters function on your two samples, you can see the problem:
> countCharacters "Laity";;
val it : seq<char * int> = seq [('l', 1); ('a', 1); ('i', 1); ('t', 1); ...]
> countCharacters "Italy";;
val it : seq<char * int> = seq [('i', 1); ('t', 1); ('a', 1); ('l', 1); ...]
One solution is to just use Seq.sort and sort the tuples using both the letter code and the number of occurrences.
The other problem is that you are comparing two seq<_> values and this does not use structural comparison, so you'll need to turn the result into a list or an array (something that is fully evaluated):
let countCharacters (someString: string) =
someString.ToLower().ToCharArray()
|> Seq.countBy (fun eachChar -> eachChar)
|> Seq.sort
|> List.ofSeq
Note that you do not actually need Seq.countBy - because if you just sort all the characters, it will work equally well (the repeated characters will just be one after another). So you could use just:
let countCharacters (someString: string) =
someString.ToLower() |> Seq.sort |> List.ofSeq
Sorting the characters of the two strings gives you an easy solution but this could be a good example of recursion.
You can immediately exclude strings of different length.
You can also filter out all the occurrences of a char per iteration, by replacing them with an empty string.
let rec areAnagram (x:string) (y:string) =
if x.Lenght <> t.Lenght
then false else
if x.Lenght = 0
then true else
let reply = x.[0].ToString ()
areAnagram
(x.Replace (reply,""))
(y.Replace (reply,""))
The above should be faster than sorting for many use cases.
Anyway we can go further and transform it into a fast Integer Sorting without recursion and string replacements
let inline charToInt c =
int c - int '0'
let singlePassAnagram (x:string) =
let hash : int array = Array.zeroCreate 100
x |> Seq.iter (fun c->
hash.[charToInt c] <- (hash.[charToInt c]+1)
)
let areAnagramsFast
(x:string) (y:string) =
if x.Length <> y.Length
then false else
(singlePassAnagram x) =
(singlePassAnagram y)
Here is a fiddle

Combine data into smaller discrete intervals

Suppose we have a pair of input arrays, or a list of (key, value) tuples if you prefer. What's an elegant and performant way to combine values that have indices falling in a certain interval? For example, if the interval (or 'bin') size is 10 then the values of all indices from 0 < x <= 10 would be combined, as would the values of indices from 10 < x <= 20 and so on. I want:
let interval = 10
let index = [| 6; 12; 18; 24 |]
let value = [| a; b; c; d |]
result = [| a; b + c; d |]
The crudest way to do this would be to use a whole lot of if, else if statements (the index range has a defined upper limit). I got close with
for i = 0 to index.Length do
result.[Math.Floor(index.[i]/10] += value.[Math.Floor(index.[i]/10]
but this is doing 0 <= x < 10, not 0 < x <= 10.
I also tried assuming the indices are ordered and evenly spaced, with
for i = 1 : ( index.Length - 1 ) / valuesPerBin
valueRange = ((i-1)*valuesPerBin + 1) : i*valuesPerBin )
result(i) = sum(value(valueRange))
which is nice but obviously breaks if there is a non integer number of values per bin.
What's the best way of doing this in F#? Is there a name or an existing function for what I'm trying to do?
let interval = 10
let index = [6;12;18;24]
let value =[101;102;103;104]
let intervals = List.map (fun e -> e/interval) index
let keys = List.map2(fun e1 e2 -> (e1,e2)) intervals value
let skeys = Seq.ofList keys
let result = skeys
|>Seq.groupBy (fun p -> fst p)
|>Seq.map (fun p -> snd p)
|>Seq.map(fun s -> Seq.sumBy (fun p -> snd p) s)
result will be [101;205;104] (as a Seq).
If you want to convert to an array, apply Seq.toArray.
Is it what you wanted ?
Adapt the surrounding code to use
0 <= x < 10 instead of 0 < x <= 10. In my case this was just a simple definition change in another function, allowing me to use
for i = 0 to index.Length do
result.[Math.Floor(index.[i]/10] += value.[Math.Floor(index.[i]/10], which is much simpler and terser syntax than the alternatives.

f# finding the indexes of a string in a char array

Hey all i am new to F#
I am trying to find the starting indexes of all the occurrences of a string in a char array.
e.g.
char array ['a';'b';'b';'a';'b';'b';'b';'b';'b';'a';'b']
would return 0 and 3 and 9 if you were searching for the string "ab"
Here's a solution using recursive functions:
/// Wraps the recursive findMatches function defined inside, so that you don't have to seed it with the "internal" paramters
let findMatches chars str =
/// Returns whether or not the string matches the beginning of the character array
let rec isStartMatch chars (str: string) =
match chars with
| char :: rest when str.Length > 0 ->
char = str.[0] && (isStartMatch rest str.[1..(str.Length - 1)])
| _ -> str.Length = 0
/// The actual function here
let rec findMatches matchedIndices i chars str =
match chars with
| _ :: rest ->
if isStartMatch chars str
then findMatches (i :: matchedIndices) (i + 1) rest str
else findMatches matchedIndices (i + 1) rest str
| [] -> matchedIndices
findMatches [] 0 chars str
Not the most efficient as it iterates over characters twice if they're part of a match, but that's not really a big worry.
I don't want to do a complete example here, so here is the hint:
let rec match (l:char seq) i=
match seq.tryFindindex ... (*the part you have already done goes here*)with
|None -> []
|Some(t) ->i+t::(match (Seq.skip t l) (i+t)
Basically, we just repeatedly apply the Findindex until it stops matching.

When Generating Primes in F#, why is the "Sieve of Erosthenes" so slow in this particular implementatIon?

IE,
What am I doing wrong here? Does it have to to with lists, sequences and arrays and the way the limitations work?
So here is the setup: I'm trying to generate some primes. I see that there are a billion text files of a billion primes. The question isn't why...the question is how are the guys using python calculating all of the primes below 1,000,000 in milliseconds on this post...and what am I doing wrong with the following F# code?
let sieve_primes2 top_number =
let numbers = [ for i in 2 .. top_number do yield i ]
let sieve (n:int list) =
match n with
| [x] -> x,[]
| hd :: tl -> hd, List.choose(fun x -> if x%hd = 0 then None else Some(x)) tl
| _ -> failwith "Pernicious list error."
let rec sieve_prime (p:int list) (n:int list) =
match (sieve n) with
| i,[] -> i::p
| i,n' -> sieve_prime (i::p) n'
sieve_prime [1;0] numbers
With the timer on in FSI, I get 4.33 seconds worth of CPU for 100000... after that, it all just blows up.
Your sieve function is slow because you tried to filter out composite numbers up to top_number. With Sieve of Eratosthenes, you only need to do so until sqrt(top_number) and remaining numbers are inherently prime. Suppose we havetop_number = 1,000,000, your function does 78498 rounds of filtering (the number of primes until 1,000,000) while the original sieve only does so 168 times (the number of primes until 1,000).
You can avoid generating even numbers except 2 which cannot be prime from the beginning. Moreover, sieve and sieve_prime can be merged into a recursive function. And you could use lightweight List.filter instead of List.choose.
Incorporating above suggestions:
let sieve_primes top_number =
let numbers = [ yield 2
for i in 3..2..top_number -> i ]
let rec sieve ns =
match ns with
| [] -> []
| x::xs when x*x > top_number -> ns
| x::xs -> x::sieve (List.filter(fun y -> y%x <> 0) xs)
sieve numbers
In my machine, the updated version is very fast and it completes within 0.6s for top_number = 1,000,000.
Based on my code here: stackoverflow.com/a/8371684/124259
Gets the first 1 million primes in 22 milliseconds in fsi - a significant part is probably compiling the code at this point.
#time "on"
let limit = 1000000
//returns an array of all the primes up to limit
let table =
let table = Array.create limit true //use bools in the table to save on memory
let tlimit = int (sqrt (float limit)) //max test no for table, ints should be fine
let mutable curfactor = 1;
while curfactor < tlimit-2 do
curfactor <- curfactor+2
if table.[curfactor] then //simple optimisation
let mutable v = curfactor*2
while v < limit do
table.[v] <- false
v <- v + curfactor
let out = Array.create (100000) 0 //this needs to be greater than pi(limit)
let mutable idx = 1
out.[0]<-2
let mutable curx=1
while curx < limit-2 do
curx <- curx + 2
if table.[curx] then
out.[idx]<-curx
idx <- idx+1
out
There have been several good answers both as to general trial division algorithm using lists (#pad) and in choice of an array for a sieving data structure using the Sieve of Eratosthenes (SoE) (#John Palmer and #Jon Harrop). However, #pad's list algorithm isn't particularly fast and will "blow up" for larger sieving ranges and #John Palmer's array solution is somewhat more complex, uses more memory than necessary, and uses external mutable state so is not different than if the program were written in an imperative language such as C#.
EDIT_ADD: I've edited the below code (old code with line comments) modifying the sequence expression to avoid some function calls so as to reflect more of an "iterator style" and while it saved 20% of the speed it still doesn't come close to that of a true C# iterator which is about the same speed as the "roll your own enumerator" final F# code. I've modified the timing information below accordingly. END_EDIT
The following true SoE program only uses 64 KBytes of memory to sieve primes up to a million (due to only considering odd numbers and using the packed bit BitArray) and still is almost as fast as #John Palmer's program at about 40 milliseconds to sieve to one million on a i7 2700K (3.5 GHz), with only a few lines of code:
open System.Collections
let primesSoE top_number=
let BFLMT = int((top_number-3u)/2u) in let buf = BitArray(BFLMT+1,true)
let SQRTLMT = (int(sqrt (double top_number))-3)/2
let rec cullp i p = if i <= BFLMT then (buf.[i] <- false; cullp (i+p) p)
for i = 0 to SQRTLMT do if buf.[i] then let p = i+i+3 in cullp (p*(i+1)+i) p
seq { for i = -1 to BFLMT do if i<0 then yield 2u
elif buf.[i] then yield uint32(3+i+i) }
// seq { yield 2u; yield! seq { 0..BFLMT } |> Seq.filter (fun i->buf.[i])
// |> Seq.map (fun i->uint32 (i+i+3)) }
primesSOE 1000000u |> Seq.length;;
Almost all of the elapsed time is spent in the last two lines enumerating the found primes due to the inefficiency of the sequence run time library as well as the cost of enumerating itself at about 28 clock cycles per function call and return with about 16 function calls per iteration. This could be reduced to only a few function calls per iteration by rolling our own iterator, but the code is not as concise; note that in the following code there is no mutable state exposed other than the contents of the sieving array and the reference variable necessary for the iterator implementation using object expressions:
open System
open System.Collections
open System.Collections.Generic
let primesSoE top_number=
let BFLMT = int((top_number-3u)/2u) in let buf = BitArray(BFLMT+1,true)
let SQRTLMT = (int(sqrt (double top_number))-3)/2
let rec cullp i p = if i <= BFLMT then (buf.[i] <- false; cullp (i+p) p)
for i = 0 to SQRTLMT do if buf.[i] then let p = i+i+3 in cullp (p*(i+1)+i) p
let nmrtr() =
let i = ref -2
let rec nxti() = i:=!i+1;if !i<=BFLMT && not buf.[!i] then nxti() else !i<=BFLMT
let inline curr() = if !i<0 then (if !i= -1 then 2u else failwith "Enumeration not started!!!")
else let v = uint32 !i in v+v+3u
{ new IEnumerator<_> with
member this.Current = curr()
interface IEnumerator with
member this.Current = box (curr())
member this.MoveNext() = if !i< -1 then i:=!i+1;true else nxti()
member this.Reset() = failwith "IEnumerator.Reset() not implemented!!!"
interface IDisposable with
member this.Dispose() = () }
{ new IEnumerable<_> with
member this.GetEnumerator() = nmrtr()
interface IEnumerable with
member this.GetEnumerator() = nmrtr() :> IEnumerator }
primesSOE 1000000u |> Seq.length;;
The above code takes about 8.5 milliseconds to sieve the primes to a million on the same machine due to greatly reducing the number of function calls per iteration to about three from about 16. This is about the same speed as C# code written in the same style. It's too bad that F#'s iterator style as I used in the first example doesn't automatically generate the IEnumerable boiler plate code as C# iterators do, but I guess that is the intention of sequences - just that they are so damned inefficient as to speed performance due to being implemented as sequence computation expressions.
Now, less than half of the time is expended in enumerating the prime results for a much better use of CPU time.
What am I doing wrong here?
You've implemented a different algorithm that goes through each possible value and uses % to determine if it needs to be removed. What you're supposed to be doing is stepping through with a fixed increment removing multiples. That would be asymptotically.
You cannot step through lists efficiently because they don't support random access so use arrays.

How do i write float values to a file in f#

i tried this following code what did i do wrong?
// Test IO
// Write a test file
let str : string[,] = Array2D.init 1 ASize (fun i j -> result.[i,j].ToString() )
System.IO.File.WriteAllLines(#"test.txt", str );
Will the first argument to Array2D.init in your code always be 1? If yes, then you can just create one dimensional array and it will work just fine:
let str = Array.init ASize (fun j -> result.[0,j].ToString() )
System.IO.File.WriteAllLines("test.txt", str );
If you really need to write a 2D array to a file, then you can convert 2D array into a one-dimensional array. The simplest way I can think of is this:
let separator = ""
let ar = Array.init (str.GetLength(0)) (fun i ->
seq { for j in 0 .. str.GetLength(1) - 1 -> str.[i, j] }
|> String.concat separator )
This generates a one-dimensional array (along the first coordinate) and then aggregates the elements along the second coordinate. It uses String.concat, so you can specify separator between the items on a single line.
because there are no overloads of File.WriteAllLines that accepts 2d array of strings. You should either convert it to 1d array or to seq<string>.

Resources