I am doing data processing with F#. First I got all files in a directory, then process each file to generate some data structure. Finally I will store the processed data into SQLite. I known that if I using Seq to store the file name and then pipe-forward to Seq.map that will do lazy process for each file. But how about there are so many files that contain all of them in memory is impossible. Then in imperative programming language, I could read one file, process it, store it and release the intermedia data and do next file. Of course F# could do imperative programming, but I want to know if there are some chances to do it in Functional programming style?
files
|> Seq.map readFile
|> Seq.map processContent
|> Seq.map storeProcessResult
code above shows my opinion. files contains a sequence of file names, then I read the content of file, process it into some structure and finally store the result into database. I know that because of the lazy behaviour, file will be read and processed one by one. But when will the final data released?
Obviously only you know what happens inside your readFile, processContent and storeProcessResult functions. As #FuleSnabel says in his comment you can map and then use fold (recursion) to process the file.
Here's a simple test you can perform to see the difference in memory consumption: create a List of lists with 10 million elements and sum the list, then create a Seq of lists with 10 million elements, and sum the list. I'm using 64-bit FSI.
This will use about 1GB of memory:
let z = [for i in 1..3 -> List.init 10000000 (fun _ -> 1)]
let w = z |> List.map (fun x -> System.GC.Collect();List.sum x)
This will only use a few MB of memory, much less than even one list with 10 million 1s in it:
let x = seq {for i in 1..3 -> List.init 10000000 (fun _ -> 1 ) }
let y = x |> Seq.map (fun x -> System.GC.Collect(); List.sum x)
This is just one (and probably easy) part in the workflow. If you're opening files, you have to make sure to close them as well, hence my suggestion of use above. However I do recognize that accessing the filesystem, and processing large amounts of data in a lazy sequence might cause some problems, in that case you can always profile it and see where the bottleneck is.
By the way, you don't need to call the GC directly in the code, I just did so the intermediate results don't pollute the memory count in the test.
Related
Suppose I have a list of elements of type 'a, i.e.;
let mylist: 'a list = ...
and a function f of type 'a -> 'b;
let f: 'a -> 'b = ...
Now I want to use f transform mylist into a 'b array.
Is the following:
mylist |> List.map f |> Array.ofList
improved upon performance-wise and memory-wise by the following:
mylist |> List.toSeq |> Seq.map f |> Array.ofSeq
?
The answer depends on your use-case, ìe how big is myList? However, in general Seq in F# is slow and wasteful with memory. Luckily, for the most time that doesn't matter as we usually only have to avoid "truly awful performance" while "mediocre performance" is usually good enough.
I haven't measured it (shame) but from my experience in most cases this would perform the better than the examples by OP
mylist |> Array.ofList |> Array.map f
Convert from list to Array ASAP and then map the Array (Arrays are for sequential access one of the more efficient structures in .NET).
In some rare scenarios where storing an intermediate array value would cause memory to be swapped to disk I suppose using Seq could help, however in those scenarios why would we be using List<_> as that is wasteful with memory compared to Array<_> (unless we can make use of tail reuse)?
There are libraries in .NET that has Seq like properties but still provides decent performance like: https://github.com/nessos/Streams
Church-encoded lists or transducers are also quite performant while avoiding intermediate array values.
I have the following code to perform the Sieve of Eratosthenes in F#:
let sieveOutPrime p numbers =
numbers
|> Seq.filter (fun n -> n % p <> 0)
let primesLessThan n =
let removeFirstPrime = function
| s when Seq.isEmpty s -> None
| s -> Some(Seq.head s, sieveOutPrime (Seq.head s) (Seq.tail s))
let remainingPrimes =
seq {3..2..n}
|> Seq.unfold removeFirstPrime
seq { yield 2; yield! remainingPrimes }
This is excruciatingly slow when the input to primesLessThan is remotely large: primes 1000000 |> Seq.skip 1000;; takes nearly a minute for me, though primes 1000000 itself is naturally very fast because it's just a Sequence.
I did some playing around, and I think that the culprit has to be that Seq.tail (in my removeFirstPrime) is doing something intensive. According to the docs, it's generating a completely new sequence, which I could imagine being slow.
If this were Python and the sequence object were a generator, it would be trivial to ensure that nothing expensive happens at this point: just yield from the sequence, and we've cheaply dropped its first element.
LazyList in F# doesn't seem to have the unfold method (or, for that matter, the filter method); otherwise I think LazyList would be the thing I wanted.
How can I make this implementation fast by preventing unnecessary duplications/recomputations? Ideally primesLessThan n |> Seq.skip 1000 would take the same amount of time regardless of how large n was.
Recursive solutions and sequences don't go well together (compare the answers here, it's very much the same pattern you are using). You might want to inspect the generated code, but I'd just consider this a rule of thumb.
LazyList (as defined in FSharpX) does of course come with unfold and filter defined, it would have been quite bizarre if it didn't. Typically in F# code this sort of functionality is provided in separate modules rather than as instance members on the type itself, a convention that does seem to confuse most of those documentation systems.
As you probably know Seq is a lazily evaluated collection. Sieve algorithm is about filtering out non-primes from a sequence so that you don't have to consider them again.
However, when you combine Sieve with a lazily evaluated collection you end up do the filtering of the same non-primes over and over again.
You see much better performance if you switch from Seq to Array or List because of the non-lazy aspect of those collections means that you only filter non-primes once.
One way to improve performance in your code is to introduce caching.
let removeFirstPrime s =
let s = s |> Seq.cache
match s with
| s when Seq.isEmpty s -> None
| s -> Some(Seq.head s, sieveOutPrime (Seq.head s) (Seq.tail s))
I implemented a LazyList that works alot like Seq that allows me to count the number of evaluations:
For all primes up to 2000.
Without caching: 14753706 evaluations
With caching: 97260 evaluations
Of course if you really need performance you use a mutable array implementation.
PS. Performance metrics
Running 'seq' ...
it took 271 ms with cc (16, 4, 0), the result is: 1013507
Running 'list' ...
it took 14 ms with cc (16, 0, 0), the result is: 1013507
Running 'array' ...
it took 14 ms with cc (10, 0, 0), the result is: 1013507
Running 'mutable' ...
it took 0 ms with cc (0, 0, 0), the result is: 1013507
This is Seq with caching. Seq in F# has rather high overhead, there are interesting lazy alternatives to Seq like Nessos.
List and Array run roughly similar but because of the more compact memory layout the GC metrics are better for Array (10 cc0 collections for Array vs 16 cc0 collections for List). Seq has worse GC metrics in that it forced 4 cc1 collections.
The mutable implementation of sieve algorithm has better memory and performance metrics by a large margin.
I'm trying to get the prime factors for a large number..
let factors (x:int64) =
[1L..x]
|> Seq.filter(fun n -> x%n = 0L)
let isPrime (x:int64) =
factors x
|> Seq.length = 2
let primeFactors (x:int64)=
factors x
|> Seq.filter isPrime
This works for say 13195 but fails with an OutOfMemoryException for 600851475143?
Sorry if i'm missing something obvious, it's only my third day on F# and I didn't know what a prime factor was until this morning.
The expresion [1L..x] creates a list, which in your example gets too large to be stored in memory.
Sequences in contrast are lazy, so if used with care you can avoid computing the whole intermediate list. Your code already uses sequences but as said before it begins with a list, to avoid converting from a list you can use curly brackets: {1L..x}
Using sequence expressions is another option:
let factors (x:int64) = seq {
for i = 1L to x do
if x%i = 0L then yield i}
Having solved the OutOfMemoryException problem your prime function is very slow, you can optimise it as suggested in the comments by returning false immediately after finding a divisor between 1 and its square root. Further optimisations may be achieved by dividing the number by the prime factors as you find them and using a sieve for the primes, you can also have a look at some efficient algorithms here.
The expression [...] creates a list of the items specified. In F#, a List can be defined something like this:
type List<'t> =
| empty
| item of 't * List<'t>
As an example, `[1..5]' would become a structure looking like this:
item(1, item(2, item(3, item(4, item(5, empty)))))
As you can see, this will not be a problem for small numbers of items, but for larger numbers of items this will eventually use up all the available memory and cause an OutOfMemoryExcepion. As Gustavo mentioned, to avoid this, you can use a sequence, which will create each item on demand rather than all at the beginning. This reduces the number of things in memory at one time and thus avoids an OutOfMemoryException.
Since you're already using the Seq module instead of the List module (i.e. Seq.filter vs List.filter etc), you can simply use a sequence instead of a list which would look like this: {1L..x}.
I need to import a large text file (55MB) (525000 * 25) and manipulate the data and produce some output. As usual I started exploring with f# interactive, and I get some really strange behaviours.
Is this file too large or my code wrong?
First test was to import and simply comute the sum over one column (not the end goal but first test):
let calctest =
let reader = new StreamReader(path)
let csv = reader.ReadToEnd()
csv.Split([|'\n'|])
|> Seq.skip 1
|> Seq.map (fun line -> line.Split([|','|]))
|> Seq.filter (fun a -> a.[11] = "M")
|> Seq.map (fun values -> float(values.[14]))
As expected this produces a seq of float both in typecheck and in interactive. If I know add:
|> Seq.sum
Type check works and says this function should return a float but if I run it in interactive I get this error:
System.IndexOutOfRangeException: Index was outside the bounds of the array
Then I removed the last line again and thought I look at the seq of float in a text file:
let writetest =
let str = calctest |> Seq.map (fun i -> i.ToString())
System.IO.File.WriteAllLines("test.txt", str )
Again, this passes the type check but throws errors in interactive.
Can the standard StreamReader not handle that amount of data? or am I going wrong somewhere? Should I use a different function then Streamreader?
Thanks.
Seq is lazy, which means that only when you add the Seq.sum is all the mapping and filtering actually being done, that's why you don't see the error before adding that line. Are you sure you have 15 columns on all rows? That's probably the problem
I would advise you to use the CSV Type Provider instead of just doing a string.Split, that way you'll be sure to not have an accidental IndexOutOfRangeException, and you'll handle , escaping correctly.
Additionaly, you're reading the whole csv file into memory by calling reader.ReadToEnd(), the CsvProvider supports streaming if you set the Cache parameter to false. It's not a problem with a 55MB file, but if you have something much larger it might be
Lazy evaluation is a great boon for stuff like processing huge files that will not fit in main memory at one go. However, suppose there are some elements in the sequence that I want evaluated immediately, while the rest can be lazily computed - is there any way to specify that?
Specific problem: (in case that helps to answer the question)
Specifically, I am using a series of IEnumerables as iterators for multiple sequences - these sequences are data read from files opened using BinaryReader streams (each sequence is responsible for the reading in of data from one of the files). The MoveNext() on these is to be called in a specific order. Eg. iter0 then iter1 then iter5 then iter3 .... and so on. This order is specified in another sequence index = {0,1,5,3,....}. However sequences being lazy, the evaluation is naturally done only when required. Hence, the file reads (for the sequences right at the beginning that read from files on disk) happens as the IEnumerables for a sequence are moving. This is causing an illegal file access - a file that is being read by one process is accessed again (as per the error msg).
True, the illegal file access could be for other reasons, and after having tried my best to debug other causes a partially lazy evaluation might be worth trying out.
While I agree with Tomas' comment: you shouldn't need this if file sharing is handled properly, here's one way to eagerly evaluate the first N elements:
let cacheFirst n (items: seq<_>) =
seq {
use e = items.GetEnumerator()
let i = ref 0
yield!
[
while !i < n && e.MoveNext() do
yield e.Current
incr i
]
while e.MoveNext() do
yield e.Current
}
Example
let items = Seq.initInfinite (fun i -> printfn "%d" i; i)
items
|> Seq.take 10
|> cacheFirst 5
|> Seq.take 3
|> Seq.toList
Output
0
1
2
3
4
val it : int list = [0; 1; 2]
Daniel's solution is sound, but I don't think we need another operator, just Seq.cache for most cases.
First cache your sequence:
let items = Seq.initInfinite (fun i -> printfn "%d" i; i) |> Seq.cache
Eager evaluation followed by lazy access from the beginning:
let eager = items |> Seq.take 5 |> Seq.toList
let cached = items |> Seq.take 3 |> Seq.toList
This will evaluate the first 5 elements once (during eager) but make them cached for secondary access.