I am trying to build a single sequence that contains the contents of multiple files so that it can be sorted and then passed to a graphing component. However I am stuck trying to fold the contents of each file together. The pseudo code below wont compile but hopefully will show the intention of what I am trying to achieve.
Any help, greatly appreciated.
open System.IO
let FileEnumerator filename = seq {
use sr = System.IO.File.OpenText(filename)
while not sr.EndOfStream do
let line = sr.ReadLine()
yield line
}
let files = Directory.EnumerateFiles(#"D:\test_Data\","*.csv",SearchOption.AllDirectories)
let res =
files
|> Seq.fold(fun x item ->
let lines = FileEnumerator(item)
let sq = Seq.concat x ; lines
sq
) seq<string>
printfn "%A" res
You are essentially trying to reimplement Files.Readlines, which returns the file contents as seq<string>. This can then be concatenated with Seq.concat:
let res = Directory.EnumerateFiles(#"D:\test_Data","*.csv",SearchOption.AllDirectories)
|> Seq.map File.ReadLines
|> Seq.concat
To fix the problem in your original approach, you need to use Seq.append instead of Seq.concat. The initial value for fold should be an empty sequence, which can be written as Seq.empty:
let res =
files |> Seq.fold(fun x item ->
let lines = FileEnumerator(item)
let sq = Seq.append x lines
sq ) Seq.empty
If you wanted to use Seq.concat, you'd have to write Seq.concat [x; lines], because concat expects a sequence of sequences to be concatenated. On the other hand, append simply takes two sequences, so it is easier to use here.
Another (simpler) way to concatenate all the lines is to use yield! in sequence expressions:
let res =
seq { for item in files do
yield! FileEnumerator(item) }
This creates a sequence by iterating over all the files and adding all lines from the files (in order) to the resulting sequence. The yield! construct adds all elements of the sequenece to the result.
Related
I'm reading a file and I want to do something with the first line, and something else with all the other lines
let lines = System.IO.File.ReadLines "filename.txt" |> Seq.map (fun r -> r.Trim())
let head = Seq.head lines
let tail = Seq.tail lines
```
Problem: the call to tail fails because the TextReader is closed.
What it means is that the Seq is evaluated twice: once to get the head once to get the tail.
How can I get the firstLine and the lastLines, while keeping a Seq and without reevaluating the Seq ?
the signature could be, for example :
let fn: ('a -> Seq<'a> -> b) -> Seq<'a> -> b
The easiest thing to do is probably just using Seq.cache to wrap your lines sequence:
let lines =
System.IO.File.ReadLines "filename.txt"
|> Seq.map (fun r -> r.Trim())
|> Seq.cache
Of note from the documentation:
This result sequence will have the same elements as the input sequence. The result can be enumerated multiple times. The input sequence is enumerated at most once and only as far as is necessary. Caching a sequence is typically useful when repeatedly evaluating items in the original sequence is computationally expensive or if iterating the sequence causes side-effects that the user does not want to be repeated multiple times.
I generally use a seq expression in which the Stream is scoped inside the expression. That will allow you to enumerate the sequence fully before the stream is disposed. I usually use a function like this:
let readLines file =
seq {
use stream = File.OpenText file
while not stream.EndOfStream do
yield stream.ReadLine().Trim()
}
Then you should be able to call Seq.head and get the first line in the fail, and Seq.last to get the last line in the file. I think this will technically create two different enumerators though. If you want to only read the file exactly one time, then materializing the sequence to a list or using a function like Seq.cache will be your best option.
I had an important use case for this, where I am using Seq.unfold to read a large number of blocks with REST reads, and sequentially processing each block, with further REST reads.
The reading of the sequence had to be both "lazy" but also cached to avoid duplicate re-evaluation (with every Seq.tail operation).
Hence finding this question and the accepted answer (Seq.cache). Thanks!
I experimented with Seq.cache and discovered that it worked as claimed (ie, lazy and avoid re-evaluation), but with one noteworthy condition - the first five elements of the sequence are always read first (and retained with 'cache'), so experiments on five or smaller numbers won't show lazy evaluation. However, after five, lazy evaluation kicks in for each element.
This code can be used to experiment. Try it for 5, and see no lazy evaluation, and then 10, and see each element after 5 being 'lazy' read, as required. Also remove Seq.cache to see the problem we are addressing (re-evaluation)
// Get a Sequence of numbers.
let getNums n = seq { for i in 1..n do printfn "Yield { %d }" i; yield i}
// Unfold a sequence of numbers
let unfoldNums (nums : int seq) =
nums
|> Seq.unfold
(fun (nums : int seq) ->
printfn "unfold: nums = { %A }" nums
if Seq.isEmpty nums then
printfn "Done"
None
else
let num = Seq.head nums // Value to yield
let tl = Seq.tail nums // Next State. CAUSES RE-EVALUTION!
printfn "Yield: < %d >, tl = { %A }" num tl
Some (num,tl))
// Get n numbers as a sequence, then unfold them as a sequence
// Observe that with 'Seq.cache' input is not re-evaluated unnecessarily,
// and also that lazy evaulation kicks in for n > 5
let experiment n =
getNums n
|> Seq.cache
// Without cache, Seq.tail causes the sequence to be re-evaluated
|> unfoldNums
|> Seq.iter (fun x -> printfn "Process: %d" x)
I am trying to filter out values from a sequence, that are not in another sequence. I was pretty sure my code worked, but it is taking a long time to run on my computer and because of this I am not sure, so I am here to see what the community thinks.
Code is below:
let statezip =
StateCsv.GetSample().Rows
|> Seq.map (fun row -> row.State)
|> Seq.distinct
type State = State of string
let unwrapstate (State s) = s
let neededstates (row:StateCsv) = Seq.contains (unwrapstate row.State) statezip
I am filtering by the neededstates function. Is there something wrong with the way I am doing this?
let datafilter =
StateCsv1.GetSample().Rows
|> Seq.map (fun row -> row.State,row.Income,row.Family)
|> Seq.filter neededstates
|> List.ofSeq
I believe that it should filter the sequence by the values that are true, since neededstates function is a bool. StateCsv and StateCsv1 have the same exact structure, although from different years.
Evaluation of contains on sequences and lists can be slow. For a case where you want to check for the existence of an element in a collection, the F# Set type is ideal. You can convert your sequences to sets using Set.ofSeq, and then run the logic over the sets instead. The following example uses the numbers from 1 to 10000 and then uses both sequences and sets to filter the result to only the odd numbers by checking that the values are not in a collection of even numbers.
Using Sequences:
let numberSeq = {0..10000}
let evenNumberSeq = seq { for n in numberSeq do if (n % 2 = 0) then yield n }
#time
numberSeq |> Seq.filter (fun n -> evenNumberSeq |> Seq.contains n |> not) |> Seq.toList
#time
This runs in about 1.9 seconds for me.
Using sets:
let numberSet = numberSeq |> Set.ofSeq
let evenNumberSet = evenNumberSeq |> Set.ofSeq
#time
numberSet |> Set.filter (fun n -> evenNumberSet |> Set.contains n |> not)
#time
This runs in only 0.005 seconds. Hopefully you can materialize your sequences to sets before performing your contains operation, thereby getting this level of speedup.
Is there a way to have a self-reference in F# sequence expression? For example:
[for i in 1..n do if _f(i)_not_in_this_list_ do yield f(i)]
which prevents inserting duplicate elements.
EDIT: In general case, I would like to know the contents of this_list before applying f(), which is very computationally expensive.
EDIT: I oversimplified in the example above. My specific case is a computationally expensive test T (T: int -> bool) having a property T(i) => T(n*i) so the code snippet is:
[for i in 1..n do if _i_not_in_this_list_ && T(i) then for j in i..i..n do yield j]
The goal is to reduce the number of T() applications and use concise notation. I accomplished the former by using a mutable helper array:
let mutable notYet = Array.create n true
[for i in 1..n do if notYet.[i] && T(i) then for j in i..i..n do yield j; notYet.[j] <- false]
You can have recursive sequence expression e.g.
let rec allFiles dir =
seq { yield! Directory.GetFiles dir
for d in Directory.GetDirectories dir do
yield! allFiles d }
but circular reference is not possible.
An alternative is to use Seq.distinct from Seq module:
seq { for i in 1..n -> f i }
|> Seq.distinct
or to convert sequence to set using Set.ofSeq before consumption as per #John's comment.
You may also decide to maintain information about the previously generated elements in an explicit way; for example:
let genSeq n =
let elems = System.Collections.Generic.HashSet()
seq {
for i in 1..n do
if not (elems.Contains(i)) then
elems.Add(i) |> ignore
yield i
}
There are several considerations here.
First, you can't check if f(i) is in a list or not before actually computing f(i). So I guess you meant that your check function is expensive, not f(i) itself. Correct me if I'm wrong.
Second, if check is indeed very computationally expensive, you may look for a more effective algorithm. There's no guarantee you will find one for every sequence, but they often exist. Then your code will be nothing but a single Seq.unfold.
Third. When there's no such optimization, you may take another approach. Within [for...yield], you only build a current element and you can't access prior ones. Instead of returning an element, building an entire list manually seems to be the way to go:
// a simple algorithm checking if some F(x) exists in a sequence somehow
let check (x:string) xs = Seq.forall (fun el -> not (x.Contains el)) xs
// a converter i -> something else
let f (i: int) = i.ToString()
let generate f xs =
let rec loop ys = function
| [] -> List.rev ys
| x::t ->
let y = f x
loop (if check y ys then y::ys else ys) t
loop [] xs
// usage
[0..3..1000] |> generate f |> List.iter (printf "%O ")
I am looking for a way to create a sequence consisting of every nth element of another sequence, but don't seem to find a way to do that in an elegant way. I can of course hack something, but I wonder if there is a library function that I'm not seeing.
The sequence functions whose names end in -i seem to be quite good for the purpose of figuring out when an element is the nth one or (multiple of n)th one, but I can only see iteri and mapi, none of which really lends itself to the task.
Example:
let someseq = [1;2;3;4;5;6]
let partial = Seq.magicfunction 3 someseq
Then partial should be [3;6]. Is there anything like it out there?
Edit:
If I am not quite as ambitious and allow for the n to be constant/known, then I've just found that the following should work:
let rec thirds lst =
match lst with
| _::_::x::t -> x::thirds t // corrected after Tomas' comment
| _ -> []
Would there be a way to write this shorter?
Seq.choose works nicely in these situations because it allows you do the filter work within the mapi lambda.
let everyNth n elements =
elements
|> Seq.mapi (fun i e -> if i % n = n - 1 then Some(e) else None)
|> Seq.choose id
Similar to here.
You can get the behavior by composing mapi with other functions:
let everyNth n seq =
seq |> Seq.mapi (fun i el -> el, i) // Add index to element
|> Seq.filter (fun (el, i) -> i % n = n - 1) // Take every nth element
|> Seq.map fst // Drop index from the result
The solution using options and choose as suggested by Annon would use only two functions, but the body of the first one would be slightly more complicated (but the principle is essentially the same).
A more efficient version using the IEnumerator object directly isn't too difficult to write:
let everyNth n (input:seq<_>) =
seq { use en = input.GetEnumerator()
// Call MoveNext at most 'n' times (or return false earlier)
let rec nextN n =
if n = 0 then true
else en.MoveNext() && (nextN (n - 1))
// While we can move n elements forward...
while nextN n do
// Retrun each nth element
yield en.Current }
EDIT: The snippet is also available here: http://fssnip.net/1R
I am trying to build a list from a sequence by recursively appending the first element of the sequence to the list:
open System
let s = seq[for i in 2..4350 -> i,2*i]
let rec copy s res =
if (s|>Seq.isEmpty) then
res
else
let (a,b) = s |> Seq.head
Console.WriteLine(string a)
let newS = s |> Seq.skip(1)|> Seq.cache
let newRes = List.append res ([(a,b)])
copy newS newRes
copy s ([])
Two problems:
. getting a Stack overflow which means my tail recusive ploy sucks
and
. why is the code 100x faster when I put |> Seq.cache here let newS = s |> Seq.skip(1)|> Seq.cache.
(Note this is just a little exercise, I understand you can do Seq.toList etc.. )
Thanks a lot
One way that works is ( the two points still remain a bit weird to me ):
let toList (s:seq<_>) =
let rec copyRev res (enum:Collections.Generic.IEnumerator<_*_>) =
let somethingLeft = enum.MoveNext()
if not(somethingLeft) then
res
else
let curr = enum.Current
Console.WriteLine(string curr)
let newRes = curr::res
copyRev newRes enum
let enumerator = s.GetEnumerator()
(copyRev ([]) (enumerator)) |>List.rev
You say it's just an exercise, but it's useful to point to my answer to
While or Tail Recursion in F#, what to use when?
and reiterate that you should favor more applicative/declarative constructs when possible. E.g.
let rec copy2 s = [
for tuple in s do
System.Console.WriteLine(string(fst tuple))
yield tuple
]
is a nice and performant way to express your particular function.
That said, I'd feel remiss if I didn't also say "never create a list that big". For huge data, you want either array or seq.
In my short experience with F# it is not a good idea to use Seq.skip 1 like you would with lists with tail. Seq.skip creates a new IEnumerable/sequence and not just skips n. Therefore your function will be A LOT slower than List.toSeq. You should properly do it imperative with
s.GetEnumerator()
and iterates through the sequence and hold a list which you cons every single element.
In this question
Take N elements from sequence with N different indexes in F#
I started to do something similar to what you do but found out it is very slow. See my method for inspiration for how to do it.
Addition: I have written this:
let seqToList (xs : seq<'a>) =
let e = xs.GetEnumerator()
let mutable res = []
while e.MoveNext() do
res <- e.Current :: res
List.rev res
And found out that the build in method actually does something very similar (including the reverse part). It do, however, checks whether the sequence you have supplied is in fact a list or an array.
You will be able to make the code entirely functional: (which I also did now - could'nt resist ;-))
let seqToList (xs : seq<'a>) =
Seq.fold (fun state t -> t :: state) [] xs |> List.rev
Your function is properly tail recursive, so the recursive calls themselves are not what is overflowing the stack. Instead, the problem is that Seq.skip is poorly behaved when used recursively, as others have pointed out. For instance, this code overflows the stack on my machine:
let mutable s = seq { 1 .. 20001 }
for i in 1 .. 20000 do
s <- Seq.skip 1 s
let v = Seq.head s
Perhaps you can see the vague connection to your own code, which also eventually takes the head of a sequence which results from repeatedly applying Seq.skip 1 to your initial sequence.
Try the following code.
Warning: Before running this code you will need to enable tail call generation in Visual Studio. This can be done through the Build tab on the project properties page. If this is not enabled the code will StackOverflow processing the continuation.
open System
open System.Collections.Generic
let s = seq[for i in 2..1000000 -> i,2*i]
let rec copy (s : (int * int) seq) =
use e = s.GetEnumerator()
let rec inner cont =
if e.MoveNext() then
let (a,b) = e.Current
printfn "%d" b
inner (fun l -> cont (b :: l))
else cont []
inner (fun x -> x)
let res = copy s
printfn "Done"