How to limit memory use when using a stream - stream

I'm working with a data set that is too big to fit into memory and therefore would like to use a stream to process the data. However, I'm finding that the stream isn't limiting the amount of memory used as I expected. The sample code illustrates the problem and will run out of memory if the memory limit is set to 128mb. I know I could up the memory limit, but with the sort of big data set that I want to use this won't be an option. How should I keep the memory use down?
#lang racket
(struct prospects (src pos num max-num)
#:methods gen:stream
[(define (stream-empty? stream)
(equal? (prospects-num stream) (prospects-max-num stream)))
;
(define (stream-first stream)
(list-ref (prospects-src stream) (prospects-pos stream)))
;
(define (stream-rest stream)
(let ([next-pos (add1 (prospects-pos stream))]
[src (prospects-src stream)]
[next-num (add1 (prospects-num stream))]
[max-num (prospects-max-num stream)])
(if (< next-pos (length src))
(prospects src next-pos next-num max-num)
(prospects src 0 next-num max-num))))])
(define (make-prospects src num)
(prospects src 0 0 num))
(define (calc-stats prospects)
(let ([just-a (stream-filter
(λ (p) (equal? "a" (vector-ref p 0)))
prospects)]
[just-b (stream-filter
(λ (p) (equal? "b" (vector-ref p 0)))
prospects)])
;
(let ([num-a (stream-length just-a)]
[num-b (stream-length just-b)]
[sum-ref1-a (for/sum ([p (in-stream just-a)])
(vector-ref p 1))]
[sum-ref1-b (for/sum ([p (in-stream just-b)])
(vector-ref p 1))])
;
#|
; Have also tried with stream-fold instead of for/sum as below:
[sum-ref1-a (stream-fold
(λ (acc p) (+ acc (vector-ref p 1)))
0 just-a)]
[sum-ref1-b (stream-fold
(λ (acc p) (+ acc (vector-ref p 1)))
0 just-b)])
|#
;
(list num-a num-b sum-ref1-a sum-ref1-b))))
;================================
; Main
;================================
(define num-prospects 800000)
(define raw-prospects (list #("a" 2 2 5 4 5 6 2 4 2 45 6 2 4 5 6 3 4 5 2)
#("b" 1 3 5 2 4 3 2 4 5 34 3 4 5 3 2 4 5 6 3)))
(calc-stats (make-prospects raw-prospects num-prospects))
Note: This program was created just to demonstrate the problem; the real stream would access a database to bring in the data.

The main problem in your code was that you were trying to make multiple passes through the stream. (Each call to stream-length is one pass, and each of your calls to for/sum (or stream-fold, for that matter) is another pass.) This means that you had to materialise the whole stream without allowing the earlier stream elements to be garbage-collected.
Here's a modification of your code to make only one pass. Note that I made num-prospects to be 8,000,000 in my version, since even the multi-pass version didn't run out of memory on my system with only 800,000:
#lang racket
(require srfi/41)
(define (make-prospects src num)
(stream-take num (apply stream-constant src)))
(define (calc-stats prospects)
(define default (const '(0 . 0)))
(define ht (for/fold ([ht #hash()])
([p (in-stream prospects)])
(hash-update ht (vector-ref p 0)
(λ (v)
(cons (add1 (car v))
(+ (cdr v) (vector-ref p 1))))
default)))
(define stats-a (hash-ref ht "a" default))
(define stats-b (hash-ref ht "b" default))
(list (car stats-a) (car stats-b) (cdr stats-a) (cdr stats-b)))
;================================
; Main
;================================
(define num-prospects 8000000)
(define raw-prospects '(#("a" 2 2 5 4 5 6 2 4 2 45 6 2 4 5 6 3 4 5 2)
#("b" 1 3 5 2 4 3 2 4 5 34 3 4 5 3 2 4 5 6 3)))
(calc-stats (make-prospects raw-prospects num-prospects))
I should clarify that the use of srfi/41 is simply to enable writing a more-efficient version of make-prospects (though, the reference implementation of stream-constant isn't very efficient, but still more efficient than what your prospects stream generator did); calc-stats doesn't use it.

I upvoted Chris' excellent answer and suggest you pick it to mark as accepted.
However, what would use the least memory for a data set that doesn't fit into RAM? Probably something like the following pseudo code:
(require db)
(define dbc <open a db connection>)
(define just-a (query-value dbc "SELECT Count(*) FROM t WHERE n = $1" "a"))
(define just-b (query-value dbc "SELECT Count(*) FROM t WHERE n = $1" "b"))
Why this is a somewhat smartypants answer:
Ostensibly you asked about using Racket streams to handle things that don't fit in memory.
If you need more complex aggregations than count (or sum/min/max), you'll need to write more complex SQL queries, and probably want to make them stored procedures on the server.
Why it isn't necessarily smartypants:
You did mention your real use case involved a database. ;)
A speciality of a DB server and SQL is large data sets that don't fit in memory. Taking advantage of the server often will beat something DB-ish re-implemented in a general-purpose language (as well as probably being friendlier to other uses/users of the same DB server).

You are running out of memory because you are hanging onto the head of the stream while traversing it. The GC can't collect anything because since you have a pointer to the head, every element of the stream is still reachable.
To demonstrate, with this stream:
(define strm (make-prospects raw-prospects num-prospects))
this blows up:
(define just-a (stream-filter (λ (p) (equal? "a" (vector-ref p 0))) strm))
(stream-length just-a)
while this is fine:
(stream-length (stream-filter (λ (p) (equal? "a" (vector-ref p 0))) strm))

Related

SICP Streams (Scheme) - CDR into the implicit stream of ones

(define ones (cons-stream 1 ones))
(stream-cdr ones)
returns an infinite sequence of evaluated 1s - i.e. I get
;Value: #0={1 1 1 1 1 1 1 and so on... - not the symbolic {1 1 ...} I would expect ...
On the other end if I define ints and cdr into it,
(define ints (cons-stream 1 (stream-map + ones ints)))
(stream-cdr ints)
I obtain the expected {1 2 ...}
Can anyone explain me why? I expect the stream-map definition to be not too far from
(define (mystream-map proc . argstreams)
(if (stream-null? (car argstreams))
the-empty-stream
(cons-stream
(apply proc (map stream-car argstreams))
(apply mystream-map (cons proc (map stream-cdr argstreams))))))
which I defined in ex 3.50 and returns the same result ... but this seems to (via (map stream-cdr argstreams)) stream-cdr into ones, which I would expect to result in the infinite sequence I get above!
Even though, if I understand correctly, due to cons-stream being a macro the second part inside the stream-map will only be lazily evaluated, at some point while I cdr into the ints stream I would expect to have to cdr into ones as well.. which instead doesn't seem to be happening! :/
Any help in understanding this would be very much appreciated!
(I also didn't override any scheme symbol and I'm running everything in MIT Scheme - Release 11.2 on OS X.)
(define ones (cons-stream 1 ones))
(stream-cdr ones)
returns an infinite sequence of evaluated ones.
No it does not. (stream-cdr ones) = (stream-cdr (cons-stream 1 ones)) = ones. It is just one ones, not the sequence of ones.
And what is ones?
Its stream-car is 1, and its stream-cdr is ones.
Whose stream-car is 1, and stream-cdr is ones again.
And so if you (stream-take n ones) (with an obvious implementation of stream-take) you will receive an n-long list of 1s, whatever the (non-negative) n is.
In other words, repeated cdring into ones produces an unbounded sequence of 1s. Or symbolically, {1 1 1 ...} indeed.
After your edit it becomes clear the question is about REPL behavior. What you're seeing most probably has to do with the difference between sharing and recalculation, just like in letrec with the true reuse (achieved through self-referring binding) vs. the Y combinator with the recalculation ((Y g) == g (Y g)).
You probably can see the behavior you're expecting, with the re-calculating definition (define (onesF) (cons-stream 1 (onesF))).
I only have Racket, where
(define ones (cons-stream 1 ones))
(define (onesF) (cons-stream 1 (onesF)))
(define ints (cons-stream 1 (stream-map + ones ints)))
(stream-cdr ones)
(stream-cdr (onesF))
(stream-cdr ints)
prints
(mcons 1 #<promise>)
(mcons 1 #<promise>)
(mcons 2 #<promise>)
Furthermore, having defined
(define (take n s)
(if (<= n 1)
(if (= n 1) ; prevent an excessive `force`
(list (stream-car s))
'())
(cons (stream-car s)
(take (- n 1)
(stream-cdr s)))))
we get
> (display (take 5 ones))
(1 1 1 1 1)
> (display (take 5 (onesF)))
(1 1 1 1 1)
> (display (take 5 ints))
(1 2 3 4 5)

Varadic zip in Racket

I've seen the other answers on zipping functions in Racket but they are first of all not quite right (a zip should only zip up to the shortest sequence provided so that you can zip with infinite streams) and most importantly not varadic so you can only zip two streams at a time.
I have figured out this far
(define (zip a-sequence b-sequence) (for/stream ([a a-sequence]
[b b-sequence])
(list a b)))
which does work correctly
(stream->list (zip '(a b c) (in-naturals)))
=> '((a 0) (b 1) (c 2))
but is not varadic. I know I can define it to be varadic with define (zip . sequences) but I have no idea how to build the for/stream form if I do.
Does this have to be a macro to be doable?
Would this work for you?
#lang racket
(define (my-zip . xs)
(match xs
[(list x) (for/stream ([e x]) (list e))]
[(list x xs ...)
(for/stream ([e x] [e* (apply my-zip xs)])
(cons e e*))]))
(stream->list
(my-zip (in-naturals) '(a b c) '(1 2 3 4 5 6)))
;;=> '((0 a 1) (1 b 2) (2 c 3))
A common implementation of zip is:
(require data/collection) ; for a `map` function that works with streams
(define (zip . xs)
(apply map list xs))
... and if you prefer the point-free style, this can be simply:
(require racket/function)
(define zip (curry map list))
These do have the limitation that they require all input sequences to have the same length (including infinite), but since they are computed lazily, they would work in any case until one of the sequences runs out, at which point an error would be raised.
(zip (cycle '(a b c))
(naturals)
(cycle '(1 2 3))) ; => '((a 0 1) (b 1 2) (c 2 3) (a 3 1) ...
[Answer updated to reflect comments]

Do Racket streams memoize their elements?

Does Racket use memoization when computing large amounts of numbers from an infinite stream? So, for example, if I printed out (aka, computed and displayed) the first 400 numbers on the infinite stream of integers:
(1 2 3 ... 399 400)
And right after I asked to print the first 500 numbers on this infinite stream. Would this second set of computations use memoization? So the first 400 numbers would not be computed again?
Or does this functionality need to be coded by the user/obtained from libraries?
The built-in racket/stream library uses lazy evaluation and memoization to draw elements from a stream:
(require racket/stream)
(define (print-and-return x)
(displayln "drawing element...")
x)
(define (in-range-stream n m)
(if (= n m)
empty-stream
(stream-cons (print-and-return n) (in-range-stream (add1 n) m))))
(define s (in-range-stream 5 10))
(stream-first s)
(stream-first s)
(stream-first (stream-rest s))
The expressions passed to stream-cons are not evaluated until requested either with stream-first or stream-rest. Once evaluated, they are memoized. Notice that despite the four stream operations performed on s, only two `"drawing element..." messages are displayed.
You can use the memoize package.
GitHub source: https://github.com/jbclements/memoize/tree/master
raco pkg install memoize
Using it is as simple as replacing define with define/memo. To quote its example:
(define (fib n)
(if (<= n 1) 1 (+ (fib (- n 1)) (fib (- n 2)))))
> (time (fib 35))
cpu time: 513 real time: 522 gc time: 0
14930352
> (define/memo (fib n)
(if (<= n 1) 1 (+ (fib (- n 1)) (fib (- n 2)))))
> (time (fib 35))
cpu time: 0 real time: 0 gc time: 0
14930352
Also, it is generally quite easy to implement memoization yourself using a Racket hash-table.
For other Scheme implementations that use SRFI 41 streams, those streams also fully memoise all the materialised elements.
In fact, in my Guile port of SRFI 41 (which has been in Guile since 2.0.9), the default printer for streams will print out all the elements so materialised (and nothing that isn't):
scheme#(guile-user)> ,use (srfi srfi-41)
scheme#(guile-user)> (define str (stream-from 0))
scheme#(guile-user)> (stream-ref str 4)
$1 = 4
scheme#(guile-user)> str
$2 = #<stream ? ? ? ? 4 ...>
Any of the elements that aren't being printed out as ? or ... have already been memoised and won't be recomputed. (If you're curious about how to implement such a printer, here's the Guile version.)

How does this implementation of a stream work?

I'm totally new to Scheme, functional programming, and specifically streams. What are the first five integers in the following stream?
(define (mystery x y z)
(cons x
(lambda ()
(mystery y z (+ x y z)))))
(mystery 1 2 3)
How does this work, and how can I use it in Racket?
We can examine the contents of an infinite stream by implementing a procedure that consumes a given number of elements and returns them in a list, for example:
(define (print strm n)
(if (zero? n)
'()
(cons (car strm)
(print ((cdr strm)) (sub1 n)))))
If we apply it to your stream, here's what we obtain:
(print (mystery 1 2 3) 5)
=> '(1 2 3 6 11)
Although in this case, it'll be more useful if we try to understand what's happening under the hood. In the end, this is just a series of procedure invocations, if we take note of the parameters that get passed at each call it's easy to find the answer. Take a look at the first column in the following table, and remember that the stream is being built by consing all the x's:
x y z
--------
1 2 3
2 3 6
3 6 11
6 11 20
11 20 37
This returns a list consisting of a number and a function. The function is - after the first call - the same as you'd get by calling (mystery 2 3 6)
How does it work? cons just makes a list of its two arguments, which in this case are a value and the result of evaluating a lambda function, which is itself a function

Scheme streams with Taylor series

I've been doing some homework, wrote some code and can't actually find the reason why it doesn't work. The main idea of this part of the work is to make a stream that will give me elements of Taylor series of cosine function for a given X (angle i guess). anyways here is my code, I'd be happy if some one could point me to the reasons it doesn't work :)
(define (force exp) exp)
(define (s-car s) (car s))
(define (s-cdr s) (force (cdr s)))
; returns n elements of stream s as a list
(define (stream->list s n)
(if (= n 0)
'()
(cons (s-car s) (stream->list (s-cdr s) (- n 1)))))
; returns the n-th element of stream s
(define stream-ref (lambda (s n)
(if (= n 1)
(s-car s)
(stream-ref (s-cdr s) (- n 1)))))
; well, the name kinda gives it away :) make factorial n!
(define (factorial x)
(cond ((= x 0) 1)
((= x 1) 1)
(else (* x (factorial (- x 1))))))
; this function is actually the equation for the
; n-th element of Taylor series of cosine
(define (tylorElementCosine x)
(lambda (n)
(* (/ (expt -1 n) (factorial (* 2 n))) (expt x (* 2 n)))))
; here i try to make a stream of those Taylor series elements of cosine
(define (cosineStream x)
(define (iter n)
(cons ((tylorElementCosine x) n)
(lambda() ((tylorElementCosine x) (+ n 1)))))
(iter 0))
; this definition should bind cosine
; to the stream of taylor series for cosine 10
(define cosine (cosineStream 10))
(stream->list cosine 10)
; this should printi on screen the list of first 10 elements of the series
However, this doesn't work, and I don't know why.
I'm using Dr.Scheme 4.2.5 with the language set to "Essentials of Programming Languages 3rd ed".
Since I was feeling nice (and nostalgic about scheme) I actually waded through your code to finde the mistakes. From what I can see there are 2 problems which keeps the code from running as it should:
If I understand your code correctly (force exp) should evaluate exp, however you directly return it (unevaluated). So it probably should be defined as (define (force exp) (exp))
The second problem is in your lambda: (lambda() ((tylorElementCosine x) (+ n 1)) ) will evaluate to the next element of the taylor series, while it should evaluate to a stream. You probably want something like this: (lambda() (iter (+ n 1)) )
I haven't checked if the output is correct, but with those modifications it does at least run. So if there are any more problems with the code the should be in the formula used.
However I'd suggest that next time you want help with your homework you at least tell us where exactly the problem manifests and what you tried already (the community does frown on "here is some code, please fix it for me" kind of questions).

Resources