I'm self taught and not too sure about the terminology or specific improvements there are (if any) from splitting up object loading into separate parts in a loop.
For example, I use rails and recently I've encountered an issue where I was loading too many heavy ActiveRecord objects at once, and found this in the rails API: http://api.rubyonrails.org/classes/ActiveRecord/Batches.html
What find_in_batches does is that it separates the query into many subsets so instead of making one large query, you're making 10 small ones and not loading so many objects at once.
Ex:
def batch_process
Car.find_in_batches do |batch|
batch.each(&:start_engine!)
end # at the end of each iteration, is the memory from the current batch deallocated?
end
def start_all_at_once
Car.all.each(&:start_engine!)
end
My question is, what exactly is the benefit of doing this? Conceptually I understand that loading less at once allows memory to be freed up on each loop (is this correct??), but what exactly is improved? I believe it's peak memory consumption, but does that translate to RAM / CPU usage improvements (not really sure what the difference between RAM / CPU is to be honest)? Or something to do with Garbage Collection or Ruby heap size?
Just trying to understand the lower level details. Thanks!
Let's say you want to work on 1 million records in your database.
First, your database needs to load and send 1 million records to your Ruby application. Then Rails needs to parse those 1 million record (this uses memory) then generate 1 million record and a big array to contain the all. This will use lots of processing power (CPU) and memory (RAM) to store them all.
Let's say each record takes 1KB of memory (this is an arbitrary number). Then 1 million will take 1GB of memory, and we're not even counting what the memory used by the database, the transfer and the transformations.
Now, load 1 million records in batches of a thousand. Then your database loads and transfers only 1000 records at once. Same for Ruby/Rails, and it will use 1MB of memory. Repeating for the next thousand records, will reuse that memory. Hence, you're only using a fraction of RAM of the previous example!
Related
I have 93 arrays. Each array has about 18 values in average
I need to make a product of these arrays.
So I have my two dimension array that store these 93 arrays.
Here is what I try to do
DATASET.first.product(*DATASET[1..-1])
Ruby returns
RangeError: too big to product
Does anyone know some workaround to figure out of it?
Some ways to chunk them?
What you want is impossible.
The product of 93 arrays with ~18 elements each is an array with approximately 549975033204266172374216967425209467080301768557741749051999338598022831065169332830885722071173603516904554174087168 elements, each of which is a 93-element array.
This means you need 549975033204266172374216967425209467080301768557741749051999338598022831065169332830885722071173603516904554174087168 * 93 * 64bit of memory to store it, which is roughly 409181424703974032246417423764355843507744515806959861294687507916928986312485983626178977220953161016576988305520852992 bytes. That is about 40 orders of magnitude more than the number of particles in the universe. In other words, even if you were to convert the entire universe into RAM, you would still need to find a way to store on the order of 827180612553027 yobibyte on each and every particle in the universe; that is about 6000000000000000000000000 times the information content of the World Wide Web and 10000000000000000000000 times the information content of the dark web.
Does anyone know some workaround to figure out of it? Some ways to chunk them?
Even if you process them in chunks, that doesn't change the fact that you still need to process 51147678087996754030802177970544480438468064475869982661835938489616123289060747953272372152619145127072123538190106624 elements. Even if you were able to process one element per CPU instruction (which is unrealistic, you will probably need dozens if not hundreds of instructions), and even if each instruction only takes one clock cycle (which is unrealistic, on current mainstream CPUs, each instruction takes multiple clock cycles), and even if you had a terahertz CPU (which is unrealistic, the fastest current CPUs top out at 5 GHz), and even if your CPU had a million cores (which is unrealistic, even GPUs only have a couple of thousand extremely simple cores), and even if your motherboard had a million sockets (which is unrealistic, mainstream motherboards only have a maximum of 4 sockets, and even the biggest supercomputers only have 10 million cores in total), and even if you had a million of those computers in a cluster, and even if you had a million of those clusters in a supercluster, and even if you had a million friends that also have a supercluster like this, it would still take you about 1621000000000000000000000000000000000000000000000000000000000000000000 years to iterate through them.
Right, so as it is hopefully clear that this should not be attempted I'll take a risk and attempt solving your actual problem.
You've mentioned in the comments that you need this array for property testing - I'll take a massive leap of faith here and assume you want to test that every possible combination satisfies some conditions - and this is the mistake here, as the amount of possible combination is just... large...
Instead, you can test that some of the combinations works. You can easily generate a short, randomized list of combinations using:
Array.new(num) { DATASET.map(&:sample) }
Where num is a number of combinations you want to test. Note that there is a chance that some of the entries will be duplicated - but given your dataset size the chances would be comparable with colliding uuids and can be safely ignored.
Generating such a subset of possible solutions is much easier, faster and, most importantly, possible. Since the output is randomized, it will test slightly different combination on each run, so remember to have some randomization setup in your test suite if you want to be able to recreate failures.
I am having issues using dask. It is very slow compared to pandas especially when reading large datasets of up to 40gig. The data set grows to about 100+ columns which are mainly float64 after some additional processing(This is quite slow especially when I call compute like so: output = df[["date", "permno"]].compute(scheduler='threading'))
I think I could live with delay even if frustrating however, when I try to save the data to parquet: df.to_parquet('my data frame', engine="fastparquet") it runs out of memory in a server with about 110gig ram. I notice that the buff/cache memory when I do free -h goes up from about 40megabytes to 40+gig.
I am confused how this is possible given that dask does not load everything into memory. I use 100 partitions for the dataset in dask.
Dask computations are executed lazily. The underlying operations aren't actually executed until the last possible moment. Here's what I can gather from your question / comment:
you read a 40GB dataset
you run grouping / sorting
you join with other datasets
you try to write to Parquet
The computation bottleneck isn't necessarily related to the Parquet writing part. Your bottleneck may be with the grouping, sorting, or joining.
You may need to perform a broadcast join, strategically persist, or repartition, it's hard to say given the information provided.
I have few m of records in db, and need to process it from time to time. However this operation takes all memory on my server. I'm running this operation using sidekiq. So while this task using all memory, my rails app becomes very slow.
In general(no logic included) my code looks like
Model.each do |m|
//do some logic code here
end
How do i make garbage collector to run after some amount of records(for ex. 10k records) so i wouldn't face out of memory situations. Will splitting it in chunks help me?
You should always use find_each when dealing with potentially large tables.
That way, models will be retrieved from the database and loaded in memory batch by batch (the default size is 1000 but you can customize it to your needs).
Just be aware that sorting by arbitrary columns doesn't play well with find_each, as it implicitly sorts records by ID so that it has a way to fetch records by batches.
You can force the garbage collector to run with GC.start, but if you are doing
Model.all.each do |m|
end
then garbage collection cannot free the already processed records - they are still referenced by the array that each is iterating over, so running the garbage collector explicitly won't do anything.
Instead use find_each (or its close relative, find_in_batches) which fetches records and processes them in batches (you can control the batch size - I think it is 1000 by default). This way the entire result set is never in memory and previously processed batches are not referenced by anything and so can be disposed of.
In this SO thread, I learned that keeping a reference to a seq on a large collection will prevent the entire collection from being garbage-collected.
First, that thread is from 2009. Is this still true in "modern" Clojure (v1.4.0 or v1.5.0)?
Second, does this issue also apply to lazy sequences? For example, would (def s (drop 999 (seq (range 1000)))) allow the garbage collector to retire the first 999 elements of the sequence?
Lastly, is there a good way around this issue for large collections? In other words, if I had a vector of, say, 10 million elements, could I consume the vector in such a way that the consumed parts could be garbage collected? What about if I had a hashmap with 10 million elements?
The reason I ask is that I'm operating on fairly large data sets, and I am having to be more careful not to retain references to objects, so that the objects I don't need can be garbage collected. As it is, I'm encountering a java.lang.OutOfMemoryError: GC overhead limit exceeded error in some cases.
It is always the case that if you "hold onto the head" of a sequence then Clojure will be forced to keep everything in memory. It doesn't have a choice: you are still keeping a reference to it.
However the "GC overhead limit reached" isn't the same as an out of memory error - It's more likely a sign that you are running a fictitious workload that is creating and discarding objects so fast that it is tricking the GC into thinking that it is overloaded.
See:
GC overhead limit exceeded
If you put an actual workload on the items being processed, I suspect you will see that this error won't happen any more. You can easily process lazy sequences that are larger than available memory in this case.
Concrete collections like vectors and hashmaps are a different matter however: these are not lazy, so must always be held completely in memory. If you have datasets larger than memory then your options include:
Use lazy sequences and don't hold onto the head
Use specialised collections that support lazy loading (Datomic uses some structures like this I believe)
Treat the data as an event stream (using something like Storm)
Write custom code to partition the data into chunks and process them one at a time.
If you hold the head of a sequence in a binding then you are correct and it can't be gc'd (and that's with every version of Clojure). If you are processing a large amount of results, why do you need to hold onto the head?
As for way around it, yes! lazy-seq implementation can gc parts that have already been 'processed' and are not being directly referenced from within the binding. Just ensure that you are not holding onto the head of the sequence.
I am using ActiveRecord to bulk migrate some data from a table in one database to a different table in another database. About 4 million rows.
I am using find_each to fetch in batches. Then I do a little bit of logic to each record fetched, and write it to a different db. I have tried both directly writing one-by-one, and using the nice activerecord-import gem to batch write.
However, in either case, my ruby process memory usage is growing quite a bit throughout the life of the export/import. I would think that using find_each, I'm getting batches of 1000, there should only be 1000 of them in memory at a time... but no, each record I fetch seems to be consuming memory forever, until the process is over.
Any ideas? Is ActiveRecord caching something somewhere that I can turn off?
update 17 Jan 2012
I think I'm going to give up on this. I have tried:
* Making sure everything is wrapped in a ActiveRecord::Base.uncached do
* Adding ActiveRecord::IdentityMap.enabled = false (I think that should turn off the identity map for the current thread, although it's not clearly documented, and I think the identity map isn't on by default in current Rails anyhow)
Neither of those seem to have much effect, memory is still leaking.
I then added a periodic explicit:
GC.start
That seems to slow down the rate of memory leak, but the memory leak still happens (eventually exhausting all memory and bombing).
So I think I'm giving up, and deciding it is not currently possible to use AR to read millions of rows from one db and insert them into another. Perhaps there is a memory leak in MySQL-specific code being used (that's my db), or somewhere else in AR, or who knows.
I would suggest queueing each unit of work into a Resque queue . I have found that ruby has some quirks when iterating over large arrays like these.
Have one main thread that queue's up the work by ID, then have multiple resque workers hitting that queue to get the work done.
I have used this method on approx 300k records, so it would most likely scale to millions.
Change line #86 to bulk_queue = [] since bulk_queue.clear only sets the length of the arrya to 0 makeing it impossible for the GC to clear it.