How to reduce the memory usage by a writing transaction? - xodus

It seems like Xodus uses the amount of memory proportional to the size of writing transaction (correct me if I am wrong, please). So for big transactions it can become a problem, and for my application I have to chose much larger heap size "just in case" of a large working set for Xodus. Are there ways to reduce the memory use? Config setting? Heuristics?

First possible approach is to split changes into batches and flush them using jetbrains.exodus.entitystore.StoreTransaction#flush one by one. For example if you want to insert 100K entities into database it's better to do that with batches.
If you extensively use large blobs then it's better to store them into temporary files first.

Related

Ruby set to store large number of words and look up

I am analyzing strings to check if they have names of places in them. These strings can have letters . numbers and random characters so we extract contiguous sequences of letters and then check if these sequences exist in a dictionary of places.
This corpus dictionary of places has about 45000 names and the smallest is 2-3 characters and the largest is 24 characters.
My initial thoughts are to store them in a Ruby Set and use include? to verify if the PLACES_SET has the sequences in them.
This method that checks for place names is called from inside an Active Job that runs quite frequently.
The entire ruby set file is about 908KB.
What are the memory implications loading such a large set from a job ? Are there options for defer loading ? Or will manual garbage collection help ?
Any other alternatives I should consider like database storage ? (This has perf query overhead)
As #sergio has observed, the question is less about memory (1MB is not that large these days; most smartphones could handle it). It is more about how often you load it vs. how frequently you perform a lookup against it once it is loaded.
If the list of places is volatile or needs to be maintained without redeploying your application then some kind of DBMS might be suitable, and if you are worried about performance you can always put it behind a distributed cache like Redis in front of the DB.
The global set looks like a good option, and it will be easily understood by subsequent maintainers.
My advice on performance is to keep it simple, and only optimise for performance when you actually have a performance problem. Otherwise, you risk optimising the wrong thing and making your solution unnecessarily complex.

Backtracking—convinient way to store resulting DataTree on Filesystem

I have created a backtracking algorithm, but after a while the program runs out of memory, since the amount of results is so huge. So I am about to find a way to store the resulting Data Tree onto the Filesystem, rather than the Memory/RAM.
So I am looking for a convenient way to do that, such that there are as few I/O actions as possible, but also a moderate usage of RAM (max ≈2GB).
One way could be, to store each node into a single file, what would probably lead to billions of small files. Or store each level of the tree into a single file, but than those files can grow very large. If those files grow too large, the content wont fit into RAM for reading the data and bring me back to the original problem.
Would it be a good Idea to have files for Nodes and others for the links?

ios: large number of database records - possible?

I may need to process a large number of database records, stored locally on an iPad, using Swift. I've yet to choose the database (but likely sqlite unless suggestions point to other). The records could number up to 700,000 and would need to be processed by way of adding up totals, working out percentages etc.
Is this even possible on an iPad with limited processing power? Is there a software storage limit? Is the iPad up to processing such large amounts of data on the fly?
Another option may be to split the data into smaller chunks or around 30,000 records and work with that. Even then I am not sure its a practical thing to attempt.
Any advice on how, or if, to approach this and what limitations may apply?

How do I get the size of a ruby object in mb in Rails?

I want to query an ActiveRecord model, modify it, and calculate the size of the new object in mb. How do I do this?
The size of data rows in a database as well as the object size of ruby objects in memory are both not readily available unfortunately.
While it is a bit easier to get a feeling for the object size in memory, you would still have to find all objects which take part of your active record object and thus should be counted (which is not obvious). Even then, you would have to deal with non-obvious things like shared/cached data, class overhead, which might be required to count, but doesn't have to.
On the database side, it heavily depends on the storage engine used. From the documentation of your database, you can normally deduct the storage requirements for each of the columns you defined in your table (which might vary in case of VARCHAR, TEXT, or BLOB columns. On top of this come shared resources like indexes, general table overhead, ... To get an estimate, the documented size requirements for the various columns in your table should be sufficient though
Generally, it is really hard to get a correct size for complex things like database rows or in-memory objects. The systems are not build to collect or provide this information.
Unless you absolutely positively need to get an exact data, you should err on the side of too much space. Generally, for databases it doesn't hurt to have too much disk space (in which case, the database will generally run a little faster) or too much memory (which will reduce memory pressure for Ruby which will again make it faster).
Often, the memory usage of Ruby processes will be not obvious. Thus, the best course of action is almost always to write your program and then test it with the desired amount of real data and check its performance and memory requirements. That way, you get the actual information you need, namely: how much memory does my program need when handling my required dataset.
The size of the record will be totally dependent on your database, which is independent of your Ruby on Rails application. It's going to be a challenge to figure out how to get the size, as you need to ask the DATABASE how big it is, and Rails (by design) shields you very much from the actual implementation details of your DB.
If you need to know the storage to estimate how big of a hard disk to buy, I'd do some basic math like estimate size in memory, then multiply by 1.5 to give yourself some room.
If you REALLY need to know how much room it is, try recording how much free space you have on disk, write a few thousand records, then measure again, then do the math.

Memory efficiency vs Processor efficiency

In general use, should I bet on memory efficiency or processor efficiency?
In the end, I know that must be according to software/hardware specs. but I think there's a general rule when there's no boundaries.
Example 01 (memory efficiency):
int n=0;
if(n < getRndNumber())
n = getRndNumber();
Example 02 (processor efficiency):
int n=0, aux=0;
aux = getRndNumber();
if(n < aux)
n = aux;
They're just simple examples and wrote them in order to show what I mean. Better examples will be well received.
Thanks in advance.
I'm going to wheel out the universal performance question trump card and say "neither, bet on correctness".
Write your code in the clearest possible way, set specific measurable performance goals, measure the performance of your software, profile it to find the bottlenecks, and then if necessary optimise knowing whether processor or memory is your problem.
(As if to make a case in point, your 'simple examples' have different behaviour assuming getRndNumber() does not return a constant value. If you'd written it in the simplest way, something like n = max(0, getRndNumber()) then it may be less efficient but it would be more readable and more likely to be correct.)
Edit:
To answer Dervin's criticism below, I should probably state why I believe there is no general answer to this question.
A good example is taking a random sample from a sequence. For sequences small enough to be copied into another contiguous memory block, a partial Fisher-Yates shuffle which favours computational efficiency is the fastest approach. However, for very large sequences where insufficient memory is available to allocate, something like reservoir sampling that favours memory efficiency must be used; this will be an order of magnitude slower.
So what is the general case here? For sampling a sequence should you favour CPU or memory efficiency? You simply cannot tell without knowing things like the average and maximum sizes of the sequences, the amount of physical and virtual memory in the machine, the likely number of concurrent samples being taken, the CPU and memory requirements of the other code running on the machine, and even things like whether the application itself needs to favour speed or reliability. And even if you do know all that, then you're still only guessing, you don't really know which one to favour.
Therefore the only reasonable thing to do is implement the code in a manner favouring clarity and maintainability (taking factors you know into account, and assuming that clarity is not at the expense of gross inefficiency), measure it in a real-life situation to see whether it is causing a problem and what the problem is, and then if so alter it. Most of the time you will not have to change the code as it will not be a bottleneck. The net result of this approach is that you will have a clear and maintainable codebase overall, with the small parts that particularly need to be CPU and/or memory efficient optimised to be so.
You think one is unrelated to the other? Why do you think that? Here are two examples where you'll find often unconsidered bottlenecks.
Example 1
You design a DB related software system and find that I/O is slowing you down as you read in one of the tables. Instead of allowing multiple queries resulting in multiple I/O operations you ingest the entire table first. Now all rows of the table are in memory and the only limitation should be the CPU. Patting yourself on the back you wonder why your program becomes hideously slow on memory poor computers. Oh dear, you've forgotten about virtual memory, swapping, and such.
Example 2
You write a program where your methods create many small objects but posses O(1), O(log) or at the worst O(n) speed. You've optimized for speed but see that your application takes a long time to run. Curiously, you profile to discover what the culprit could be. To your chagrin you discover that all those small objects adds up fast. Your code is being held back by the GC.
You have to decide based on the particular application, usage etc. In your above example, both memory and processor usage is trivial, so not a good example.
A better example might be the use of history tables in chess search. This method caches previously searched positions in the game tree in case they are re-searched in other branches of the game tree or on the next move.
However, it does cost space to store them, and space also requires time. If you use up too much memory you might end up using virtual memory which will be slow.
Another example might be caching in a database server. Clearly it is faster to access a cached result from main memory, but then again it would not be a good idea to keep loading and freeing from memory data that is unlikely to be re-used.
In other words, you can't generalize. You can't even make a decision based on the code - sometimes the decision has to be made in the context of likely data and usage patterns.
In the past 10 years. main memory has increased in speed hardly at all, while processors have continued to race ahead. There is no reason to believe this is going to change.
Edit: Incidently, in your example, aux will most likely end up in a register and never make it to memory at all.
Without context I think optimising for anything other than readability and flexibilty
So, the only general rule I could agree with is "Optmise for readability, while bearing in mind the possibility that at some point in the future you may have to optimise for either memory or processor efficiency in the future".
Sorry it isn't quite as catchy as you would like...
In your example, version 2 is clearly better, even though version 1 is prettier to me, since as others have pointed out, calling getRndNumber() multiple times requires more knowledge of getRndNumber() to follow.
It's also worth considering the scope of the operation you are looking to optimize; if the operation is time sensitive, say part of a web request or GUI update, it might be better to err on the side of completing it faster than saving memory.
Processor efficiency. Memory is egregiously slow compared to your processor. See this link for more details.
Although, in your example, the two would likely be optimized to be equivalent by the compiler.

Resources