Backtracking—convinient way to store resulting DataTree on Filesystem - storage

I have created a backtracking algorithm, but after a while the program runs out of memory, since the amount of results is so huge. So I am about to find a way to store the resulting Data Tree onto the Filesystem, rather than the Memory/RAM.
So I am looking for a convenient way to do that, such that there are as few I/O actions as possible, but also a moderate usage of RAM (max ≈2GB).
One way could be, to store each node into a single file, what would probably lead to billions of small files. Or store each level of the tree into a single file, but than those files can grow very large. If those files grow too large, the content wont fit into RAM for reading the data and bring me back to the original problem.
Would it be a good Idea to have files for Nodes and others for the links?

Related

How to reduce the memory usage by a writing transaction?

It seems like Xodus uses the amount of memory proportional to the size of writing transaction (correct me if I am wrong, please). So for big transactions it can become a problem, and for my application I have to chose much larger heap size "just in case" of a large working set for Xodus. Are there ways to reduce the memory use? Config setting? Heuristics?
First possible approach is to split changes into batches and flush them using jetbrains.exodus.entitystore.StoreTransaction#flush one by one. For example if you want to insert 100K entities into database it's better to do that with batches.
If you extensively use large blobs then it's better to store them into temporary files first.

How do I get the size of a ruby object in mb in Rails?

I want to query an ActiveRecord model, modify it, and calculate the size of the new object in mb. How do I do this?
The size of data rows in a database as well as the object size of ruby objects in memory are both not readily available unfortunately.
While it is a bit easier to get a feeling for the object size in memory, you would still have to find all objects which take part of your active record object and thus should be counted (which is not obvious). Even then, you would have to deal with non-obvious things like shared/cached data, class overhead, which might be required to count, but doesn't have to.
On the database side, it heavily depends on the storage engine used. From the documentation of your database, you can normally deduct the storage requirements for each of the columns you defined in your table (which might vary in case of VARCHAR, TEXT, or BLOB columns. On top of this come shared resources like indexes, general table overhead, ... To get an estimate, the documented size requirements for the various columns in your table should be sufficient though
Generally, it is really hard to get a correct size for complex things like database rows or in-memory objects. The systems are not build to collect or provide this information.
Unless you absolutely positively need to get an exact data, you should err on the side of too much space. Generally, for databases it doesn't hurt to have too much disk space (in which case, the database will generally run a little faster) or too much memory (which will reduce memory pressure for Ruby which will again make it faster).
Often, the memory usage of Ruby processes will be not obvious. Thus, the best course of action is almost always to write your program and then test it with the desired amount of real data and check its performance and memory requirements. That way, you get the actual information you need, namely: how much memory does my program need when handling my required dataset.
The size of the record will be totally dependent on your database, which is independent of your Ruby on Rails application. It's going to be a challenge to figure out how to get the size, as you need to ask the DATABASE how big it is, and Rails (by design) shields you very much from the actual implementation details of your DB.
If you need to know the storage to estimate how big of a hard disk to buy, I'd do some basic math like estimate size in memory, then multiply by 1.5 to give yourself some room.
If you REALLY need to know how much room it is, try recording how much free space you have on disk, write a few thousand records, then measure again, then do the math.

Performance of flat directory structures in iOS

On the iOS filesystem, is there a way to optimize file access performance by using a tiered directory structure vs. a flat directory structure?
Specifically, my app has Objects that each contain a number of images and data files. A user could create thousands of these Objects and I need to optimize access to one image for ~100 arbitrary Objects at a time.
In this situation, how should I organize files on the filesystem? Would a tiered directory structure be faster than a flat one? And if so, how should I structure the tiered system (i.e. how many tiers, and how many subdirectories / files per tier)?
THANKS!
Well first of all you might as well try it with a flat structure to see if it is slow or not. Perhaps apple has put in code to optimize how files are found and you don't even need to worry about this. You can probably build out the whole app and just test how quickly it loads and see if that meets your requirements.
If you need to speed it up I would suggest trying to make some sort of structure based on the name of the file. You could have a folder which has all of the items beginning with the letter 'a' or 'b' and so on and so forth. This would split it into 26 folders which should significantly decrease the amount of items in each. Depending on how you name the files you might want a different scheme so that each of the folders had a similar amount of items in it
If you are using Core Data, you could always just enable the Allows External Storage option in the attribute of your model and let the system decide where it should go.
That would be a decent first step to see if the performance is ok.

Traversing directory without recursion in iOS

I need to traverse all the files in the folder structure of the directory accessed by the app from shared servers. With the inclusion static libraries I'm able to access various servers and the files shared in the them. List of all servers are stored in NSArray
I need to traverse through all folders shared by server to store all files in a container. I have used recursion but that has huge impact on the performance in case number of folders and sub folders increase.
Can anyone suggest any algorithm or logic to traverse the directory structure.
Kindly refer below illustration to have an idea of structure.
One of the possibility could be usage of threads but how to divide the logic to iterate all folders for files so that threads can work on them parallel.
Being a mobile app I don't have a luxury of memory.
Remark: "I don't have a luxury of memory." - ask an engineer who worked in the 70's. He will say that the 1GB of RAM you have in your iPhone is more than enough.
To the point: are you sure that it is indeed the recursion itself that has such a great impact on performance? Of course, there are algorithms for traversing a tree data structure (such as a directory in the filesystem) without recursion, using an explicit stack, but that really is painful.
Instead, make sure you obtain only the necessary information, so do not, for example, get all the attributes and hard link count and birthday and... and... of a file if all you need is its full path.

Removing data from a HDF5 file

I'm having a HDF5 file with one-dimensional (N x 1) dataset of compound elements - actually it's a time series. The data is first collected offline into the HFD5 file, and then analyzed. During analysis most of the data turns out to be uninteresting, and only some parts of it are interesting. Since the datasets can be quite big, I would like to get rid of the uninteresting elements, while keeping the interesting ones. For instance, keep elements 0-100 and 200-300 and 350-400 of a 500-element dataset, dump the rest. But how?
Does anybody have experience on how accomplish this with HDF5? Apparently it could be done in several ways, at least:
(Obvious solution), create a new fresh file and write the necessary data there, element by element. Then delete the old file.
Or, into the old file, create a new fresh dataset, write the necessary data there, unlink the old dataset using H5Gunlink(), and get rid of the unclaimed free space by running the file through h5repack.
Or, move the interesting elements within the existing dataset towards the start (e.g. move elements 200-300 to positions 101-201 and elements 350-400 to positions 202-252). Then call H5Dset_extent() to reduce the size of the dataset. Then maybe run through h5repack to release the free space.
Since the files can be quite big even when the uninteresting elements have been removed, I'd rather not rewrite them (it would take a long time), but it seems to be required to actually release the free space. Any hints from HDF5 experts?
HDF5 (at least the version I am used to, 1.6.9) does not allow deletion. Actually, it does, but it does not free the used space, with the result that you still have a huge file. As you said, you can use h5repack, but it's a waste of time and resources.
Something that you can do is to have a lateral dataset containing a boolean value, telling you which values are "alive" and which ones have been removed. This does not make the file smaller, but at least it gives you a fast way to perform deletion.
An alternative is to define a slab on your array, copy the relevant data, then delete the old array, or always access the data through the slab, and then redefine it as you need (I've never done it, though, so I'm not sure if it's possible, but it should)
Finally, you can use the hdf5 mounting strategy to have your datasets in an "attached" hdf5 file you mount on your root hdf5. When you want to delete the stuff, copy the interesting data in another mounted file, unmount the old file and remove it, then remount the new file in the proper place. This solution can be messy (as you have multiple files around) but it allows you to free space and to operate only on subparts of your data tree, instead of using the repack.
Copying the data or using h5repack as you have described are the two usual ways of 'shrinking' the data in an HDF5 file, unfortunately.
The problem, as you may have guessed, is that an HDF5 file has a complicated internal structure (the file format is here, for anyone who is curious), so deleting and shrinking things just leaves holes in an identical-sized file. Recent versions of the HDF5 library can track the freed space and re-use it, but your use case doesn't seem to be able to take advantage of that.
As the other answer has mentioned, you might be able to use external links or the virtual dataset feature to construct HDF5 files that were more amenable to the sort of manipulation you would be doing, but I suspect that you'll still be copying a lot of data and this would definitely add additional complexity and file management overhead.
H5Gunlink() has been deprecated, by the way. H5Ldelete() is the preferred replacement.

Resources