I'm having a HDF5 file with one-dimensional (N x 1) dataset of compound elements - actually it's a time series. The data is first collected offline into the HFD5 file, and then analyzed. During analysis most of the data turns out to be uninteresting, and only some parts of it are interesting. Since the datasets can be quite big, I would like to get rid of the uninteresting elements, while keeping the interesting ones. For instance, keep elements 0-100 and 200-300 and 350-400 of a 500-element dataset, dump the rest. But how?
Does anybody have experience on how accomplish this with HDF5? Apparently it could be done in several ways, at least:
(Obvious solution), create a new fresh file and write the necessary data there, element by element. Then delete the old file.
Or, into the old file, create a new fresh dataset, write the necessary data there, unlink the old dataset using H5Gunlink(), and get rid of the unclaimed free space by running the file through h5repack.
Or, move the interesting elements within the existing dataset towards the start (e.g. move elements 200-300 to positions 101-201 and elements 350-400 to positions 202-252). Then call H5Dset_extent() to reduce the size of the dataset. Then maybe run through h5repack to release the free space.
Since the files can be quite big even when the uninteresting elements have been removed, I'd rather not rewrite them (it would take a long time), but it seems to be required to actually release the free space. Any hints from HDF5 experts?
HDF5 (at least the version I am used to, 1.6.9) does not allow deletion. Actually, it does, but it does not free the used space, with the result that you still have a huge file. As you said, you can use h5repack, but it's a waste of time and resources.
Something that you can do is to have a lateral dataset containing a boolean value, telling you which values are "alive" and which ones have been removed. This does not make the file smaller, but at least it gives you a fast way to perform deletion.
An alternative is to define a slab on your array, copy the relevant data, then delete the old array, or always access the data through the slab, and then redefine it as you need (I've never done it, though, so I'm not sure if it's possible, but it should)
Finally, you can use the hdf5 mounting strategy to have your datasets in an "attached" hdf5 file you mount on your root hdf5. When you want to delete the stuff, copy the interesting data in another mounted file, unmount the old file and remove it, then remount the new file in the proper place. This solution can be messy (as you have multiple files around) but it allows you to free space and to operate only on subparts of your data tree, instead of using the repack.
Copying the data or using h5repack as you have described are the two usual ways of 'shrinking' the data in an HDF5 file, unfortunately.
The problem, as you may have guessed, is that an HDF5 file has a complicated internal structure (the file format is here, for anyone who is curious), so deleting and shrinking things just leaves holes in an identical-sized file. Recent versions of the HDF5 library can track the freed space and re-use it, but your use case doesn't seem to be able to take advantage of that.
As the other answer has mentioned, you might be able to use external links or the virtual dataset feature to construct HDF5 files that were more amenable to the sort of manipulation you would be doing, but I suspect that you'll still be copying a lot of data and this would definitely add additional complexity and file management overhead.
H5Gunlink() has been deprecated, by the way. H5Ldelete() is the preferred replacement.
Related
I have created a backtracking algorithm, but after a while the program runs out of memory, since the amount of results is so huge. So I am about to find a way to store the resulting Data Tree onto the Filesystem, rather than the Memory/RAM.
So I am looking for a convenient way to do that, such that there are as few I/O actions as possible, but also a moderate usage of RAM (max ≈2GB).
One way could be, to store each node into a single file, what would probably lead to billions of small files. Or store each level of the tree into a single file, but than those files can grow very large. If those files grow too large, the content wont fit into RAM for reading the data and bring me back to the original problem.
Would it be a good Idea to have files for Nodes and others for the links?
What is the best way to load huge text file data in delphi? Is there any component that can load text file superfast?
Let's say I have a text file contains database and stored in fix length format.
It contains 150 field with each at least 50 characters.
1. I need to load it into memory
2. I need to parse it and probably store it in a memdataset for processing
My questions:
1. Is it enough if I use TStringList.loadFromFile method?
2. Is there any other better component to manipulate the text file?
3. Should I use low level reading from textfile?
Thank you in advance.
TStringList is never the optimal way of working with lots of text, but it's the simplest. If you've got small files on your hands you can use TStringList without issues. Even if you have large files (not huge files) you might implement a version of you algorithm using TStringList for testing purposes, because it's simple and easy to understand.
If your files are large, as they probably are since you call them "databases", you need to look into alternative technologies that will enable you to read only as much as you need from the database. Look into:
TFileStream
Memory mapped files.
Don't look at the old "file" based API's still available in Delphi, they're plain old.
I'm not going to go into details on how to access text using those methods because we've recently had two similar questions on SO:
How Can I Efficiently Read The FIrst Few Lines of Many Files in Delphi
and
Fast Search to see if a String Exists in Large Files with Delphi
Since you have a fixed length that you're working with, you can build an access class based on TList with a TWriter and TReader that will take your records into account. You'll have none of the overhead of a TStringList (not that it's a bad thing, but if you don't need it, why have it) and you can build in your own access to records into the class.
Ultimately it depends on what you are trying to accomplish with the data once you have it loaded into memory. While TStringlist is easy to use, it isn't as efficient as "rolling your own".
However, efficiency in data manipulation may not be that much of an issue, as you are using text files to hold a database. If you just need to read in and make decisions based on data in the file, the more flexible TList may be overkill.
I recommend to adhere to TStringList if you find it convenient for your problem. Optimization is another thing that should be done later.
As for TStringList the optimization is to declare a descendant class that overrides TStrings.LoadFromStream method - you can make it practically as fast as possible, taking into account the structure of your files.
It is not entirely clear from your question why you need to load the entire file into memory, prior to then going on to create an in-memory data set.... are you conflating the two issues? (i.e. because you need to create an in-memory data set you think you first need to load the source data entirely into memory? Or is there some initial pre-processing of the source file which is only possible with the entire file loaded in memory (this is unlikely and even if this is the case, it isn't necessary with a navigable stream object such as a TFileStream).
But I think the answer you are looking for is right there in the question....
If you are loading this file in order to parse it and populate/initialise a further data structure (the data set) for further processing, then using an existing high level data structure is an unnecessary and potentially costly (in terms of time) step.
Use the lowest level means of access that provides the capabilities you need.
In this case a TFileStream will likely provide the best balance of convenience and ease of use.
I have a few .dat and .idx files inside a .dbs folder, so presumably these were created using Informix SE. I need a way to view or extract info from these files.
I have Informix SE 7.25 installed in a Windows platform but I don't know how to read those files. Do I use dbexport.exe that comes with Informix SE or dbaccess.exe? I tried to have a play around with both but don't know how to use them properly.
Can anyone help me with this problem?
If the files are in a directory dbase.dbs, then you should be able to use:
dbexport dbase
to generate a directory dbase.exp containing a schema file dbase.sql and a series of .unl files — the unloaded data from each table in the database.
That's by far the easiest way to get the data.
The data format is Informix's UNLOAD format — usually pipe-delimited, backslash-escaped files with one logical line per record. There are tedious wrinkles in the data format, but most of them won't affect you.
It's relatively straight-forward (but by no means trivial) to convert the data into CSV format — and other formats would be possible to, with greater or lesser amounts of work.
Using DB-Access would be a labour of love — it would be a lot harder work. You'd probably generate a list of tables from the database (with DB-Access), then munge that output into a series of UNLOAD statements, which you'd then run into DB-Access again to get the actual data. It can certainly be done; you should aim to avoid having to do it that way.
If the database is incomplete — for example, the system catalog is missing — then all is not lost, but the work gets far harder as you have to read the raw .dat files and you have to deduce the schema for the tables. That is hard to do 100% reliably. If you have the schema, then I have tools to get the data out of the database — contact me (see my profile).
On the iOS filesystem, is there a way to optimize file access performance by using a tiered directory structure vs. a flat directory structure?
Specifically, my app has Objects that each contain a number of images and data files. A user could create thousands of these Objects and I need to optimize access to one image for ~100 arbitrary Objects at a time.
In this situation, how should I organize files on the filesystem? Would a tiered directory structure be faster than a flat one? And if so, how should I structure the tiered system (i.e. how many tiers, and how many subdirectories / files per tier)?
THANKS!
Well first of all you might as well try it with a flat structure to see if it is slow or not. Perhaps apple has put in code to optimize how files are found and you don't even need to worry about this. You can probably build out the whole app and just test how quickly it loads and see if that meets your requirements.
If you need to speed it up I would suggest trying to make some sort of structure based on the name of the file. You could have a folder which has all of the items beginning with the letter 'a' or 'b' and so on and so forth. This would split it into 26 folders which should significantly decrease the amount of items in each. Depending on how you name the files you might want a different scheme so that each of the folders had a similar amount of items in it
If you are using Core Data, you could always just enable the Allows External Storage option in the attribute of your model and let the system decide where it should go.
That would be a decent first step to see if the performance is ok.
What is the best way to load huge text file data in delphi? Is there any component that can load text file superfast?
Let's say I have a text file contains database and stored in fix length format.
It contains 150 field with each at least 50 characters.
1. I need to load it into memory
2. I need to parse it and probably store it in a memdataset for processing
My questions:
1. Is it enough if I use TStringList.loadFromFile method?
2. Is there any other better component to manipulate the text file?
3. Should I use low level reading from textfile?
Thank you in advance.
TStringList is never the optimal way of working with lots of text, but it's the simplest. If you've got small files on your hands you can use TStringList without issues. Even if you have large files (not huge files) you might implement a version of you algorithm using TStringList for testing purposes, because it's simple and easy to understand.
If your files are large, as they probably are since you call them "databases", you need to look into alternative technologies that will enable you to read only as much as you need from the database. Look into:
TFileStream
Memory mapped files.
Don't look at the old "file" based API's still available in Delphi, they're plain old.
I'm not going to go into details on how to access text using those methods because we've recently had two similar questions on SO:
How Can I Efficiently Read The FIrst Few Lines of Many Files in Delphi
and
Fast Search to see if a String Exists in Large Files with Delphi
Since you have a fixed length that you're working with, you can build an access class based on TList with a TWriter and TReader that will take your records into account. You'll have none of the overhead of a TStringList (not that it's a bad thing, but if you don't need it, why have it) and you can build in your own access to records into the class.
Ultimately it depends on what you are trying to accomplish with the data once you have it loaded into memory. While TStringlist is easy to use, it isn't as efficient as "rolling your own".
However, efficiency in data manipulation may not be that much of an issue, as you are using text files to hold a database. If you just need to read in and make decisions based on data in the file, the more flexible TList may be overkill.
I recommend to adhere to TStringList if you find it convenient for your problem. Optimization is another thing that should be done later.
As for TStringList the optimization is to declare a descendant class that overrides TStrings.LoadFromStream method - you can make it practically as fast as possible, taking into account the structure of your files.
It is not entirely clear from your question why you need to load the entire file into memory, prior to then going on to create an in-memory data set.... are you conflating the two issues? (i.e. because you need to create an in-memory data set you think you first need to load the source data entirely into memory? Or is there some initial pre-processing of the source file which is only possible with the entire file loaded in memory (this is unlikely and even if this is the case, it isn't necessary with a navigable stream object such as a TFileStream).
But I think the answer you are looking for is right there in the question....
If you are loading this file in order to parse it and populate/initialise a further data structure (the data set) for further processing, then using an existing high level data structure is an unnecessary and potentially costly (in terms of time) step.
Use the lowest level means of access that provides the capabilities you need.
In this case a TFileStream will likely provide the best balance of convenience and ease of use.