Techniques for writing critical text data - delphi

We take text/csv like data over long periods (~days) from costly experiments and so file corruption is to be avoided at all costs.
Recently, a file was copied from the Explorer in XP whilst the experiment was in progress and the data was partially lost, presumably due to multiple access conflict.
What are some good techniques to avoid such loss? - We are using Delphi on Windows XP systems.
Some ideas we came up with are listed below - we'd welcome comments as well as your own input.

Use a database as a secondary data storage mechanism and take advantage of the atomic transaction mechanisms

How about splitting the large file into separate files, one for each day.

If these machines are on a network: send a HTTP post with the logging data to a webserver.
(sending UDP packets would be even simpler).
Make sure you only copy old data. If you have a timestamp on the filename with a 1 hour resolution, you can safely copy the data older than 1 hour.

If a write fails, cache the result for a later write - so if a file is opened externally the data is still stored internally, or could even be stored to a disk

I think what you're looking for is the Win32 CreateFile API, with these flags:
FILE_FLAG_WRITE_THROUGH : Write operations will not go through any intermediate cache, they will go directly to disk.
FILE_FLAG_NO_BUFFERING : The file or device is being opened with no system caching for data reads and writes. This flag does not affect hard disk caching or memory mapped files.
There are strict requirements for successfully working with files opened with CreateFile using the FILE_FLAG_NO_BUFFERING flag, for details see File Buffering.

Each experiment much use a 'work' file and a 'done' file. Work file is opened exclusively and done file copied to a place on the network. A application on the receiving machine would feed that files into a database. If explorer try to move or copy the work file, it will receive a 'Access denied' error.
'Work' file would become 'done' after a certain period (say, 6/12/24 hours or what ever period). So it create another work file (the name must contain the timestamp) and send the 'done' through the network ( or a human can do that, what is you are doing actually if I understand your text correctly).
Copying a file while in use is asking for it being corrupted.

Write data to a buffer file in an obscure directory and copy the data to the 'public' data file periodically (every 10 points for instance), thereby reducing writes and also providing a backup

Write data points discretely, i.e. open and close the filehandle for every data point write - this reduces the amount of time the file is being accessed provided the time between data points is low

Related

nonatomic append failure outcomes

I've got a file that I want to append to. It's pretty important to me that this is both fast (I'm calling it 50hz on an iPhone 4) and safe.
I've looked at atomic appending. It seems to me like I would have to copy the whole file, append to it, and then use the NSFileManager's replaceItemAtURL to move them over, which sounds rather slow.
On the other hand, I could simply suck up a non-atomic append, assuming that the failure conditions are strictly that some subset of bytes at the end of the data I'm trying to write are not written. My file format writes out the length of each chunk first, so if there's not enough space for the length data or the length data is bigger than the available bytes, I can detect a partial write and discard.
The question is, how feasible would it be to use an atomic append to rapidly atomically append small amounts of data (half a kilobyte or so at a time), and what exactly are the failure outcomes of a non-atomic append?
Edit: I am the only one appending to this file. I am concerned only with external failure conditions, e.g. process termination, device running out of power, disk full, etc. I am currently using a synchronous append.
POSIX gives no guarantees about atomicity of write(2) when writing to a file.
If the platform does not provide any other means of writing that grants additional characteristics (and I'm not aware of any such API in iOS) you basically have to live with the possibility that the write could be partial.
The workaround of many Cocoa APIs (like -[NSData writeToFile:atomically:]) is the mechanism you mentioned: Perform the work on a temporary file and then atomically link(2) the new over the old file. This strategy does not apply well to your use case as it requires a copy of the old contents.
I would suggest the non-atomic approach you already considered. Actually I once used a very similar mechanism in an iOS app where I had to write a transcript of user actions for crash recovery. The recovery code thoroughly tested the transcript for integrity and would bail out on unexpected errors. Yet, I never received a single report of a corrupt file.

Solution For Monitoring and Maintaining App's Size on Disc

I'm building an app that makes extensive use of CoreData and a lot of my models have UIImage and NSData properties (for images and videos). Since it's not a great idea to store that data directly into CoreData, I built a file manager class that writes the files into different buckets in the documents directory depends on the context in which was created and media type.
My question now is how do I manage the documents directory? Is there a way to detect how much space the app has used up out of its total allocated space? Additionally, what is the best way to go about cleaning those directories; do I check every time a file is written or only on app launch, ect ect.
Is there a way to detect how much space the app has used up out of its total allocated space?
Apps don't have a limit on total allocated space, they're limited by the amount of space on the device. You can find out how much space you're using for these files by using NSFileManager to scan the directories. There are several methods that do this in different ways-- check out enumeratorAtPath:, for example. For each file, use a method like attributesOfItemAtPath:error: to get the file size.
Better would be to track the file sizes as you create and delete files. Keep a running total, stored in user defaults. When you create a new file, increase it by the amount of new data. When you remove a file, decrease the running total.
Additionally, what is the best way to go about cleaning those directories; do I check every time a file is written or only on app launch, ect ect.
If these files are local data that's inherently part of the associated Core Data object, the sensible approach is to delete a file when its Core Data object is deleted. The managed object needs the data file, so don't delete the file if you still use the object. That means there must be some way to link the two, but I'm assuming that's already true since you say that these files are used by managed objects somehow.
If the files are something like cached data that's easily re-created or re-downloaded, you should put them in the location returned by NSTemporaryDirectory(). Then iOS can delete them when it thinks the space is needed. You can also clear out old files whenever it seems appropriate, by scanning for older files or ones that haven't been used in a while (the details depend on exactly how you use the files).

Loading leveldb from stream

Is there a way to load a leveldb store from a data stream?
If I were to take the stream of a leveldb instance and tuck it in a DLL as a manifest resource stream, will I have a way to just load that db from that stream later when I retrieve the manifest resource from my DLL? Essentially, I am looking for a way to build, save, and later load a leveldb without ever writing to a physical file on disk.
Thanks in advance for any useful info.
Raja.
You might have already figured this out since it's been a long time since you asked.
leveldb allows you to override the "Environment" such that reads and writes don't need to access a physical file.
You might want to look at this file:
http://code.google.com/p/leveldb/source/browse/helpers/memenv/memenv_test.cc
in particular the DBTest, for an example.

Delphi Search files and directories fastest alghorithm

I'm using Delphi7 and i need a solution to a big problem.Can someone provide me a faster way for searching through files and folders than using findnext and findfirst? because i also process the data for each file/folder (creation date/author/size/etc) and it takes a lot of time...I've searched a lot under WinApi but probably I haven't see the best function in order to accomplish this. All the examples which I've found made in Delphi are using findfirst and findnext...
Also, I don't want to buy components or use some free ones...
Thanks in advance!
I think any component that you'd buy, would also use findfirst/findnext. Recursively, of course. I don't think there's a way to look at every directory and file, without actually looking at every directory and file.
As a benchmark to see if your code is reasonably fast, compare performance against WinDirStat http://windirstat.info/ (Just to the point where it's gathered data, and is ready to build its graph of the space usage.)
Source code is available, if you want to see what they're doing. It's C, but I expect it's using the same API calls.
The one big thing you can do to really increase your performance is parse the MFT directly, if your volumes are NTFS. By doing this, you can enumerate files very, very quickly -- we're talking at least an order of magnitude faster. If all the metadata you need is part of the MFT record, your searches will complete much faster. Even if you have to do more reads for extra metadata, you'll be able to build up a list of candidate files very quickly.
The downside is that you'll have to parse the MFT yourself: There's no WinAPI functions for doing it that I'm aware of. You also get to worry about things that the shell normally does for you in worrying about things like hardlinks, junctions, reparse points, symlinks, shell links, etc.
However, if you want speed, the increase in complexity is the only way to achieve it.
I'm not aware of any available Delphi code that already implements an MFT parser, so you'll probably have to either use a 3rd party library or implement it yourself. I was going to suggest the Open Source (GPL) NTFS Undelete, which was written in Delphi, but it implements the MFT parsing via Python code and has a Delphi-Python bridge built in.
If you want to get really fast search results consider using the Windows Search (API) or the Indexing service.
Other improvements might be to make use of threads and split the search for files and the gathering of file properties or just do a threaded search.
I once ran into a very similar problem where the number of files in the directory, coupled with findfirst/findnext was taking more time than was reasonable. With a few files its not an issue, but as you scale upwards into the thousands, or tens of thousands of files, then performance drops considerably.
Our solution was to use a queue file in a separate directory. As files are "added" to the system they were written to a queue file (was a fixed record file). When the system needed to process data, it would see if the file existed, and if so then rename it and open the renamed version (this way adds could occur for the next process pass). The file was then processed in order. We then archived the queue file & processed files into a subdirectory based on the date and time (for example: G:\PROCESSED\2010\06\25\1400 contained the files run at 2:00 pm on 6/25/2010).
Using this approach not only did we reach an almost "real-time" processing of files (delayed only by the frequency by which we processed the queue file), but we also insured processing of files in the order they were added.
If you need to scan remote drive with that many files, I would strongly suggest doing so with a "client-server" design, so that the actual file scanning is always done locally and only the results are fetched remotely. That would save you a lot of time. Also, all "server" could scan in parallel.
If your program is running on Windows 7 or Server 2008 R2, there are some enhancements to the Windows FindFirstFileEx function which will make it run a bit faster. You would have to copy and modify the VCL functions to incorporate the new options.
There isn't much room for optimization with a findfirst / findnext loop, because it's mostly I/O bound: the operating system needs to read this information from your HDD!
The proof: Make a small program that implements a simple findfirst / findnext loop that does NOTHING with the files it finds. Restart your computer and run it over your big directory, note the time it takes to finish. Then run it again, without restarting the computer. You'll notice the second run is significantly faster, because the operating system cached the information!
If you know for sure the directory you're trying to scan is heavily accessed by the OS because of some other application that's using the data (this would put the directory structure information into the OS's cache and make scanning not be bound to the I/O) you can try running several findfirst/findnext loops in parallel using threads. The down side of this is that if the directory structure is not allready in the OS cache your algorithm is again bound to HDD in/out and it might be worst then the original because you're now making multiple parallel I/O requests that need to be handled by the same device.
When I had to tackle this same problem I decided against parallel loops because the SECOND run of the application is allways so much faster, prooving I'm bound to I/O and no ammount of CPU optimisation would fix the I/O bottleneck.
I solved a similar problem by using two threads. This way I could "process" the file(s) at the same time as they where scanned from the disk. In my case the processing was significantly slower than scanning so I also had to limit the number of files in memory at one time.
TMyScanThread
Scan the file structure, for each "hit" add the path+file to a TList/TStringList or similar using Syncronize(). Remember to Sleep() inside the loop to let the OS have some time too.
PseudoCode for the thread:
TMyScanThread=class(TThread)
private
fCount : Cardinal;
fLastFile : String;
procedure GetListCount;
procedure AddToList;
public
FileList : TStringList;
procedure Execute; Override;
end;
procedure TMyScanThread.GetListCount;
begin
fCount := FileList.Count;
end;
procedure TMyScanThread.AddToList;
begin
FileList.Add(fLastFile);
end;
procedure TMyScanThread.Execute;
begin
try
{ Get the list size }
Syncronize( GetListCount );
if fCount<500 then
begin
// FindFirst code goes here
{ Add a file to the list }
fLastFile := SR.Name; // Store Filename in local var
Syncronize( AddToList ); // Call method to add to list
SleepEx(0,True);
end else
SleepEx(1000,True);
finally
Terminate;
end;
end;
TMyProcessFilesThread
Get the oldest entry in the list, and Process it. Then output results to DB.
This class is implemented similarly with Syncronized methods that access the list.
One alternative to the Syncronize() calls is to use a TCriticalSection. Implementing Syncronization between threads is often a matter of taste and the task at hand ...
You can also try BFS vs. DFS. This may affect your performance.
Link
http://en.wikipedia.org/wiki/Breadth-first_search
http://en.wikipedia.org/wiki/Depth-first_search
When I started to run into performance problems working with lots of small files on in the file system I moved to storing the files as blobs in database. There is no reason why related information like size, creation, and author couldn't also be stored in the database. Once the tables are populated in the database, I suspect that the database engine could do a much faster job of finding records (files) than any solution that we are going to come up with since Database code is highly specialized for efficient searches through large data sets. This will definitely be more flexible since adding a new search would be as simple as creating a new Select statement. Example: Select * from files where author = 'bob' and size > 10000
I'm not sure that approach will help you. Could you tell us more about what you are doing with these files and the search criteria.

What is the difference between file and random access file?

what is the difference between file and random access file?
A random access file is a file where you can "jump" to anywhere within it without having to read sequentially until the position you are interested in.
For example, say you have a 1MB file, and you are interested in 5 bytes that start after 100k of data. A random access file will allow you to "jump" to the 100k-th position in one operation. A non-random access file will require you to read 100k bytes first, and only then read the data you're interested in.
Hope that helps.
Clarification: this description is language-agnostic and does not relate to any specific file wrapper in any specific language/framework.
Almost nothing these days. There used to be a time in certain operating systems where there were different types of files - some of which could be accessed randomly (at any point in the file) and others which could only be accessed sequentially. This made more sense when you were using a sequential medium such as tape. Any file system worth its salt these days only supports random access.

Resources