Delphi Search files and directories fastest alghorithm - delphi

I'm using Delphi7 and i need a solution to a big problem.Can someone provide me a faster way for searching through files and folders than using findnext and findfirst? because i also process the data for each file/folder (creation date/author/size/etc) and it takes a lot of time...I've searched a lot under WinApi but probably I haven't see the best function in order to accomplish this. All the examples which I've found made in Delphi are using findfirst and findnext...
Also, I don't want to buy components or use some free ones...
Thanks in advance!

I think any component that you'd buy, would also use findfirst/findnext. Recursively, of course. I don't think there's a way to look at every directory and file, without actually looking at every directory and file.
As a benchmark to see if your code is reasonably fast, compare performance against WinDirStat http://windirstat.info/ (Just to the point where it's gathered data, and is ready to build its graph of the space usage.)
Source code is available, if you want to see what they're doing. It's C, but I expect it's using the same API calls.

The one big thing you can do to really increase your performance is parse the MFT directly, if your volumes are NTFS. By doing this, you can enumerate files very, very quickly -- we're talking at least an order of magnitude faster. If all the metadata you need is part of the MFT record, your searches will complete much faster. Even if you have to do more reads for extra metadata, you'll be able to build up a list of candidate files very quickly.
The downside is that you'll have to parse the MFT yourself: There's no WinAPI functions for doing it that I'm aware of. You also get to worry about things that the shell normally does for you in worrying about things like hardlinks, junctions, reparse points, symlinks, shell links, etc.
However, if you want speed, the increase in complexity is the only way to achieve it.
I'm not aware of any available Delphi code that already implements an MFT parser, so you'll probably have to either use a 3rd party library or implement it yourself. I was going to suggest the Open Source (GPL) NTFS Undelete, which was written in Delphi, but it implements the MFT parsing via Python code and has a Delphi-Python bridge built in.

If you want to get really fast search results consider using the Windows Search (API) or the Indexing service.
Other improvements might be to make use of threads and split the search for files and the gathering of file properties or just do a threaded search.

I once ran into a very similar problem where the number of files in the directory, coupled with findfirst/findnext was taking more time than was reasonable. With a few files its not an issue, but as you scale upwards into the thousands, or tens of thousands of files, then performance drops considerably.
Our solution was to use a queue file in a separate directory. As files are "added" to the system they were written to a queue file (was a fixed record file). When the system needed to process data, it would see if the file existed, and if so then rename it and open the renamed version (this way adds could occur for the next process pass). The file was then processed in order. We then archived the queue file & processed files into a subdirectory based on the date and time (for example: G:\PROCESSED\2010\06\25\1400 contained the files run at 2:00 pm on 6/25/2010).
Using this approach not only did we reach an almost "real-time" processing of files (delayed only by the frequency by which we processed the queue file), but we also insured processing of files in the order they were added.

If you need to scan remote drive with that many files, I would strongly suggest doing so with a "client-server" design, so that the actual file scanning is always done locally and only the results are fetched remotely. That would save you a lot of time. Also, all "server" could scan in parallel.

If your program is running on Windows 7 or Server 2008 R2, there are some enhancements to the Windows FindFirstFileEx function which will make it run a bit faster. You would have to copy and modify the VCL functions to incorporate the new options.

There isn't much room for optimization with a findfirst / findnext loop, because it's mostly I/O bound: the operating system needs to read this information from your HDD!
The proof: Make a small program that implements a simple findfirst / findnext loop that does NOTHING with the files it finds. Restart your computer and run it over your big directory, note the time it takes to finish. Then run it again, without restarting the computer. You'll notice the second run is significantly faster, because the operating system cached the information!
If you know for sure the directory you're trying to scan is heavily accessed by the OS because of some other application that's using the data (this would put the directory structure information into the OS's cache and make scanning not be bound to the I/O) you can try running several findfirst/findnext loops in parallel using threads. The down side of this is that if the directory structure is not allready in the OS cache your algorithm is again bound to HDD in/out and it might be worst then the original because you're now making multiple parallel I/O requests that need to be handled by the same device.
When I had to tackle this same problem I decided against parallel loops because the SECOND run of the application is allways so much faster, prooving I'm bound to I/O and no ammount of CPU optimisation would fix the I/O bottleneck.

I solved a similar problem by using two threads. This way I could "process" the file(s) at the same time as they where scanned from the disk. In my case the processing was significantly slower than scanning so I also had to limit the number of files in memory at one time.
TMyScanThread
Scan the file structure, for each "hit" add the path+file to a TList/TStringList or similar using Syncronize(). Remember to Sleep() inside the loop to let the OS have some time too.
PseudoCode for the thread:
TMyScanThread=class(TThread)
private
fCount : Cardinal;
fLastFile : String;
procedure GetListCount;
procedure AddToList;
public
FileList : TStringList;
procedure Execute; Override;
end;
procedure TMyScanThread.GetListCount;
begin
fCount := FileList.Count;
end;
procedure TMyScanThread.AddToList;
begin
FileList.Add(fLastFile);
end;
procedure TMyScanThread.Execute;
begin
try
{ Get the list size }
Syncronize( GetListCount );
if fCount<500 then
begin
// FindFirst code goes here
{ Add a file to the list }
fLastFile := SR.Name; // Store Filename in local var
Syncronize( AddToList ); // Call method to add to list
SleepEx(0,True);
end else
SleepEx(1000,True);
finally
Terminate;
end;
end;
TMyProcessFilesThread
Get the oldest entry in the list, and Process it. Then output results to DB.
This class is implemented similarly with Syncronized methods that access the list.
One alternative to the Syncronize() calls is to use a TCriticalSection. Implementing Syncronization between threads is often a matter of taste and the task at hand ...

You can also try BFS vs. DFS. This may affect your performance.
Link
http://en.wikipedia.org/wiki/Breadth-first_search
http://en.wikipedia.org/wiki/Depth-first_search

When I started to run into performance problems working with lots of small files on in the file system I moved to storing the files as blobs in database. There is no reason why related information like size, creation, and author couldn't also be stored in the database. Once the tables are populated in the database, I suspect that the database engine could do a much faster job of finding records (files) than any solution that we are going to come up with since Database code is highly specialized for efficient searches through large data sets. This will definitely be more flexible since adding a new search would be as simple as creating a new Select statement. Example: Select * from files where author = 'bob' and size > 10000
I'm not sure that approach will help you. Could you tell us more about what you are doing with these files and the search criteria.

Related

swapbuffers minifilter problems

I implemented a minifilter driver using the swapbuffers example. I made two changes:
attach only to \Device\HarddiskVolume3
encryption XORing with 0xFF
Encryption works, but the volume3 (which in my system is E:) not working. E: is not recognized file system. chkdsk E: results all boot sectors corrupted message.
After investigations (using procmon.exe): the chkdsk.exe creates a shadow copy of volume. If the driver attaches the shadow copy too the chkdsk E: is OK, the filesystem is perfect. But E: remains unrecognized.
Any idea what I should change?
Assuming no simple mistake was made, that is, the volume was unmounted, you added the filter, and remounted, obviously the mount/filesystem is not using your filter.
I noticed a comment in the example code about "not for kernel mode drivers".
What you want to research is "whole disk encryption". A google search AND search on: windows whole disk encryption will help.
In particular, TrueCrypt does what you want. Since it is open source, and is available on sourceforge.net, you could download the source and figure out how to hook your stuff in by learning how TrueCrypt does it.
Just one problem: TrueCrypt has security gaps, so the sourceforge.net page is now just migration info to BitLocker. But, it still exists and other pages have been created where you can get it. Notably, a fork of TrueCrypt is VeraCrypt
Just one of the pages in the search is: http://www.howtogeek.com/203708/3-alternatives-to-the-now-defunct-truecrypt-for-your-encryption-needs/
UPDATE
Note: After I wrote this update, I realized that there may be hope ... So, keep reading.
Minifilter appears to be for filesystems but not underlying storage. It may work, you just need to find a lower level hook. What about filter stack altitute? Here's a link: https://msdn.microsoft.com/en-us/library/windows/hardware/ff540402%28v=vs.85%29.aspx It also has documentation on fltmc and the !fltkd debugger extension
In this [short] blog: http://blogs.msdn.com/b/erick/archive/2006/03/27/562257.aspx it says:
The Filter Manager was meant to create a simple mechanism for drivers to filter file system operations: file system minifilter drivers. File system minifilter driver are located between the I/O manager and the base filesystem, not between the filesystem and the storage driver(s) like legacy file system filter drivers.
Figuring out what that means will help. Is the hook point between FS and I/O manager [which I don't know about] sufficient? Or, do you need to hook between filesystem and storage drivers [implying legacy filter]?
My suspicion is that a "legacy" driver filter may be what you need, if the minifilter does not have something that can do the same.
Since your hooks need to work on unmounted storage so that chkdsk will work, this may imply the legacy filter. On the other hand, you mentioned that you were able to hook the shadow copy and it worked for chkdsk. That implies minifilter has the right stuff.
Here's a link that I think is a bit more informative: http://blogs.msdn.com/b/ntdebugging/archive/2013/03/25/understanding-file-system-minifilter-and-legacy-filter-load-order.aspx It has a direct example about the altitute of an encryption filter. You just may need more hook points and to lower the altitude of your minifilter
UPDATE #2
Swapbuffers just hooks a few things: IRP_MJ_READ, IRP_MJ_WRITE, IRP_MJ_DIRECTORY_CONTROL. These are file I/O related, not device I/O related. The example is fine, just not necessarily for your purposes.
The link I gave you to fltmc is one page in MS's entire reference for filters. If you meander around that, you'll find more interesting things like IoGetDeviceAttachmentBaseRef, IoGetDiskDeviceObject. You need to find the object for the device and filter its I/O operations.
I think that you'll have to read the reference material in addition to examples. As I've said previously, your filter needs to hook more or different things.
In the VeraCrypt source, the Driver subdirectory is an example of the types of things you may need to do. In DriveFilter.c, it uses IRP_MJ_READ but also uses IRP_MN_START_DEVICE [A hook when the device is started].
Seriously, this may be more work than you imagine. Is this just for fun, or is this just a test case for a much larger project?

Is it possible to get generator value twice?

We have run into a very embarrassing problem. It seems that some network or server error lead the front-end application to get a generator value twice.
Is it possible that getting (and updating) the generator value stays in memory, and in case of loss of power, it remains in memory doesn't get to be written on disk, so when the power restores, it loses it's current value, so we can get the generator value again?
We are using Firebird 1.5.6, Delphi (BDE and native IBExpert components).
Thanks,
SanTa
Update 1: It turned out that the server is some linux, if it helps ...
Generator values are stored on special dedicated pages inside the database. Updates are atomic and occur outside of normal transaction control and should be stored immediately. However when generators keep changing frequently it looks for OS/RAID/HDD as a "hot" page, constantly written and never read. So they have a lot of incentive to keep in cached in memory and little to actually flush it into the media.
If you wanted speed at all costs, disabled FORCED WRITES - or - enabled WRITE CACHE in Device Manager for the drive - or - just chanced to have a RAID Controller that trades speed for safety to get good magazine reviews: then it is quite possible that those header pages did not got saved to HDD before the crash.
Read links mentioned in the answer for https://serverfault.com/questions/279571/lvm-dangers-and-caveats : even if FB thinks the data is saved, even if Windows thinks so - it may simply be untrue. Also read http://blogs.msdn.com/b/oldnewthing/archive/2013/04/16/10411267.aspx
Or maybe you have an error in the program including PSQL.
Like
i := GEN_ID (Name, 0);
i := GEN_ID (Name, 1);
or
i := GEN_ID (Name, +1);
i := GEN_ID (Name, -1);
Or you may have bad options in backup-restore loop, that do reset generator values.
I also suggest you reading all the Release Notes of Firebird 2.0 to 3.0 Alpha - if there are mentioned any generator-related bugs, there are big chances you have them in your obsolete 1.5.6
This can happen if you are connecting to the database without using aliaes and using paths that are different. Firebird then thinks that they are two separate databases. And one set of memory cache knows nothing about another.
This can cause severe database corruptions so it is important to ensure that all access to a database uses the same path. Or use aliases.

Remote Profiling with D7 (NOT memory profiling, but timing...)

We have one machine in-house that is 20 times slower starting our Delphi 7 app than any other machine.
We would like to get a performance profile (not memory profile) to locate where it's spending its time.
AQTime, which we own, we've discovered doesn't do remote profiling.
We'd prefer not to take the time to build up an entire D7 IDE development environment just so we can use AQTime to profile our app on this one in-house machine.
The code is a bit too complex for us to want to meter it ourselves.
Any suggestions on a profiler that will gather high level (procedure or line number) statistics remotely?
Take a look at SamplingProfiler. It doesn't do "remote" profiling, but it also doesn't require a development environment. It just needs to be able to launch the program to be profiled (so it has to run on the same machine) and the program has to have a .MAP file generated by the linker in the same folder as the .EXE. If this is in-house, that shouldn't be a problem for you.
And if you look at the helpfile, you'll even find ways to have it only profile certain sections of your program, which AQTime can't do. That helps if you know the issue is in one specific place, such as the startup code.
Why not install AQTime on the machine and use it as a standalone profiler? No need for an "entire D7 IDE development environment".
You can also try my free/open source sampling profiler:
http://code.google.com/p/asmprofiler/wiki/AsmProfilerSamplingMode
(I get better results with it than with SamplingProfiler)
It uses all kinds of Delphi debug symbols (.map, TD32, .jdbg, etc)
You can use our Open Source TSynLog class to add profiling to any application, not only on the developer computer.
It is not an automated profiler, as other tools: you'll have to modify your code. But it can be run on request remotely and even with no communication at all, even from the end customer side.
You add some profiling calls to some method code, then entering and leaving the methods will be logged into a text file. Then a supplied log viewer is available, and has some dedicated method to do the profiling, and identify the slow methods.
(source: synopse.info)
The logging mechanism can be used to trace recursive calls. It can use an interface-based mechanism to log when you enter and leave any method:
procedure TMyDB.SQLExecute(const SQL: RawUTF8);
var ILog: ISynLog;
begin
ILog := TSynLogDB.Enter(self,'SQLExecute');
// do some stuff
ILog.Log(sllInfo,'SQL=%',[SQL]);
end; // when you leave the method, it will write the corresponding event to the log
It will be logged as such:
20110325 19325801 + MyDBUnit.TMyDB(004E11F4).SQLExecute
20110325 19325801 info SQL=SELECT * FROM Table;
20110325 19325801 - MyDBUnit.TMyDB(004E11F4).SQLExecute 00.000.507
Here the method name is set in the code ('SQLExecute'). But if you have an associated .map file, the logging mechanism is able to read this symbol information, and write the exact line number of the event. You can even use a highly compressed version of the .map file (900 KB .map -> 70 KB .mab, i.e. much better than zip or lzma), or embed its content to the executable at build time.
Adding profiling at method level is therefore just the matter of adding one line of code at the beginning of the method, as such:
procedure TMyDB.SQLExecute(const SQL: RawUTF8);
begin
TSynLogDB.Enter;
// do some stuff
end; // when you leave the method, it will write the corresponding event to the log
High-resolution timestamps are also logged on the file (here 00.000.507). With this, you'll be able to profile your application with data coming from the customer side, on its real computer. Via the Enter method (and its auto-Leave feature), you have all information needed for this.
By procedding steps by steps, you'll get very quickly to your application bottlenecks. And it would be possible to do the same on the end customer side, on request.
I used this on several applications, and found out very easily several bottlenecks, even on specific hardware, software and network configuration (you never know what your customers use), very easily.

moving data between processes

The reason I ask this is widows do not support a good method to communicate between processes. So I want to create a DLL for a communications point between windows processes. A thread is owned by a process and cannot be given to another process.
Each thread has a stack of its own.
If a DLL is loaded (loadlibray) and a DLL function is called that asks windows for memory. Am I write to think the thread is still being owned by the same process and allocates memory into that same process.
So I’m thinking can I turn to assembly to reallocate a small memory block to another process. Create a critical section, copy the data over to another (already created) memory block and return to the original block to its original process with out up setting windows. Has any one done that before. Or is thier a better way.
Best regards,
Lex Dean.
I see other methods that mite be quite fast but I would like a very fast method that has little over head. Pipes and internet will obviously work but are not the best option yet simple to implement (thanks to offer such suggestions guys). I want to send quite a few 500 byte blocks at quite regular intervals sometimes. I like WM_COPYDATA because it looks fast, my biggest question that I have been looking all over the internet is:- GetCurrentProcess and DuplicateHandle to get the real handle. Finding the other process. And using messages to set up memory and then use WM_COPYDATA. I only need two messages a) the pointer and size b) the data has been copied.
I get my application process easy ‘GetCurrentProcess’ except it’s a pseudo handle, that’s always $FFFFFFE. I need the real process handle and no body on the internet gives an example of DuplicateHandle. That’s what’s got me stumped. Can you show me an example of DuplicateHandle as that’s what’s got me stumped?
I do not like turning to a form to get a handle as one application dose not always have a current form.
I do not like turning to a form to get a handle as one application dose not always have a current form.
In Delphi I have seen message sending with TSpeedButton to set up a simple fast communication methods between applications that most probably uses about 80 instructions I guess. And so I still thinking to think dll’s. The example Mads Elvheim sent is on that same line as what I already know.
I'm still willing to understand any other options of using my own *.Dll
Because my applications important to me can simply register/unregister on the *.DLL its own process rather than searching all the time to see if a process is current.
It’s how I manage memory with a *.DLL between process but I’m not told about.
To me DLL’s are not hard to implement to me as I already have one of my own in operation.
The real bottom line is access to windows to create a good option. As I’m very open to idea’s. Even the assembly instructions for between processes or a windows call. But I do not what to get court crashing windows ether by doing things illegal.
So please show an example of what you have done that is to my needs. That is fast and I’m interested as I most probably will use it anyway.
I have a very fast IPC (interprocess communication) solution based on named pipes. It is very fast and very easy to use (It hides the actual implementation from you. You just work with data packets). Also tested and proven. You can find the code and the demo here.
http://www.cromis.net/blog/downloads/cromis-ipc/
It also works across computers in the same LAN.
If your processes have message loops (with windows), you can send/receive serialized data with the WM_COPYDATA message: http://msdn.microsoft.com/en-us/library/ms649011(VS.85).aspx
Just remember that only the allocated memory for the COPYDATASTRUCT::lpData member is allowed to be read. Again, you can not pass a structure that has pointers. The data must be serialized instead. And the receiving side can only read this structure, it can not write to it. Example:
/* Both are conceptual windows procedures. */
/* For sending : */
{
...
TCHAR msg[] = _T("This is a test\r\n");
HWND target;
COPYDATASTRUCT cd = {0};
cd.lpData = _tcsdup(msg); // We allocate and copy a string, which is fine.
cd.cbData = _tcsclen(msg) + 1; //The size of our data. Windows needs to know this.
target = FindWindow(..); //or EnumProcesses
SendMessage(target, WM_COPYDATA, (LPARAM)hwnd, (WPARAM)&cd);
}
/* For receiving */
{
...
case WM_COPYDATA:
{
TCHAR* msg;
COPYDATASTRUCT* cb = (COPYDATASTRUCT*)wParam;
sender = FindWindow(..); //or EnumProcesses
//check if this message is sent from the window/process we want
if(sender == (HWND)lParam){
msg = _tcsdup(cb->ldData);
...
}
break;
}
}
Otherwise, use memory mapped files, or network sockets.
I currently use Mailslots in Delphi to do it and it is very efficient.
"Win32 DLLs are mapped into the address space of the calling process. By default, each process using a DLL has its own instance of all the DLLs global and static variables. If your DLL needs to share data with other instances of it loaded by other applications, you can use either of the following approaches:
•Create named data sections using the data_seg pragma.
•Use memory mapped files. See the Win32 documentation about memory mapped files."
http://msdn.microsoft.com/en-us/library/h90dkhs0(VS.80).aspx
You cannot share pointers between processes, they only make sense to the process that alloc'd it. You're likely to run into issues.
Win32 is not different from any other modern OS in this aspect. There are plenty IPC services at your disposal in Windows.
Try to describe, which task you want to solve - not the "...then I think that I need to copy that block of memory here...". It's not your task. Your customer didn't say you: "I want to transfer thread from one process to another".

Techniques for writing critical text data

We take text/csv like data over long periods (~days) from costly experiments and so file corruption is to be avoided at all costs.
Recently, a file was copied from the Explorer in XP whilst the experiment was in progress and the data was partially lost, presumably due to multiple access conflict.
What are some good techniques to avoid such loss? - We are using Delphi on Windows XP systems.
Some ideas we came up with are listed below - we'd welcome comments as well as your own input.
Use a database as a secondary data storage mechanism and take advantage of the atomic transaction mechanisms
How about splitting the large file into separate files, one for each day.
If these machines are on a network: send a HTTP post with the logging data to a webserver.
(sending UDP packets would be even simpler).
Make sure you only copy old data. If you have a timestamp on the filename with a 1 hour resolution, you can safely copy the data older than 1 hour.
If a write fails, cache the result for a later write - so if a file is opened externally the data is still stored internally, or could even be stored to a disk
I think what you're looking for is the Win32 CreateFile API, with these flags:
FILE_FLAG_WRITE_THROUGH : Write operations will not go through any intermediate cache, they will go directly to disk.
FILE_FLAG_NO_BUFFERING : The file or device is being opened with no system caching for data reads and writes. This flag does not affect hard disk caching or memory mapped files.
There are strict requirements for successfully working with files opened with CreateFile using the FILE_FLAG_NO_BUFFERING flag, for details see File Buffering.
Each experiment much use a 'work' file and a 'done' file. Work file is opened exclusively and done file copied to a place on the network. A application on the receiving machine would feed that files into a database. If explorer try to move or copy the work file, it will receive a 'Access denied' error.
'Work' file would become 'done' after a certain period (say, 6/12/24 hours or what ever period). So it create another work file (the name must contain the timestamp) and send the 'done' through the network ( or a human can do that, what is you are doing actually if I understand your text correctly).
Copying a file while in use is asking for it being corrupted.
Write data to a buffer file in an obscure directory and copy the data to the 'public' data file periodically (every 10 points for instance), thereby reducing writes and also providing a backup
Write data points discretely, i.e. open and close the filehandle for every data point write - this reduces the amount of time the file is being accessed provided the time between data points is low

Resources