Delphi - Check each line of a file againt another file

Delphi - Check each line of a file againt another file - delphi

I have to check each line of a file against another file.
If one line from the first file exists in the second file I have to delete it.
Right now i'm using 2 listboxes and the "for listbox1.items.count-1 downto do begin..."
My program works but I have to check this for huge files with over 1 milion lines.
Is there a faster approach to this method?
I want to load the files inside memory in order to be extremely fast!
Thanks

You can use TStringList for this. List for second file should be sorted for faster search. Try this:
var
l1, l2: TStringList;
i: integer;
begin
l1 := nil;
l2 := nil;
try
l1 := TStringList.Create;
l1.loadfromFile('file1');
l2 := TStringList.Create;
l2.LoadFromFile('file2');
l2.Sorted := True;
for i := l1.Count -1 downto 0 do
begin
if l2.IndexOf(l1[i]) <> -1 then
l1.Delete(i);
end;
l1.SaveToFile('file1');
finally
FreeEndNil(l1);
FreeEndNil(l2);
end
end

A quick solution (but not the fastest one) is to use two TStringList lists instead of list boxes.
var
a, b: TStringList;
i: Integer;
begin
a := TStringList.Create;
b := TStringList.Create;
try
a.LoadFromFile('C:\1.txt');
b.LoadFromFile('C:\2.txt');
b.Sorted := True;
for i := a.Count - 1 downto 0 do
begin
// Check if line of file 'a' are present in file 'b'
// and delete line if true
if b.IndexOf(a[i]) > -1 then
a.Delete(i);
end;
a.SaveToFile('C:\1.txt');
finally
b.Free;
a.Free;
end;
end;
Again, this is a slow and simple solution that loads whole files in RAM. It still will be much faster than using a ListBox. Sometimes simple is just enough for solving a one-time problem.
A faster method would be to create an index (eg. binary tree) of both files on hard disk and use this index to compare. That way you will not need to store the whole files on disk.

Related

What is the most elegant way to reposition after filtering a TClientDataset

I'm working on a TClientDataset that the user can filter at any time based on some criterias. My problem is, we'd like the dataset's cursor to remain positionned "mostly" at the same place after filtering. ("Mostly" in double quote since, of course, it can't stay at the same place if the record is filtered out).
After doing some research, the best I could come up with is the following :
procedure RefreshFilter;
var
I : Integer;
sFilter : string;
vIndexValue: array of TVarRec;
vIndexValueAsVar : Array of Variant;
begin
sFilter := GenerateNewFilterExpression;
if sFilter <> MyDataset.Filter then
begin
if MyDataset.IndexFieldCount > 0 then
begin
SetLength(vIndexValueAsVar, MyDataset.IndexFieldCount);
SetLength(vIndexValue, MyDataset.IndexFieldCount);
for I := 0 to MyDataset.IndexFieldCount - 1 do
begin
vIndexValueAsVar[I] := MyDataset.IndexFields[I].AsVariant;
vIndexValue[I].VType := vtVariant;
vIndexValue[I].VVariant := #vIndexValueAsVar[I];
end;
end;
MyDataset.Filtered := sFilter <> '';
Mydataset.Filter := sFilter;
if MyDataset.IndexFieldCount > 0 then
begin
MyDataset.FindNearest(vIndexValue);
end;
end;
end;
Even though it works pretty well, I find the solution a bit "bulky". I was wondering if there was a some built-in function or a different approach that might be more elegant and less "heavy".
And please, don't mention bookmarks... Bookmarks don't work properly after changing the active filter, and not at all if your record gets filtered out.

How to reduce CPU usage when scanning for folders/sub-folders/files?

I have developed an application that scans basically everywhere for a file or list of files.
When I scan small folders like 10 000 files and sub files there is no problem. But when I scan for instance my entire users folder with more than 100 000 items, it is very heavy on my processor. It takes about 40% of my processor's power.
Is there a way to optimize this code so that it uses less CPU?
procedure GetAllSubFolders(sPath: String);
var
Path: String;
Rec: TSearchRec;
begin
try
Path := IncludeTrailingBackslash(sPath);
if FindFirst(Path + '*.*', faAnyFile, Rec) = 0 then
try
repeat
Application.ProcessMessages;
if (Rec.Name <> '.') and (Rec.Name <> '..') then
begin
if (ExtractFileExt(Path + Rec.Name) <> '') And
(ExtractFileExt(Path + Rec.Name).ToLower <> '.lnk') And
(Directoryexists(Path + Rec.Name + '\') = False) then
begin
if (Pos(Path + Rec.Name, main.Memo1.Lines.Text) = 0) then
begin
main.ListBox1.Items.Add(Path + Rec.Name);
main.Memo1.Lines.Add(Path + Rec.Name)
end;
end;
GetAllSubFolders(Path + Rec.Name);
end;
until FindNext(Rec) <> 0;
finally
FindClose(Rec);
end;
except
on e: Exception do
ShowMessage(e.Message);
end;
end;
My app searches for all the files in a selected folder and sub-folder, zip's them and copies them to another location you specify.
The Application.ProcessMessages command is there to make sure the application doesn't look like it is hanging and the user closes it. Because finding 100 000 files for instance can take an hour or so...
I am concerned about the processor usage, The memory is not really affected.
Note: The memo is to make sure the same files are not selected twice.

I see the following performance problems:
The call to Application.ProcessMessages is somewhat expensive. You are polling for messages rather than using a blocking wait, i.e. GetMessage. As well as the performance issue, the use of Application.ProcessMessages is generally an indication of poor design for various reasons and one should, in general, avoid the need to call it.
A non-virtual list box performs badly with a lot of files.
Using a memo control (a GUI control) to store a list of strings is exceptionally expensive.
Every time you add to the GUI controls they update and refresh which is very expensive.
The evaluation of Memo1.Lines.Text is extraordinarily expensive.
The use of Pos is likewise massively expensive.
The use of DirectoryExists is expensive and spurious. The attributes returned in the search record contain that information.
I would make the following changes:
Move the search code into a thread to avoid the need for ProcessMessages. You'll need to devise some way to transport the information back to the main thread for display in the GUI.
Use a virtual list view to display the files.
Store the list of files that you wish to search for duplicates in a dictionary which gives you O(1) lookup. Take care with case-insensitivity of file names, an issue that you have perhaps neglected so far. This replaces the memo.
Check whether an item is a directory by using Rec.Attr. That is check that Rec.Attr and faDirectory <> 0.

I agree with the answer which says you would do best to do what you're doing in a background thread and I don't want to encourage you to persist in doing it in your main thread.
However, if you go to a command prompt and do this:
dir c:\*.* /s > dump.txt & notepad dump.txt
you may be surprised quite how quickly Notepad pops into view.
So there are few things you could do to speed up your GetAllSubFolders, even if you keep it in your main thread, e.g. to bracket the code by calls to main.Memo1.Lines.BeginUpdate and main.Memo1.Lines.EndUpdate, likewise main.Listbox1.Items.BeginUpdate and EndUpdate. This will stop these controls being updated while it executes (which is actually what your code is spending most of its time doing, that and the "if Pos( ...)" business I've commented on below). And, if you haven't gathered already, Application.ProcessMessages is evil (mostly).
I did some timings on my D: drive, which is a 500Gb SSD with 263562 files in 35949 directories.
The code in your q: 6777 secs
Doing a dir to Notepad as per the above: 15 secs
The code below, in main thread: 9.7 secs
The reason I've included the code below in this answer is that you'll find it much easier to execute in a thread because it gathers the results into a TStringlist, whose contents you can then assign to your memo and listbox once the thread has completed.
A few comments on the code in your q, which I imagine you might have got from somewhere.
It pointlessly recurses even when the current entry in Rec is a plain file. The code below only recurses if the current Rec entry is a directory.
It apparently tries to avoid duplicates by the "if Pos( ...)" business, which shouldn't be necessary (except maybe if there's a symbolic link (e.g created with the MkLink command) somewhere that points elsewhere on the drive) and does it in a highly inefficient manner, i.e. by searching for the filename in the memo contents - those will get longer and longer as it finds more files). In the code below, the stringlist is set up to discard duplicates and has its Sorted property set to True, which makes its checking for duplicates much quicker, becauseit can then do a binary search through its contents rather than a serial one.
It calculates Path + Rec.Name 6 times for each thing it finds, which is avoidably inefficient at r/t and inflates the source code. This is only a minor point, though, compared to the first two.
Code:
function GetAllSubFolders(sPath: String) : TStringList;
procedure GetAllSubFoldersInner(sPath : String);
var
Path,
AFileName,
Ext: String;
Rec: TSearchRec;
Done: Boolean;
begin
Path := IncludeTrailingBackslash(sPath);
if FindFirst(Path + '*.*', faAnyFile, Rec) = 0 then begin
Done := False;
while not Done do begin
if (Rec.Name <> '.') and (Rec.Name <> '..') then begin
AFileName := Path + Rec.Name;
Ext := ExtractFileExt(AFileName).ToLower;
if not ((Rec.Attr and faDirectory) = faDirectory) then begin
Result.Add(AFileName)
end
else begin
GetAllSubFoldersInner(AFileName);
end;
end;
Done := FindNext(Rec) <> 0;
end;
FindClose(Rec);
end;
end;
begin
Result := TStringList.Create;
Result.BeginUpdate;
Result.Sorted := True;
Result.Duplicates := dupIgnore; // don't add duplicate filenames to the list
GetAllSubFoldersInner(sPath);
Result.EndUpdate;
end;
procedure TMain.Button1Click(Sender: TObject);
var
T1,
T2 : Integer;
TL : TStringList;
begin
T1 := GetTickCount;
TL := GetAllSubfolders('D:\');
try
Memo1.Lines.BeginUpdate;
try
Memo1.Lines.Text := TL.Text;
finally
Memo1.Lines.EndUpdate;
end;
T2 := GetTickCount;
Caption := Format('GetAll: %d, Load: %d, Files: %d', [T2 - T1, GetTickCount - T2, TL.Count]);
finally
TL.Free;
end;
end;

Find duplicates in a stringlist very fast

what is the fastest way to find duplicates in a Tstringlist. I get the data I need to search for duplicates in a Stringlist. My current idea goes like this :
var TestStringList, DataStringList : TstringList;
for i := 0 to DataStringList.Items-1 do
begin
if TestStringList.Indexof(DataStringList[i])< 0 < 0 then
begin
TestStringList.Add(DataStringList[i])
end
else
begin
memo1.ines.add('duplicate item found');
end;
end;
....

Just for completeness, (and because your code doesn't actually use the duplicate, but just indicates one has been found): Delphi's TStringList has the built-in ability to deal with duplicate entries, in it's Duplicates property. Setting it to dupIgnore will simply discard any duplicates you attempt to add. Note that the destination list has to be sorted, or Duplicates has no effect.
TestStringList.Sorted := True;
TestStringList.Duplicates := dupIgnore;
for i := 0 to DataStringList.Items-1 do
TestStringList.Add(DataStringList[i]);
Memo1.Lines.Add(Format('%d duplicates discarded',
[DataStringList.Count - TestStringList.Count]));
A quick test shows that the entire loop can be removed if you use Sorted and Duplicates:
TestStringList.Sorted := True;
TestStringList.Duplicates := dupIgnore;
TestStringList.AddStrings(DataStringList);
Memo1.Lines.Add(Format('%d duplicates discarded',
[DataStringList.Count - TestStringList.Count]));
See the TStringList.Duplicates documentation for more info.

I think that you are looking for duplicates. If so then you do the following:
Case 1: The string list is ordered
In this scenario, duplicates must appear at adjacent indices. In which case you simply loop from 1 to Count-1 and check whether or not the elements of index i is the same as that at index i-1.
Case 2: The string list is not ordered
In this scenario we need a double for loop. It looks like this:
for i := 0 to List.Count-1 do
for j := i+1 to List.Count-1 do
if List[i]=List[j] then
// duplicate found
There are performance considerations. If the list is ordered the search is O(N). If the list is not ordered the search is O(N2). Clearly the former is preferable. Since a list can be sorted with complexity O(N log N), if performance becomes a factor then it will be advantageous to sort the list before searching for duplicates.

Judging by the use of IndexOf you use an unsorted list. The scaling factor of your algorithm then is n^2. That is slow. You can optimize it as David shown by limiting search area in the internal search and then the average factor would be n^2/2 - but that still scales badly.
Note: scaling factor here makes sense for limited workloads, say dozen or hundreds of strings per list. For larger sets of data asymptotic analysis O(...) measure would suit better. However finding O-measures for QuickSort and for hash-lists is a trivial task.
Option 1: Sort the list. Using quick-sort it would have scaling factor n + n*log(n) or O(n*log(n)) for large loads.
Set Duplicates to accept
Set Sorted to True
Iterate the sorted list and check if the next string exists and is the same
http://docwiki.embarcadero.com/Libraries/XE3/en/System.Classes.TStringList.Duplicates
http://docwiki.embarcadero.com/Libraries/XE3/en/System.Classes.TStringList.Sorted
Option 2: use hashed list helper. In modern Delphi that would be TDictionary<String,Boolean>, in older Delphi there is a class used by TMemIniFile
You iterate your stringlist and then check if the string was already added into the helper collection.
The scaling factor would be a constant for small data chunks and O(1) for large ones - see http://docwiki.embarcadero.com/Libraries/XE2/en/System.Generics.Collections.TDictionary.ContainsKey
If it was not - you add it with "false" value.
If it was - you switch the value to "true"
For older Delphi you can use THashedStringList in a similar pattern (thanks #FreeConsulting)
http://docs.embarcadero.com/products/rad_studio/delphiAndcpp2009/HelpUpdate2/EN/html/delphivclwin32/IniFiles_THashedStringList_IndexOf.html

Unfortunately it is unclear what you want to do with the duplicates. Your else clause suggests you just want to know whether there is one (or more) duplicate(s). Although that could be the end goal, I assume you want more.
Extracting duplicates
The previously given answers delete or count the duplicate items. Here an answer for keeping them.
procedure ExtractDuplicates1(List1, List2: TStringList; Dupes: TStrings);
var
Both: TStringList;
I: Integer;
begin
Both := TStringList.Create;
try
Both.Sorted := True;
Both.Duplicates := dupAccept;
Both.AddStrings(List1);
Both.AddStrings(List2);
for I := 0 to Both.Count - 2 do
if (Both[I] = Both[I + 1]) then
if (Dupes.Count = 0) or (Dupes[Dupes.Count - 1] <> Both[I]) then
Dupes.Add(Both[I]);
finally
Both.Free;
end;
end;
Performance
The following alternatives are tried in order to compare performance of the above routine.
procedure ExtractDuplicates2(List1, List2: TStringList; Dupes: TStrings);
var
Both: TStringList;
I: Integer;
begin
Both := TStringList.Create;
try
Both.AddStrings(List1);
Both.AddStrings(List2);
Both.Sort;
for I := 0 to Both.Count - 2 do
if (Both[I] = Both[I + 1]) then
if (Dupes.Count = 0) or (Dupes[Dupes.Count - 1] <> Both[I]) then
Dupes.Add(Both[I]);
finally
Both.Free;
end;
end;
procedure ExtractDuplicates3(List1, List2, Dupes: TStringList);
var
I: Integer;
begin
Dupes.Sorted := True;
Dupes.Duplicates := dupAccept;
Dupes.AddStrings(List1);
Dupes.AddStrings(List2);
for I := Dupes.Count - 1 downto 1 do
if (Dupes[I] <> Dupes[I - 1]) or (I > 1) and (Dupes[I] = Dupes[I - 2]) then
Dupes.Delete(I);
if (Dupes.Count > 1) and (Dupes[0] <> Dupes[1]) then
Dupes.Delete(0);
while (Dupes.Count > 1) and (Dupes[0] = Dupes[1]) do
Dupes.Delete(0);
end;
Although ExtractDuplicates3 marginally performs better, I prefer ExtractDuplicates1 because it reeds better and the TStrings parameter provides more usability. ExtractDuplicates2 performs noticeable worst, which demonstrates that sorting all items afterwards in a single run takes more time then continuously sorting every single item added.
Note
This answer is part of this recent answer for which I was about to ask the same question: "how to keep duplicates?". I didn't, but if anyone knows or finds a better solution, please comment, add or update this answer.

This is an old thread but I thought this solution may be useful.
An option is to pump the values from one stringlist to another one with the setting of TestStringList.Duplicates := dupError; and then trap the exception.
var TestStringList, DataStringList : TstringList;
TestStringList.Sorted := True;
TestStringList.Duplicates := dupError;
for i := 0 to DataStringList.Items-1 do
begin
try
TestStringList.Add(DataStringList[i])
except
on E : EStringListError do begin
memo1.Lines.Add('duplicate item found');
end;
end;
end;
....
Just note that the trapping of the exception also masks the following errors:
There is not enough memory to expand the list, the list tried to grow beyond its maximal capacity, a non-existent element of the list was referenced. (i.e. the list index was out of bounds).

function TestDuplicates(const dataStrList: TStringList): integer;
begin
with TStringlist.create do begin
{Duplicates:= dupIgnore;}
for it:= 0 to DataStrList.count-1 do begin
if IndexOf(DataStrList[it])< 0 then
Add(DataStrList[it])
else
inc(result)
end;
Free;
end;
end;

Remove duplicates from combobox

Say i have a combobox with
apples
apples
pears
oranges
oranges
i would like to have it show
apples
pears
oranges
how can i do this?

for iter := combobox.Items.Count - 1 downto 0 do
begin
index := combobox.Items.IndexOf(combobox.Items[iter]);
if index < iter then
combobox.Items.Delete(iter);
end;

I suggest that you simply refill the combo box each time. That makes the logic simpler:
ComboBox.Items.BeginUpdate;
try
ComboBox.Clear;
for Str in Values do
begin
if ComboBox.Items.IndexOf (Str) = -1 then
ComboBox.Items.Add (Str);
end;
finally
ComboBox.Items.EndUpdate;
end;

Just to put methods against eachother: one keeps the order but is increasingly slow with larger number of items. The other stays relatively faster but doesn't keep order:
procedure SortStringlist;
var
i,index,itimer: integer;
sl : TStringlist;
const
numberofitems = 10000;
begin
sl := TStringlist.Create;
for i := 0 to numberofitems-1 do begin
sl.Add(IntToStr(random(2000)));
end;
Showmessage(IntToStr(sl.Count));
itimer := GetTickCount;
sl.Sort;
for I := sl.Count-1 downto 1 do begin
if sl[i]=sl[i-1] then sl.Delete(i);
end;
Showmessage(IntToStr(sl.Count)+' Time taken in ms: '+IntToStr(GetTickCount-itimer));
sl.free;
sl := TStringlist.Create;
for i := 0 to numberofitems-1 do begin
sl.Add(IntToStr(random(2000)));
end;
Showmessage(IntToStr(sl.Count));
itimer := GetTickCount;
for i := sl.Count - 1 downto 0 do
begin
index := sl.IndexOf(sl[i]);
if index < i then
sl.Delete(i);
end;
Showmessage(IntToStr(sl.Count)+' Time taken in ms: '+IntToStr(GetTickCount-itimer));
end;

If you don't care if the items get reordered (or they're sorted already), TStrings can do the work for you - it eliminates all of the looping, deletion, and other work. (Of course, it requires the creation/destruction of a temporary TStringList, so if that's an issue for you it won't work.)
var
SL: TStringList;
begin
ComboBox1.Items.BeginUpdate;
try
SL := TStringList.Create;
try
SL.Sorted := True; // Required for Duplicates to work
SL.Duplicates := dupIgnore;
SL.AddStrings(ComboBox1.Items);
ComboBox1.Items.Assign(SL);
finally
SL.Free;
end;
finally
ComboBox1.Items.EndUpdate;
end;
end;
To properly compare with Igor's answer (which includes no BeginUpdate/EndUpdate), remove those things:
var
SL: TStringList;
begin
SL := TStringList.Create;
try
SL.Sorted := True; // Required for Duplicates to work
SL.Duplicates := dupIgnore;
SL.AddStrings(ComboBox1.Items);
ComboBox1.Items.Assign(SL);
finally
SL.Free;
end;
end;

You have to remove duplicates from the source data.
In most scenarios, a ComboBox is filled with data in run-time, which means, data is coming from some source. There are basically 2 scenarios here: a dataset from database and a collection of strings from any other source. In both cases you filter out duplicates before inserting anything into the ComboBox.
If source is a dataset from database, simply use the SQL DISTINCT keyword.
If source is any collection of strings, use a peace of code provided in the answer by #Smasher.

I faced this problem several times before, and i used all the previous approaches and I'm still using them, but do you know : i think the best approach , though not mentioned here, is to subclass TComboBox, creating a new method (say AddUnique ) that add the string to the combo ONLY if it does not exist previously , otherwise it will drop it.
This solution may cost some extra time in the beginning , but it will solve the problem once and for all.

Search TextFile for Line with a specific string

Is it possible to search a TextFile line by line to find a specific string.
ie. Hello
It searches line by line to find if any line has Hello in it. There will only be one string per line.
Is this possible?
If so how do I attempt this?

It's certainly easiest to load the entire file into memory. Provided that your file is small enough then you can do it like this:
found := false;
sl := TStringList.Create;
try
sl.LoadFromFile(fileName);
for line in sl do
if Pos('Hello', line)<>0 then
begin
found := true;
break;
end;
finally
sl.Free;
end;
I assume that when you say
if any line has Hello in it
that you are looking for lines that contain the search string rather than lines that equal the search string.
In a comment you ask:
Ok can I then ask if it would be possible to ask on how to
delete a string that is typed from an edit box out of a file? So
you enter a string into an edit box then it searches the file
for it and deletes that line?
That's an easy enough variation of the above:
procedure RemoveLinesContaining(const fileName, searchText: string);
var
sl: TStringList;
i: Integer;
begin
sl := TStringList.Create;
try
sl.LoadFromFile(fileName);
for i := sl.Count-1 downto 0 do
if Pos(searchText, sl[i])<>0 then
sl.Delete(i);
sl.SaveToFile(fileName);
finally
sl.Free;
end;
end;
This function deletes all lines that contain the search string. If you only want to delete the first such line, then break out of the for loop after the call to Delete.
The loop variable is descending to allow the loop to modify the list.
If you want to use a different test, say equality rather than contains, then simply modify the if statement.

An easy way is to use TStringList.LoadFromFile to load the file, then check IndexOf('hello') - if it's greater than -1, the string is in the file.
var
sl : TStringList;
ix : Integer;
begin
sl := TStringList.Create;
try
sl.LoadFromFile('test.txt');
ix := sl.IndexOf('Hello');
if ix > -1 then ShowMessage('Yup, the file contains a greeting.');
finally
sl.Free;
end;
end;

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart