Optimizing looping through dataset - delphi

I have a code some thing like this
dxMemOrdered : TdxMemData;
while not qrySandbox2.EOF do
begin
dxMemOrdered.append;
dxMemOrderedTotal.asCurrency := qrySandbox2.FieldByName('TOTAL').asCurrency;
dxMemOrdered.post;
qrySandbox2.Next;
end;
this code executes in a thread. When there are huge records say "400000" it is taking around 25 minutes to parse through it. Is there any way that i can reduce the size by optimizing the loop? Any help would be appreciated.
Update
Based on the suggestions i made the following changes
dxMemOrdered : TdxMemData;
qrySandbox2.DisableControls;
while not qrySandbox2.Recordset.EOF do
begin
dxMemOrdered.append;
dxMemOrderedTotal.asCurrency := Recordset.Fields['TOTAL'].Value;
dxMemOrdered.post;
qrySandbox2.Next;
end;
qrySandbox2.EnableControls;
and my output time have improved from 15 mins to 2 mins. Thank you guys

Without seeing more code, the only suggestion I can make is make sure that any visual control that is using the memory table is disabled. Suppose you have a cxgrid called Grid that is linked to your dxMemOrdered memory table:
var
dxMemOrdered: TdxMemData;
...
Grid.BeginUpdate;
try
while not qrySandbox2.EOF do
begin
dxMemOrdered.append;
dxMemOrderedTotal.asCurrency := qrySandbox2.FieldByName('TOTAL').asCurrency;
dxMemOrdered.Post;
qrySandbox2.Next;
end;
finally
Grid.EndUpdate;
end;

Some ideas in order of performance gain vs work to do by you:
1) Check if the SQL dialect that you are using lets you use queries that directly SELECT from/INSERT to. This depends on the database you're using.
2) Make sure that if your datasets are not coupled to visual controls, that you call DisableControls/EnableControls around this loop
3) Does this code have to run in the main program thread? Maybe you can send if off to a separate thread while the user/program continues doing something else
4) When you have to deal with really large data, bulk insertion is the way to go. Many databases have options to bulk insert data from text files. Writing to a text file first and then bulk inserting is way faster than individual inserts. Again, this depends on your database type.
[Edit: I just see you inserting the info that it's TdxMemData, so some of these no longer apply. And you're already threading, missed that ;-). I leave this suggestions in for other readers with similar problems]

It's much better to let SQL do the work instead of iterating though a loop in Delphi. Try a query such as
insert into dxMemOrdered (total)
select total from qrySandbox2
Is 'total' the only field in dxMemOrdered? I hope that it's not the primary key otherwise you are likely to have collisions, meaning that rows will not be added.

There's actually a lot you could do to speed up your thread.
The first would be to look at the problem in a broader perspective:
Am I fetching data from a cached / fast disk, possibly moved in memory?
Am I doing the right thing, when aggregating totals by hand? SQL engines are expecially optimized to do those things, all you'd need to do is to define an additional logical field where to store the SQL aggregated result.
Another little optimization that may bring an improvement over large amounts of looping is to not use constructs like:
Recordset.Fields['TOTAL'].Value
Recordset.FieldByName('TOTAL').Value
but to add the fields with the fields editor and then directly accessing the right field. You'll save a whole loop through the fields collection, that otherwise is performed on every field, on every next record.

Related

Writing many stored procedures

I need to write a long procedure which generates a report for a company.
Since report involves multiple data to be fetched i have written many small procedures to fetch the different records .
Is it the write approach to write many sub programs in the main program and calling them in the main program?
please help or is there any other way to do this.
Unless you really go wild (**) and build a 'tree' of stored procedures each calling the other one I don't see any problems with this. There might in fact be benefits to this as
it's easier to maintain smaller pieces of code
(re)compilation of smaller stored procedures is going to be faster
**: There is a 'limit' in MSSQL in that the stack is limited to 32 levels. That is, if procedure1 calls procedure1_1 and that procedure calls procedure1_1_1 and that one calls another etc... you'll get an error when you get over 32 calls 'deep'. Calling multiple stored procedures sequentially isn't a problem though.
The only thing to keep in mind is the context of the variables/temporary tables you're using. If you want to pass values around you'll need to use parameters. (using `OUTPUT can be useful to keep track of a #rowcount variable for instance).

TClientDataSet and limit by memory

We have a system that creates reports out of our data. And we can deal with a lot of data. The idea of over 150,000 rows is not out of the question.
Unfortunately, our experience with TClientDataSet is its limitations, because it often results in an 'insufficient memory for this operation' error, when the data gets too big.
So the question is this: Does there exist a generally available implementation of TDataSet that can handle a large amount of data (such as streaming directly to a file and not keeping the entire dataset in memory)?
I am willing to implement such a class myself. But as far as I understand TClientDataSet, it needs to be able to contain the data itself before it can save it to a file/stream. In addition, loading the data again should also be possible as a stream rather than loading in an entire TClientDataSet object, because then we wouldn't have solved the issue.
You can use either FireBird or Interbase in embedded mode.
Is there really any need to cache all the data on the client before reporting? If not, maybe rethink how you're querying and processing data to generate these reports and see if there's a way that involves less client-side data (which comes with a bonus of less data transmitted over the network).
If you've been down that road before and you really do need all this data client side, then you could look at custom data structures. A TList<T> of records (even if you need to build your own indexes) takes a lot less memory than a TClientDataSet does.
KBMMemTable is a nice alternative to TClientDataset
http://www.components4programmers.com/products/kbmmemtable/
We are using it for years and it is very useful and fast.
Wanted to underline that the capacities of the TClientDataset could be something bigger.
Test on the limits of TClientDataset - appending xxx,xxx records, putting the single record in whole (repeated) to create an idea on the size.
// Begin Load Record to TCLientDataset for Reverse (on Reverse) Processing
dxMemData1.Append;
dxMemData1['NT_Rec_No'] := 1000;
dxMemData1['NT_User'] := 'DEV\Administrator';
dxMemData1['NT_Type'] := 'Information';
dxMemData1['Ora_Timestamp'] := '20170706033859.000';
dxMemData1['Ora_Host'] := 'DEV0001';
dxMemData1['Ora_SID'] := 'Oracle.orcl';
dxMemData1['Ora_Event_Id'] := '34';
dxMemData1['NT_Message'] := Memo1.Text;
dxMemData1.Post;
// End Load Record to TCLientDataset for Reverse (on Reverse) Processing
String on the memo1 is a of 100 characters - ansi
did several tests and managed to keep going to something from 600,000 to 900,000 records without crashing.
Difference can be made by making the text on the memo bigger - this did decrease the max number before crash - which means it is not a matter of exact max. record number - but of size consumed - my guess.
Tested the same with TdxMemData (devexpress), this time I could reach almost the double of records

Trying to hack my way around SQLite3 concurrent writing, any better way to do this?

I use Delphi XE2 along with DISQLite v3 (which is basically a port of SQLite3). I love everything about SQLite3, except the lack of concurrent writing, especially that I extensively rely on multi-threading in this project :(
My profiler made it clear I needed to do something about it, so I decided to use this approach:
Whenever I need to insert a record in DB, Instead of doing an INSERT, I write the SQL query in a special foler, ie.
WriteToFile_Inline(SPECIAL_FOLDER_PATH + '\' + GUID, FileName + '|' + IntToStr(ID) + '|' + Hash + '|' + FloatToStr(ModifDate) + '|' + ...);
I added a timer (in the main app thread) that fires every minute, parse these files and then INSERT the queries using a transaction.
Delete those temporary files at the end.
The result is I have like 500% performance gain. Plus this technique is ACID, as I can always scan the SPECIAL_FOLDER_PATH after a power failure and execute the INSERTs I find.
Despite the good results, I'm not very happy with the method used (hackish to say the least), I keep thinking that if I could have a generics-like with fast lookup access, thread-safe, ACID list, this would be much cleaner (and possibly faster?)
So my question is: do you know anything like that for Delphi XE2?
PS. I trust many of you reading the code above be in shock and will start insulting me at this point! Please be my guest, but if you know a better (ie. faster) ACID approach, please share your thoughts!
Your idea of sending the inserts to a queue, which will rearrange the inserts, and join them via prepared statements is very good. Using a timer in the main thread or a separated thread is up to you. It will avoid any locking.
Do not forget to use a transaction, then commit it every 100/1000 inserts for instance.
About high performance using SQLite3, see e.g. this blog article (and graphic below):
In this graphic, best performance (file off) comes from:
PRAGMA synchronous = OFF
Using prepared statements
Inside a transaction
In WAL mode (especially in concurrency mode)
You may also change the page size, or the journal size, but settings above are the best. See https://stackoverflow.com/search?q=sqlite3+performance
If you do not want to use a background thread, ensure WAL is ON, prepare your statements, use batchs, and regroup your process to release the SQLite3 lock as soon as possible.
The best performance will be achieved by adding a Client-Server layer, just as we did for mORMot.
With files you organized an asynchronous job queue with persistance. It allows you to avoid one-by-one and use batch (records group) approach to insert the records. Comparing one-by-one and batch:
first works in auto-commit mode (probably) for each record, second wraps a batch into a single transaction and gives greatest performance gain.
first prepares an INSERT command each time when you need to insert a record (probably), second once per batch and gives second by value gain.
I dont think, that SQLite concurrency is a problem in your case (at least not the main issue). Because in SQLite a single insert is comparably fast and concurrency performance issues you will get with high workload. Probably similar results you will get with other DBMS, like Oracle.
To improve your batch approach, consider the following:
consider to set journal_mode to WAL and disable shared cache mode.
use a background thread to process your queue. Instead of a fixed time interval (1 min), check SPECIAL_FOLDER_PATH more often. And if the queue has more than X Kb of data, then start processing. Or use a count of queued records and event to notify the thread, that the queue should start processing.
use multy-record prepared INSERT instead of single-record INSERT. You can build an INSERT for 100 records and process your queue data in a single batch, but by 100 record chanks.
consider to write / read a binary field values instead of a text values.
consider to use a set of files with preallocated size.
etc
sqlite3_busy_timeout is pretty inefficient because it doesn't return immediately when the table it's waiting on is unlocked.
I would try creating a critical section (TCriticalSection?) to protect each table. If you enter the critical section before inserting a row and exit it immediately thereafter, you will create better table locks than SQLite provides.
Without knowing your access patterns, though, it's hard to say if this will be faster than batching up a minute's worth of inserts into single transactions.

How to handle billions of objects without "Outofmemory" error

I have an application which may needs to process billions of objects.Each object of is of TRange class type. These ranges are created at different parts of an algorithm which depends on certain conditions and other object properties. As a result, if you have 100 items, you can't directly create the 100th object without creating all the prior objects. If I create all the (billions of) objects and add to the collection, the system will throw Outofmemory error. Now I want to iterate through each object mainly for two purposes:
To apply an operation for each TRange object(eg:Output certain properties)
To get a cumulative sum of a certain property.(eg: Each range has a weight property and I want to retreive totalweight that is a sum of all the range weights).
How do I effectively create an Iterator for these object without raising Outofmemory?
I have handled the first case by passing a function pointer to the algorithm function. For eg:
procedure createRanges(aProc: TRangeProc);//aProc is a pointer to function that takes a //TRange
var range: TRange;
rangerec: TRangeRec;
begin
range:=TRange.Create;
try
while canCreateRange do begin//certain conditions needed to create a range
rangerec := ReturnRangeRec;
range.Update(rangerec);//don't create new, use the same object.
if Assigned(aProc) then aProc(range);
end;
finally
range.Free;
end;
end;
But the problem with this approach is that to add a new functionality, say to retrieve the Total weight I have mentioned earlier, either I have to duplicate the algorithm function or pass an optional out parameter. Please suggest some ideas.
Thank you all in advance
Pradeep
For such large ammounts of data you need to only have a portion of the data in memory. The other data should be serialized to the hard drive. I tackled such a problem like this:
I Created an extended storage that can store a custom record either in memory or on the hard drive. This storage has a maximum number of records that can live simultaniously in memory.
Then I Derived the record classes out of the custom record class. These classes know how to store and load themselves from the hard drive (I use streams).
Everytime you need a new or already existing record you ask the extended storage for such a record. If the maximum number of objects is exceeded, the storage streams some of the least used record back to the hard drive.
This way the records are transparent. You always access them as if they are in memory, but they may get loaded from hard drive first. It works really well. By the way RAM works in a very similar way so it only holds a certain subset of all you data on your hard drive. This is your working set then.
I did not post any code because it is beyond the scope of the question itself and would only confuse.
Look at TgsStream64. This class can handle a huge amounts of data through file mapping.
http://code.google.com/p/gedemin/source/browse/trunk/Gedemin/Common/gsMMFStream.pas
But the problem with this approach is that to add a new functionality, say to retrieve the Total weight I have mentioned earlier, either I have to duplicate the algorithm function or pass an optional out parameter.
It's usually done like this: you write a enumerator function (like you did) which receives a callback function pointer (you did that too) and an untyped pointer ("Data: pointer"). You define a callback function to have first parameter be the same untyped pointer:
TRangeProc = procedure(Data: pointer; range: TRange);
procedure enumRanges(aProc: TRangeProc; Data: pointer);
begin
{for each range}
aProc(range, Data);
end;
Then if you want to, say, sum all ranges, you do it like this:
TSumRecord = record
Sum: int64;
end;
PSumRecord = ^TSumRecord;
procedure SumProc(SumRecord: PSumRecord; range: TRange);
begin
SumRecord.Sum := SumRecord.Sum + range.Value;
end;
function SumRanges(): int64;
var SumRec: TSumRecord;
begin
SumRec.Sum := 0;
enumRanges(TRangeProc(SumProc), #SumRec);
Result := SumRec.Sum;
end;
Anyway, if you need to create billions of ANYTHING you're probably doing it wrong (unless you're a scientist, modelling something extremely large scale and detailed). Even more so if you need to create billions of stuff every time you want one of those. This is never good. Try to think of alternative solutions.
"Runner" has a good answer how to handle this!
But I would like to known if you could do a quick fix: make smaller TRange objects.
Maybe you have a big ancestor? Can you take a look at the instance size of TRange object?
Maybe you better use packed records?
This part:
As a result, if you have 100 items,
you can't directly create the 100th
object without creating all the prior
objects.
sounds a bit like calculating Fibonacci. May be you can reuse some of the TRange objects instead of creating redundant copies? Here is a C++ article describing this approach - it works by storing already calculated intermediate results in a hash map.
Handling billions of objects is possible but you should avoid it as much as possible. Do this only if you absolutely have to...
I did create a system once that needed to be able to handle a huge amount of data. To do so, I made my objects "streamable" so I could read/write them to disk. A larger class around it was used to decide when an object would be saved to disk and thus removed from memory. Basically, when I would call an object, this class would check if it's loaded or not. If not, it would re-create the object again from disk, put it on top of a stack and then move/write the bottom object from this stack to disk. As a result, my stack had a fixed (maximum) size. And it allowed me to use an unlimited amount of objects, with a reasonable good performance too.
Unfortunately, I don't have that code available anymore. I wrote it for a previous employer about 7 years ago. I do know that you would need to write a bit of code for the streaming support plus a bunch more for the stack controller which maintains all those objects. But it technically would allow you to create an unlimited number of objects, since you're trading RAM memory for disk space.

different between cfstoredproc and cfquery

I've found previous programmers using cfstoredproc in our existing project about insert records into database.
Just example he/she used like that:
<cfstoredproc procedure="myProc" datasource="myDsn">
<cfprocparam type="In" cfsqltype="CF_SQL_CHAR" value="ppshein" dbvarname="username">
<cfprocparam type="In" cfsqltype="CF_SQL_CHAR" value="28 yrs" dbvarname="age">
</cfstoredproc>
I can't convince why he/she used above code instead of:
<cfquery datasource="myDsn">
insert usertb
(name, age)
values
(<cfqueryparam cfsqltype="CF_SQL_CHAR" value="ppshein">, <cfqueryparam cfsqltype="CF_SQL_CHAR" value="28 yrs">)
</cfquery>
I feel there will be the hidden performance using cfstoredproc and cfquery for complex data manipulation. Please let me know the performance using cfstoredproc in coldfusion instead of cfquery for complex data manipulation. What I know is reusable.
CFSTOREDPROC should have better performance for the same reason a stored procedure will have better performance at the database level -- when created, the stored procedure is optimized internally by the database.
Whether this is noticeable depends on your database and your query. Use of CFQUERYPARAM inside your CFQUERY (as in your example) also speeds execution (at the db driver level).
Unless the application is VERY performance-sensitive, I tend to run my SQL code in a profiler first to optimize it, then put it into my CFQUERY parametrized with CFQUERYPARAM tags, rather than use a storedproc. That way, all the logic is in the CF code. This is a matter of taste, of course, and it's easy to move the SQL into a storedproc later when the application matures.
Some shops prefer to have all data logic controlled by the database, leaving CF to act almost exclusively as the front-end generator. Some places are so controlling that they won't let you write any SQL in your CF code.
Update: There might be more to that Stored Proc than a simple INSERT INTO. There may be some data lookup in another table. There could be validation. There may be conditional logic. There may be multiple transactions going on, like a log. A failure to perform the insert may return a specific status code rather than throwing an error.
Honestly, it's simply a matter of style. There are reasons for and against either way, and I've found that it usually comes down to who has more/more competent coders: The CF guys or the database guys.
Basically if you use a stored procedure, the stored procedure will be pre-complied by the database and the execution plan stored. This means that subsequent calls to the stored procedure do not incurr the that overhead. For large complex queries this can be substantial.
So as a rule of thumb: queries that are...
large and complex or
called very frequently or
both of the above
...are very good candidates for conversion to a stored procedure.
Hope that helps!

Resources