Can TCustomClientDataset apply updates in a batch mode? - delphi

I've got a DB Express TSimpleDataset connected to a Firebird database. I've just added several thousand rows of data to the dataset, and now it's time to call ApplyUpdates.
Unfortunately, this results in several thousand database hits as it tries to INSERT each row individually. That's a bit disappointing. What I'd really like to see is the dataset generate a single transaction with a few thousand INSERT statements in it and send the whole thing at once. I could set that up myself if I had to, but first I'd like to know if there's any method for it built in to the dataset or the DBX framework.

Don't know if it is possible with a TSimpleDataset (never used it), but surely you can if you use a TClientDataset + TDatasetProvider + <put your db dataset here>. You can write a BeforeUpdateRecord to handle the apply process yourself. Basically, it allows you to bypass the standard apply process, access the dataset delta with changes made to records, and then use your own code and components to apply changes to the database. For example you could call stored procedures to modify data, and so on.
However, there is a difference between a transaction and what is called "array DML", "bulk insert" or the like. Even if you use a single transaction (and an "apply" AFAIK happens in a single transaction), within the transaction you may still need to send "n" INSERTs. Some databases supports a way of sending a single INSERT (or update, delete) with an array of parameters to be inserted, reducing the number of single statements to be used - but that may be very database specific and AFAIK dbExpress/Datasnap do not support it - you still could use the BeforeUpdateRecord event to take advantage of specific database capabililties.

Related

"Transactional safety" in influxDB

We have a scenario where we want to frequently change the tag of a (single) measurement value.
Our goal is to create a database which is storing prognosis values. But it should never loose data and track changes to already written data, like changes or overwriting.
Our current plan is to have an additional field "write_ts", which indicates at which point in time the measurement value was inserted or changed, and a tag "version" which is updated with each change.
Furthermore the version '0' should always contain the latest value.
name: temperature
-----------------
time write_ts (val) current_mA (val) version (tag) machine (tag)
2015-10-21T19:28:08Z 1445506564 25 0 injection_molding_1
So let's assume I have an updated prognosis value for this example value.
So, I do:
SELECT curr_measurement
INSERT curr_measurement with new tag (version = 1)
DROP curr_mesurement
//then
INSERT new_measurement with version = 0
Now my question:
If I loose the connection in between for whatever reason in between the SELECT, INSERT, DROP:
I would get double records.
(Or if I do SELECT, DROP, INSERT: I loose data)
Is there any method to prevent that?
Transactions don't exist in InfluxDB
InfluxDB is a time-series database, not a relational database. Its main use case is not one where users are editing old data.
In a relational database that supports transactions, you are protecting yourself against UPDATE and similar operations. Data comes in, existing data gets changed, you need to reliably read these updates.
The main use case in time-series databases is a lot of raw data coming in, followed by some filtering or transforming to other measurements or databases. Picture a one-way data stream. In this scenario, there isn't much need for transactions, because old data isn't getting updated much.
How you can use InfluxDB
In cases like yours, where there is additional data being calculated based on live data, it's common to place this new data in its own measurement rather than as a new field in a "live data" measurement.
As for version tracking and reliably getting updates:
1) Does the version number tell you anything the write_ts number doesn't? Consider not using it, if it's simply a proxy for write_ts. If version only ever increases, it might be duplicating the info given by write_ts, minus the usefulness of knowing when the change was made. If version is expected to decrease from time to time, then it makes sense to keep it.
2) Similarly, if you're keeping old records: does write_ts tell you anything that the time value doesn't?
3) Logging. Do you need to over-write (update) values? Or can you get what you need by adding new lines, increasing write_ts or version as appropriate. The latter is a more "InfluxDB-ish" approach.
4) Reading values. You can read all values as they change with updates. If a client app only needs to know the latest value of something that's being updated (and the time it was updated), querying becomes something like:
SELECT LAST(write_ts), current_mA, machine FROM temperature
You could also try grouping the machine values together:
SELECT LAST(*) FROM temperature GROUP BY machine
So what happens instead of transactions?
In InfluxDB, inserting a point with the same tag keys and timestamp over-writes any existing data with the same field keys, and adds new field keys. So when duplicate entries are written, the last write "wins".
So instead of the traditional SELECT, UPDATE approach, it's more like SELECT A, then calculate on A, and put the results in B, possibly with a new timestamp INSERT B.
Personally, I've found InfluxDB excellent for its ability to accept streams of data from all directions, and its simple protocol and schema-free storage means that new data sources are almost trivial to add. But if my use case has old data being regularly updated, I use a relational database.
Hope that clear up the differences.

Assign, memcpy or other method required for TClientDataset or TDataSetProvider (dbexpress)

I'm usign the dbExpress components within the Embarcadero C++Builder XE environment.
I have a relatively large table with something between 20k and 100k records, which I display in a DBGrid.
I am using a DataSetProvider, which is connected to a SQLQuery and a ClientDataSet, which is connected to the DataSetProvider.
I also need to analyze the data and therefore I need to run through the whole table. For smaller tables I always used code, which is basically something like this:
Form1->ClientDataSet1->First();
while(!Form1->ClientDataSet1->Eof){
temp=Form1->ClientDataSet1->FieldByName("FailReason")->AsLargeInt;
//do something with temp
Form1->ClientDataSet1->Next();
}
Of course this works out, but it is very slow, when I need to run through the whole DBGrid. For some 50000 records in can take up to some minutes. My suspicion is that the most perform is lost since the DBGrid needs to be repainted as the actual Dataset increments its address.
Therefore I am looking for a method which allows me to either read the data without manipulating the actual ClientDataSet. Maybe a method which copies the data of a column into a variable, or another way to run through the datasets, which is more efficient. I am sure if I would have a copy in a variable the operation would take less than a few seconds...
I googled now for hours, but didn't find anything useful so far.
Best regards,
Bodo
if your cds is connected to some db-aware control(-s) (via TDataSource) then first of all consider using DisableControls()
another option would be to avoid utilizing FieldByName within the loop

Find changes quickly in larger SQL database?

There is a Java Swing application which uses an Informix database. I have user rights granted for the Swing application (i.e. no source code), and read only access to a mirror of the database.
Sometimes I need to find a database column, which is backing a GUI element (TextBox, TableField, Label...). What would be best approach to find out which database column and table is holding the data shown e.g. in a TextBox?
My general approach is to capture the state of the database. Commit a change using the GUI and then capture the state of the database again. Then I need to examine the difference. I've already tried:
Use the nrows field of systables: Didn't work, because the number in nrows does not seem to be a realtime representation of the row count.
Create a script with SELECT COUNT(*) ... for all tables: didn't work because too many tables (> 5000). Also tried to optimize by removing empty tables, but there are still too many left.
Is there a simple solution that I'm missing?
Please look at the Change Data Capture API and check if this suits your needs
There probably isn't a simple solution.
You probably need to build yourself a map of the database, or a data dictionary for it. It sounds as though you can eliminate many of the tables from consideration since they're empty — at least for a preliminary pass. If you're dealing with information in a text box, the chances are it is some sort of character data; you can analyze which (non-empty) tables which contain longer character strings, and they'd be the primary targets of your searches. If the schema is badly designed with lots of VARCHAR(255) columns even though the columns normally only hold short strings, life is more difficult. Over time, you can begin to classify tables and columns so that you end up knowing where to look for parts of the application.
One problem to beware of: the tabid in informix.systables isn't necessarily as stable as you'd like. Your data dictionary needs to record its own dd_tabid for the table it describes, and can store the last known tabid from informix.systables, but it needs to be ready to find a new tabid value on occasion. You should probably only mark data in your dictionary for logical deletion.
To some extent, this assumes you can create a database in which to record this information. If you can't create an Informix database, you may have to use something else (MySQL, or SQLite, perhaps) to store the data dictionary. Alternatively, go to your DBA team and ask them for the information. Unless you're trying something self-evidently untoward, they're likely to help (but politics can get in the way — I've no idea how collegial your teams are).

Are there any extensions to TADOQuery that include client indexes

Quick question (hopefully)
I have a large dataset (>100,000 records) that I would like to use as a lookup to determine existence or non-existence of multiple keys. The purpose of this is to find FK violations before trying to commit them to the database to try and avoid the resultant EDatabaseError messing up my transaction.
I had been using TClientDataSet/TDatasetProvider with the FindKey method, as this allowed a client-side index to be set up and was faster (2s to scan each key rather than 10s for ADO). However, moving to large datasets the population of the CDS is starting to take far more time than the local index is saving.
I see that I have a few options for alternatives:
client cursor with TADOQuery.locate method
ADO SELECT statements for each check (no client cache)
ADO SEEK method
Extend TADOQuery to mimic FindKey
The Locate method seems easiest and doesn't spam the server with the SELECT/SEEK methods. I like the idea of extending the TADOQuery, but was wondering whether anyone knew of any ready-made solutions for this rather than having to create my own?
I would create a temporary table in the database server. Insert all 100,000 records into this temp table. Do bulk inserts of say 3000 records at a time, to minimise round trips to the server. Then run select statements on this temp table to check for foreign key violations etc. If all okay, do an insert SQL from the temp table to the main table.

Avoiding round-trips when importing data from Excel

I'm using EF 4.1 (Code First). I need to add/update products in a database based on data from an Excel file. Discussing here, one way to achieve this is to use dbContext.Products.ToList() to force loading all products from the database then use db.Products.Local.FirstOrDefault(...) to check if product from Excel exists in database and proceed accordingly with an insert or add. This is only one round-trip.
Now, my problem is there are two many products in the database so it's not possible to load all products in memory. What's the way to achieve this without multiplying round-trips to the database. My understanding is that if I just do a search with db.Products.FirstOrDefault(...) for each excel product to process, this will perform a round-trip each time even if I issue the statement for the exact same product several times ! What's the purpose of the EF caching objects and returning the cached value if it goes to the database anyway !
There is actually no way to make this better. EF is not a good solution for this kind of tasks. You must know if product already exists in database to use correct operation so you always need to do additional query - you can group multiple products to single query using .Contains (like SQL IN) but that will solve only check problem. The worse problem is that each INSERT or UPDATE is executed in separate roundtrip as well and there is no way to solve this because EF doesn't support command batching.
Create stored procedure and pass information about product to that stored procedure. The stored procedure will perform insert or update based on the existence of the record in the database.
You can even use some more advanced features like table valued parameters to pass multiple records from excel into procedure with single call or import Excel to temporary table (for example with SSIS) and process them all directly on SQL server. As last you can use bulk insert to get all records to special import table and again process them with single stored procedures call.

Resources