I have a simple Delphi (2007) procedure that given a TDataSet and a (sub)list of fields returns a new TClientDataSet with the distinct values from the given TDataSet.
This works quite well.
In my proc I used the TClientDataSet index to populate the distinct values.
It was fast and easy.
The problem is that TClientDataSet index support at maximum 16 fields.
If you add more of them they will be silently ignored.
I need more than 16 fields in the dataset (and thus in the index).
Is there any solution? Some hack?
Maybe some open source library to use as workaround?
I'm working offline so I must do it in memory. The size of the dataset is not huge
If you're needing to get distinct occurrences of records across more than 16 fields and you want to use an index to keep things fast you'll need to consider concatenating some of those fields. For example:
Test Field Field 1 Field 2 Field 3 Field 4
Apple~Banana~Carrot~Donut Apple Banana Carrot Donut
Create you index on the Test Field.
You might need to create multiple test fields if the total length of your other fields exceeds the maximum length of a text field.
You could swap out the TClientDataSet for a TjvCsvDataset from JVCL. It can be used as a pure "in memory dataset" replacement for Client Data Sets, without any need to read or write any CSV files on disk.
It is not quite like Client Data Set in design. I am not sure what benefit all these "Indexes" in a client data set offer you, other than that you can't have a field without an index definition, but in the case that this is all you need, you can set the TJvCsvDataSet.FieldDef property = 'Field1,Field2,.....FieldN' and then open the dataset and add as many rows as you like to the dataset. It is practically limited to the amount of memory you can address in a 32 bit process.
Related
What's the point in adding fields to TClientDataset if and can do Cds.FieldByName('field').Value?
Is it faster to have the reference?
Is it 'clearer'?
The problem with
DataSet.FieldByName('field').Value
is a performance one. Each time this is executed, it causes a serial search through the fields collection of the dataset to locate the one with the required name. This search is not optimised in any way, for instance by using a binary search or hashing algorithm. So, if there are many fields and/or you are doing this access while iterating the records in the dataset, it can have a significant impact on performance.
That's one reason, but not the only one, to define "persistent" TFields using the Object Inspector. You can obtain a reference to a particular TField by using the symbolic name known to the compiler and this happens once only, at compile time. So yes, it is faster than FieldByName. up to you whether you find it clearer.
Other reasons to use persistent TFields include the ease which which calculated fields can be set up and, more importantly, the fact that the calculated field(s) do not need to be accessed via FieldByName in the OnCalcFields event. The performance hit of using FieldByName versus persistent fields is, of course, multiplied by the number of fields referenced in the OnCalcField event and OnCalcFields is called at least once for each record in the dataset, even if you do not iterate the dataset records in your own code.
The above is true of all TDataSet descendants, not just TClientDataSets.
Well, I'm studying the "packetRecord" property (TClientDataSet) and i have a doubt about it. I will explain how i think this works and if i'm wrong, correct me please.
1 - If i configure th packetRecord = 50, when i do "SELECT * FROM history", if table history have 200k of rows, the TClientDataSet will do something like this: "SELECT * FROM history limit 50", so when i need more 50 rows the ClientDataSet will search more 50 in Database.
The property packetRecord just make senses if TClientDataSet don't get all rows in Database, at least for me.
Am i correct ?
It will probably execute the entire query and ask to return just 50 records, but that is an implementation detail that I think is not chosen by the ClientDataSet but rather by the provider, the dataset or the driver.
But in general, that is more or less how it works, yes.
Did some browsing through the code. If the ClientDataSet is linked to a (local) TDataSetProvider, that provider just opens the dataset it is connected to. After opening it, it sets the DataSet.BlockReadSize property to the number of records to retrieve (=packetRecords).
So in the end it comes down on the implementation of BlockReadSize and the dsBlockRead state of the TDataSet that is used.
With a client-server setup this must be the same thing. In fact, there doesn't even have to be a dataset or even a database. There's also an TXMLTransformProvider, and people could implement custom providers too. TXMLTransformProvider ignores this value completely.
So, like I said above, there is no general rule on how this works and even if this properties has any effect.
see TDataPacketWriter.WriteDataSet. no matter whether underlying dataset supports block read mode or not it (datapacket writer) will stop writing data packet as soon as requested amount of records processed (or Eof state reached)
I am scanning an SQLite database looking for all matches and using
OneFound:=False;
if tbl1.FieldByName('Name').AsString = 'jones' then
begin
OneFound:=True;
tbl1.Next;
end;
if OneFound then // Do something
or should I be using
if not(OneFound) then OneFound:=True;
Is it faster to just assign "True" to OneFound no matter how many times it is assigned or should I do the comparison and only change OneFuond the first time?
I know a better way would be to use FTS3, but for now I have to scan the database and the question is more on the approach to setting OneFound as many times as a match is encountered or using the compare-approach and setting it just once.
Thanks
Your question is, which is faster:
if not(OneFound) then OneFound:=True;
or
OneFound := True;
The answer is probably that the second is faster. Conditional statements involve branches which risks branch mis-prediction.
However, that line of code is trivial compared to what is around it. Running across a database one row at a time is going to be outrageously expensive. I bet that you will not be able to measure the difference between the two options because the handling of that little Boolean is simply swamped by the rest of the code. In which case choose the more readable and simpler version.
But if you care about the performance of this code you should be asking the database to do the work, as you yourself state. Write a query to perform the work.
It would be better to change your SQL statement so that the work is done in the database. If you want to know whether there is a tuple which contains the value 'jones' in the field 'name', then a quicker query would be
with tquery.create (nil) do
begin
sql.add ('select name from tbl1 where name = :p1 limit 1');
sql.params[0].asstring:= 'jones';
open;
onefound:= not isempty;
close;
free
end;
Your syntax may vary regarding the 'limit' clause but the idea is to return only one tuple from the database which matches the 'where' statement - it doesn't matter which one.
I used a parameter to avoid problems delimiting the value.
1. Search one field
If you want to search one particular field content, using an INDEX and a SELECT will be the fastest.
SELECT * FROM MYTABLE WHERE NAME='Jones';
Do not forget to create an INDEX on the column, first!
2. Fast reading
But if you want to search within a field, or within several fields, you may have to read and check the whole content. In this case, what will be slow will be calling FieldByName() for each data row: you should better use a local TField variable.
Or forget about TDataSet, and switch to direct access to SQLite3. In fact, using DB.pas and TDataSet requires a lot of data marshalling, so is slower than a direct access.
See e.g. DiSQLite3 or our DB classes, which are very fast, but a bit of higher level. Or you can use our ORM on top of those classes. Our classes are able to read more than 500,000 rows per second from a SQLite3 database, including JSON marshalling into objects fields.
3. FTS3/FTS4
But, as you guessed, the fastest would be indeed to use the FTS3/FTS4 feature of SQlite3.
You can think of FTS4/FTS4 as a "meta-index" or a "full-text index" on supplied blob of text. Just like google is able to find a word in millions of web pages: it does not use a regular database, but full-text indexing.
In short, you create a virtual FTS3/FTS4 table in your database, then you insert in this table the whole text of your main records in the FTS TEXT field, forcing the ID field to be the one of the original data row.
Then, you will query for some words on your FTS3/FTS4 table, which will give you the matching IDs, much faster than a regular scan.
Note that our ORM has dedicated TSQLRecordFTS3 / TSQLRecordFTS4 kind of classes for direct FTS process.
Is it possible to provide the following type of functionality with informix client tools?
As the user types the first two characters of a name, the drop-down list is empty. At the third character, the list fills with just the names beginning with those three characters. At the fourth character, MS-Access completes the first matching name (assuming the combo's AutoExpand is on). Once enough characters are typed to identify the customer, the user tabs to the next field.
The time taken to load the combo between keystrokes is minimal. This occurs once only for each entry, unless the user backspaces through the first three characters again.
If your list still contains too many records, you can reduce them by another order of magnitude by changing the value of constant conSuburbMin from 3 to 4.
This requires a combination of two things, only one of which is partially under the control of Informix the DBMS or Informix the Client API.
First of all, you need the gadget that is accepting user input to asynchronously generate a query which matches what the user has typed, fetches some of the results from the DBMS, and shows them. Secondly, you need the DBMS to respond rapidly to such queries. Part of the issue is 'what form does the query take'. But the basic functionality is:
SELECT TitleCaseName
FROM ReferenceTable
WHERE LowerCaseName[1,3] = 'abc';
You might or might not bother with 'first rows optimization'; you might or might not bother with an ORDER BY. Your code would only select the first N rows. You might do it with some prioritization information - most frequently used names, etc.
But this is logic is basically the same for any DBMS - give or take the details such as the choice of technique for dealing with case-mapping (function call vs column) and notation for substrings vs LIKE 'abc%'.
The tricky stuff, though, is the asynchronous combination of user-input plus collecting data from the DBMS; that is best handled with multiple threads, one dealing with the user input, one dealing with the DBMS and (possibly) one dealing with the display (or that might also be the one dealing with user input). And that requires hooking into the UI API - not something that the Informix APIs do of their own accord. The UI can get at Informix (or any other DBMS) easily enough through ODBC or any other faintly similar API.
Good morning !
What is the best way to remove duplicate records from grid control? I use Delphi 2009 and devEx quantumGrid component.
I tried looping through all the records and when a duplicate record is found then add it to list and apply filter on grid. I found this as time consuming logic. There are also two downsides of this approach.
[1] When the duplicate records are considerably more say 10K records then applying filter takes lot of time, because of lot of entries to filter out.
[2] Looping through all the records is itself time consuming for big result set like 1M rows.
SQL query returns me distinct rows, but when the user hides any column in grid, then it resembles as if there are duplicate records(internally they are distinct).
Is there any other way of doing this?
Any ideas on this are greatly helpful !
Thanks & Regards,
Pavan.
Can you alter your dataset to not return duplicate records in the first place? I would normally only return the records I want displayed instead of returning unwanted records from the database and then using a database grid to try to suppress unwanted records.
With thousands of rows I would add an additional field to the DB called say Sum or Hash or if you can't change the DB add a calculated field if it is a ClientDataSet but this carries overhead at display time
Calculate the contents of the hash field with something fast and simple like a sum of all the chars in your text field. All dupes are now easily identified. Add this field to your Unique or Distinct Query parameters or filter out on that.
Just an Idea.
Checking for duplicates is always a bit tricky, for the reasons you just mentioned. The best way to do it in this particular case is probably to filter before the data reaches the grid.
If this grid is getting its records from a database, try tweaking your SQL query to not return any duplicate records. (The "distinct" keyword can be useful here.) The database server can usually do a much better job of it than you can.
If not, then you're probably loading your result set from some sort of object list. Try filtering the list and culling duplicate objects before you load it into the grid. Then it's over with and you don't have to filter the grid itself. This is a lot less time-consuming.
I have worked with DevExpress's Quantum Grid for some time and their support form http://www.devexpress.com/Support/Center/ is excellent. When you post questions the DevExpress staff will answer you directly. With that said, I did a quick search for you and found some relevant articles.
how to hide duplicate row values: http://www.devexpress.com/Support/Center/p/Q142379.aspx?searchtext=Duplicate+Rows&p=T1|P0|83
highlight duplicate records in a grid: http://www.devexpress.com/Support/Center/p/Q98776.aspx
Unfortunately, it looks like you will have to iterate through the table in order to hide duplicate values. I would suggest that you try to clean the data up before it makes it to the grid. Ideally you would update the code/sql that produces the data. If that is not possible, you could write a TcxCustomDataSource that will scrub the data when it is first loaded. This should have better performance because you will not be using the grid's api to access the data.
Edit
ExpressQuantumGrid will not automatically hide rows that look like duplicates because the user hid a column. See: http://www.devexpress.com/Support/Center/p/Q205956.aspx?searchtext=Duplicate+Rows&p=T1|P0|83.
Poster
For example, I have a dataset which
contains two fields ID and TXT. ID is
a unique field and TXT field may
contain duplicate values. So when the
dataset is connected to the grid with
all columns visible, the records are
unique. See the image1.bmp. But if I
hide the ID column then the grid shows
duplicate rows. See the image2.bmp.
DevExpress Team
I'm sorry, but our ExpressQuantumGrid
Suite doesn't support such a
functionality, because this task is
very specific. However, you can
implement it manually.