Is it faster to constantly assign a value or compare - delphi

I am scanning an SQLite database looking for all matches and using
OneFound:=False;
if tbl1.FieldByName('Name').AsString = 'jones' then
begin
OneFound:=True;
tbl1.Next;
end;
if OneFound then // Do something
or should I be using
if not(OneFound) then OneFound:=True;
Is it faster to just assign "True" to OneFound no matter how many times it is assigned or should I do the comparison and only change OneFuond the first time?
I know a better way would be to use FTS3, but for now I have to scan the database and the question is more on the approach to setting OneFound as many times as a match is encountered or using the compare-approach and setting it just once.
Thanks

Your question is, which is faster:
if not(OneFound) then OneFound:=True;
or
OneFound := True;
The answer is probably that the second is faster. Conditional statements involve branches which risks branch mis-prediction.
However, that line of code is trivial compared to what is around it. Running across a database one row at a time is going to be outrageously expensive. I bet that you will not be able to measure the difference between the two options because the handling of that little Boolean is simply swamped by the rest of the code. In which case choose the more readable and simpler version.
But if you care about the performance of this code you should be asking the database to do the work, as you yourself state. Write a query to perform the work.

It would be better to change your SQL statement so that the work is done in the database. If you want to know whether there is a tuple which contains the value 'jones' in the field 'name', then a quicker query would be
with tquery.create (nil) do
begin
sql.add ('select name from tbl1 where name = :p1 limit 1');
sql.params[0].asstring:= 'jones';
open;
onefound:= not isempty;
close;
free
end;
Your syntax may vary regarding the 'limit' clause but the idea is to return only one tuple from the database which matches the 'where' statement - it doesn't matter which one.
I used a parameter to avoid problems delimiting the value.

1. Search one field
If you want to search one particular field content, using an INDEX and a SELECT will be the fastest.
SELECT * FROM MYTABLE WHERE NAME='Jones';
Do not forget to create an INDEX on the column, first!
2. Fast reading
But if you want to search within a field, or within several fields, you may have to read and check the whole content. In this case, what will be slow will be calling FieldByName() for each data row: you should better use a local TField variable.
Or forget about TDataSet, and switch to direct access to SQLite3. In fact, using DB.pas and TDataSet requires a lot of data marshalling, so is slower than a direct access.
See e.g. DiSQLite3 or our DB classes, which are very fast, but a bit of higher level. Or you can use our ORM on top of those classes. Our classes are able to read more than 500,000 rows per second from a SQLite3 database, including JSON marshalling into objects fields.
3. FTS3/FTS4
But, as you guessed, the fastest would be indeed to use the FTS3/FTS4 feature of SQlite3.
You can think of FTS4/FTS4 as a "meta-index" or a "full-text index" on supplied blob of text. Just like google is able to find a word in millions of web pages: it does not use a regular database, but full-text indexing.
In short, you create a virtual FTS3/FTS4 table in your database, then you insert in this table the whole text of your main records in the FTS TEXT field, forcing the ID field to be the one of the original data row.
Then, you will query for some words on your FTS3/FTS4 table, which will give you the matching IDs, much faster than a regular scan.
Note that our ORM has dedicated TSQLRecordFTS3 / TSQLRecordFTS4 kind of classes for direct FTS process.

Related

How to go to the last record in results of ado.locate (Delphi)

I located some records by this code:
ADOQuery1.Locate('field1',ADOQuery2.FieldByName('field2').Value,[])
How to go to the last one of these records?
You have a number of options. The best depends on a whole lot of considerations you haven't mentioned in your question. I'll provide a very brief overview of the options to avoid this becoming "too broad". It'll be up to you to make your choice and figure out the details. If you get stuck, you can ask a new, more specific question.
Using Locate
A solution involving Locate is only feasible if your dataset is sorted by the same field you're searching on.
Clearly your Search Value is not a unique key. So I'm guessing that you're trying to find the last row matching Search Key in data sorted by some other unique field. (Otherwise the concept of last is meaningless.)
So it's highly probable this is not appropriate for you; unless your data is ordered by a composite key of your search field followed by a unique key.
The approach is simple: navigate forwards until you find a row where the search value doesn't match, then backtrack by 1 row.
if not DataSet.Locate(SearchField, SearchValue, []) then
{ handle not found case as desired }
else
begin
while (not DataSet.Eof) and (DataSet.FieldByName(SearchField).Value = SearchValue) do
DataSet.Next;
{ Watch out for case that last row in dataset matches search value }
if (DataSet.FieldByName(SearchField).Value <> SearchValue) then
DataSet.Prior;
end;
Implement your own search
This is straight-forward and will always work. But it is inefficient, having O(n) complexity. So not advised for large datasets.
DataSet.Last;
while (not DataSet.Bof) and (DataSet.FieldByName(SearchField).Value <> SearchValue) do
DataSet.Prior;
NOTE: In order to mirror behaviour of Locate it would be advisable to enhance this method to deal with the case where a match is not found at all. In that case the active record should not be inadvertently changed as a side-effect of the search.
Use filtering
Obviously this solution depends on whether filtering the dataset is appropriate to the rest of your code. But it is a fairly simple option, and depending factors beyond the scope of this answer, it can be more performant than the previous option.
DataSet.Filtered := False;
{ The next line may be a little tricky.
Ensure the filter string is appropriate for the data-types involved. }
DataSet.Filter := '<string of the form SearchField = SearchValue>';
DataSet.Filtered := True;
DataSet.Last;
See documentation on the Filter property.
NOTE: It may be advisable to take precaution against setting the filter redundantly.
Use a master-detail relationship
This option is included because your question code indicates the SearchValue comes from the active record of another dataset. You're using ADO, so this option is available to you.
DataSet.MasterSource := <Appropriate DataSource>;
DataSet.MasterFields := SearchField;
DataSet.Last;
See documentation on master-detail relationships and on ADO MasterFields.
Offload the work to the RDBMS
Finally, it's worth considering using a stored procedure to get the information you need directly from the database. The advantage is that the server can leverage available indexes and have the potential to provide the most performant option. Again though, a lot depends on the particulars of your application.
A query along the following lines can form the basis of your stored procedure.
select MAX(UniqueField) as RowKey
from Table
where SearchField = SearchValue
Then call your stored procedure, and use its result to find the desired row.
DataSet.Locate(UniqueField, RowKey, []);
NOTE: Don't forget to consider the stored procedure returning NULL if no rows with SearchValue exist.
General Disclaimer
All the above code is extremely brief and for illustrative purposes only. In many cases additional code is required for a robust implementation.
E.g. It might be necessary to DisableControls and enable them again.
NOTE: It's very important with the above to be aware of the actual ordering of the data in your datasets. Failure to take this into account can lead to incorrect behaviour. Even the last option may exhibit worse than expected performance if your dataset is not sorted by UniqueKey.
If your table has an Autoincrement identity field you can do this
adoquery1.sql.clear;
adoquery1.sql.add('select top 1 * from yourtablename where field1=value1 and filed2=value2 order by yourAIcolums desc')
adoquery1.execsql;
value1 and value2 are your desired values.pass them as parameters or put them in command text
this way you get only row you want and no need to loop

Find changes quickly in larger SQL database?

There is a Java Swing application which uses an Informix database. I have user rights granted for the Swing application (i.e. no source code), and read only access to a mirror of the database.
Sometimes I need to find a database column, which is backing a GUI element (TextBox, TableField, Label...). What would be best approach to find out which database column and table is holding the data shown e.g. in a TextBox?
My general approach is to capture the state of the database. Commit a change using the GUI and then capture the state of the database again. Then I need to examine the difference. I've already tried:
Use the nrows field of systables: Didn't work, because the number in nrows does not seem to be a realtime representation of the row count.
Create a script with SELECT COUNT(*) ... for all tables: didn't work because too many tables (> 5000). Also tried to optimize by removing empty tables, but there are still too many left.
Is there a simple solution that I'm missing?
Please look at the Change Data Capture API and check if this suits your needs
There probably isn't a simple solution.
You probably need to build yourself a map of the database, or a data dictionary for it. It sounds as though you can eliminate many of the tables from consideration since they're empty — at least for a preliminary pass. If you're dealing with information in a text box, the chances are it is some sort of character data; you can analyze which (non-empty) tables which contain longer character strings, and they'd be the primary targets of your searches. If the schema is badly designed with lots of VARCHAR(255) columns even though the columns normally only hold short strings, life is more difficult. Over time, you can begin to classify tables and columns so that you end up knowing where to look for parts of the application.
One problem to beware of: the tabid in informix.systables isn't necessarily as stable as you'd like. Your data dictionary needs to record its own dd_tabid for the table it describes, and can store the last known tabid from informix.systables, but it needs to be ready to find a new tabid value on occasion. You should probably only mark data in your dictionary for logical deletion.
To some extent, this assumes you can create a database in which to record this information. If you can't create an Informix database, you may have to use something else (MySQL, or SQLite, perhaps) to store the data dictionary. Alternatively, go to your DBA team and ask them for the information. Unless you're trying something self-evidently untoward, they're likely to help (but politics can get in the way — I've no idea how collegial your teams are).

SSIS Merge Join component wrote 0 rows

First of all, thanks to the community for the amount of information on the site, helped me a lot with C# and SSIS. The second thing is that i'm not very good with english, so please be patient, if you don't understand something, please ask, i'll try to make it better.
I got 2 OLEDB connection source from different databases, both tables got a column with an ID that I use as a Join Key. In RUT CRUZADOS, the ID its a float datatype, while in the other source (CTACTE AÑO PAS) I don't know which type of data it is (I can't open the database with sql server, i can only do SELECT operations).
When I combine them in the Merge,it doesn't return me any mistake, but when I run the program, this happens.
[SSIS.Pipeline] Information: "component "CARGOS ABONOS" (239)" wrote 0
rows.
In Microsoft Access, the "Inner Join" returns like 4 Millions of rows. I think the problem its the metadata but i dont know how to use the "Data Conversion". Can someone help me please.
Thank you all
You can view the data types, at least as far as SSIS is concerned by double clicking on connector lines. In the Data Flow Path Editor that pops up, the Metadata tab will describe the column types.
That said, it doesn't matter because the Merge Join transformation is only going to allow you to merge data of the same type.
A Merge Join requires the source system data to be sorted. This is accomplished by either adding sort components into the stream (not recommended as this is an asynchronous transform that eats all your memory and kills your performance) or by explicitly sorting in your source systems and then marking them as sorted in the Advanced tab.
Since I don't see a Sort, that leads me to believe the sort is done in the source systems. Or, the sorts are not done there but someone has marked the output as sorted. There must be explicit ORDER BY clauses in those source queries. Sometimes, SQL Server will return data in the same order but unless there is an ORDER BY, it cannot be guaranteed. (I wish I could use the flash tag to emphasis the last point).
Future readers, if you have a sort in both systems and they are both sorted on the same column, then you need to examine collations. Case Insensitive is a different beast than Case Sensitive and a sort on an ASCII based system yields a different sort than one using EBCIDIC for mixed alpha-numeric like I once had...
As the source data type appears to be floats, then sorting is not the likely culprit. The realization is dawning on me, instead of sort issues, you likely have an uglier and more insidious comparison issue. Floating point numbers are approximations. 1=1 but 1.00000000000(etc) may or may not be equal to 1.0000000000(etc)1
Do you actually need the decimal places to make the match? If not, casting to an integer in both (and sorting on the CAST'ed value) systems should make these matches work. If there are decimal places that matter, then you're going to need to cast that into an exact numeric type (and pray that they both convert in the same way). The fact that Access does it leads me to believe Integer data type will be your salvation.

Using multiple key value stores

I am using Ruby on Rails and have a situation that I am wondering if is appropriate for using some sort of Key Value Store instead of MySQL. I have users that have_many lists and each list has_many words. Some lists have hundreds of words and I want users to be able to copy a list. This is a heavy MySQL task b/c it is going to have to create these hundreds of word objects at one time.
As an alternative, I am considering using some sort of key value store where the key would just be the word. A list of words could be stored in a text field in mysql. Each list could be a new key value db? It seems like it would be faster to copy a key value db this way rather than have to go through the database. It also seems like this might be faster in general. Thoughts?
The general way to solve this using a relational database would be to have a list table, a word table, and a table-words table relating the two. You are correct that there would be some overhead, but don't overestimate it; because table structure is defined, there is very little actual storage overhead for each record, and records can be inserted very quickly.
If you want very fast copies, you could allow lists to be copied-on-write. Meaning a single list could be referred to by multiple users, or multiple times by the same user. You only actually duplicate the list when the user tries to add, remove, or change an entry. Of course, this is premature optimization, start simple and only add complications like this if you find they are necessary.
You could use a key-value store as you suggest. I would avoid trying to build one on top of a MySQL text field in less you have a very good reason, it will make any sort of searching by key very slow, as it would require string searching. A key-value data store like CouchDB or Tokyo Cabinet could do this very well, but it would most likely take up more space (as each record has to have it's own structure defined and each word has to be recorded separately in each list). The only dimension of performance I would think would be better is if you need massively scalable reads and writes, but that's only relevant for the largest of systems.
I would use MySQL naively, and only make changes such as this if you need the performance and can prove that this method will actually be faster.

Can one rely on the auto-incrementing primary key in your database?

In my present Rails application, I am resolving scheduling conflicts by sorting the models by the "created_at" field. However, I realized that when inserting multiple models from a form that allows this, all of the created_at times are exactly the same!
This is more a question of best programming practices: Can your application rely on your ID column in your database to increment greater and greater with each INSERT to get their order of creation? To put it another way, can I sort a group of rows I pull out of my database by their ID column and be assured this is an accurate sort based on creation order? And is this a good practice in my application?
The generated identification numbers will be unique.
Regardless of whether you use Sequences, like in PostgreSQL and Oracle or if you use another mechanism like auto-increment of MySQL.
However, Sequences are most often acquired in bulks of, for example 20 numbers.
So with PostgreSQL you can not determine which field was inserted first. There might even be gaps in the id's of inserted records.
Therefore you shouldn't use a generated id field for a task like that in order to not rely on database implementation details.
Generating a created or updated field during command execution is much better for sorting by creation-, or update-time later on.
For example:
INSERT INTO A (data, created) VALUES (smething, DATE())
UPDATE A SET data=something, updated=DATE()
That depends on your database vendor.
MySQL I believe absolutely orders auto increment keys. SQL Server I don't know for sure that it does or not but I believe that it does.
Where you'll run into problems is with databases that don't support this functionality, most notably Oracle that uses sequences, which are roughly but not absolutely ordered.
An alternative might be to go for created time and then ID.
I believe the answer to your question is yes...if I read between the lines, I think you are concerned that the system may re-use ID's numbers that are 'missing' in the sequence, and therefore if you had used 1,2,3,5,6,7 as ID numbers, in all the implementations I know of, the next ID number will always be 8 (or possibly higher), but I don't know of any DB that would try and figure out that record Id #4 is missing, so attempt to re-use that ID number.
Though I am most familiar with SQL Server, I don't know why any vendor who try and fill the gaps in a sequence - think of the overhead of keeping that list of unused ID's, as opposed to just always keeping track of the last I number used, and adding 1.
I'd say you could safely rely on the next ID assigned number always being higher than the last - not just unique.
Yes the id will be unique and no, you can not and should not rely on it for sorting - it is there to guarantee row uniqueness only. The best approach is, as emktas indicated, to use a separate "updated" or "created" field for just this information.
For setting the creation time, you can just use a default value like this
CREATE TABLE foo (
id INTEGER UNSIGNED AUTO_INCREMENT NOT NULL;
created TIMESTAMP NOT NULL DEFAULT NOW();
updated TIMESTAMP;
PRIMARY KEY(id);
) engine=InnoDB; ## whatever :P
Now, that takes care of creation time. with update time I would suggest an AFTER UPDATE trigger like this one (of course you can do it in a separate query, but the trigger, in my opinion, is a better solution - more transparent):
DELIMITER $$
CREATE TRIGGER foo_a_upd AFTER UPDATE ON foo
FOR EACH ROW BEGIN
SET NEW.updated = NOW();
END;
$$
DELIMITER ;
And that should do it.
EDIT:
Woe is me. Foolishly I've not specified, that this is for mysql, there might be some differences in the function names (namely, 'NOW') and other subtle itty-bitty.
One caveat to EJB's answer:
SQL does not give any guarantee of ordering if you don't specify an order by column. E.g. if you delete some early rows, then insert 'em, the new ones may end up living in the same place in the db the old ones did (albeit with new IDs), and that's what it may use as its default sort.
FWIW, I typically use order by ID as an effective version of order by created_at. It's cheaper in that it doesn't require adding an index to a datetime field (which is bigger and therefore slower than a simple integer primary key index), guaranteed to be different, and I don't really care if a few rows that were added at about the same time sort in some slightly different order.
This is probably DB engine depended. I would check how your DB implements sequences and if there are no documented problems then I would decide to rely on ID.
E.g. Postgresql sequence is OK unless you play with the sequence cache parameters.
There is a possibility that other programmer will manually create or copy records from different DB with wrong ID column. However I would simplify the problem. Do not bother with low probability cases where someone will manually destroy data integrity. You cannot protect against everything.
My advice is to rely on sequence generated IDs and move your project forward.
In theory yes the highest id number is the last created. Remember though that databases do have the ability to temporaily turn off the insert of the autogenerated value , insert some records manaully and then turn it back on. These inserts are no typically used on a production system but can happen occasionally when moving a large chunk of data from another system.

Resources