Informix. "Load from file" and indexes - informix

I need load big amount of data from file to table with many indexes on it.
What way faster, load data and after this create indexes or create indexes first?

There's no question that loading the data first and creating the indexes subsequently is much faster. If it's messy to create the indexes separately for some reason, you can create them and then disable them for the period of the load:
SET INDEXES, CONSTRAINTS ON table DISABLED;
Load table and then run:
SET INDEXES, CONSTRAINTS ON table ENABLED;
Then UPDATE STATISTICS on the table according to best practice.
Having said all that, if speed is the issue, look at the High Performance Loader or even DBLoad. Either will be far more efficient than LOAD FROM file.unl INSERT INTO table.

Related

Merging without rewriting one table

I'm wondering about something that doesn't seem efficient to me.
I have 2 tables, one very large table DATA (millions of rows and hundreds of cols), with an id as primary key.
I then have another table, NEW_COL, with variable rows (1 to millions) but alwas 2 cols : id, and new_col_name.
I want to update the first table, adding the new_data to it.
Of course, i know how to do it with a proc sql/left join, or a data step/merge.
Yet, it seems inefficient, as far as I see with time executing, (which may be wrong), these 2 ways of doing rewrite the huge table completly, even when NEW_DATA is only 1 row (almost 1 min).
I tried doing 2 sql, with alter table add column then update, but it's waaaaaaaay too slow as update with joining doesn't seem efficient at all.
So, is there an efficient way to "add a column" to an existing table WITHOUT rewriting this huge table ?
Thanks!
SAS datasets are row stores and not columnar stores like tables in other databases. As such, adding rows is far easier and efficient than adding columns. A key joined view could be argued as the most 'efficient' way to add a column to a data rectangle.
If you are adding columns so often that the 1 min resource incursion is a problem you may need to upgrade hardware with faster drives, less contentious operating environment, or more memory and SASFILE if the new columns are often yet temporary in nature.
#Richard answer is perfect. If you are adding columns on regular basis then there is problem with your design. You either need to give more details on what you are doing and someone can suggest you.
I would try hash join. you can find code for simple hash join. This is efficient way of joining because in your case you have one large table and one small table if it fit into memory, it much better than a left join. I have done various joins using and query run times was considerably less( to order of 10)
By Altering table approach you are rewriting the table and also it causes lock on your table and nobody can use the table.
You should perform this joins when workload is less, which means during not during office and you may need to schedule the jobs in night, when more SAS resources are available
Thanks for your answers guys.
To add information, i don't have any constraint about table locking, balance load or anything as it's a "projet tool" script I use.
The goal is, in data prep step 'starting point data generator', to recompute an already existing data, or add a new one (less often but still quite regularly). Thus, i just don't want to "lose" time to wait for the whole table to rewrite while i only need to update one data for specific rows.
When i monitor the servor, the computation of the data and the joining step are very fast. But when I want tu update only 1 row, i see the whole table rewriting. Seems a waste of ressource to me.
But it seems it's a mandatory step, so can't do much about it.
Too bad.

Auto indexing vs batch importer indexing. What's the difference?

I see that there are two ways of creating indexes on the node and relationship properties. One is to create the header row with columns in the format as below
Property:Type:Index on the first line of nodes.csv or rels.csv and then uncomment Auto indexing lines in batch.properties file.
Other way is, to specify which properties need to be indexed in the neo4j.propeties file.
Yet an other way is to create indexes from cypher language. Given we have at least these 3 ways of creating indexes, which one should I use. When I do batch import of graph with indexes specified in the header, it takes awfully long time to insert the graph. Without indexes specified, it took 10 mins to insert and with took 5 hours on a 250 GB memory machine.
If I do the second way, then server startup takes forever and sometimes fail to start with "auto upgrade failed" message after some time.
So please advice what's the best way to create indexes
Also should u create indexes for the id, label and type columns or not needed since they are auto created?
Unless you have a good reason, go with the schema indexes - those based on a label and a property.
I've written a blog post on the different types of indexes, see blog.armbruster-it.de/2013/12/indexing-in-neo4j-an-overview/.

Deletion of rows from Informix Database

I have around 3 Million rows in a Table in Informix DB.
We have to delete it, before loading new data.
It has a primary key on one of its columns.
For deleting the same, I thought of going with rowid usage. But when I tried
select rowid from table
it responded with -857 error [Rowid does not exist].
So, I am not sure, how to go with the deletion. I prefer not going with primary key, as deletion with primary key is costly compared with rowid deletion.
Any suggestion on the above would be helpful.
If you get error -857, the chances are that the table is fragmented, and was created without the WITH ROWIDS option.
Which version of Informix are you using, and on which platform?
The chances are high that you have the TRUNCATE TABLE statement, which is designed to drop all the rows from a table very quickly indeed.
Failing that, you can use a straight-forward:
DELETE FROM TableName;
as long as you have sufficient logical log space available. If that won't work, then you'll need to do repeated DELETE statements based on ranges of the primary key (or any other convenient column).
Or you could consider dropping the table and then creating it afresh, possible with the WITH ROWIDS clause (though I would not particularly recommend using the WITH ROWIDS clause - it becomes a physical column with index instead of being a virtual column as it is in a non-fragmented table). One of the downsides of dropping and rebuilding a table is that the referential constraints have to be reinstated, and any views built on the table are automatically dropped when the table is dropped, so they have to be reinstated too.
I'm assuming this is IDS?.. How many new rows will be loaded and how often is this process repeated?.. Despite having to re-establish referential constraints and views, in my opinion, it is much better to drop the table, create it from scratch, load the data and then create the indexes because if you just delete all the rows, the deleted rows still remain physically in the table with a NULL \0 flag at the end of the row, thus the table size will be even larger when loading in the new rows and performance will suffer!.. It's also a good opportunity to create fresh indexes, and if possible, pre-sort the load data so that its in the most desirable order (like when creating a CLUSTERED INDEX). If you're going to fragment your tables on expressions or other type, then ROWID's go out the window, but use WITH ROWIDS if you're sure the table will never be fragmented. If your table has a serial column, are there any other tables using the serial columns as a foreign key?

Optimize Searching Through Rails Database

I'm building a rails project, and I have a database with a set of tables.. each holding between 500k and 1M rows, and i am constantly creating new rows.
By the nature of the project, before each creation, I have to search through the table for duplicates (for one field), so i don't create the same row twice. Unfortunately, as my table is growing, this is taking longer and longer.
I was thinking that I could optimize the search by adding indexes to the specific String fields through which i am searching.. but I have heard that adding indexes increases the creation time.
So my question is as follows:
What is the trade off with finding and creating rows which contain fields that are indexed? I know adding indexes to the fields will cause my program to be faster with the Model.find_by_name.. but how much slower will it make my row creation?
Indexing slows down insertation of entries because its required to add the entry to the index and that needs some ressources but once added they speed up your select queries, thats like you said BUT maybe the b-tree isnt the right choice for you! Because the B-Tree indexes the first X units of the indexed subject. Thats great when you have integers but text search is tricky. When you do queries like
Model.where("name LIKE ?", "#{params[:name]}%")
it will speed up selection but when you use queries like this:
Model.where("name LIKE ?", "%#{params[:name]}%")
it wont help you because you have to search the whole string which can be longer than some hundred chars and then its not an improvement to have the first 8 units of a 250 char long string indexed! So thats one thing. But theres another....
You should add a UNIQUE INDEX because the database is better in finding duplicates then ruby is! Its optimized for sorting and its definitifly the shorter and cleaner way to deal with this problem! Of cause you should also add a validation to the relevant model but thats not a reason to let things lide with the database.
// about index speed
http://dev.mysql.com/doc/refman/5.0/en/insert-speed.html
You dont have a large set of options. I dont think the insert speed loss will be that great when you only need one index! But the select speed will increase propotionall!

Indexed View vs. Aggregate Table

It appears that indexed views and aggregate tables are used for the same purpose: To precompute aggregates in order to improve query performance. What are the benefits to using one approach over another? Is it ease of maintenance when using the views versus having to maintain the ETL required for the aggregate table?
You seem to be using SQL Server, so here are some points to consider.
Indexed view may or may not contain aggregations.
There is a list of functions (operators, keywords) that can not be used in indexed views, many of them aggregate.
Indexed view binds schema to tables referenced by the view.
Also, disabling an index on the view physically deletes the data. In data-warehousing, all indexes are usually dropped or disabled during loading. So, rebuilding this index would have to re-aggregate whole table after every major (daily?) load -- as opposed to an aggregate table which may be updated only for a last day or so.

Resources