looping through a bunch of DB changes in a Rails project - ruby-on-rails

I have a Import function which takes a bunch of data in xml format and pastes it into my db. The problem is, that depending on the amount of data that process can take quite a long time. I see in the server log, that there are incredible lots of sql statements beeing executed do save back all the data.
How can I improve performance for that case? Is it possible to do all the operations only in memory and save them back with only one statement?
Update:
In response to HLGEM's answer:
I read through the bulk insert way, but it seems not to be very practical to me cause I have a lot of relations between the data... in order to put 100 data in a table I have to set the relations of those to other tables...
Is there a way to solve that? can I do encapsulated inserts?

When doing things to a database, NEVER work row by row, that is what is causing your problem. A bulk insert of some type will be much faster than a row by row process. Not knowing the database you are using it's hard to be more specific as to exactly how to do this. You should never even think about row-by-row processing but rather how to affect a set of data.

Well, it depends :)
If the processing (before import) of the XML data is expensive, you can run the processing once, import to the DB, and then export plain SQL from the database. Then future imports can use this SQL for a bulk import, avoiding the need for XML processing.
If the data import itself is expensive, then you probably want to get specific to your database in order to figure out how to speed it up. If you are using MySQL you might check out: https://serverfault.com/questions/37769/fast-bulk-import-of-a-large-dataset-into-mysql

Related

Import delphi data to access [duplicate]

I need to insert 800000 records into an MS Access table. I am using Delphi 2007 and the TAdoXxxx components. The table contains some integer fields, one float field and one text field with only one character. There is a primary key on one of the integer fields (which is not autoinc) and two indexes on another integer and the float field.
Inserting the data using AdoTable.AppendRecord(...) takes > 10 Minutes which is not acceptable since this is done every time the user starts using a new database with the program. I cannot prefill the table because the data comes from another database (which is not accessible through ADO).
I managed to get down to around 1 minute by writing the records to a tab separated text file and using a tAdoCommand object to execute
insert into table (...) select * from [filename.txt] in "c:\somedir" "Text;HDR=Yes"
But I don't like the overhead of this.
There must be a better way, I think.
EDIT:
Some additional information:
MS Access was chosen because it does not need any additional installation on the target machine(s) and the whole database is contained in one file which can be easily copied.
This is a single user application.
The data will be inserted only once and will not change for the lifetime of the database. Though, the table contains one additional field that is used as a flag to indicate that the corresponding record in another database has been processed by the user.
One minute is acceptable (up to 3 minutes would be too) and my solution works, but it seems too complicated to me, so I thought there should be an easier way to do this.
Once the data has been inserted, the performance of the table is quite good.
When I started planning/implementing the feature of the program working with the Access database the table was not required. It only became necessary later on, when another feature was requested by the customer. (Isn't that always the case?)
EDIT:
From all the answers I got so far, it seems that I already got the fastest method for inserting that much data into an Access table. Thanks to everybody, I appreciate your help.
Since you've said that the 800K records data won't change for the life of the database, I'd suggest linking to the text file as a table, and skip the insert altogether.
If you insist on pulling it into the database, then 800,000 records in 1 minute is over 13,000 / second. I don't think you're gonna beat that in MS Access.
If you want it to be more responsive for the user, then you might want to consider loading some minimal set of data, and setting up a background thread to load the rest while they work.
It would be quicker without the indexes. Can you add them after the import?
There are a number of suggestions that may be of interest in this thread Slow MSAccess disk writing
What about skipping the text file and using ODBC or OLEDB to import directly from the source table? That would mean altering your FROM clause to use the source table name and an appropriate connect string as the IN '' part of the FROM clause.
EDIT:
Actually I see you say the original format is xBase, so it should be possible to use the xBase ISAM that is part of Jet instead of needing ODBC or OLEDB. That would look something like this:
INSERT INTO table (...)
SELECT *
FROM tablename IN 'c:\somedir\'[dBase 5.0;HDR=NO;IMEX=2;];
You might have to tweak that -- I just grabbed the connect string for a linked table pointing at a DBF file, so the parameters might be slightly different.
Your text based solution seems the fastest, but you can get it quicker if you could get an preallocated MS Access in a size near the end one. You can do that by filling an typical user database, closing the application (so the buffers are flushed) and doing a manual deletion of all records of that big table - but not shrinking/compacting it.
So, use that file to start the real filling - Access will not request any (or very few) additional disk space. Don't remeber if MS Access have a way to automate this, but it can help much...
How about an alternate arrangement...
Would it be an option to make a copy of an existing Access database file that has this table you need and then just delete all the other data in there besides this one large table (don't know if Access has an equivalent to something like "truncate table" in SQL server)?
I would replace MS Access with another database, and for your situation I see Sqlite is the best choice, it doesn't require any installation into client machine, and it's very fast database and one of the best embedded database solution.
You can use it in Delphi in two ways:
You can download the Database engine Dll from Sqlite website and use Free Delphi component to access it like Delphi SQLite components or SQLite4Delphi
Use DISQLite3 which have the engine built in, and you don't have to distribute the dll with your application, they have a free version ;-)
if you still need to use MS Access, try to use TAdoCommand with SQL Insert statment directly instead of using TADOTable, that should be faster than using TADOTable.Append;
You won't be importing 800,000 records in less than a minute, as someone mentioned; that's really fast already.
You can skip the annoying translate-to-text-file step however if you use the right method (DAO recordsets) for doing the inserts. See a previous question I asked and had answered on StackOverflow: MS Access: Why is ADODB.Recordset.BatchUpdate so much slower than Application.ImportXML?
Don't use INSERT INTO even with DAO; it's slow. Don't use ADO either; it's slow. But DAO + Delphi + Recordsets + instantiating the DbEngine COM object directly (instead of via the Access.Application object) will give you lots of speed.
You're looking in the right direction in one way. Using a single statement to bulk insert will be faster than trying to iterate through the data and insert it row by row. Access, being a file-based database will be exceedingly slow in iterative writes.
The problem is that Access is handling how it optimizes writes internally and there's not really any way to control it. You've probably reached the maximum efficiency of an INSERT statement. For additional speed, you should probably evaluate if there's any way around writing 800,000 records to the database every time you start the application.
Get SQL Server Express (free) and connect to it from Access an external table. SQL express is much faster than MS Access.
I would prefill the database, and hand them the file itself, rather than filling an existing (but empty) database.
If the data you have to fill changes, then keep an ODBC access database (MDB file) synchronized on the server using a bit of code to see changes in the main database and copy them to the access database.
When the user requests a new database zip up the MDB, transfer it to them, and open it.
Alternately, you may be able to find code that opens and inserts data into databases directly.
Alternately, alternately, you may be able to find another format (other than csv) which access can import that is faster.
-Adam
Also check to see how long it takes to copy the file. That will be the lower bound of how fast you can write data. In db's like SQL, it usually takes a bulk load utility to get close to that speed. As far as I know, MS never created a tool to write directly to MS Access tables the way bcp does. Specialized ETL tools will also optimize some of the steps surrounding the insert, such as the way SSIS does transformations in memory, DTS likewise has some optimizations.
Perhaps you could open a ADO Recordset to the table with lock mode adLockBatchOptimistic and CursorLocation adUseClient, write all the data to the recordset, then do a batch update (rs.UpdateBatch).
If it's coming from dbase, can you just copy the data and index files and attach directly without loading? Should be pretty efficient (from the people who bring you FoxPro.) I imagine it would use the existing indexes too.
At the least, it should be a pretty efficient single-command Import.
how much do the 800,000 records change from one creation to the next? Would it be possible to pre populate the records and then just update the ones that have changed in the external database when creating the new database?
This may allow you to create the new database file quicker.
How fast is your disk turning? If it's 7200RPM, then 800,000 rows in 3 minutes is still 37 rows per disk revolution. I don't think you're going to do much better than that.
Meanwhile, if the goal is to streamline the process, how about a table link?
You say you can't access the source database via ADO. Can you set up a table link in MS Access to a table or view in the source database? Then a simple append query from the table link would copy the data over from the source database to the target database for you. I'm not sure, but I think this would be pretty fast.
If you can't set up a table link until runtime, maybe you could build the table link programatically via ADO, then build the append query programatically, then invoke the append query.
HI
The best way is Bulk Insert from txt File as they said
you should insert your record's in txt file then bulk insert the txt file into table
that time should be less than 3 second.

Read huge file from disk

I am developing an iOS app and I have this text file with a city name per line. I have like 3 Million cities in that file. In order to be able to perform searches and operations on it I am using a B-Tree but this tree takes a long time to be created. It is not good for the user experience having him to wait for this every time he uses the time. All this without using Core data!
Any tips on how can I speed up this process?
Thank you
My recommendation is that you use SQLite with an index on the fields you want to query (or some other type of permanent, indexed storage) so that the user only has to wait the first time the app is opened, and then you can query the database, which will be much faster. I am also fairly certain that you can install a SQLite database from a pre-generated file, so you might be able to generate this index offline, bundle it with your application, and then the user has no wait time at all. I'm not 100% sure on this options though, so you should investigate.
Either way, there is no magic solution here. If the data you want is on line 2 million of the file, you will have to read 2 million lines of text in order to get to that line. I would recommend finding a way to make the UX of your app acceptable so that the user feels better about waiting for the data to load. If you display some sort of pretty screen with a progress bar while the data indexes, the user will be more forgiving of this wait.
The B-Tree will obviously take some time to be created. If you don't want to use a database but stick with your own B-Tree implementation you could dump the tree data to a separate file and load that when the program starts instead of recreating it every time. However, you will have to update the cached tree every time the source data is modified.
In Python the pickle module can help you, but most programming languages will have a serialisation module.
Does this file come with the Application? If it does then you could already process the file file into an SQLite database. Before you ship the app containing the database. You can then use "Select" statements to search the data using indexed fields (like cityname).
If the file changes. Then Still ship with a database and just send amendments as a file. Which would edit the database to bring it back up to date. You may need to add a command to the file for each line like, REPLACE, NEW, DELETE.

Inputting Incremental Database into Apache Storm Project

I searched a lot but couldnt pretty much find what I was specifically looking for. The Question is simple and straightforward.
I have a database table, which gets populated every second!
Next, I have almost defined the Analysis Methods/classes in the Apache Storm Spout/Bolts classes.
All I wish to do is, send those new rows being inserted every second to the Spout class as a stream input.
How Do I do this?
Thanks,
There are several ways you could accomplish this, but without knowing more about the nature of the data it's hard to give a good answer. One way would be to use another table to track which records have already been processed by storm based on some field in the original table. For instance, if you used a timestamp column you could track the maximum timestamp you have already processed. There are some potential race conditions you have to be careful of with both the reading/updating of the metadata table as well as the actual data table, but both of those can be managed with transactions and proper time synchronization.
Teradata provide functionality of Queue tables. These tables support "select and consume" operation, which means it will remove rows from table as soon as you select them. For more information: http://www.info.teradata.com/htmlpubs/DB_TTU_14_00/index.html#page/SQL_Reference/B035_1146_111A/ch01.032.045.html#ww798205
This approach assumes that table in Teradata is used as buffer and nobody else needs it.
If you need to have both: permanent full table (for some other application) as well as streaming this data to Storm, you may want to modify your loading process in a way to populate permanent table as well as queue table. In this case other applications can use whole data depth in permanent table, and Storm will consume data from queue table with minimal space impact.

Core data migration without reading all of the data into memory?

We have a core data DB (sqlite store) which, for some users, is about 100-150 MB. I wouldn't think that would be too big for a storage system to deal with (even on a mobile device), but we've found that with that size core data DB, ANY lightweight migration takes ~10+ seconds to complete. Even something as simple as adding a completely new independent entity (not related to any other entity). With raw sqlite this would be a create table statement. So, my question is whether anyone else has seen this and, if so, have they found a workaround to make such simple migrations faster? Specifically, I'm looking for a way to handle adding a new independent entity to an existing core data DB that's ~100-150 MB and have it be quick (i.e., under 5 seconds).
I believe that core data migrations ALWAYS have to read all of the data from the source and write it all to a destination for a migration (which is terrible BTW), but I'm hoping someone can prove me wrong. :) I couldn't find any way to do it with a custom migration either.
I've considered munging the DB with sql directly to basically make the model look like what CoreData would expect (I've done stuff like this to manually "downgrade" a core data DB for debugging purposes), but we really want to avoid doing something like that in production.
UPDATE:
For reference, this is the current approach we are taking. This is not a generic solution, but will work for our use case. Unless I get a better answer I'll add this as an answer at some point in the future and accept it.
We're going to deal with this by essentially making the DB smaller. There are 2 out of 15 entities that take up the bulk of the space in the DB (~95%). We're going to create completely separate data models each with one of those entities. This is done without changing the main model at all (hence, no core data migration). We'll then make a task that runs with background priority in GCD that, if any of those entities are found in the main DB, they're moved to the appropriate separate DB and removed from the main DB. This is done in batches with some sleeps between batches so it's less resource intensive and doesn't affect normal app operation. We'll modify the code that accesses those entities to try to get them from the new DB and fall back to the main DB if they're not in there.
In a future update after we find that all, or at least most, of our users have updated their data in the new DBs we'll drop those entities from the main DB.
This leaves us with a small main DB that can have migrations applied quickly and two large DBs that have migrations done more slowly. Those large DBs, in our case, should change less often (maybe never?) and even if they do change there are limited places in the app that require them so we can work around it in the UI (e.g., report some feature as unavailable until we move data).
A 10-20 second delay for an update to a huge dataset seems perfectly reasonable to me. Just don't do it on the main thread.
This means you'll have to modify the boilerplate Core Data stack setup that you get in the usual Xcode templates. Instead of always setting up the stack on the main thread at launch time, check to see if migration is needed. If so, put up appropriate UI, do the migration in a background thread, and be ready to invoke beginBackgroundTaskWithExpirationHandler: if needed.

How do I insert 800000 records into an MS Access table?

I need to insert 800000 records into an MS Access table. I am using Delphi 2007 and the TAdoXxxx components. The table contains some integer fields, one float field and one text field with only one character. There is a primary key on one of the integer fields (which is not autoinc) and two indexes on another integer and the float field.
Inserting the data using AdoTable.AppendRecord(...) takes > 10 Minutes which is not acceptable since this is done every time the user starts using a new database with the program. I cannot prefill the table because the data comes from another database (which is not accessible through ADO).
I managed to get down to around 1 minute by writing the records to a tab separated text file and using a tAdoCommand object to execute
insert into table (...) select * from [filename.txt] in "c:\somedir" "Text;HDR=Yes"
But I don't like the overhead of this.
There must be a better way, I think.
EDIT:
Some additional information:
MS Access was chosen because it does not need any additional installation on the target machine(s) and the whole database is contained in one file which can be easily copied.
This is a single user application.
The data will be inserted only once and will not change for the lifetime of the database. Though, the table contains one additional field that is used as a flag to indicate that the corresponding record in another database has been processed by the user.
One minute is acceptable (up to 3 minutes would be too) and my solution works, but it seems too complicated to me, so I thought there should be an easier way to do this.
Once the data has been inserted, the performance of the table is quite good.
When I started planning/implementing the feature of the program working with the Access database the table was not required. It only became necessary later on, when another feature was requested by the customer. (Isn't that always the case?)
EDIT:
From all the answers I got so far, it seems that I already got the fastest method for inserting that much data into an Access table. Thanks to everybody, I appreciate your help.
Since you've said that the 800K records data won't change for the life of the database, I'd suggest linking to the text file as a table, and skip the insert altogether.
If you insist on pulling it into the database, then 800,000 records in 1 minute is over 13,000 / second. I don't think you're gonna beat that in MS Access.
If you want it to be more responsive for the user, then you might want to consider loading some minimal set of data, and setting up a background thread to load the rest while they work.
It would be quicker without the indexes. Can you add them after the import?
There are a number of suggestions that may be of interest in this thread Slow MSAccess disk writing
What about skipping the text file and using ODBC or OLEDB to import directly from the source table? That would mean altering your FROM clause to use the source table name and an appropriate connect string as the IN '' part of the FROM clause.
EDIT:
Actually I see you say the original format is xBase, so it should be possible to use the xBase ISAM that is part of Jet instead of needing ODBC or OLEDB. That would look something like this:
INSERT INTO table (...)
SELECT *
FROM tablename IN 'c:\somedir\'[dBase 5.0;HDR=NO;IMEX=2;];
You might have to tweak that -- I just grabbed the connect string for a linked table pointing at a DBF file, so the parameters might be slightly different.
Your text based solution seems the fastest, but you can get it quicker if you could get an preallocated MS Access in a size near the end one. You can do that by filling an typical user database, closing the application (so the buffers are flushed) and doing a manual deletion of all records of that big table - but not shrinking/compacting it.
So, use that file to start the real filling - Access will not request any (or very few) additional disk space. Don't remeber if MS Access have a way to automate this, but it can help much...
How about an alternate arrangement...
Would it be an option to make a copy of an existing Access database file that has this table you need and then just delete all the other data in there besides this one large table (don't know if Access has an equivalent to something like "truncate table" in SQL server)?
I would replace MS Access with another database, and for your situation I see Sqlite is the best choice, it doesn't require any installation into client machine, and it's very fast database and one of the best embedded database solution.
You can use it in Delphi in two ways:
You can download the Database engine Dll from Sqlite website and use Free Delphi component to access it like Delphi SQLite components or SQLite4Delphi
Use DISQLite3 which have the engine built in, and you don't have to distribute the dll with your application, they have a free version ;-)
if you still need to use MS Access, try to use TAdoCommand with SQL Insert statment directly instead of using TADOTable, that should be faster than using TADOTable.Append;
You won't be importing 800,000 records in less than a minute, as someone mentioned; that's really fast already.
You can skip the annoying translate-to-text-file step however if you use the right method (DAO recordsets) for doing the inserts. See a previous question I asked and had answered on StackOverflow: MS Access: Why is ADODB.Recordset.BatchUpdate so much slower than Application.ImportXML?
Don't use INSERT INTO even with DAO; it's slow. Don't use ADO either; it's slow. But DAO + Delphi + Recordsets + instantiating the DbEngine COM object directly (instead of via the Access.Application object) will give you lots of speed.
You're looking in the right direction in one way. Using a single statement to bulk insert will be faster than trying to iterate through the data and insert it row by row. Access, being a file-based database will be exceedingly slow in iterative writes.
The problem is that Access is handling how it optimizes writes internally and there's not really any way to control it. You've probably reached the maximum efficiency of an INSERT statement. For additional speed, you should probably evaluate if there's any way around writing 800,000 records to the database every time you start the application.
Get SQL Server Express (free) and connect to it from Access an external table. SQL express is much faster than MS Access.
I would prefill the database, and hand them the file itself, rather than filling an existing (but empty) database.
If the data you have to fill changes, then keep an ODBC access database (MDB file) synchronized on the server using a bit of code to see changes in the main database and copy them to the access database.
When the user requests a new database zip up the MDB, transfer it to them, and open it.
Alternately, you may be able to find code that opens and inserts data into databases directly.
Alternately, alternately, you may be able to find another format (other than csv) which access can import that is faster.
-Adam
Also check to see how long it takes to copy the file. That will be the lower bound of how fast you can write data. In db's like SQL, it usually takes a bulk load utility to get close to that speed. As far as I know, MS never created a tool to write directly to MS Access tables the way bcp does. Specialized ETL tools will also optimize some of the steps surrounding the insert, such as the way SSIS does transformations in memory, DTS likewise has some optimizations.
Perhaps you could open a ADO Recordset to the table with lock mode adLockBatchOptimistic and CursorLocation adUseClient, write all the data to the recordset, then do a batch update (rs.UpdateBatch).
If it's coming from dbase, can you just copy the data and index files and attach directly without loading? Should be pretty efficient (from the people who bring you FoxPro.) I imagine it would use the existing indexes too.
At the least, it should be a pretty efficient single-command Import.
how much do the 800,000 records change from one creation to the next? Would it be possible to pre populate the records and then just update the ones that have changed in the external database when creating the new database?
This may allow you to create the new database file quicker.
How fast is your disk turning? If it's 7200RPM, then 800,000 rows in 3 minutes is still 37 rows per disk revolution. I don't think you're going to do much better than that.
Meanwhile, if the goal is to streamline the process, how about a table link?
You say you can't access the source database via ADO. Can you set up a table link in MS Access to a table or view in the source database? Then a simple append query from the table link would copy the data over from the source database to the target database for you. I'm not sure, but I think this would be pretty fast.
If you can't set up a table link until runtime, maybe you could build the table link programatically via ADO, then build the append query programatically, then invoke the append query.
HI
The best way is Bulk Insert from txt File as they said
you should insert your record's in txt file then bulk insert the txt file into table
that time should be less than 3 second.

Resources