Inputting Incremental Database into Apache Storm Project - stream

I searched a lot but couldnt pretty much find what I was specifically looking for. The Question is simple and straightforward.
I have a database table, which gets populated every second!
Next, I have almost defined the Analysis Methods/classes in the Apache Storm Spout/Bolts classes.
All I wish to do is, send those new rows being inserted every second to the Spout class as a stream input.
How Do I do this?
Thanks,

There are several ways you could accomplish this, but without knowing more about the nature of the data it's hard to give a good answer. One way would be to use another table to track which records have already been processed by storm based on some field in the original table. For instance, if you used a timestamp column you could track the maximum timestamp you have already processed. There are some potential race conditions you have to be careful of with both the reading/updating of the metadata table as well as the actual data table, but both of those can be managed with transactions and proper time synchronization.

Teradata provide functionality of Queue tables. These tables support "select and consume" operation, which means it will remove rows from table as soon as you select them. For more information: http://www.info.teradata.com/htmlpubs/DB_TTU_14_00/index.html#page/SQL_Reference/B035_1146_111A/ch01.032.045.html#ww798205
This approach assumes that table in Teradata is used as buffer and nobody else needs it.
If you need to have both: permanent full table (for some other application) as well as streaming this data to Storm, you may want to modify your loading process in a way to populate permanent table as well as queue table. In this case other applications can use whole data depth in permanent table, and Storm will consume data from queue table with minimal space impact.

Related

Rails saving and/or caching complicated query results

I have an application that, at its core, is a sort of data warehouse and report generator. People use it to "mine" through a large amount of data with ad-hoc queries, produce a report page with a bunch of distribution graphs, and click through those graphs to look at specific result sets of the underlying items being "mined." The problem is that the database is now many hundreds of millions of rows of data, and even with indexing, some queries can take longer than a browser is willing to wait for a response.
Ideally, at some arbitrary cutoff, I want to "offline" the user's query, and perform it in the background, save the result set to a new table, and use a job to email a link to the user which could use this as a cached result to skip directly to the browser rendering the graphs. These jobs/results could be saved for a long time in case people wanted to revisit the particular problem they were working on, or emailed to coworkers. I would be tempted to just create a PDF of the result, but it's the interactive clicking of the graphs that I'm trying to preserve here.
None of the standard Rails caching techniques really captures this idea, so maybe I just have to do this all by hand, but I wanted to check to see if I wasn't missing something that I could start with. I could create a keyed model result in the in-memory cache, but I want these results to be preserved on the order of months, and I deploy at least once a week.
Considering Data loading from lots of join tables. That's why it's taking time to load.
Also you are performing calculation/visualization tasks with the data you fetch from DB, then show on UI.
I like to recommend some of the approaches to your problem:
Minimize the number of joins/nested join DB queries
Add some direct tables/columns, ex. If you are showing counts of comments of user the you can add new column in user table to store it in user table itself. You can add scheduled job to update data or add callback to update count
also try to minimize the calculations(if any) performing on UI side
you can also use the concept of lazy loading for fetching the data in chunks
Thanks, hope this will help you to decide where to start 🙂

Import delphi data to access [duplicate]

I need to insert 800000 records into an MS Access table. I am using Delphi 2007 and the TAdoXxxx components. The table contains some integer fields, one float field and one text field with only one character. There is a primary key on one of the integer fields (which is not autoinc) and two indexes on another integer and the float field.
Inserting the data using AdoTable.AppendRecord(...) takes > 10 Minutes which is not acceptable since this is done every time the user starts using a new database with the program. I cannot prefill the table because the data comes from another database (which is not accessible through ADO).
I managed to get down to around 1 minute by writing the records to a tab separated text file and using a tAdoCommand object to execute
insert into table (...) select * from [filename.txt] in "c:\somedir" "Text;HDR=Yes"
But I don't like the overhead of this.
There must be a better way, I think.
EDIT:
Some additional information:
MS Access was chosen because it does not need any additional installation on the target machine(s) and the whole database is contained in one file which can be easily copied.
This is a single user application.
The data will be inserted only once and will not change for the lifetime of the database. Though, the table contains one additional field that is used as a flag to indicate that the corresponding record in another database has been processed by the user.
One minute is acceptable (up to 3 minutes would be too) and my solution works, but it seems too complicated to me, so I thought there should be an easier way to do this.
Once the data has been inserted, the performance of the table is quite good.
When I started planning/implementing the feature of the program working with the Access database the table was not required. It only became necessary later on, when another feature was requested by the customer. (Isn't that always the case?)
EDIT:
From all the answers I got so far, it seems that I already got the fastest method for inserting that much data into an Access table. Thanks to everybody, I appreciate your help.
Since you've said that the 800K records data won't change for the life of the database, I'd suggest linking to the text file as a table, and skip the insert altogether.
If you insist on pulling it into the database, then 800,000 records in 1 minute is over 13,000 / second. I don't think you're gonna beat that in MS Access.
If you want it to be more responsive for the user, then you might want to consider loading some minimal set of data, and setting up a background thread to load the rest while they work.
It would be quicker without the indexes. Can you add them after the import?
There are a number of suggestions that may be of interest in this thread Slow MSAccess disk writing
What about skipping the text file and using ODBC or OLEDB to import directly from the source table? That would mean altering your FROM clause to use the source table name and an appropriate connect string as the IN '' part of the FROM clause.
EDIT:
Actually I see you say the original format is xBase, so it should be possible to use the xBase ISAM that is part of Jet instead of needing ODBC or OLEDB. That would look something like this:
INSERT INTO table (...)
SELECT *
FROM tablename IN 'c:\somedir\'[dBase 5.0;HDR=NO;IMEX=2;];
You might have to tweak that -- I just grabbed the connect string for a linked table pointing at a DBF file, so the parameters might be slightly different.
Your text based solution seems the fastest, but you can get it quicker if you could get an preallocated MS Access in a size near the end one. You can do that by filling an typical user database, closing the application (so the buffers are flushed) and doing a manual deletion of all records of that big table - but not shrinking/compacting it.
So, use that file to start the real filling - Access will not request any (or very few) additional disk space. Don't remeber if MS Access have a way to automate this, but it can help much...
How about an alternate arrangement...
Would it be an option to make a copy of an existing Access database file that has this table you need and then just delete all the other data in there besides this one large table (don't know if Access has an equivalent to something like "truncate table" in SQL server)?
I would replace MS Access with another database, and for your situation I see Sqlite is the best choice, it doesn't require any installation into client machine, and it's very fast database and one of the best embedded database solution.
You can use it in Delphi in two ways:
You can download the Database engine Dll from Sqlite website and use Free Delphi component to access it like Delphi SQLite components or SQLite4Delphi
Use DISQLite3 which have the engine built in, and you don't have to distribute the dll with your application, they have a free version ;-)
if you still need to use MS Access, try to use TAdoCommand with SQL Insert statment directly instead of using TADOTable, that should be faster than using TADOTable.Append;
You won't be importing 800,000 records in less than a minute, as someone mentioned; that's really fast already.
You can skip the annoying translate-to-text-file step however if you use the right method (DAO recordsets) for doing the inserts. See a previous question I asked and had answered on StackOverflow: MS Access: Why is ADODB.Recordset.BatchUpdate so much slower than Application.ImportXML?
Don't use INSERT INTO even with DAO; it's slow. Don't use ADO either; it's slow. But DAO + Delphi + Recordsets + instantiating the DbEngine COM object directly (instead of via the Access.Application object) will give you lots of speed.
You're looking in the right direction in one way. Using a single statement to bulk insert will be faster than trying to iterate through the data and insert it row by row. Access, being a file-based database will be exceedingly slow in iterative writes.
The problem is that Access is handling how it optimizes writes internally and there's not really any way to control it. You've probably reached the maximum efficiency of an INSERT statement. For additional speed, you should probably evaluate if there's any way around writing 800,000 records to the database every time you start the application.
Get SQL Server Express (free) and connect to it from Access an external table. SQL express is much faster than MS Access.
I would prefill the database, and hand them the file itself, rather than filling an existing (but empty) database.
If the data you have to fill changes, then keep an ODBC access database (MDB file) synchronized on the server using a bit of code to see changes in the main database and copy them to the access database.
When the user requests a new database zip up the MDB, transfer it to them, and open it.
Alternately, you may be able to find code that opens and inserts data into databases directly.
Alternately, alternately, you may be able to find another format (other than csv) which access can import that is faster.
-Adam
Also check to see how long it takes to copy the file. That will be the lower bound of how fast you can write data. In db's like SQL, it usually takes a bulk load utility to get close to that speed. As far as I know, MS never created a tool to write directly to MS Access tables the way bcp does. Specialized ETL tools will also optimize some of the steps surrounding the insert, such as the way SSIS does transformations in memory, DTS likewise has some optimizations.
Perhaps you could open a ADO Recordset to the table with lock mode adLockBatchOptimistic and CursorLocation adUseClient, write all the data to the recordset, then do a batch update (rs.UpdateBatch).
If it's coming from dbase, can you just copy the data and index files and attach directly without loading? Should be pretty efficient (from the people who bring you FoxPro.) I imagine it would use the existing indexes too.
At the least, it should be a pretty efficient single-command Import.
how much do the 800,000 records change from one creation to the next? Would it be possible to pre populate the records and then just update the ones that have changed in the external database when creating the new database?
This may allow you to create the new database file quicker.
How fast is your disk turning? If it's 7200RPM, then 800,000 rows in 3 minutes is still 37 rows per disk revolution. I don't think you're going to do much better than that.
Meanwhile, if the goal is to streamline the process, how about a table link?
You say you can't access the source database via ADO. Can you set up a table link in MS Access to a table or view in the source database? Then a simple append query from the table link would copy the data over from the source database to the target database for you. I'm not sure, but I think this would be pretty fast.
If you can't set up a table link until runtime, maybe you could build the table link programatically via ADO, then build the append query programatically, then invoke the append query.
HI
The best way is Bulk Insert from txt File as they said
you should insert your record's in txt file then bulk insert the txt file into table
that time should be less than 3 second.

How much data a column of mnesia table can store

How much data can a column of mnesia can store.Is there any limit on it or we can store as much as we want.Any pointer?(If table is disc_only_copy)
As with any potentially large data set (in terms of total entries, not total volume of bytes) the real question isn't how much you can cram into a single table, but how you want to partition the data and how unified or distinct those partitions should appear to the system.
In the context of a chat system, for example, you may want to be able to save the chat history forever, which is a reasonable goal. But you may not want all chat entries to be in the same table forever and ever (10 years? how long? who knows!) right next to chat entries made yesterday. You may also discover as time moves on that storing every chat message in a single table to be a painfully naive decision to overcome later on down the road.
So this brings up the issue of partitioning. How do you want to do it? (Staying within the context of a chat system, but easily transferrable to another problem...) By time? By channel? By user? By time and channel?
How do you want to locate the data later? This brings up obvious answers that are the same as above: By time? By channel? By user? By time and channel?
This issue exists whether you're dealing with Mnesia or with Postgres -- or any database -- when you're contemplating the storage of lots of entries. So think about your problem in the context of how you want to partition the data.
The second issue is the volume of the data in bytes, and the most natural representation of that data. Considering basic chat data, its not that hard to imagine simply plugging everything into the database. But if its a chat system that can have large files attached within a message, I would probably want to have those files stored as what they are (files) somewhere in a system made for that (like a file system!) and store only a reference to it in the database. If I were creating a movie archive I would certainly feel comfortable using Mnesia to store titles, actors, years, and a pointer (URL or file system path) to the movie, but I wouldn't dream of storing movie file data in my database, even if I was using Postgres (which can actually stand up to that sort of abuse... but think about new awkwardness of database dumps, backups and massive bottleneck introduced in the form of everyone's download/upload speed being whatever the core service's bandwidth to the database backend is!).
In addition to these issues, you want to think about how the data backend will interface with the rest of the system. What is the API you wish you could use? Write it now and think it through to see if its silly. Once it seems perfect, go back through critically and toss out any elements you don't have an immediate need to actually use right now.
So, that gives us:
Partition scheme
Context of future queries
Volume of data in bytes
Natural state of the different elements of data you want to store
Interface to the overall system you wish you could use
When you start wondering how much data you can put into a database these are the questions you have to start asking yourself.
Now that all that's been written, here is a question that discusses Mnesia in terms of entries, bytes, and how many bytes different types of entries might represent: What is the storage capacity of a Mnesia database?
Mnesia started as an in-memory database. It means that it is not designed to store large amount of data. When you ask yourself this question, it means you should look at another ejabberd backend.

ASP.NET MVC 3 - Web Application - Efficiently Aggregate Data

I am running an ASP.NET MVC 3 web application and would like to gather statistics such as:
How often is a specific product viewed
Which search phrases typically return specific products in their result list
How often (for specific products) does a search result convert to a view
I would like to aggregate this data and break it down:
By product
By product by week
etc.
I'm wondering what are the cleanest and most efficient strategies for aggregating the data. I can think of a couple but I'm sure there are many more:
Insert the data into a staging table, then run a job to aggregate the data and push it into permanent tables.
Use a queuing system (MSMQ/Rhino/etc.) and create a service to aggregate this data before it ever gets pushed to the database.
My concerns are:
I would like to limit the number of moving parts.
I would like to reduce impact on the database. The fewer round trips and less extraneous data stored the better
In certain scenarios (not listed) I would like the data to be somewhat close to real-time (accurate to the hour may be appropriate)
Does anyone have real world experience with this and if so which approach would you suggest and what are the positives and negatives? If there is a better solution that I am not thinking of I'd love ot hear it...
Thanks
JP
I needed to do something similar in a recent project. We've implemented a full audit system in a secondary database, it tracks changes on every record on the live db. Essentially every insert, update and delete actually updates 2 records, one in the live db and one in the audit db.
Since we have this data in realtime on the audit db, we use this second database to fill any reports we might need. One of the tricks I've found when working with a reporting DB is to forget about normalisation. Just create a table for each report you want, and have it carry just the data you want for that report. Its duplicating data, but the performance gains are worth it.
As to filling the actual data in the reports, we use a mixture. Daily reports are generated by a scheduled task at around 3am, ditto for the weekly and monthly reports, normally over weekends or late at night.
Other reports are generated on demand, using mostly the data since the last daily, so its not that many records, once again all from the secondary database.
I agree that you should create a separate database for your statistics, it will reduce the impact on your database.
You can go with your idea of having "Staging" tables and "Aggregate" tables; that way, if you want to access the near-real-time data you go o the staging table, when you want to historical data, you go to the aggregates.
Finally, I would recommend you use an asynchronous call to save your statistics; that way your pages will not have an impact in response time.
I suggest that you will create a separate database for this. The best way is to use BI technique. There is a separate services in
SQL server for Bi.

How do I insert 800000 records into an MS Access table?

I need to insert 800000 records into an MS Access table. I am using Delphi 2007 and the TAdoXxxx components. The table contains some integer fields, one float field and one text field with only one character. There is a primary key on one of the integer fields (which is not autoinc) and two indexes on another integer and the float field.
Inserting the data using AdoTable.AppendRecord(...) takes > 10 Minutes which is not acceptable since this is done every time the user starts using a new database with the program. I cannot prefill the table because the data comes from another database (which is not accessible through ADO).
I managed to get down to around 1 minute by writing the records to a tab separated text file and using a tAdoCommand object to execute
insert into table (...) select * from [filename.txt] in "c:\somedir" "Text;HDR=Yes"
But I don't like the overhead of this.
There must be a better way, I think.
EDIT:
Some additional information:
MS Access was chosen because it does not need any additional installation on the target machine(s) and the whole database is contained in one file which can be easily copied.
This is a single user application.
The data will be inserted only once and will not change for the lifetime of the database. Though, the table contains one additional field that is used as a flag to indicate that the corresponding record in another database has been processed by the user.
One minute is acceptable (up to 3 minutes would be too) and my solution works, but it seems too complicated to me, so I thought there should be an easier way to do this.
Once the data has been inserted, the performance of the table is quite good.
When I started planning/implementing the feature of the program working with the Access database the table was not required. It only became necessary later on, when another feature was requested by the customer. (Isn't that always the case?)
EDIT:
From all the answers I got so far, it seems that I already got the fastest method for inserting that much data into an Access table. Thanks to everybody, I appreciate your help.
Since you've said that the 800K records data won't change for the life of the database, I'd suggest linking to the text file as a table, and skip the insert altogether.
If you insist on pulling it into the database, then 800,000 records in 1 minute is over 13,000 / second. I don't think you're gonna beat that in MS Access.
If you want it to be more responsive for the user, then you might want to consider loading some minimal set of data, and setting up a background thread to load the rest while they work.
It would be quicker without the indexes. Can you add them after the import?
There are a number of suggestions that may be of interest in this thread Slow MSAccess disk writing
What about skipping the text file and using ODBC or OLEDB to import directly from the source table? That would mean altering your FROM clause to use the source table name and an appropriate connect string as the IN '' part of the FROM clause.
EDIT:
Actually I see you say the original format is xBase, so it should be possible to use the xBase ISAM that is part of Jet instead of needing ODBC or OLEDB. That would look something like this:
INSERT INTO table (...)
SELECT *
FROM tablename IN 'c:\somedir\'[dBase 5.0;HDR=NO;IMEX=2;];
You might have to tweak that -- I just grabbed the connect string for a linked table pointing at a DBF file, so the parameters might be slightly different.
Your text based solution seems the fastest, but you can get it quicker if you could get an preallocated MS Access in a size near the end one. You can do that by filling an typical user database, closing the application (so the buffers are flushed) and doing a manual deletion of all records of that big table - but not shrinking/compacting it.
So, use that file to start the real filling - Access will not request any (or very few) additional disk space. Don't remeber if MS Access have a way to automate this, but it can help much...
How about an alternate arrangement...
Would it be an option to make a copy of an existing Access database file that has this table you need and then just delete all the other data in there besides this one large table (don't know if Access has an equivalent to something like "truncate table" in SQL server)?
I would replace MS Access with another database, and for your situation I see Sqlite is the best choice, it doesn't require any installation into client machine, and it's very fast database and one of the best embedded database solution.
You can use it in Delphi in two ways:
You can download the Database engine Dll from Sqlite website and use Free Delphi component to access it like Delphi SQLite components or SQLite4Delphi
Use DISQLite3 which have the engine built in, and you don't have to distribute the dll with your application, they have a free version ;-)
if you still need to use MS Access, try to use TAdoCommand with SQL Insert statment directly instead of using TADOTable, that should be faster than using TADOTable.Append;
You won't be importing 800,000 records in less than a minute, as someone mentioned; that's really fast already.
You can skip the annoying translate-to-text-file step however if you use the right method (DAO recordsets) for doing the inserts. See a previous question I asked and had answered on StackOverflow: MS Access: Why is ADODB.Recordset.BatchUpdate so much slower than Application.ImportXML?
Don't use INSERT INTO even with DAO; it's slow. Don't use ADO either; it's slow. But DAO + Delphi + Recordsets + instantiating the DbEngine COM object directly (instead of via the Access.Application object) will give you lots of speed.
You're looking in the right direction in one way. Using a single statement to bulk insert will be faster than trying to iterate through the data and insert it row by row. Access, being a file-based database will be exceedingly slow in iterative writes.
The problem is that Access is handling how it optimizes writes internally and there's not really any way to control it. You've probably reached the maximum efficiency of an INSERT statement. For additional speed, you should probably evaluate if there's any way around writing 800,000 records to the database every time you start the application.
Get SQL Server Express (free) and connect to it from Access an external table. SQL express is much faster than MS Access.
I would prefill the database, and hand them the file itself, rather than filling an existing (but empty) database.
If the data you have to fill changes, then keep an ODBC access database (MDB file) synchronized on the server using a bit of code to see changes in the main database and copy them to the access database.
When the user requests a new database zip up the MDB, transfer it to them, and open it.
Alternately, you may be able to find code that opens and inserts data into databases directly.
Alternately, alternately, you may be able to find another format (other than csv) which access can import that is faster.
-Adam
Also check to see how long it takes to copy the file. That will be the lower bound of how fast you can write data. In db's like SQL, it usually takes a bulk load utility to get close to that speed. As far as I know, MS never created a tool to write directly to MS Access tables the way bcp does. Specialized ETL tools will also optimize some of the steps surrounding the insert, such as the way SSIS does transformations in memory, DTS likewise has some optimizations.
Perhaps you could open a ADO Recordset to the table with lock mode adLockBatchOptimistic and CursorLocation adUseClient, write all the data to the recordset, then do a batch update (rs.UpdateBatch).
If it's coming from dbase, can you just copy the data and index files and attach directly without loading? Should be pretty efficient (from the people who bring you FoxPro.) I imagine it would use the existing indexes too.
At the least, it should be a pretty efficient single-command Import.
how much do the 800,000 records change from one creation to the next? Would it be possible to pre populate the records and then just update the ones that have changed in the external database when creating the new database?
This may allow you to create the new database file quicker.
How fast is your disk turning? If it's 7200RPM, then 800,000 rows in 3 minutes is still 37 rows per disk revolution. I don't think you're going to do much better than that.
Meanwhile, if the goal is to streamline the process, how about a table link?
You say you can't access the source database via ADO. Can you set up a table link in MS Access to a table or view in the source database? Then a simple append query from the table link would copy the data over from the source database to the target database for you. I'm not sure, but I think this would be pretty fast.
If you can't set up a table link until runtime, maybe you could build the table link programatically via ADO, then build the append query programatically, then invoke the append query.
HI
The best way is Bulk Insert from txt File as they said
you should insert your record's in txt file then bulk insert the txt file into table
that time should be less than 3 second.

Resources