I've data about 20 million rows. The shell is going to infinite execution mode when firing create relationship queries. Is there any better approach to handle large data?
If you're importing from CSV, you can use add USING PERIODIC COMMIT to your import. This is supposed to allow import in the area of millions of records per second. This should prevent memory from being eaten up as it batches the import for you.
Related
I have an application that, at its core, is a sort of data warehouse and report generator. People use it to "mine" through a large amount of data with ad-hoc queries, produce a report page with a bunch of distribution graphs, and click through those graphs to look at specific result sets of the underlying items being "mined." The problem is that the database is now many hundreds of millions of rows of data, and even with indexing, some queries can take longer than a browser is willing to wait for a response.
Ideally, at some arbitrary cutoff, I want to "offline" the user's query, and perform it in the background, save the result set to a new table, and use a job to email a link to the user which could use this as a cached result to skip directly to the browser rendering the graphs. These jobs/results could be saved for a long time in case people wanted to revisit the particular problem they were working on, or emailed to coworkers. I would be tempted to just create a PDF of the result, but it's the interactive clicking of the graphs that I'm trying to preserve here.
None of the standard Rails caching techniques really captures this idea, so maybe I just have to do this all by hand, but I wanted to check to see if I wasn't missing something that I could start with. I could create a keyed model result in the in-memory cache, but I want these results to be preserved on the order of months, and I deploy at least once a week.
Considering Data loading from lots of join tables. That's why it's taking time to load.
Also you are performing calculation/visualization tasks with the data you fetch from DB, then show on UI.
I like to recommend some of the approaches to your problem:
Minimize the number of joins/nested join DB queries
Add some direct tables/columns, ex. If you are showing counts of comments of user the you can add new column in user table to store it in user table itself. You can add scheduled job to update data or add callback to update count
also try to minimize the calculations(if any) performing on UI side
you can also use the concept of lazy loading for fetching the data in chunks
Thanks, hope this will help you to decide where to start 🙂
I would like to append data on a published dask dataset from a queue (like redis). Then other python programs would be able to fetch the latest data (e.g. once per second/minute) and do some futher opertions.
Would that be possible?
Which append interface should be used? Should I load it into a pd.DataFrame first or better use some text importer?
What are the assumed append speeds? Is it possible to append lets say 1k/10k rows in a second?
Are there other good suggestions to exchange huge and rapidly updating datasets within a dask cluster?
Thanks for any tips and advice.
You have a few options here.
You might take a look at the streamz project
You might take a look at Dask's coordination primitives
What are the assumed append speeds? Is it possible to append lets say 1k/10k rows in a second?
Dask is just tracking remote data. The speed of your application has a lot more to do with how you choose to represent that data (like python lists vs pandas dataframes) than with Dask. Dask can handle thousands of tasks a second. Each of those tasks could have a single row, or millions of rows. It's up to how you build it.
We have a grails application and we are importing users through CSV feed. User Domain object has huge list of fields (85 total). And we are writing the Import functionality using Hibernate api ... to insert into User table.
Right now we are getting out of memory errors if we load like 2000 users since User domain object in memory.
Is there any way to optimize the design to remove the out of memory errors. like not loading all columns of User domain on load each column when required. how to reduce the memory consumption?
Take a look at Ted Naleid's great article on batch processing:
http://naleid.com/blog/2009/10/01/batch-import-performance-with-grails-and-mysql/
The last part of the article (Grails Performance Tweaks) REALLY helped me solve a similar problem. Adding the "cleanUpGorm" function and running it every 100 records has let us process tens of thousands of records with no memory problems.
I have a Import function which takes a bunch of data in xml format and pastes it into my db. The problem is, that depending on the amount of data that process can take quite a long time. I see in the server log, that there are incredible lots of sql statements beeing executed do save back all the data.
How can I improve performance for that case? Is it possible to do all the operations only in memory and save them back with only one statement?
Update:
In response to HLGEM's answer:
I read through the bulk insert way, but it seems not to be very practical to me cause I have a lot of relations between the data... in order to put 100 data in a table I have to set the relations of those to other tables...
Is there a way to solve that? can I do encapsulated inserts?
When doing things to a database, NEVER work row by row, that is what is causing your problem. A bulk insert of some type will be much faster than a row by row process. Not knowing the database you are using it's hard to be more specific as to exactly how to do this. You should never even think about row-by-row processing but rather how to affect a set of data.
Well, it depends :)
If the processing (before import) of the XML data is expensive, you can run the processing once, import to the DB, and then export plain SQL from the database. Then future imports can use this SQL for a bulk import, avoiding the need for XML processing.
If the data import itself is expensive, then you probably want to get specific to your database in order to figure out how to speed it up. If you are using MySQL you might check out: https://serverfault.com/questions/37769/fast-bulk-import-of-a-large-dataset-into-mysql
i have two applications (server and client), that uses TQuery connected with TClientDataSet through TDCOMConnection,
and in some cases clientdataset opens about 300000 records and than application throws exception "Temporary table resource limit".
Is there any workaround how to fix this? (except "do not open such huge dataset"?)
update: oops i'm sorry there is 300K records, not 3 millions..
The error might be from the TQuery rather than the TClientDataSet. When using a TQuery it creates a temporary table and it might be this limit that you are hitting. However in saying this, loading 3,000,000 records into a TClientDataSet is a bad idea also as it will try to load every record into memory - which maybe possible if they are only a few bytes each but it is probably still going to kill your machine (obviously at 1kb each you are going to need 3GB of RAM minimum).
You should try to break your data into smaller chunks. If it is the TQuery failing this will mean adjusting the SQL (fewer fields / fewer records) or moving to a better database (the BDE is getting a little tired after all).
You have the answer already. Don't open such a huge dataset in a ClientDataSet (CDS).
Three million rows in a CDS is a huge memory load (depending on the size of each row, it can be gigantic).
The whole purpose of using a CDS is to work quickly with small datasets that can be manipulated in memory. Adding that many rows is ridiculous; use a real dataset instead, or redesign things so you don't need to retrieve so many rows at a time.
over 3 million records is way too much to handle at once. My guess is that you are performing an export or something like that which requires that many records to be sent down the wire. One method you could use to reduce this issue would be to have the middle-tier generate an export file, and then deliver that file to the client (preferably compressing first using ZLIB or something simular).
If you are pulling data back to the client for viewing purposes, then consider sending summary information only, and then allowing the client to dig thier way thru the data a portion at a time. The users would thank you because your performance will go way up and they won't have to dig thru records they don't care about looking at.
EDIT
Even 300,000 records is way too much to handle at once. If you had that many pennies, would you be able to carry them all? But if you made it into larger denominations, you could. if your sending data to the client for a report, then I strongly suggest a summary method... give them the large picture and let them drill slowly into the data. send grouped data and then let them open up slowly.
If this is a search results screen, then set a limit of the number of records to be returned + 1. For example to display 100 records, set the limit to 101. Still only display 100, the last record means that there were MORE than 100 records so the customer needs to adjust thier search criteria to return a smaller subset.
Temporary table resource limit is not a limit for one single query. it is the limit for all open queries together. so it may be a solution for you to close all other queries at the time.
if it is not possible for you to use ADO connection, also you can design a paging mechanism for querying data page by page.
GOOD LUCK