I have to unload the output of a query that involves joining 4 tables. 2 of the 4 tables are pretty huge in size. I have tried to optimize this unload in numerous ways as stated below yet the query continues to run more than 10 hours on the cluster.
Removed PARALLEL OFF to unload happen parallelly
Used PARQUET to write the output in the optimized redshift friendly data format
Created temp table for the 2 massive table with the dist key as the same column being joined upon in the query.
Please let me know if there are any further ways we can optimize to unload the table efficiently. Thanks!
Currently have a tricky problem and need ideas for most efficient way to go about solving.
We periodically iterate through large CSV files (~50000 to 2m rows), and for each row, we need to check a database table for matching columns.
So for example, each CSV row could have details about an Event - artist, venue, date/time etc, and for each row, we check our database (PG) for any rows that match the artist, venue and date/time the most, and then perform operations if any match is found.
Currently, the entire process is highly CPU, memory and time intensive pulling row by row, so we perform the matching in batches, but still seeking ideas for an efficient way to perform the comparison both memory-wise, and time-wise
Load the complete CSV file into a temporary table in your database (using a DB tool, see e.g. How to import CSV file data into a PostgreSQL table?)
Perform matching and operations in-database, i.e. in SQL
If necessary, truncate the temporary table afterwards
This would move most of the load into the DB server, avoiding all the ActiveRecord overhead (network traffic, result parsing, model instantiation etc.)
I'm wondering about something that doesn't seem efficient to me.
I have 2 tables, one very large table DATA (millions of rows and hundreds of cols), with an id as primary key.
I then have another table, NEW_COL, with variable rows (1 to millions) but alwas 2 cols : id, and new_col_name.
I want to update the first table, adding the new_data to it.
Of course, i know how to do it with a proc sql/left join, or a data step/merge.
Yet, it seems inefficient, as far as I see with time executing, (which may be wrong), these 2 ways of doing rewrite the huge table completly, even when NEW_DATA is only 1 row (almost 1 min).
I tried doing 2 sql, with alter table add column then update, but it's waaaaaaaay too slow as update with joining doesn't seem efficient at all.
So, is there an efficient way to "add a column" to an existing table WITHOUT rewriting this huge table ?
SAS datasets are row stores and not columnar stores like tables in other databases. As such, adding rows is far easier and efficient than adding columns. A key joined view could be argued as the most 'efficient' way to add a column to a data rectangle.
If you are adding columns so often that the 1 min resource incursion is a problem you may need to upgrade hardware with faster drives, less contentious operating environment, or more memory and SASFILE if the new columns are often yet temporary in nature.
#Richard answer is perfect. If you are adding columns on regular basis then there is problem with your design. You either need to give more details on what you are doing and someone can suggest you.
I would try hash join. you can find code for simple hash join. This is efficient way of joining because in your case you have one large table and one small table if it fit into memory, it much better than a left join. I have done various joins using and query run times was considerably less( to order of 10)
By Altering table approach you are rewriting the table and also it causes lock on your table and nobody can use the table.
You should perform this joins when workload is less, which means during not during office and you may need to schedule the jobs in night, when more SAS resources are available
Thanks for your answers guys.
To add information, i don't have any constraint about table locking, balance load or anything as it's a "projet tool" script I use.
The goal is, in data prep step 'starting point data generator', to recompute an already existing data, or add a new one (less often but still quite regularly). Thus, i just don't want to "lose" time to wait for the whole table to rewrite while i only need to update one data for specific rows.
When i monitor the servor, the computation of the data and the joining step are very fast. But when I want tu update only 1 row, i see the whole table rewriting. Seems a waste of ressource to me.
But it seems it's a mandatory step, so can't do much about it.
Too bad.
First things first, I am an amateur, self-taught ruby programmer who came of age as a novice engineer in the age of super-fast computers where program efficiency was not an issue in the early stages of my primary GIS software development project. This technical debt is starting to tax my project and I want to speed up access to this lumbering GIS database.
Its a postgresql database with a postgis extension, controlled inside of rails, which immediately creates efficiency issues via the object-ification of database columns when accessing and/or manipulating database records with one or many columns containing text or spatial data easily in excess of 1 megabyte per column.
Its extremely slow now, and it didn't used to be like this.
One strategy: I'm considering building child tables of my large spatial data tables (state, county, census tract, etc) so that when I access the tables I don't have to load the massive spatial columns every time I access the objects. But then doing spatial queries might be difficult on a parent table's children. Not sure exactly how I would do that but I think its possible.
Maybe I have too many indexes. I have a lot of spatial indexes. Do additional spatial indexes from tables I'm not currently using slow down my queries? How about having too many for one table?
These tables have a massive amount of columns. Maybe I should remove some columns, or create parent tables for the columns with massive serialized hashes?
There are A LOT of tables I don't use anymore. Is there a reason other than tidiness to remove these unused tables? Are they slowing down my queries? Simply doing a #count method on some of these tables takes TIME.
- Looking back at this 8 hours later, I think what I'm equally trying to understand is how many of the above techniques are completely USELESS when it comes to optimizing (rails) database performance?
You don't have to read all of the columns of the table. Just read the ones you need.
You can:
MyObject.select(:id, :col1, :col2).where(...)
... and the omitted columns are not read.
If you try to use a method that needs one of the columns you've omitted then you'll get an ActiveModel::MissingAttributeError (Rails 4), but you presumably know when you're going to need them or not.
The inclusion of large data sets in the table is going to be a noticeable problem from the database side if you have full table scans, and then you might consider moving these data to other tables.
If you only use Rails to read and write the large data columns, and don't use PostgreSQL functions on them, you might be able to compress the data on write and decompress on read. Override the getter and setter methods by using write_attribute and read_attribute, compressing and decompressing (respectively of course) the data.
Indexing. If you are using postgres to store such large chucks of data in single fields consider storing it as Array, JSON or Hstore fields. If you index it using the gin index types so you can search effectively within a given field.
How to persist large amounts of data by reading from a CSV file (say 20 million rows).
This is running close to 1 1/2 days so far and has persisted only 10 million rows, how can I batch this so that it becomes faster and is there a possibility to run this in a parallel fashion.
I am using the code here to read the CSV, I would like to know if there is a better way to achieve this.
Refer: dealing with large CSV files (20G) in ruby
You can try to first split the file into several smaller files, then you will be able to process several files in parallel.
Probably for splinting the file it will be faster to user a tool like split
split -l 1000000 ./test.txt ./out-files-
Then while you are processing each of the files and assuming you are inserting records instead of inserting them one by one, you can combine them into batches and do bulk inserts. Something like:
INSERT INTO some_table
(2, 'data2')
For better performance you'll need to build the SQL statement yourself and execute it:
ActiveRecord::Base.connection.execute('INSERT INTO <whatever you have built>')
Since you would like to persist your data to MySQL for further processing, using Load Data Infile from MySQL would be faster. something like the following with your schema:
sql = "LOAD DATA LOCAL INFILE 'big_data.csv'
con = ActiveRecord::Base.connection
Key points:
If you use MySQL InnoDB engine, my advice is that always define a auto-increment PRIMARY KEY, InnoDB uses clustered index to store data in the table. A clustered index determines the physical order of data in a table.
refer: http://www.ovaistariq.net/521/understanding-innodb-clustered-indexes/
Config your MySQL Server parameters, the most important ones are
(1) close mysql binlog
(2) innodb_buffer_pool_size.
(3) innodb_flush_log_at_trx_commit
(4) bulk_insert_buffer_size
You can read this: http://www.percona.com/blog/2013/09/20/innodb-performance-optimization-basics-updated/
You should use producer-consumer scenario.
Sorry for my poor English.
Trying to join 6 tables which are having 5 million rows approximately in each table. Trying to join on account number which is sorted in ascending order on all tables. Map tasks are successfully finished and reducers stopped working at 66.68%. Tried options like increasing number of reducers and also tried other options set hive.auto.convert.join = true; and set hive.hashtable.max.memory.usage = 0.9; and set hive.smalltable.filesize = 25000000L; but the result is same. Tried with small number of records (like 5000 rows) and the query works really well.
Please suggest what can be done here to make it work.
Reducers at 66% start doing the actual reduce (0-33% is shuffle, 33-66% is sort). In a join with hive, the reducer is performing a Cartesian product between the two data sets.
I'm going to guess that there is at least one foreign key that is appearing frequently in all of the data sets. Watch for NULL and default values.
For example, in a join, imagine the key "abc" appears ten times in each of the six tables (10^6). That's a million output records for that one key. If "abc" appears 1000 times in one table, 1000 in another, 1000 in another, then twice in the other three tables, you get 8 billion records (1000^3 * 2^3). You can see how this gets out of hand. I'm guessing there is at least one key that is resulting in a massive number of output records.
This is general good practice to avoid in RDBMS outside of Hive as well. Doing multiple inner joins between many-to-many relationships can get you in a lot of trouble.
For debugging this now, and in the future, you could use the JobTracker to find and examine the logs for the Reducer(s) in question. You can then instrument the reduce operation to get a better handle as to what's going on. be careful you don't blow it up with logging of course!
Try looking at the number of records input to the reduce operation for example.