I am thinking to create denormalized table for our BI purpose.
While creating business logic from several tables i noticed queries perform better when denormalized table is updated in batches(sproc with multiple business logic SQL's) with merge statement as below.
eg: sproc contains multiple SQL's like
merge denormalized_data (select businesslogic1)
merge denormalized_data (select businesslogic2)
etc
Is it better to include business logic in huge SQL or divide it so that each query handles less number of rows?
Is there any overhead if I use sproc?
Speaking very generally. Snowflake is optimized to work in large batches. For example, I've had cases where it takes about as long to insert 1 record as 100,000 records. So inserting 1 record 100,000 times is a LOT slower.
There is certainly going to be some limit. A 1TB batch should be split up. And your mileage may vary depending on how/when/etc. you are updating the table. In general though, you'll find batches are more performant.
The only real overhead that I know of for procedures has to do with converting data types from SQL to Javascript and back again, and then how you have to manage the output. In most cases, this won't be significant.
Related
I'm new to alasql (which is amazing). While the documentation shows you how, it doesn't provide a lot information on best practices.
To date I have simply been running queries against an array of arrays (of js objects). i haven't created a database object or table objects.
Are there performance (speed, memory, other) benefits of using database and table objects over an array of arrays?
Here is a real world example. I have 2 sets of data that I am loading: Employees (10 columns) and Employee Sales (5 columns), that are joined on an EmployeeID column. Employees will be relatively small (say, 100 rows), whereas Employee Sales will have 10,000 records. My current approach is to simply run a query where I join those 2 set of data together and end up with one big result set: 10,000 rows of data with 14 columns per row (repeating every column in the Employee data set), which I then pull data from using dynamic filters, interactivity, etc.
This big data set is stored in memory the whole time, but this has the advantage that I don't need to rerun that query over and over. Alternatively, I could simply run the join against the 2 data sets each time I need it, then remove it from memory after.
Also, if I am joining together multiple tables, can I create indexes on the join columns to speed up performance? I see in examples where indexes are created, but there is nothing else in the documentation. (Nothing on this page: https://github.com/agershun/alasql/wiki/Sql). What is the memory impact of indexes? What are the performance impacts of insertions?
Primary keys are supported, but there is no documentation. Does this create an index?
Are there performance (speed, memory, other) benefits of using database and table objects over an array of arrays?
If you put indexes on your tables then - Yes - you get performance benefits. How much depends on your data.
if I am joining together multiple tables, can I create indexes on the join columns to speed up performance?
Yes. And all other column your put into a "where" condition.
Imagine the following situation I am planning:
Have two rather large tables stored in Hive, both containing different types of customer related information (say, although this is not exactly the case, a record of customer transactions in one and customer owned data in the other). Let's call the tables A and B.
Tables are large in the sense that none of the tables fits completely in memory. (There are 10 million customers and theres is a few kilobytes of info associated to each of them in each of the two tables)
Be careful enough to bucket both tables in exactly the same way, by a field present in both tables (customer_id, which is a bigint), and using the same number of buckets 100.
I wonder whether this setup will, in any way, guarantee that a join (by customer_id) between both tables will be efficient, in the sense that very little shuffling of information between nodes will be required. I imagine this could the case, if for instance, there were a guarantee that the physical files corresponding to the same bucket in both tables are physically stored in the same (sets of nodes), i.e. if for every bucket i (in [0,99]) the file A/part_0_000i and the file B/part_0_000i were physically stored in the same nodes and the same held for their replicas.
Notes:
I am aware that partitioning and bucketing are different and that the first essentially determines the structure of subdirectories, whereas the second on determines which file each record goes too. This question is about bucketing only
Also, by number 2, map-side joins are not an option here, since, as far as my understading goes, they require loading one of the tables completely within each mapper and doing the join completely there.
Bucketing is used when there are too many levels in your data in which you want to partition by, or there are no good candidate partitions.
A concrete example would be partitioning on customerID in sales data. You may have 20 thousand customers. Partitions would contain small amounts of data which is inefficient and have too many partitions also inefficient. However you can hash the customerID and partition into 50 buckets for example. Then when you are merging on customerID the job will only have to scan against what is in a bucket rather than the entire sum of all your data.
With ideal bucketing your buckets should contain some multiple of the file system block size. Remember also that too many buckets or buckets that are built over varialbes not used as keys can be detrimental for other queries.
I have used them when I need to execute large jobs repeatedly. My queries time has been reduced significantly. I tend to only use when my data is very big. And big is relative to cluster size and capacity.
One great thing about bucketing is that they help ensure the bucketed partitions are of similar size. If you partition over State for example, California will have huge partitions while other states are very small.
Bucketing is tactical and not an appropriate for all use cases. Happy bucketing!
Yes, it will definitely help.
Bucketed tables are partitioned and sorted the same way, so they will be mergesorted, which works in linear time (n), otherwise the tables have to be sorted the same way first, which is usually nlog(n)
First things first, I am an amateur, self-taught ruby programmer who came of age as a novice engineer in the age of super-fast computers where program efficiency was not an issue in the early stages of my primary GIS software development project. This technical debt is starting to tax my project and I want to speed up access to this lumbering GIS database.
Its a postgresql database with a postgis extension, controlled inside of rails, which immediately creates efficiency issues via the object-ification of database columns when accessing and/or manipulating database records with one or many columns containing text or spatial data easily in excess of 1 megabyte per column.
Its extremely slow now, and it didn't used to be like this.
One strategy: I'm considering building child tables of my large spatial data tables (state, county, census tract, etc) so that when I access the tables I don't have to load the massive spatial columns every time I access the objects. But then doing spatial queries might be difficult on a parent table's children. Not sure exactly how I would do that but I think its possible.
Maybe I have too many indexes. I have a lot of spatial indexes. Do additional spatial indexes from tables I'm not currently using slow down my queries? How about having too many for one table?
These tables have a massive amount of columns. Maybe I should remove some columns, or create parent tables for the columns with massive serialized hashes?
There are A LOT of tables I don't use anymore. Is there a reason other than tidiness to remove these unused tables? Are they slowing down my queries? Simply doing a #count method on some of these tables takes TIME.
PS:
- Looking back at this 8 hours later, I think what I'm equally trying to understand is how many of the above techniques are completely USELESS when it comes to optimizing (rails) database performance?
You don't have to read all of the columns of the table. Just read the ones you need.
You can:
MyObject.select(:id, :col1, :col2).where(...)
... and the omitted columns are not read.
If you try to use a method that needs one of the columns you've omitted then you'll get an ActiveModel::MissingAttributeError (Rails 4), but you presumably know when you're going to need them or not.
The inclusion of large data sets in the table is going to be a noticeable problem from the database side if you have full table scans, and then you might consider moving these data to other tables.
If you only use Rails to read and write the large data columns, and don't use PostgreSQL functions on them, you might be able to compress the data on write and decompress on read. Override the getter and setter methods by using write_attribute and read_attribute, compressing and decompressing (respectively of course) the data.
Indexing. If you are using postgres to store such large chucks of data in single fields consider storing it as Array, JSON or Hstore fields. If you index it using the gin index types so you can search effectively within a given field.
After realizing that an application suffer of the N+1 problem because the ORM, I would like to have more information about the improvements that can be performed and the statistics with the time compared before the improvements (with the N+1 problem) and after. So what is the time difference before and after such improvements ? Can anyone give me a link to some paper that analyze the problem and retrieve statisics on that?
You really don't need statistical data for this, just math. N+1 (or better 1+N) stands for
1 query to get a record, and
N queries to get all records associated with it
The bigger N is, the more a performance hit this becomes, particularly if your queries are sent across the network to a remote database. That's why N+1 problems keep cropping up in production - they're usually insignificant in development mode with little data in the DB, but as your data grows in production to thousands or millions of rows, your queries will slowly choke your server.
You can instead use
a single query (via a join) or
2 queries (one for the primary record, one for all associated records
The first query will return more data than strictly needed (the data of the first record will be duplicated in each row), but that's usually a good tradeoff to make. The second query might get a bit cumbersome for large data sets since all foreign keys are passed in as a single range, but again, it's usually a tradeoff worth making.
The actual numbers depend on too many variables for statistics to be meaningful. Number or records, DB version, hardware etc. etc.
Since you tagged this question with rails, ActiveRecord does a good job avoiding N+1 queries if you know how to use it. Check out the explanation of eager loading.
The time difference would depend on how many additional selects were performed because of the N+1 problem. Here's a quote from an answer given to another stackoverflow question regarding N+1 -
Quote Start
SELECT * FROM Cars;
/* for each car */
SELECT * FROM Wheel WHERE CarId = ?
In other words, you have one select for the Cars, and then N additional selects, where N is the total number of cars.
Quote End
In the example above the time difference would depend on how many car records were in the database and how long it took to query the 'Wheel' table each time the code/ORM fetched a new record. If you only had 2 car records then the difference after removing the N+1 problem would be negligible, but if you have a million car records then it would have a significant affect.
I have an OLTP database, and am currently creating a data warehouse. There is a dimension table in the DW (DimStudents) that contains student data such as address details, email, notification settings.
In the OLTP database, this data is spread across several tables (as it is a standard OLTP database in 3rd normal form).
There are currently 10,390 records but this figure is expected to grow.
I want to use Type 2 ETL whereby if a record has changed in the OLTP database, a new record is added to the DW.
What is the best way to scan through 10,000 records in the DW and then compare the results with the results in several tables contained in the OLTP?
I'm thinking of creating a "snapshot" using a temporary table of the OLTP data and then comparing the results row by row with the data in the Dimension table in the DW.
I'm using SQL Server 2005. This doesn't seem like the most efficient way. Are there alternatives?
Introduce LastUpdated into source system (OLTP) tables. This way you have less to extract using:
WHERE LastUpdated >= some_time_here
You seem to be using SQL server, so you may also try rowversion type (8 byte db-scope-unique counter)
When importing your data into the DW, use ETL tool (SSIS, Pentaho, Talend). They all have a componenet (block, transformation) to handle SCD2 (slowly changing dimension type 2). For SSIS example see here. The transformation does exactly what you are trying to do -- all that you have to do is specify which columns to monitor and what to do when it detects the change.
It sounds like you are approaching this sort of backwards. The typical way for performing ETL (Extract, Test, Load) is:
"Extract" data from your OLTP database
Compare ("Test") your extracted data against the dimensional data to determine if there are changes or whatever other validation needs to be performed
Insert the data ("Load") in to your dimension table.
Effectively, in step #1, you'll create a physical record via a query against the multiple tables in your OLTP database, then compare that resulting record against your dimensional data to determine if a modification was made. This is the standard way of doing things. In addition, 10000 rows is pretty insignificant as far as volume goes. Any RDBMS and ETL process should be able to process through that in a matter of no more than few seconds at most. I know SQL Server has DTS, although I'm not sure if the name has changed in more recent versions. That is the perfect tool for doing something like this.
Does you OLTP database have an audit trail?
If so, then you can query the audit trail for just the records that have been touched since the last ETL.