How to improve the performance in big table join? - join

Please help me out with this big data problem.
I have a very large table (500G) that stores cookie information collected from one website, and I try to provide service to many other clients. For each client, they have their cookies, so in the end I need to do query on 500G+300G(client_data).
Since some query use both my cookie data and client cookie data, it is possible that I need to do a join between my table and their table, therefore the performance is bad. To solve this problem, I put the entire 800GB data into a giant table. Since there is no join table, the performance is good. But when I expand my service to multiple client, it takes too much storage.
Current I am using Vertica as my data source, and use bitmap to store my information.
Any suggestion that can maintain my current performance but also support like 40 cients? My storage is about 12 TB and each client in current solution talkes 1.5T.
what I want is either a replacement of Vertica with can support bitmap operation and quick table join. Or a better way to represent my data.

If you have 40 * 1.5TB worth of non-duplicated cookie data to store, there's no magic to make that fit into 12TB.

This will be an imprecise answer due to the lack of details about definitions, etc. But I would add the following about performance:
Look at your projection definitions. You may be able to get performance gains depending on what you put in the order by clause of the projection.

You have a few ways forward, depending on the specifics of your case. Point 1 and 3 are the easiest to deal with:
You can properly set projections, to make sure that both tables are identically segmented:
You can set up pre join projections, where the join cost is paid during data load, not during data retrieval, see
Make sure that your data type is the best possible. Matching on ints is faster than matching on strings, matching columns with low cardinality is faster than matching columns with high cardinality.
If 1 and 3 are well set, Vertica can actually apply filters before decompression, fastening a lot your query and thus using a lot less memory.


Merging without rewriting one table

I'm wondering about something that doesn't seem efficient to me.
I have 2 tables, one very large table DATA (millions of rows and hundreds of cols), with an id as primary key.
I then have another table, NEW_COL, with variable rows (1 to millions) but alwas 2 cols : id, and new_col_name.
I want to update the first table, adding the new_data to it.
Of course, i know how to do it with a proc sql/left join, or a data step/merge.
Yet, it seems inefficient, as far as I see with time executing, (which may be wrong), these 2 ways of doing rewrite the huge table completly, even when NEW_DATA is only 1 row (almost 1 min).
I tried doing 2 sql, with alter table add column then update, but it's waaaaaaaay too slow as update with joining doesn't seem efficient at all.
So, is there an efficient way to "add a column" to an existing table WITHOUT rewriting this huge table ?
SAS datasets are row stores and not columnar stores like tables in other databases. As such, adding rows is far easier and efficient than adding columns. A key joined view could be argued as the most 'efficient' way to add a column to a data rectangle.
If you are adding columns so often that the 1 min resource incursion is a problem you may need to upgrade hardware with faster drives, less contentious operating environment, or more memory and SASFILE if the new columns are often yet temporary in nature.
#Richard answer is perfect. If you are adding columns on regular basis then there is problem with your design. You either need to give more details on what you are doing and someone can suggest you.
I would try hash join. you can find code for simple hash join. This is efficient way of joining because in your case you have one large table and one small table if it fit into memory, it much better than a left join. I have done various joins using and query run times was considerably less( to order of 10)
By Altering table approach you are rewriting the table and also it causes lock on your table and nobody can use the table.
You should perform this joins when workload is less, which means during not during office and you may need to schedule the jobs in night, when more SAS resources are available
Thanks for your answers guys.
To add information, i don't have any constraint about table locking, balance load or anything as it's a "projet tool" script I use.
The goal is, in data prep step 'starting point data generator', to recompute an already existing data, or add a new one (less often but still quite regularly). Thus, i just don't want to "lose" time to wait for the whole table to rewrite while i only need to update one data for specific rows.
When i monitor the servor, the computation of the data and the joining step are very fast. But when I want tu update only 1 row, i see the whole table rewriting. Seems a waste of ressource to me.
But it seems it's a mandatory step, so can't do much about it.
Too bad.

Does a data warehouse need to satisfy 2NF or another normal form?

I'm investigating data warehouses. And I have an issue about star schemas.
It's in
Oracle® OLAP Application Developer's Guide
10g Release 1 (10.1)
3.2.1 Dimension Table: TIME_DIM
To represent the hierarchy MONTH -> QUARTER -> YEAR, we need some keys such as: YEAR_ID, QUARTER_ID. But there are some things that I do not understand:
1) Why do we need field YEAR_DSC & QUARTER_DSC? I think that we can look up these values from YEAR & QUARTER TABLE. And it breaks 2NF.
2) What is the normal form that a schema in data warehouse needs to satisfy? (1NF, 2NF, 3NF, or any.)
NFs (normal forms) don't matter for data warehouse base tables.
We normalize to reduce certain kinds of redundancy so that when we update a database we don't have to say the same thing in multiple places and so that we can't accidentally erroneously not say the same thing where it would need to be said in multiple places. That is not a problem in query results because we are not updating them. The same is true for a data warehouse's base tables. (Which are also just queries on its original database's base tables.)
Data warehouses are usually optimized for reading speed, and that usually means some denormalization compared to the original database to avoid recomputation at the expense of space. (Notice though that sometimes rereading something bigger can be slower than reading smaller parts and recomputing the big thing.) We probably don't want to drop normalized tables when moving to a data warehouse, because they answer simple queries and we don't want to slow down by recomputing them. Other than those tradeoffs, there's no reason not to denormalize. Some particular warehouse design methods might have their own rules about what parts should be denormalized what amounts.
(Whatever our original database design NF is chosen to be, we should always first normalize to 5NF then consciously denormalize. We don't need to normalize or know constraints to update or query a database.)
Read some textbook basics on why we normalize & why we use data warehouses.

Surrogate Keys in Datawarehouse

I want to understand how surrogate keys are leveraged in real-time DWH environments. I get that they add the benefit of not being dependent on source-generated data to store each dimension key and also avoid having composite key built out of natural keys from dimensions in the fact, For eg, (prod id + cust id+ time id)
But does it not add the complexity of having to maintain the lookup of (natural key, surrogate key) while we load data into facts. I have been working in BI/DW teams for last 3 years and we do not maintain any surrogate keys in our systems. We leverage natural keys to build our datamarts. One sample usecase is revenue data which is stored in transactional system, which is loaded into warehouse at customer, product, time period granularity using the same natural keys from source. We use the same to join with corresponding dimensions to build STAR schema.
Main reason I think it makes sense in our case is that business uses EDW data to do micro-analysis of data at account level, not just trending analysis. We would need to maintain data integrity in that case which we achieve using natural keys. I want to understand how other DW environments work. How do you leverage surrogate keys or natural keys in your systems.
One reason is to maintain and being able to compare historical changes.
Example, if one of your product attributes changes and you wanted to look at and compare revenue before and after the attribute change, how would you do that without using surrogate product keys? Using a natural key would just overwrite the old value when you ETL.
The lookup doesn't have to be very complex to maintain. Most ETL tools have support for this and usually have some caching mechanism built in to cache lookup values.
Also, what do you mean when you say "real-time" data warehouse? Are you using ROLAP, DirectQuery or something similar? If so, you might be building your marts directly on your OLTP system and de-normalize in some semantic model. Then you could use your natural keys because there is no traditional ETL/data warehouse to do lookups and store your surrogate keys.
Lastly, granularity is not related to what type of key you are using.
If your business is stable and runs on top of a single application for everything, natural keys will work just fine, as your experience tells you.
Most businesses are not in such a state or not for very long. Mergers happen, new applications are introduced, legacy stuff refuses to die. New lines of business are started or split off and require wholesale renaming of existing natural key schemes.
Surrogate keys provide great benefits in keeping reporting dimensions stable and usable across the business when you have a bunch of separate new and legacy applications that all have their own versions of your customers and products and regularly get migrated or swapped out for similar systems with new natural key definitions. The major work is linking the various natural keys of a customer/product/whatever, assigning a surrogate key is just a simple and very helpful step in that.
Even in your scenario, I would use surrogate keys as they prepare you for future changes and are very helpful with historical data (as NITHIN B also answered) in Type 2 Dimensions.
It's quite possible to do versioning with natural keys by adding a version field to your dimension and fact tables, but it makes the joins harder to write for reporting and your whole system still gets messy if business or application changes cause the natural keys to change.
To illustrate:
Select bla
from Fact F
inner join Dim_Customer DC
on F.Surrogate_key = DC.Surrogate_Key
is almost foolproof. If you mess this up, it will be immediately obvious in your report.
Select bla
from Fact F
inner join Dim_Customer DC
on F.Natural_key = DC.Natural_Key
and F.Version = DC.Version
does the same job, but if you forget that last line, everything will look normal but your numbers will be inflated depending on how many versions there are on average. Kinda painful when that 25% sales increase turns out to be an error.
An additional reason, which has not been mentioned yet, is performance. Sometimes (very often in my experience) natural keys are strings, sometimes long strings.
It seems not a big deal using 10, 20 or 30 byte string instead of a 4 byte integer, but when you have 10 dimension and hundred of millions of rows, it adds up fast.
Could you please post a sample design.
I would be interested to see how you can load a fact table with Dimension Keys which are natural keys. Kimball design never recommends it.
My stand on Surrogate Keys in DWH.
Surrogate keys give you a lot of flexibility with Type 2 Dimensions,
ie if you have Type 2 Dimensions. For eg: You can track changes of a customer
if he or she changes her second name. You can have rows withe old values and
new values.
Fact tables usually hold keys which are surrogate keys. It makes your star
schema neat and tidy and robust.
However I am not jumping queues here, would wait for your design before going pro or against your stand.

Limit the growth of ETS storage

I'm considering using Erlang's ETS as a cache for user searches in a new Elixir project. Based on user input, the system will do lookups using an expensive third-party API.
In order to avoid making duplicate calls for the same user input, I intend to put a cache layer in front of the external API, and ETS seems like a good option for this. However, since there is no limit to the variations of user input, I'm concerned that the storage space required for the ETS table will grow without bound.
In my reading about ETS, I haven't seen anyone else discuss concern about the size of tables in ETS. Is that because this would be an abnormal use case for ETS?
At first blush, my preference would be to limit the number of entries in the ETS table, and reject (i.e. delete) the oldest entries once the limit is reached…
Is there a common strategy for dealing with unbounded number of entries in ETS?
I use ETS tables in production like a 'smart invalidated cache' with a redis API (also it have master-master replication like a SQL WAL log).
The biggest sizes is ~ 200-300Mb and they have more than 1million items. There are no any problems for last 2 years. I know about limits ERL_MAX_ETS_TABLES but havn't any information about sizes.
I have special 'smart indexes' for this tables. ETS select/match/etc is slow because this methods passing all the elements in the table.
use the ets:tab2list(TableId) function to convert the ETS table to a common list. After doing that, you are able to check the size of the list with the, well known BIF length(List).
Last but not least, you are now able to set a buffer (just check the size of the list with pattern matching, if, or case expression

Does bucketing two *large* tables in Hive *in the same way* help perform much more efficient joins?

Imagine the following situation I am planning:
Have two rather large tables stored in Hive, both containing different types of customer related information (say, although this is not exactly the case, a record of customer transactions in one and customer owned data in the other). Let's call the tables A and B.
Tables are large in the sense that none of the tables fits completely in memory. (There are 10 million customers and theres is a few kilobytes of info associated to each of them in each of the two tables)
Be careful enough to bucket both tables in exactly the same way, by a field present in both tables (customer_id, which is a bigint), and using the same number of buckets 100.
I wonder whether this setup will, in any way, guarantee that a join (by customer_id) between both tables will be efficient, in the sense that very little shuffling of information between nodes will be required. I imagine this could the case, if for instance, there were a guarantee that the physical files corresponding to the same bucket in both tables are physically stored in the same (sets of nodes), i.e. if for every bucket i (in [0,99]) the file A/part_0_000i and the file B/part_0_000i were physically stored in the same nodes and the same held for their replicas.
I am aware that partitioning and bucketing are different and that the first essentially determines the structure of subdirectories, whereas the second on determines which file each record goes too. This question is about bucketing only
Also, by number 2, map-side joins are not an option here, since, as far as my understading goes, they require loading one of the tables completely within each mapper and doing the join completely there.
Bucketing is used when there are too many levels in your data in which you want to partition by, or there are no good candidate partitions.
A concrete example would be partitioning on customerID in sales data. You may have 20 thousand customers. Partitions would contain small amounts of data which is inefficient and have too many partitions also inefficient. However you can hash the customerID and partition into 50 buckets for example. Then when you are merging on customerID the job will only have to scan against what is in a bucket rather than the entire sum of all your data.
With ideal bucketing your buckets should contain some multiple of the file system block size. Remember also that too many buckets or buckets that are built over varialbes not used as keys can be detrimental for other queries.
I have used them when I need to execute large jobs repeatedly. My queries time has been reduced significantly. I tend to only use when my data is very big. And big is relative to cluster size and capacity.
One great thing about bucketing is that they help ensure the bucketed partitions are of similar size. If you partition over State for example, California will have huge partitions while other states are very small.
Bucketing is tactical and not an appropriate for all use cases. Happy bucketing!
Yes, it will definitely help.
Bucketed tables are partitioned and sorted the same way, so they will be mergesorted, which works in linear time (n), otherwise the tables have to be sorted the same way first, which is usually nlog(n)
