Something similar to distkey and sort key - data-warehouse

can anyone explain functions similar to distkey and sortkey which i can use in datawarehouse, or any examples and sources which I can use, any sources or examples will be helpfull.

DISTKEY and SORTKEY are specific to AWS Redshift. DISTKEY controls how data is distributed across the nodes in the Redshift cluster, and SORTKEY is a form of indexing specific columns.
On Azure SQL Data Warehouse DISTKEY maps to the WITH (DISTRIBUTION = ?) clause of the CREATE TABLE statement. DISTKEY EVEN maps to ROUND_ROBIN, DISTKEY maps to HASH() and there is no equivalent to DISTKEY ALL>
SQL Server Parallel Data Warehouse is very similar to Azure SQL Data Warehouse, except DISTKEY ALL maps to REPLICATE.
On both platforms, indexing (including a clustered columnstore index) replace the SORTKEY functionality.
Snowflake Elastic Data Warehouse manages to get great performance without having to worry about these concepts.

Related

Apache Druid vs Snowflake

I'm choosing the proper Tools for BI/OLAP and need understand whether Snowflake or Druid is more suitable for my goals.
Currently I'm using Snowflake as my Data Warehouse, and it serves both
raw data queries (with massive dataset responses)
aggregated results
In order to achieve performance of the second type, I'm creating additional aggregation tables, which acts as an OLAP cube.
My data is time-based.
However, this method demands additional work, as well as data duplication and static query requirements.
Therefore, I'm considering to adopt apache Druid, which will provide solution for the aggregation.
Is Druid capable of replacing Snowflake for the raw dataset, assuming the the queries will always be bound by the time range, and that I can use scan-queries? Or must I keep both DBs?

How do I connect an iOS app to Google Cloud SQL?

I had been building my database using Cloud Firestore because this was the easiest to implement. However, the querying capabilities of Firestore are insufficient for what I want to build, mostly due to the fact it can't handle querying inequalities on multiple fields. I need a SQL database.
I have an instance of Google Cloud SQL set up. The integration is far harder than Firebase where you just need to add a Cocoapods Pod. From my research it looks like I need to set up a Cloud SQL proxy, although if there is a simpler way of connecting it, I'd be glad to hear about it.
Essentially, I need a way for a client on the iOS to read and write to a SQL database. Cloud SQL seemed like the best, most scalable option (though I'd be open to hearing about alternatives that are easy to implement).
You probably don't want to configure your application to rely on connecting directly to an SQL database. Firestore is a highly scalable database that can handle thousands of connections - MySQL and Postgres do not scale as cleanly.
Instead, you should consider constructing a simple front end service that can be used to query the database and return formatted results. There are a variety of benefits to structuring this way, including being able to further optimize or distribute your queries. Google AppEngine and Google Cloud Functions can both be used to stand up such a service quickly, and both provide easy connection options to Cloud SQL.
I’ve found that querying with Firestore is best designed around your front end needs. Using nested sub collections, the ref property or document/collection id relationships can get you most of what you need for the front end.
You could also use Firebase Functions written in most of the major languages which perform stateless transactions to a Cloud SQL, Spanner or any other GCP database instance.
Alternatively you could deploy container images to Google Container Registry and easily deploy to Kubernetes Engine, Compute Engine or Cloud Run. Each of which have trade offs and advantages.
One advantage to using Firestore is to easily tie users with authentication {uid}; rules to protect the backend; custom claims for role based permissions on the front end and access to real-time streams as observables with extremely low latency.

Hive Hbase JOIN performance & KUDU

Reading the Cloudera documentation using Impala to join a Hive table against HBase smaller tables as stated below, then in the absence of a Big Data appliance such as OBDA and a largish HBase dimension table that is mutable:
If you have join queries that do aggregation operations on large fact
tables and join the results against small dimension tables, consider
using Impala for the fact tables and HBase for the dimension tables.
(Because Impala does a full scan on the HBase table in this case,
rather than doing single-row HBase lookups based on the join column,
only use this technique where the HBase table is small enough that
doing a full table scan does not cause a performance bottleneck for
the query.)
Is there any way to get that single key look up in another way?
In addition I noted the following on KUDU and HDFS, presumably HIVE. Does anybody have experience here? Keen to know. I will be tryiong it myself in due course, but installing parcels on non-parcelled quickstarts is not so easy...
Mix and match storage managers within a single application (or query)
• SELECT COUNT(*) FROM my_fact_table_on_hdfs JOIN
my_dim_table_in_kudu ON ...
Erring on the side of caution, linking with KUDU for dimensions would be the way to go so as to avoid a scan on a large dimension in HBASE when a lkp is only required.
I am retracting the latter point, I am sure that a JOIN will not cause an HBASE scan if it is an equijoin.
That said, IMPALA with MPP allows an MPP approach w/o MR and JOINing of dimensions with fact tables. The advantage of the OBDA is less obvious now. imo.

Graph database referencing cassandra tables

I have a scenario where I would like to model my IoT asset in a the graph database of DataStax Enterprise. This is a perfect fit for my hierarchical data structure. However, when it comes to my time series data I already have that stored in a separate Cassandra table. Is there a way to bridge the gap between data in the graph database and data in a standard cassandra table?
Thanks
At this current moment, all data needs to reside in DSE Graph tables to be available via Gremlin traversals for OLTP or OLAP use cases. We have features coming out soon though that will help provide an OLAP scenario. We'd love to learn more about your use case though to enhance the product for this type of scenario. If you'd like, please join the DataStax Academy Graph channel and we can discuss this requirement further - https://academy.datastax.com/slack

Stored Procedure AND/OR ORM for a BI web app

I am building a business intelligence web app in php 5 that display informations retrieved from a datawarehouse highly normalized (60+ tables in mysql).
We use MODx as our CMF to organize the code. So far the code is mainly procedura, each page is essentially composed of a bunch of sql query directly in the php code (Snippet as in the MODx terminology) and code to display the info in tables and graphically.
We are in the process of creating objects for our main components and put the sql queries there and use PDO. It is easy to do when the query map to a real object of the domain.
For more BI (aggregation with subqueries, join on 5+ tables) or search oriented query, I find it more difficult to see how to replace the dynamically created sql. For example, we have a search functionality in the web app with a lot of criteria. Depending on which criteria are selected, the php code add or remove tables to join, subqueries and change the 'where' clause.
Do you think an ORM or Stored Proc could improve performance/quality of code in that context ?
Is our model (60+ tables highly normalized) too complicated to be directly accessed from the web app and a kind of datamart (basically denormalized view of the data) would bring more benefits than ORM ?
This question is related to : stored-procedures-or-or-mappers
The bottleneck would surely be the level of normalization - if the option is available to you, adopting a more star-schema style DWH would greatly increase performance as it pre-prepares the data for consumption by your BI app.

Resources