Hive Hbase JOIN performance & KUDU

Hive Hbase JOIN performance & KUDU - join

Reading the Cloudera documentation using Impala to join a Hive table against HBase smaller tables as stated below, then in the absence of a Big Data appliance such as OBDA and a largish HBase dimension table that is mutable:
If you have join queries that do aggregation operations on large fact
tables and join the results against small dimension tables, consider
using Impala for the fact tables and HBase for the dimension tables.
(Because Impala does a full scan on the HBase table in this case,
rather than doing single-row HBase lookups based on the join column,
only use this technique where the HBase table is small enough that
doing a full table scan does not cause a performance bottleneck for
the query.)
Is there any way to get that single key look up in another way?
In addition I noted the following on KUDU and HDFS, presumably HIVE. Does anybody have experience here? Keen to know. I will be tryiong it myself in due course, but installing parcels on non-parcelled quickstarts is not so easy...
Mix and match storage managers within a single application (or query)
• SELECT COUNT(*) FROM my_fact_table_on_hdfs JOIN
my_dim_table_in_kudu ON ...

Erring on the side of caution, linking with KUDU for dimensions would be the way to go so as to avoid a scan on a large dimension in HBASE when a lkp is only required.
I am retracting the latter point, I am sure that a JOIN will not cause an HBASE scan if it is an equijoin.
That said, IMPALA with MPP allows an MPP approach w/o MR and JOINing of dimensions with fact tables. The advantage of the OBDA is less obvious now. imo.

Related

Join query in Cassandra

I am new in Cassandra. Although I can do some stuff in SQL, I am finding it pretty hard to do simple join in Cassandra. My schema looks like this:
Now I have to find, for each department how many emails in total were sent out from employees working there. The output per department shall contain the corresponding number of emails.
Maybe I am missing some simple thing, but no matter what I do, I am not even being able to retrieve data from two tables.

Cassandra has no join operation. It has been implemented in such way to increase the performance in basic operations like reading and writing, but with the caveat that you write to and read from a single table at a particular moment.
If your model is relational, so it is based on relations between tables, than Cassandra is not the way to go. In this case you should go with some RDBMS(Relational Database Management System) like: PostgreSQL, MySql, Sql Server etc.

Does GraphQL negate the need for Graph Databases

Most of the reasons for using a graph database seem to be that relational databases are slow when making graph like queries.
However, if I am using GraphQL with a data loader, all my queries are flattened and combined using the data loader, so you end up making simpler SELECT * FROM X type queries instead of doing any heavy joins. I might even be using a No-SQL database which is usually pretty fast at these kinds of flat queries.
If this is the case, is there a use case for Graph databases anymore when combined with GraphQL? Neo4j seems to be promoting GraphQL. I'd like to understand the advantages if any.

GraphQL doesn't negate the need for graph databases at all, the connection is very powerful and makes GraphQL more performant and powerful.
You mentioned:
However, if I am using GraphQL with a data loader, all my queries are flattened and combined using the data loader, so you end up making simpler SELECT * FROM X type queries instead of doing any heavy joins.
This is a curious point, because if you do a lot of SELECT * FROM X and the data is connected by a graph loader, you're still doing the joins, you're just doing them in software outside of the database, at another layer, by another means. If even that software layer isn't joining anything, then what you gain by not doing joins in the database you're losing by executing many queries against the database, plus the overhead of the additional layer. Look into the performance profile of sequencing a series of those individual "easy selects". By not doing those joins, you may have lost 30 years value of computer science research...rather than letting the RDMBS optimize the query execution path, the software layer above it is forcing a particular path by choosing which selects to execute in which order, at which time.
It stands to reason that if you don't have to go through any layer of formalism transformation (relational -> graph) you're going to be in a better position. Because that formalism translation is a cost you must pay every time, every query, no exceptions. This is sort of equivalent to the obvious observation that XML databases are going to be better at executing XPath expressions than relational databases that have some XPath abstraction on top. The computer science of this is straightforward; purpose-built data structures for the task typically outperform generic data structures adapted to a new task.
I recommend Jim Webber's article on the motivations for a native graph database if you want to go deeper on why the storage format and query processing approach matters.
What if it's not a native graph database? If you have a graph abstraction on top of an RDBMS, and then you use GraphQL to do graph queries against that, then you've shifted where and how the graph traversal happens, but you still can't get around the fact that the underlying data structure (tables) isn't optimized for that, and you're incurring extra overhead in translation.
So for all of these reasons, a native graph database + GraphQL is going to be the most performant option, and as a result I'd conclude that GraphQL doesn't make graph databases unnecessary, it's the opposite, it shows where they shine.
They're like chocolate and peanut butter. Both great, but really fantastic together. :)

Yes GraphQL allows you to make some kind of graph queries, you can start from one entity, and then explore its neighborhood, and so on.
But, if you need performances in graph queries, you need to have a native graph database.
With GraphQL you give a lot of power to the end-user. He can make a deep GraphQL query.
If you have an SQL database, you will have two choices:
to compute a big SQL query with a lot of joins (really bad idea)
make a lot of SQL queries to retrieve the neighborhood of the neighborhood, ...
If you have a native graph database, it will be just one query with good performance! It's a graph traversal, and native graph database are made for this.
Moreover, if you use GraphQL, you consider your data model as a graph. So to store it as graph seems obvious and gives you less headache :)
I recommend you to read this post: The Motivation for Native Graph Databases
Answer for Graph Loader
With Graph loader you will do a lot of small queries (it's the second choice on my above answer) but wait no, ... there is a cache record.
Graph loaders just do batch and cache.
For comparaison:
you need to add another library and implement the logic (more code)
you need to manage the cache. There is a lot of documentation about this topic. (more memory and complexity)
due to SELECT * in loaders, you will always get more data than needed Example: I only want the id and name of a user not his email, birthday, ... (less performant)
...

The answer from FrobberOfBits is very good. There are many reasons to add (or avoid) using GraphQL, whether or not a graph database is involved. I wanted to add a small consideration against putting GraphQL in front of a graph. Of course, this is just one of what ought to be many other considerations involved with making a decision.
If the starting point is a relational database, then GraphQL (in front of that datbase) can provide a lot of flexibility to the caller – great for apps, clients, etc. to interact with data. But in order to do that, GraphQL needs to be aligned closely with the database behind it, and specifically the database schema. The database schema is sort of "projected out" to apps, clients, etc. in GraphQL.
However, if the starting point is a native graph database (Neo4j, etc.) there's a world of schema flexibility available to you because it's a graph. No more database migrations, schema updates, etc. If you have new things to model in the data, just go ahead and do it. This is a really, really powerful aspect of graphs. If you were to put GraphQL in front of a graph database, you also introduce the schema concept – GraphQL needs to be shown what is / isn't allowed in the data. While your graph database would allow you to continue evolving your data model as product needs change and evolve, your GraphQL interactions would need to be updated along the way to "know" about what new things are possible. So there's a cost of less flexibility, and something else to maintain over time.
It might be great to use a graph + GraphQL, or it might be great to just use a graph by itself. Of course, like all things, this is a question of trade-offs.

What is the time complexity of search query in Graph database?

What is the time complexity of search query in Graph database (especially Neo4j) ?
I'm having relational data with me. I'm confused to use a Relational database or Graph database to store that data. So, I want to store the data based on the performance and time complexity of the queries for that particular database. But, I'm unable to find the performance and time complexity of the queries for Graph database.
Can anyone help me out ?

The answer isn't so simple because the time complexities typically depend upon what you're doing in the query (the results of the query planner), there isn't a one-size-fits-all time complexity for all queries.
I can speak some for Neo4j (disclaimer: I am a Neo4j employee).
I won't say much on Lucene index lookups in Neo4j, as these are typically performed only to find starting nodes by index, and represent a fraction of execution time of a query. Relationship traversals tend to be where the real differences show themselves.
Once starting nodes are found, Neo4j walks the graph through relationship traversal, which for Neo4j is basically pointer chasing through memory, which tends to be where native graph dbs outperform relational dbs: The cost of chasing pointers is constant per traversal.
For relational dbs (including graph layers built on top of relational dbs), relationship traversal is usually accomplished by various table join algorithms, which have their own time complexity, typically O(log n) when the join is backed by a B-tree index. This can be quite efficient, but we are in the age of big data and data lakes, so data is getting larger, and the efficiency of the join does grow based on the data in the tables being joined. And this is fine for smaller numbers of joins, but there are some queries that require many joins (sometimes we won't have an upper bound on when to stop joining), and we may want to traverse through many kinds of tables/nodes (and sometimes we may not care what kind they are), so we may need the capability to join to or through any table arbitrarily, which isn't easily expressed or performed in a relational db.
This makes native graph databases like Neo4j well-positioned for handling queries over connected data, especially with a non-trivial or growing number of relationship traversals (or if the traversals are unbounded, such as for reachabilty queries, shortest-path, and others). The cost for queries is associated with the part of the graph touched or walked by the query, not the total number of nodes in the database, so it helps when you can adequately constrain the query to touch the smallest possible subgraph in the db.
As far as your question of whether to use a relational or graph database, that depends upon the connectedness of your data and the queries you plan to run.
If you do decide upon a graph database, then you have choices here as well, and a different set of criteria, such as native vs non-native implementation (Neo4j is a native graph database and takes advantage of index-free adjacency for relationship traversals), whether you need ACID (Neo4 is an ACID database), and if a rich and expressive query language is desired (Cypher is Neo4j's query language, feel free to learn and compare against others).
For more in-depth info, here's a DZone article on Why Graph Databases Outperfrom RDBMS on Connected Data, and an article on Explaining Graph Databases to a Developer by Neo4j's Chief Scientist Jim Webber.

Actually, the most probable scenario is to use both Neo4j and some DBMS (relational or nosql like Mongo). Because it is too hard to store all dataset in Neo4j.
Speed-wise traversing nodes in DBMS is 10-100++ times slower than Neo4j. Also Neo4j has built-in shortestPath (and many other) methods.
Also, can mention hybrid solutions, like ArangoDB. It has graph engine + document-based engine. But under the hood it is two separate tables, so it is almost as inconvenient as Neo4j+DBMS.

It would actually depend upon the size of your data and the complexity.
In Graph databases like neo4j the time complexity is depends on the kind of query and on the planners(executors) behind the query. Graph database perticularly Neo4j performs easier JOINS which give us a clear view of data.
For more information please visit this reference blog by Neo4j.
And you can also refer to this question as it is similar to yours.
Hope this helps!

Graph database referencing cassandra tables

I have a scenario where I would like to model my IoT asset in a the graph database of DataStax Enterprise. This is a perfect fit for my hierarchical data structure. However, when it comes to my time series data I already have that stored in a separate Cassandra table. Is there a way to bridge the gap between data in the graph database and data in a standard cassandra table?
Thanks

At this current moment, all data needs to reside in DSE Graph tables to be available via Gremlin traversals for OLTP or OLAP use cases. We have features coming out soon though that will help provide an OLAP scenario. We'd love to learn more about your use case though to enhance the product for this type of scenario. If you'd like, please join the DataStax Academy Graph channel and we can discuss this requirement further - https://academy.datastax.com/slack

How do you achieve performance on the order of a graph database with an RDBMS?

We have a data model stored in a relational data model that effectively looks like a graph. There is a small number of tables, but the tables are quite large and the types of queries we do are often a 5 join-levels deep. It would be most performant if this data were stored in a graph database, but we dont have that option. How does one achieve graph database-level performance with an RDBMS? What tools can you add on top of the database e.g. caching, search indexes, use an OLAP server that will give you anything close to the performance of a graph database in this situation?

How does one achieve graph database-level performance with an RDBMS?
You don't, at least not in the same way. Both systems have their strengths and weaknesses and trying to apply one to the other will usually end up with a mess. If you are stuck using RDBMS, then design your data model for RDBMS.
What tools can you add on top of the database e.g. caching, search indexes, use an OLAP server that will give you anything close to the performance of a graph database in this situation?
It depends on your data model, perhaps you can elaborate on why you cannot use a Graph Database? Can you have both running side by side?

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart