Joining data in apache beam / cloud dataflow - join

I have two sources that I need to join together. Let's say each of these sources are about 100M rows of data, and I want to join two the results of two queries that run against these sources. So, conceptually speaking the join query looks like this:
SELECT *
FROM
(query1 against ElasticSearch source -- results are of unknown size)
JOIN
(query2 against BigQuery source -- results are of unknown size)
ON query1.joinkey = query2.joinkey
In other words, the results of query1 could be anywhere from 0 bytes/rows to 10GB/100M rows. Same with the results of query2.
How does Apache Beam or Cloud Dataflow deal with 'unknown-sized' joins? For example, in the case where I define two run-time queries. Additionally, for the above case, is Apache Beam a good resource to use or might there be better options?
I suppose in the case where the two sizes may be of unlimited size, it might work best to do the join (conceptually at least) as:
WITH query1 AS (
es_query -> materialize to BigQuery
)
SELECT *
FROM
query1 JOIN query2 USING (joinkey)

There are several ways you can approach a join in an Apache Beam pipeline. For example,
Using side inputs
Using CoGroupByKey operation
Both approaches do not have a set size limit hence should work for arbitrary large datasets. But side-inputs are better suited for cases where a relatively small dataset is fed (and iterated over) while processing a large dataset as the main input. Hence if both datasets are large and are relatively the same size, CoGroupByKey might better suite your case.
For additional code examples for Java and Python see here and here.

Related

Query optimization that collects and orders nodes on very large graph

I have a decently large graph (1.8 billion nodes and roughly the same number of relationships) where I am performing the follow query:
MATCH (n:Article)
WHERE n.id IN $pmids
MATCH (n)-[:HAS_MENTION]->(m:Mention)
WITH n, collect(m) as mentions
RETURN n.id as pmid, mentions
ORDER BY pmid
where $pmids are a list of strings, e.g. ["1234", "4567"] where the length of this list varies from 100-500 length.
I am currently am holding the data within neo4j docker community instance with the following conf modifications: NEO4J_dbms_memory_pagecache_size=32G, NEO4J_dbms_memory_heap_max__size=32G. Index has been created for Article.id.
This query has been quite slow to run (roughly 5 seconds) and I would like to optimize to make for faster runtime. As part of work, I have access to neo4j enterprise so one approach would be to ingest this data as part of a neo4j enterprise account where I can tweak advanced configuration settings.
In general, does anyone have any tips in how I may improve performance, whether it be optimizing the cypher query itself, increase workers or other settings in neo4j.conf?
Thanks in advance.
For anyone interested - I posed this question in the neo4j forums as well and there have already been some interesting optimization suggestions (especially around the "type hint" to trigger backward-indexing, and using pattern comprehension instead of collect()
Initial thoughts
you are using a string field to store PMID, but PMIDs are numeric, it might reduce the database size, and possibly perform better if stored as int (and indexed as int, and searched as int)
if the PMID list is usually large, and the server has over half dozen cores, it might be worth looking into the apoc parallel cypher functions
do you really need every property from the Mention nodes? if not try gathering just what you need
what is the size of the database in GBs? (some context is required in terms of memory settings), and what did neo4j-admin memrec recommend?
If this is how the db is always used, all the time, a sql database might be better, and when building that sql db, collect the mentions into one field (once and done)
Note: Go PubMed!

Internal working of Pentaho Mondrian for query generation

I am trying to analyse how sql queries are generated by Pentaho mondrian. Let us assume there are no aggregate tables as of now. I have noticed two types of behaviour when I try to fetch data from data warehouse (star schema) using Pentaho.
Case 1: I apply various filters and try to get fact count corresponding to it which is the default measure in my case.
Case 2: I apply the same filters as mentioned in case 1 and try to get some other measure by explicitly putting it into the measures selection box.
Observation: In both the cases, sql queries generated in the back-end include joins of fact table with multiple dimension tables as per the filters applied and columns and rows selected in Pentaho.
However, the join order is different in both the cases. In case 1, the fact table is placed at the left-most position of join whereas it is placed somewhere between the dimension tables in case 2.
I have connected Pentaho with AWS Athena at the back-end to execute queries on data stored on s3 with the help of jdbc connection. Since Athena has Presto at the back-end and Presto does not do automatic JOIN re-ordering, queries in case 2 are getting failed.
(http://docs.qubole.com/en/latest/user-guide/presto/best-practices.html)
I noticed that hash joins are being performed by Presto here. For hash joins to be effective, the largest table should be placed on the left side of join so that the smaller table is cached in memory while performing join. This is not happening in second case and it is trying to hash the fact table which consists of a large amount of data as compared to any of the dimension tables. This causes the query to fail whenever I add measure explicitly (other than default measure) and the data range is large (across an year for example).
Can someone please give an insight into the logic behind query formation of Mondrian in both the cases. Also, is there a way we can make the fact table to always remain on the left-most position of joins in the sql queries generated by Mondrian. Or is there any property of Presto which could be set through Athena to change the join type from hash join to some other type of join in which could solve this problem.
Pentaho version - 6.1.0
Saiku version - 3.10

get arbitrary first N value of PCollection in google dataflow

I am wondering whether Google Dataflow can do something that is equivalent of like SQL
SELECT * FROM A INNER JOIN B ON A.a = B.b **LIMIT 1000**
I know that Dataflow has very standard programming paradigm to do join. However, the part I am interested in. is this LIMIT 1000. Since I don't need all of the joined result but only any 1000 of them. I am wondering whether I can utilize this use case to speed up my job (assuming the join are between very expansive tables and will produce very large result on a fully join)
So I assume that a very naive way to achieve the above SQL result is some template code as follows:
PCollection A = ...
PCollection B = ...
PCollection result = KeyedPCollectionTuple.of(ATag, A).and(BTag, B)
.apply(CoGroupByKey.create())
.apply(ParDo.of(new DoFn<KV<...,CoGbkResult>, ...>() {
})
.apply(Sample.any(1000))
However my concern is that how is this Sample transformation hooking up with ParDo internally handled by dataflow. Will dataflow able to optimize in the way that it will stop processing join as long as it know it will definitely have enough output? Or there is simply no optimization in this use case that dataflow will just compute the full join result and then select 1000 from the result? (In this way, Sample transform is will only be an overhead)
Or long question short, it is possible for me to utilize this use case to do partial join in dataflow?
EDIT:
Or in essentially, I am wondering does Sample.any() transform will able to hint any optimization to upstream PCollection? For example if I do
pipeline.apply(TextTO.Read.from("gs://path/to/my/file*"))
.apply(Sample.any(N))
Will dataflow first load all data in and then select N or will it able to take advantage of Sample.any() and do some optimization and prune out some useless read.
Currently neither Cloud Dataflow, nor any of the other Apache Beam runners (as far as I'm aware) implement such an optimization.

will Gremlin graph queries always perform operations in it's own address space?

admittedly, most of my database experience is relational. one of the tenets in that space is to avoid moving data over the network. this manifests by using something like:
select * from person order by last_name limit 10
which will presumably order and limit within the database engine vs using something like:
select * from person
and subsequently ordering and taking the top 10 at the client which could have disastrous effects if there are a million person records.
so, with Gremlin (from Groovy), if i do something like:
g.V().has('#class', 'Person').order{println('!'); it.a.last_name <=> it.b.last_name}[0..9]
i am seeing the ! printed, so i am assuming that this bringing all Person records into the address space of my client prior to the order and limit steps which is not the desired effect.
do my options for processing queries entirely in the database engine become product specific (e.g. for orient-db perhaps submit the query in their flavor of SQL), or is there something about Gremlin that i am missing?
If you want the implementer's query optimizer to kick in, you need to use as many Gremlin steps as possible and avoid pure Groovy/in-memory processing of your graph traversals.
You're most likely looking for something like this (as of TinkerPop v3.2.0):
g.V().has('#class', 'Person').order().by('last_name', incr).limit(10)
If you find yourself using lambdas, chances are often high that this could be done with pure Gremlin steps. Favor Gremlin steps over lambdas.
See TinkerPop v3.2.0 documentation:
Order By step
Limit step

What is the default MapReduce join used by Apache Hive?

What is the default MapReduce join algorithm implemented by Hive? Is it a Map-Side Join, Reduce-Side, Broadcast-Join, etc.?
It is not specified in the original paper nor the Hive wiki on joins:
http://cs.brown.edu/courses/cs227/papers/hive.pdf
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins
The 'default' join would be the shuffle join, aka. as common-join. See JoinOperator.java. It relies on M/R shuffle to partition the data and the join is done on the Reduce side. As is a size-of-data copy during the shuffle, it is slow.
A much better option is the MapJoin, see MapJoinOpertator.java. This works if you have only one big table and one or more small tables to join against (eg. typical star schema). The small tables are scanned first, a hash table is built and uploaded into the HDFS cache and then the M/R job is launched which only needs to split one table (the big table). Is much more efficient than shuffle join, but requires the small table(s) to fit in memory of the M/R map tasks. Normally Hive (at least since 0.11) will try to use MapJoin, but it depends on your configs.
A specialized join is the bucket-sort-merge join, aka. SMBJoin, see SMBJoinOperator.java. This works if you have 2 big tables that match the bucketing on the join key. The M/R job splits then can be arranged so that a map task gest only splits form the two big tables that are guaranteed to over overlap on the join key so the map task can use a hash table to do the join.
There are more details, like skew join support and fallback in out-of-memory conditions, but this should probably get you started into investigating your needs.
A very good presentation on the subject of joins is Join Strategies in Hive. Keep in mind that things evolve fast an a presentaiton from 2011 is a bit outdated.
Do an explain on the Hive query and you can see the execution plan.

Resources