get arbitrary first N value of PCollection in google dataflow - join

I am wondering whether Google Dataflow can do something that is equivalent of like SQL
SELECT * FROM A INNER JOIN B ON A.a = B.b **LIMIT 1000**
I know that Dataflow has very standard programming paradigm to do join. However, the part I am interested in. is this LIMIT 1000. Since I don't need all of the joined result but only any 1000 of them. I am wondering whether I can utilize this use case to speed up my job (assuming the join are between very expansive tables and will produce very large result on a fully join)
So I assume that a very naive way to achieve the above SQL result is some template code as follows:
PCollection A = ...
PCollection B = ...
PCollection result = KeyedPCollectionTuple.of(ATag, A).and(BTag, B)
.apply(CoGroupByKey.create())
.apply(ParDo.of(new DoFn<KV<...,CoGbkResult>, ...>() {
})
.apply(Sample.any(1000))
However my concern is that how is this Sample transformation hooking up with ParDo internally handled by dataflow. Will dataflow able to optimize in the way that it will stop processing join as long as it know it will definitely have enough output? Or there is simply no optimization in this use case that dataflow will just compute the full join result and then select 1000 from the result? (In this way, Sample transform is will only be an overhead)
Or long question short, it is possible for me to utilize this use case to do partial join in dataflow?
EDIT:
Or in essentially, I am wondering does Sample.any() transform will able to hint any optimization to upstream PCollection? For example if I do
pipeline.apply(TextTO.Read.from("gs://path/to/my/file*"))
.apply(Sample.any(N))
Will dataflow first load all data in and then select N or will it able to take advantage of Sample.any() and do some optimization and prune out some useless read.

Currently neither Cloud Dataflow, nor any of the other Apache Beam runners (as far as I'm aware) implement such an optimization.

Related

Joining data in apache beam / cloud dataflow

I have two sources that I need to join together. Let's say each of these sources are about 100M rows of data, and I want to join two the results of two queries that run against these sources. So, conceptually speaking the join query looks like this:
SELECT *
FROM
(query1 against ElasticSearch source -- results are of unknown size)
JOIN
(query2 against BigQuery source -- results are of unknown size)
ON query1.joinkey = query2.joinkey
In other words, the results of query1 could be anywhere from 0 bytes/rows to 10GB/100M rows. Same with the results of query2.
How does Apache Beam or Cloud Dataflow deal with 'unknown-sized' joins? For example, in the case where I define two run-time queries. Additionally, for the above case, is Apache Beam a good resource to use or might there be better options?
I suppose in the case where the two sizes may be of unlimited size, it might work best to do the join (conceptually at least) as:
WITH query1 AS (
es_query -> materialize to BigQuery
)
SELECT *
FROM
query1 JOIN query2 USING (joinkey)
There are several ways you can approach a join in an Apache Beam pipeline. For example,
Using side inputs
Using CoGroupByKey operation
Both approaches do not have a set size limit hence should work for arbitrary large datasets. But side-inputs are better suited for cases where a relatively small dataset is fed (and iterated over) while processing a large dataset as the main input. Hence if both datasets are large and are relatively the same size, CoGroupByKey might better suite your case.
For additional code examples for Java and Python see here and here.

Are multiple vertex labels in Gremlin/Janusgraph possible, or is an alternative solution better?

I am working on an import runner for a new graph database.
It needs to work with:
Amazon Neptune - Gremlin implementation, has great infrastructure support in production, but a pain to work with locally, and does not support Cypher. No visualization tool provided.
Janusgraph - easy to work with locally as a Gremlin implementation, but requires heavy investment to support in production, hence using Amazon Neptune. No visualization tool provided.
Neo4j - Excellent visualization tool, Cypher language feels very familiar, even works with Gremlin clients, but requires heavy investment to support in production, and there appears to be no visualization tool that is anywhere nearly as good as the one found in Neo4j that works with Gremlin implementations.
So I am creating the graph where the Entity (Nodes/Verticies) have multiple Types (Labels), some being orthogonal to each other, as well as multi-dimensional.
For example, an Entity representing an order made online would be labeled as Order, Online, Spend, Transaction.
| Spend Chargeback
----------------------------------------
Transaction | Purchase Refund
Line | Sale Return
Zooming into the Spend column.
| Online Instore
----------------------------------------
Purchase | Order InstorePurchase
Sale | OnlineSale InstoreSale
In Neo4j and its Cypher query language, this proves to be very powerful for creating Relationships/Edges across multiple types without explicitly knowing what transaction_id values are in the graph :
MATCH (a:Transaction), (b:Line)
WHERE a.transaction_id = b.transaction_id
MERGE (a)<-[edge:TRANSACTED_IN]-(b)
RETURN count(edge);
Problem is, Gremlin/Tinkerpop does not natively support multiple Labels for its Verticies.
Server implementations like AWS Neptune will support this using a delimiter eg. Order::Online::Spend::Transaction and the Gremlin client does support it for a Neo4j server but I haven't been able to find an example where this works for JanusGraph.
Ultimately, I need to be able to run a Gremlin query equivalent to the Cypher one above:
g
.V().hasLabel("Line").as("b")
.V().hasLabel("Transaction").as("a")
.where("b", eq("a")).by("transaction_id")
.addE("TRANSACTED_IN").from("b").to("a")';
So there are multiple questions here:
Is there a way to make JanusGraph accept multiple vertex labels?
If not possible, or this is not the best approach, should there be an additional vertex property containing a list of labels?
In the case of option 2, should the label name be the high-level label (Transaction) or the low-level label (Order)?
Is there a way to make JanusGraph accept multiple vertex labels?
No, there is not a way to have multiple vertex labels in JanusGraph.
If not possible, or this is not the best approach, should there be
an additional vertex property containing a list of labels?
In the case of option 2, should the label name be the high-level label
(Transaction) or the low-level label (Order)?
I'll answer these two together. Based on what you have described above I would create a single label, probably named Transaction, and with different properties associated with them such as Location (Online or InStore) and Type (Purchase, Refund, Return, Chargeback, etc.). Looking at how you describe the problem above you are really talking only about a single entity, a Transaction where all the other items you are using as labels (Online/InStore, Spend/Refund) are really just additional metadata about how that Transaction occurred. As such the above approach would allow for simple filtering on one or more of these attributes to achieve anything that could be done with the multiple labels you are using in Neo4j.

One single Azure SQL query is consuming almost all query_stats.total_worker_time and query_stats.execution_count

I'm running a production website for 4 years with azure SQL.
With help of 'Top Slow Request' query from alexsorokoletov on github I have 1 super slow query according to Azure query stats.
The one on top is the one that uses a lot of CPU.
When looking at the linq query and the execution plans / live stats, I can't find the bottleneck yet.
And the live stats
The join from results to project is not directly, there is a projectsession table in between, not visible in the query, but maybe under the hood of entity framework.
Might I be affected by parameter sniffing? Can I reset a hash? Maybe the optimized query plan was used in 2014 and now result table is about 4Million rows and the query is far from optimal?
If I run this query in Management Studio its very fast!
Is it just the stats that are wrong?
Regards
Vincent - The Netherlands.
I would suggest you try adding option(hash join) at the end of the query, if possible. Once you start getting into large arity, loops join is not particularly efficient. That would prove out if there is a more efficient plan (likely yes).
Without seeing more of the details (your screenshots are helpful but cut off whether auto-param or forced parameterization has kicked in and auto-parameterized your query), it is hard to confirm/deny this explicitly. You can read more about parameter sniffing in a blog post I wrote a bit longer ago than I care to admit ;) :
https://blogs.msdn.microsoft.com/queryoptteam/2006/03/31/i-smell-a-parameter/
Ultimately, if you update stats, dbcc freeproccache, or otherwise cause this plan to recompile, your odds of getting a faster plan in the cache are higher if you have this particular query + parameter values being executed often enough to sniff that during plan compilation. Your other option is to add optimize for unknown hints which will disable sniffing and direct the optimizer to use an average value for the frequency of any filters over parameter values. This will likely encourage more hash or merge joins instead of loops joins since the cardinality estimates of the operators in the tree will likely increase.

Neo4J using properties on relationships for quicker lookup?

I am yet trying to make use of neo4j to perform a complex query (similar to shortest path search except I have very strange conditions applied to this search like minimum path length in terms of nodes traversed count).
My dataset contains around 2.5M nodes of one single type and around 1.5 billion edges (One single type as well). Each given node has on average 1000 directional relation to a "next" node.
Yet, I have a query that allows me to retrieve this shortest path given all of my conditions but the only way I found to have decent response time (under one second) is to actually limit the number of results after each new node added to the path, filter it, order it and then pursue to the next node (This is kind of a greedy algorithm I suppose).
I'd like to limit them a lot less than I do in order to yield more path as a result, but the problem is the exponential complexity of this search that makes going from LIMIT 40 to LIMIT 60 usually a matter of x10 ~ x100 processing time.
This being said, I am yet evaluating several solutions to increase the speed of the request but I'm quite unsure of the result they will yield as I'm not sure about how neo4j really stores my data internally.
The solution I think about yet is to actually add a property to my relationships which would be an integer in between 1 and 15 because I usually will only query the relationships that have one or two max different values for this property. (like only relationships that have this property to 8 or 9 for example).
As I can guess yet, for each relationship, neo4j then have to gather the original node properties and use it to apply my further filters which takes a very long time when crossing 4 nodes long path with 1000 relationships each (I guess O(1000^4)). Am I right ?
With relationship properties, will it have direct access to it without further data fetching ? Is there any chance it will make my queries faster? How are neo4j edges properties stored ?
UPDATE
Following #logisima 's advice I've written a procedure directly with the Java traversal API of neo4j. I then switched to the raw Java procedure API of Neo4J to leverage even more power and flexibility as my use case required it.
The results are really good : the lower bound complexity is overall a little less thant it was before but the higher bound is like ten time faster and when at least some of the nodes that will be used for the traversal are in the cache of Neo4j, the performances just becomes astonishing (depth 20 in less than a second for one of my tests when I only need depth 4 usually).
But that's not all. The procedures makes it very very easily customisable while keeping the performances at their best and optimizing every single operation at its best. The results is that I can use far more powerful filters in far less computing time and can easily update my procedure to add new features. Last but not least Procedures are very easily pluggable with spring-data for neo4j (which I use to connect neo4j to my HTTP API). Where as with cypher, I would have to auto generate the queries (as being very complex, there was like 30 java classes to do the trick properly) and I should have used jdbc for neo4j while handling a separate connection pool only for this request. Cannot recommend more to use the awesome neo4j java API.
Thanks again #logisima
If you're trying to do a custom shortespath algo, then you should write a cypher procedure with the traversal API.
The principe of Cypher is to make pattern matching, and you want to traverse the graph in a specific way to find your good solution.
The response time should be really faster for your use-case !

will Gremlin graph queries always perform operations in it's own address space?

admittedly, most of my database experience is relational. one of the tenets in that space is to avoid moving data over the network. this manifests by using something like:
select * from person order by last_name limit 10
which will presumably order and limit within the database engine vs using something like:
select * from person
and subsequently ordering and taking the top 10 at the client which could have disastrous effects if there are a million person records.
so, with Gremlin (from Groovy), if i do something like:
g.V().has('#class', 'Person').order{println('!'); it.a.last_name <=> it.b.last_name}[0..9]
i am seeing the ! printed, so i am assuming that this bringing all Person records into the address space of my client prior to the order and limit steps which is not the desired effect.
do my options for processing queries entirely in the database engine become product specific (e.g. for orient-db perhaps submit the query in their flavor of SQL), or is there something about Gremlin that i am missing?
If you want the implementer's query optimizer to kick in, you need to use as many Gremlin steps as possible and avoid pure Groovy/in-memory processing of your graph traversals.
You're most likely looking for something like this (as of TinkerPop v3.2.0):
g.V().has('#class', 'Person').order().by('last_name', incr).limit(10)
If you find yourself using lambdas, chances are often high that this could be done with pure Gremlin steps. Favor Gremlin steps over lambdas.
See TinkerPop v3.2.0 documentation:
Order By step
Limit step

Resources