It's certainly possible to view a Dask graph at any stage while holding onto the object. Though once .compute() is called on a Dask object, there is an opportunity to apply additional optimizations to the Dask graph before running the computation. Any optimizations applied at this stage would impact how the computation is run. However this optimized graph would not necessarily be attached to a corresponding Dask object available to the user. Is there a way to also view the final Dask graph that was actually used for the computation?
The graph is not easily accessible after it has been submitted.
If you are using the distributed scheduler you can inspect the state there after submission, but it is no longer in a form that matches the traditional graph specification.
The best option I can think of is to optimize the graph before computing, and to investigate this. This isn't guaranteed to be exactly the same, but is likely close.
Related
I have a pipeline that translates a bounded data source into a set of RPCs to a third-party system, and want to have a reasonable balance between batching requests for efficiency and enforcing a maximum batch size. Is GroupIntoBatches the appropriate transform to use in this case? Are there any concerns around efficiency in batch mode that I should be aware of?
Based on the unit tests, it appears that the "final" batch will be emitted for a bounded source (even if it doesn't make up a full batch), correct?
I think that GroupIntoBatches is a good approach for this use case. Keep in
mind that this transform uses KV pairs and the parallelism that you want to achieve will be limited by the number of keys. I suggest taking a look at this answer.
Regarding the batch size, yes, the batches may be of a lower size if there are not enough elements. Take a look at this fun example of the beam Python documentation:
GroupIntoBatches will work. If you're running a batch pipeline and don't have a natural key on which to group (making a random one will often result in batches that are too small or parallelism that is too small and can interact poorly with liquid sharding) you should consider using BatchElements instead which can batch without keys and can be configured with either a fixed or dynamic batch size.
I know its a very specific to the environment running the code but given dask calculates its execution plan in advance into a DAG is there a way to understand how long that execution should take?
The progress bar is a great help once execution is running but would it be possible to understand beforehand how long to expect a series of operations should take?
Short Answer
No.
Explanation
The Dask scheduler just executes Python functions. It doesn't think about where they came from or the broader context of what they represent (for example, a dataframe join or matrix multiply). From its perspective it has just been asked to execute a graph of opaque function calls. This generality is a weakness (hard to perform high level analysis) but also Dask's main strength, because it can be applied to a broad variety of problems outside of any particular domain or specialty.
The distributed scheduler does maintain an exponentially weighted average of each function's duration which could be used to create an estimate of a task graph. I would search the scheduler.py file for task_duration if you're interested in building this.
I'm looking at creative ways to speed up training time for my neural nets and also maybe reducing vanishing gradient. I was considering breaking up the net onto different nodes, using classifiers on each node as backprop "boosters", and then stacking the nodes on top of each other with sparse connections between each node (as many as I can get away with without ethernet network saturation making it pointless). If I do this, I am uncertain if I have to maintain some kind of state between nodes and train synchronously on the same example (probably defeats the purpose of speeding up the process), OR I can simply train on the same data but asynchronously. I think I can, and the weight space can still be updated and propagated down my sparse connections between nodes even if they are training on different examples, but uncertain. Can someone confirm this is possible or explain why not?
It is possible to do what you suggest, however it is a formidable amount of work for one person to undertake. The most recent example that I'm aware of is the "DistBelief" framework, developed by a large research/engineering team at Google -- see the 2012 NIPS paper at http://static.googleusercontent.com/media/research.google.com/en//archive/large_deep_networks_nips2012.pdf.
Briefly, the DistBelief approach partitions the units in a neural network so that each worker machine in a cluster is responsible for some disjoint subset of the overall architecture. Ideally the partitions are chosen to minimize the amount of cross-machine communication required (i.e., a min-cut through the network graph).
Workers perform computations locally for their part of the network, and then send updates to the other workers as needed for links that cross machine boundaries.
Parameter updates are handled by a separate "parameter server." The workers send gradient computations to the parameter server, and periodically receive updated parameter values from the server.
The entire setup runs asynchronously and works pretty well. Due to the async nature of the computations, the parameter values for a given computation might be "stale," but they're usually not too far off. And the speedup is pretty good.
Is the a way to physically separate between neo4j partitions?
Meaning the following query will go to node1:
Match (a:User:Facebook)
While this query will go to another node (maybe hosted on docker)
Match (b:User:Google)
this is the case:
i want to store data of several clients under neo4j, hopefully lots of them.
now, i'm not sure about whats is the best design for that but it has to fulfill few conditions:
no mixed data should be returned from a cypher query ( its really hard to make sure, that no developer will forget the ":Partition1" (for example) in a cypher query)
performance of 1 client shouldn't affect another client, for example, if 1 client has lots of data, and another client has small amount of data, or if a "heavy" query of 1 client is currently running, i dont want other "lite" queries of another client to suffer from slow slow performance
in other words, storing everything under 1 node, at some point in the future, i think, will have scalability problem, when i'll have more clients.
btw, is it common to have few clusters?
also whats the advantage of partitioning over creating different Label for each client? for example: Users_client_1 , Users_client_2 etc
Short answer: no, there isn't.
Neo4j has high availability (HA) clusters where you can make a copy of your entire graph on many machines, and then serve many requests against that copy quickly, but they don't partition a really huge graph so some of it is stored here, some other parts there, and then connected by one query mechanism.
More detailed answer: graph partitioning is a hard problem, subject to ongoing research. You can read more about it over at wikipedia, but the gist is that when you create partitions, you're splitting your graph up into multiple different locations, and then needing to deal with the complication of relationships that cross partitions. Crossing partitions is an expensive operation, so the real question when partitioning is, how do you partition such that the need to cross partitions in a query comes up as infrequently as possible?
That's a really hard question, since it depends not only on the data model but on the access patterns, which may change.
Here's how bad the situation is (quote stolen):
Typically, graph partition problems fall under the category of NP-hard
problems. Solutions to these problems are generally derived using
heuristics and approximation algorithms.[3] However, uniform graph
partitioning or a balanced graph partition problem can be shown to be
NP-complete to approximate within any finite factor.[1] Even for
special graph classes such as trees and grids, no reasonable
approximation algorithms exist,[4] unless P=NP. Grids are a
particularly interesting case since they model the graphs resulting
from Finite Element Model (FEM) simulations. When not only the number
of edges between the components is approximated, but also the sizes of
the components, it can be shown that no reasonable fully polynomial
algorithms exist for these graphs.
Not to leave you with too much doom and gloom, plenty of people have partitioned big graphs. Facebook and twitter do it every day, so you can read about FlockDB on the twitter side or avail yourself of relevant facebook research. But to summarize and cut to the chase, it depends on your data and most people who partition design a custom partitioning strategy, it's not something software does for them.
Finally, other architectures (such as Apache Giraph) can auto-partition in some senses; if you store a graph on top of hadoop, and hadoop already automagically scales across a cluster, then technically this is partitioning your graph for you, automagically. Cool, right? Well...cool until you realize that you still have to execute graph traversal operations all over the place, which may perform very poorly owing to the fact that all of those partitions have to be traversed, the performance situation you're usually trying to avoid by partitioning wisely in the first place.
If I am build a project applying Lambda-architecture now, should I split the batch layer and the serving layer, i.e. program A do the batch layer's work, program B do the serving layer's? they are physically independent but logically relevant, since program A can tell B to work after A finishes the pre-compute works.
If so, would you please tell me how to implement it? I am thinking about IPC. If IPC could help, what is the specific way?
BTW, what does "batch view" mean exactly? why and How does the serving layer index it?
What is the best way to implement Lambda Architecture batch_layer and serving_layer? That totally depends upon the specific requirements, system environment, and so on. I can address how to design Lambda Architecture batch_layer and serving_layer, though.
Incidentally, I was just discussing this with a colleague yesterday and this is based on that discussion. I will explain in 3 parts and for the sake of this discussion lets say we are interested in designing a system that computes the most read stories (a) of the day, (b) of the week, (c) of the year:
Firstly in a lambda architecture, it is important to divide the problem that you are trying to solve with respect to time first and features second. So if you model your data as an incoming stream then the speed layer deals with the 'head' of the stream, e.g. current day's data; the batch layer deals with the 'head' + 'tail', the masterset.
Secondly, divide the features between these time-based lines. For instance, some features can be done using the 'head' of the stream alone, while other features require a wider breadth of data than the 'head', e.g. masterset. In our example, lets say that we define the speed layer to compute one day's worth of data. Then the Speed layer would compute most read stories (a) of the day in the so-called Speed View; while the Batch Layer would compute most read stories (a) of the days, (b) of the week, and (c) of the year in the so-called Batch View. Note that yes there may appear to be a bit of redundancy but hold on to that thought.
Thirdly, serving layer response to queries from clients regarding Speed View and Batch View and merges results accordingly. There will necessarily be overlap in the results from the Speed View and the Batch View. No matter, this is divide of Speed vs Batch, among other benefits, allows us to minimize exposure to risks such as (1) rolling out bugs, (2) corrupt data delivery, (3) long-running batch processes, etc. Ideally, issues will be caught in the speed view and then fixes will be applied prior to the batch view re-compute. If it is then all is well and good.
In summary, no IPC needs to be used since they are completely independent from each other. So program A does not need to communicate to program B. Instead, the system relies upon some overlap of processing. For instance, if program B computes its Batch view based on a daily basis then program A needs to compute the Speed view for day plus any additional time that processing may take. This extra time needs to include any downtime in the batch layer.
Hope this helps!
Notes:
Redundancy of the batch layer - it is necessary to have at least some redundancy in the batch layer since the serving layer must be able to provide a single cohesive view of results to queries. At the least, the redundancy may help avoid time-gaps in query responses.
Assessing which features are in the speed layer - this step will not always be as convenient as in the 'most read stories' example here. This is more of an art form.