My Combine process always runs on a single machine.
Is it difficult to parallelize this process as a principle of computing?
The attached figure shows the CPUusage of Dataflow, and the Combine process is running on only one machine on the right side of the chart.
The Combine operation is parallelized across multiple workers via an optimization called "combiner lifting." Say, for example, one is taking a sum has three workers and the data is distributed as follows
Worker 1: ("A", 1), ("A", 2), and ("B", 100)
Worker 2: ("A", 3), ("B", 200), and ("B", 300)
Worker 3: ("A", 4) and ("A", 5)
These workers will do pre-combining and emit partial sums per key, specifically
Worker 1: ("A", 3) and ("B", 100)
Worker 2: ("A", 3) and ("B", 500)
Worker 3: ("A", 9)
which then get shuffled to workers according to their key for a final combine
Worker 1: ("A", 3), ("A", 3), and ("A", 9) -> ("A", 15)
Worker 2: ("B", 100) and ("B", 500) -> ("B", 600)
Worker 3: other keys
If there are not very many keys (e.g. a global combine which has exactly one key) and the combining operation is expensive (e.g. grows as the number of values are added) and there are many upstream workers this can result in limited parallelism of the final combine. Beam's combining operations have a with_hot_keys option that can help in this case by introducing intermediate keys to do an intermediate level of combining before the final combine.
It's also possible that other operations that get fused into the same stage running this combine are expensive. A Reshuffle can be used to redistribute work if parallelism is possible after such a constrained stage.
Related
I have come across an issue to eliminate the completed part of the task graph when handling an iterative problem with a large matrix. The minimal example code and corresponding task graph is shown below
from dask.distributed import Client
client = Client()
import numpy as np
import dask.array as da
x = np.array([1, 1, 2, 3, 3, 3, 2, 1, 1])
x = da.from_array(x, chunks=5)
def derivative(x):
return x - np.roll(x, 1)
for i in range(10):
y = x.map_overlap(derivative, depth=1, boundary='periodic')
x = y.persist()
corresponding task graph
the task graph grows during the iterative process, and the data to rebuild from the initial is not practical for this case once the data is missing. I want to eliminate the part of completed tasks in graph and keep only the task graph of the ongoing loop.
I try to clean the dependencies of x in the for loop, but it did not work.
dsk = x.__dask_graph__()
dsk.dependencies={}
How should I broke the dependencies exactly to cut the unwanted graph?
Thanks in advance!
input = torch.randn(8, 3, 50, 100)
m = nn.Conv2d(3, 3, kernel_size=(3, 3), padding=(1, 1))
m2 = nn.Conv2d(3, 3, kernel_size=3, padding=1)
output = m(input)
output2 = m2(input)
torch.equal(output, output2) >> False
I suppose above m and m2 Conv2d should have exactly same output value but practically not, what is the reason?
You have initialized two nn.Conv2d with identical settings, that's true. Initialization of the weights however is done randomly! You have here two different layers m and m2. Namely m.weight and m2.weigth have different components, same goes for m.bias and m2.bias.
One way to have get the same results, is to copy the underlying parameters of the model:
>>> m.weight = m2.weight
>>> m.bias = m2.bias
Which, of course, results in torch.equal(m(input), m2(input)) being True.
The "problem" here isn't related to int vs tuple. In fact, if you print m and m2 you'll see
>>> m
Conv2d(3, 3, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
>>> m2
Conv2d(3, 3, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
that the integer got expanded as the documentation promises.
What actually differs is the initial weights, which I believe are just random. You can view them via m.weights, m2.weights. These will differ every time you create a new Conv2d, even if you use the same arguments.
You can initialize the weights if you want to play around with these objects in a predictable way, see
How to initialize weights in PyTorch?
e.g.
m.weight.data.fill_(0.01)
m2.weight.data.fill_(0.01)
m.bias.data.fill_(0.1)
m2.bias.data.fill_(0.1)
and they should now be identical.
I'm trying to identify strongly connected communities within large group (undirected weighted graph). Alternatively, identifying vertices causing connection of sub-groups (communities) that would be otherwise unrelated.
The problem is part of broader Databricks solution thus Spark GraphX and GraphFrames are the first choice for resolving it.
As you can see from attached picture, I need to find vertex "X" as a point where can be split big continuous group identified by connected componect algorithms (val result = g.connectedComponents.run())
Strongly connected components method (for directed graph only), Triangle counting, or LPA community detection algorithms are not suitable, even if all weights are same, e.g. 1.
Picture with point, where should be cut big group ST0
Similar logic is nice described in question "Cut in a Weighted Undirected Connected Graph", but as a mathematical expression only.
Thanks for any hint.
// Vertex DataFrame
val v = sqlContext.createDataFrame(List(
(1L, "A-1", 1), // "St-1"
(2L, "B-1", 1),
(3L, "C-1", 1),
(4L, "D-1", 1),
(5L, "G-2", 1), // "St-2"
(6L, "H-2", 1),
(7L, "I-2", 1),
(8L, "J-2", 1),
(9L, "K-2", 1),
(10L, "E-3", 1), // St-3
(11L, "F-3", 1),
(12L, "Z-3", 1),
(13L, "X-0", 1) // split point
)).toDF("id", "name", "myGrp")
// Edge DataFrame
val e = sqlContext.createDataFrame(List(
(1L, 2L, 1),
(1L, 3L, 1),
(1L, 4L, 1),
(1L, 13L, 5), // critical edge
(2L, 4L, 1),
(5L, 6L, 1),
(5L, 7L, 1),
(5L, 13L, 7), // critical edge
(6L, 9L, 1),
(6L, 8L, 1),
(7L, 8L, 1),
(12L, 10L, 1),
(12L, 11L, 1),
(12L, 13L, 9), // critical edge
(10L, 11L, 1)
)).toDF("src", "dst", "relationship")
val g = GraphFrame(v, e)
Betweenness centrality seems to be one of the algorithms fitting this problem. This method counts how many shortest paths are going thru each vertex from all shortest paths connecting any pair of other vertices.
As far as I know, GraphFrame does not have Betweenness centrality and its Shortest Path just provides number of hoops between vertices w/o listing the actual path. Using bfs (Breadth First Search) method can give us reasonable approximation (note: bfs doesn't reflect distance/edge length neither; it also treats each graph as directed):
Ensure each vertex is defined in both directions to make bfs treating graph as undirected
Declare mutable structure (e.g. ArrayBuffer) pathMembers with following fields [fromId, toId, pathId, vertexId]
For each vertex o in your graph g.vertices (outer loop)
For each vertex i in your graph g.vertices.filter($"id" < lit(o.id)) (inner loop - looks only into i.id smaller than o.id, because shortestPath(o.id, i.id) is exaclty same as shortestPath(i.id, o.id) in undirected graph)
apply val paths = g.bfs.fromExpr("id = " + o.id).toExpr("id = " + i.id).run()
transpose paths to store all vertices in the path for each path and store them in pathMembers
Calculate how many time was each vertexId present in each fromId, toId path (i.e. vertexId count divided by pathId count for each fromId, toId pair)
Sum-up the calculation for each vertexId to obtain betweenness centrality measure
Vertex "X" for the schema will get highest value. Value for vertices directly connected to "X" will drop. Difference will be highes if most of the groups cross-connected by "X" have comparable size.
Note: if your graph is so large the full Betweenness centrality algorithm will be prohibitively long, sub-set of pairs for shortest path calculation could be selected randomly. Sample size is compromise between acceptable processing time and probability choosing majority of pairs within single branch of the graph.
import dask.distributed
def f(x, y):
return x, y
client = dask.distributed.Client()
client.map(f, [(1, 2), (2, 3)])
Does not work.
[<Future: status: pending, key: f-137239e2f6eafbe900c0087f550bc0ca>,
<Future: status: pending, key: f-64f918a0c730c63955da91694fcf7acc>]
distributed.worker - WARNING - Compute Failed
Function: f
args: ((1, 2))
kwargs: {}
Exception: TypeError("f() missing 1 required positional argument: 'y'",)
distributed.worker - WARNING - Compute Failed
Function: f
args: ((2, 3))
kwargs: {}
Exception: TypeError("f() missing 1 required positional argument: 'y'",)
You do not quite have the signature right - perhaps the doc is not clear (suggestions welcome). Client.map() takes (variable number of) sets of arguments for each task submitted, not a single iterable thing. You should phrase this as
client.map(f, (1, 2), (2, 3))
or, if you wanted to stay closer to your original pattern
client.map(f, *[(1, 2), (2, 3)])
Ok, the documentation is definitely a bit confusing on this one. And I couldn't find an example that clearly demonstrated this problem. So let me break it down below:
def test_fn(a, b, c, d, **kwargs):
return a + b + c + d + kwargs["special"]
futures = client.map(test_fn, *[[1, 2, 3, 4], (1, 2, 3, 4), (1, 2, 3, 4), (1, 2, 3, 4)], special=100)
output = [f.result() for f in futures]
# output = [104, 108, 112, 116]
futures = client.map(test_fn, [1, 2, 3, 4], (1, 2, 3, 4), (1, 2, 3, 4), (1, 2, 3, 4), special=100)
output = [f.result() for f in futures]
# output = [104, 108, 112, 116]
Things to note:
Doesn't matter if you use lists or tuples. And like I did above, you can mix them.
You have to group arguments by their position. So if you're passing in 4 sets of arguments, the first list will contain the first argument from all 4 sets. (In this case, the "first" call to test_fn gets a=b=c=d=1.)
Extra **kwargs (like special) are passed through to the function. But it'll be the same value for all function calls.
Now that I think about it, this isn't that surprising. I think it's just following Python's concurrent.futures.ProcessPoolExecutor.map() signature.
PS. Note that even though the documentation says "Returns:
List, iterator, or Queue of futures, depending on the type of the
inputs.", you can actually get this error: Dask no longer supports mapping over Iterators or Queues. Consider using a normal for loop and Client.submit
As the documentation states
the last state for each sample at index i in a batch will be used as
initial state for the sample of index i in the following batch
does it mean that to split data to batches I need to do it the following way
e.g. let's assume that I am training a stateful RNN to predict the next integer in range(0, 5) given the previous one
# batch_size = 3
# 0, 1, 2 etc in x are samples (timesteps and features omitted for brevity of the example)
x = [0, 1, 2, 3, 4]
y = [1, 2, 3, 4, 5]
batches_x = [[0, 1, 2], [1, 2, 3], [2, 3, 4]]
batches_y = [[1, 2, 3], [2, 3, 4], [3, 4, 5]]
then the state after learning on x[0, 0] will be initial state for x[1, 0]
and x[0, 1] for x[1, 1] (0 for 1 and 1 for 2 etc)?
Is it the right way to do it?
Based on this answer, for which I performed some tests.
Stateful=False:
Normally (stateful=False), you have one batch with many sequences:
batch_x = [
[[0],[1],[2],[3],[4],[5]],
[[1],[2],[3],[4],[5],[6]],
[[2],[3],[4],[5],[6],[7]],
[[3],[4],[5],[6],[7],[8]]
]
The shape is (4,6,1). This means that you have:
1 batch
4 individual sequences = this is batch size and it can vary
6 steps per sequence
1 feature per step
Every time you train, either if you repeat this batch or if you pass a new one, it will see individual sequences. Every sequence is a unique entry.
Stateful=True:
When you go to a stateful layer, You are not going to pass individual sequences anymore. You are going to pass very long sequences divided in small batches. You will need more batches:
batch_x1 = [
[[0],[1],[2]],
[[1],[2],[3]],
[[2],[3],[4]],
[[3],[4],[5]]
]
batch_x2 = [
[[3],[4],[5]], #continuation of batch_x1[0]
[[4],[5],[6]], #continuation of batch_x1[1]
[[5],[6],[7]], #continuation of batch_x1[2]
[[6],[7],[8]] #continuation of batch_x1[3]
]
Both shapes are (4,3,1). And this means that you have:
2 batches
4 individual sequences = this is batch size and it must be constant
6 steps per sequence (3 steps in each batch)
1 feature per step
The stateful layers are meant to huge sequences, long enough to exceed your memory or your available time for some task. Then you slice your sequences and process them in parts. There is no difference in the results, the layer is not smarter or has additional capabilities. It just doesn't consider that the sequences have ended after it processes one batch. It expects the continuation of those sequences.
In this case, you decide yourself when the sequences have ended and call model.reset_states() manually.