Dask distributed executes tasks sequentially

Dask distributed executes tasks sequentially - dask

I have a pipeline working with LocalCluster:
from distributed import Client
client = Client()
list_of_queries = [...] # say 1_000 queries
loaded_data = client.map(sql_data_loader, list_of_queries)
processed_data = client.map(data_processor, loaded_data)
writer_results = client.map(data_writer, processed_data)
results = client.gather(writer_results)
Everything works, but not quite as I would expect.
Looking at dashboard's status page I see somethings like this:
sql_data_loader 900 / 1000
data_processor 0 / 1000
data_writer 0 / 1000
I.e. tasks are executed sequentially as opposed to "in parallel". As a result data_processor does not start executing until all 1000 queries have been loaded. And data_writer waits until 'data_processor' finishes processing all its futures.
Based on previous experience with dask where dask.delayed was used instead of client.map
expected behavior would be something like:
sql_data_loader 50 / 1000
data_processor 10 / 1000
data_writer 5 / 1000
Is this a false expectation or is there something I am missing with how to set up pipeline to ensure behavior that would be similar to dask.delayed?

If you run the maps one after the other then everything should pipeline nicely.
There is some tension between two desired objectives:
Tasks should pipeline, as you like
Tasks that are submitted first should have higher priority
To balance between these two objectives Dask assigns policies based on the delay between calls. If two map calls happen right after each other then Dask assumes that they're part of the same computation, however if they are separated by a significant amount of time then Dask assumes that they are different computations, and so prioritizes the earlier tasks. You can control this with the fifo_timeout keyword
client.map(f, ..., fifo_timeout='10 minutes')
Here is the relevant documentation page
Here is an example showing the behavior you want if you bundle map calls together:
In [1]: from dask.distributed import Client
In [2]: client = Client(processes=False)
In [3]: import time
In [4]: def f(x):
...: time.sleep(0.1)
...: print('f', x)
...: return x
...:
In [5]: def g(x):
...: time.sleep(0.1)
...: print('g', x)
...: return x
...:
In [6]: futures = client.map(f, range(20))
...: futures = client.map(g, futures)
...:
In [7]: f 0
f 1
f 2
f 3
f 5
f 4
f 6
f 7
g 0
g 1
g 3
g 2
g 4
g 5
g 6
g 7
f 8
f 9
f 10
f 11
f 12
g 8
f 13
g 9
g 10
g 11
g 12
f 14
g 13
f 15
f 16
f 17
g 14
f 18
g 15
f 19
g 16
g 17
g 18
g 19

Related

Find the row of highest numbers from each of names or group who'd has been have a some of similarity of names then sumif their values group of names

I want to make the total of values every each member or names in every each their own group at the first match (or after blank space) or highest values positions of each them on column "D" according to column "B" with the result's row of an output like the exactly as an EXPECT OUTPUT as act of what I've just created on column "E". That's the replace a little bit down of just only one row from the column "B" positions or row must be the same as the column "C" and "D". Could we do this anyway ?
My achievements: I feel I've tried this before and got succeed to achieve this but I've forgot how to solve this when that happened. But it's look like kinda this code of formula:
=FILTER(IF(IFERROR(MATCH($B$3:$B;$B:$B;0);0)=ROW($B$3:$B);SUMIF($B$3:$B;$B$3:$B;$D$3:$D);"");$B$3:$B<>"0")
I don't know if I'm right or wrong but please see the table I'd created at the down below this and also see how I expected with that and feel free as well to edit to my doc file of google sheet I attached down below this.
THIS HERE YOU CAN EDIT TO MY SAMPLE G.SHEET TO SOLVE THIS MY QUIZ. THANKS IN ADVANCE!
A
B
C
D
E
1
2
N U M B
I D   -   M E M B E R
I D      -     C O D E
V A L U E S
E X P E C T     O U T P U T
3
4
4
JYFI7
5
JYFI7
J3573
3
6
6
JYFI7
IYR
1
7
JYFI7
F498S
2
8
9
3
DFJ9F11
10
DFJ9F11
C684J
7
8
11
DFJ9F11
J58
1
12
13
2
H684K
14
H684K
JF585
2
2
15
16
1
FJSR
17
FJSR
4684
7
16
18
FJSR
834
1
19
FJSR
49
2
20
FJSR
9835
6

Here's a possible solution:
=ARRAYFORMULA(LAMBDA(cusum,IF(SCAN(,cusum,
LAMBDA(acc,cur,if(cur="",,acc+1)))=1,cusum,))
(SORT(SCAN(,SORT(D3:D,ROW(D3:D),0),
LAMBDA(acc,cur,if(cur="",,acc+cur))),ROW(D3:D),0)))
You can find it in tab 'z' cell F3.

Clustering to achieve heterogeneous groups

I want to group 100 users based on a categorical variable (which can be low, medium, or high). The group size should be 3. I want to get the maximal heterogeneity within groups, assuming that users are distributed equally. I wonder if I can use some clustering algorithm to group based on the dissimilarity? Any suggestions?

I don't believe you need a clustering algorithm to group the data based upon a categorical variable.
Based on you question, I think this should work.
# Code
from sklearn.model_selection import train_test_split
group1, group23 = train_test_split(data, test_size=2/3., stratify=data['lab'])
group2, group3 = train_test_split(group23, test_size=1/2., stratify=group23['lab'])
Stratify makes sure that the maximum heterogeneity is maintained for the given categorical value.
# Sample output
print(data)
val1 val2 lab
0 1 1 L
1 2 2 L
2 3 3 L
3 4 4 M
4 5 5 M
5 6 6 M
6 7 7 H
7 8 8 H
8 9 9 H
print(group1)
val1 val2 lab
4 5 5 M
1 2 2 L
6 7 7 H
print(group2)
val1 val2 lab
8 9 9 H
2 3 3 L
3 4 4 M
print(group3)
val1 val2 lab
0 1 1 L
7 8 8 H
5 6 6 M
train_test_split() Documentation

When re-inserting into queue - Huffman Code

Example
3 2 5 5
a b c d
Joining first two
5 | 5 5
3 2 | c d
a b |
I have to put the new tree of five into the queue
Am I obligated to put it in the end like this:
5 5 5
c d / \
3 2
a b
Or can I put it in the beginning:
5 5 5
3 2 c d
a b
Or even in the middle of 'c' and 'd'
Is it my choice or is there a rule?

It's not your choice, the Queue needs to be sorted at all times (by it's number of occurrences and in case of equal number of occurrences by the depth of the tree). So it needs to be inserted where it belongs into the order.
This is needed to pick the sub-trees with the least amount of occurrences and if there is choice the most shallow one of them by simply pop-ing them.
If you simply resort after every insertion (this is inefficient and should not be done) the position obviously doesn't matter.

Yes, it's your choice. Whichever way you will get an optimal Huffman code, even though two resulting codes can be manifestly different.
You can get:
a - 00
b - 01
c - 10
d - 11
or you can get:
a - 111
b - 110
c - 10
d - 0
Now if I multiply the number of bits in each symbol times the number of occurrences, I get for the first code: 2*3 + 2*2 + 2*5 + 2*5 = 30 bits. For the second code: 3*3 + 3*2 + 2*5 + 1*5 = 30 bits. So both codes will code the original message to exactly 30 bits.

How to predict multi-label dataset using svm

I'm using a dataset with all decimal values and timestamp which has the following features :
1. sno
2. timestamp
3. v1
4. v2
5. v3
I've the data for 5 months with timestamps for every minute. I need to predict if v1, v2 ,v3 is being used at any time in the future. The values of v1,v2,v3 are between 0 to 25.
How can I do this ?
I've used binary classification before but I've no clue how to process with the multi-label problem to predict. I've used the code below all the time . How should I train the model and how should I use v1,v2,v3 to fit into 'y'?
X_train, X_test, y_train, y_test = train_test_split(train, y, test_size=0.2)
Data:
sno power voltage v1 v2 v3 timestamp
1 3.74 235.24 0 16 18 2006-12-16 18:03:00
2 4.928 237.14 0 37 16 2006-12-16 18:04:00
3 6.052 236.73 0 37 17 2006-12-16 18:05:00
4 6.752 237.06 0 36 17 2006-12-16 18:06:00
5 6.474 237.13 0 37 16 2006-12-16 18:07:00
6 6.308 235.84 0 36 17 2006-12-16 18:08:00
7 4.464 232.69 0 37 16 2006-12-16 18:09:00
8 3.396 230.98 0 22 18 2006-12-16 18:10:00
9 3.09 232.21 0 12 17 2006-12-16 18:11:00
10 3.73 234.19 0 27 17 2006-12-16 18:12:00
11 2.308 234.96 0 1 17 2006-12-16 18:13:00
12 2.388 236.66 0 1 17 2006-12-16 18:14:00
13 4.598 235.84 0 20 17 2006-12-16 18:15:00
14 4.524 235.6 0 9 17 2006-12-16 18:16:00
15 4.202 235.49 0 1 17 2006-12-16 18:17:00

Following the documentation:
The multiclass support is handled according to a one-vs-one scheme (and should thus support one-vs-all strategy).
one-vs-one strat
The one-vs-one scheme basically refers to using a classifier per pair of classes. At a prediction stage, the class that receives the most votes (the outputs of the each classifier) is eventually selected as a prediction. If such a voting has a tie, i.e. having two classes with an equal amount of votes, then the classification confidence plays a role.
To use SVM with such a scheme, one should go:
from sklearn.multiclass import OneVsOneClassifier
from sklearn.svm import SVC
...
subclf = SVC(**params)
clf = OneVsOneClassifier(estimator=subclf)
clf.fit()
one-vs-rest strat
The other way around would be to use a one-vs-all strategy. This strategy fits a classifier per class and against all other classes in the data. It is more popular than the first scheme as it is fairly easier to interpert the results, and the computational time is much weaker. It is as simple to use as the first example:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC
...
subclf = SVC(**params)
clf = OneVsRestClassifier(estimator=subclf)
clf.fit()
To read more about multi-label classification and learning proceed here
Aftermath variable coding
So, the basic idea is to instantiate a complex (i.e. multi-label) target variable in a way that:
y equals to 0 if v1 v2 v3 are zeros
y equals to 1 if either v1 or v2 or v3 is one
y equals to 2 if either v1 v2 or v1 v3 or v2 v3 are ones
y equals to 3 if v1 v2 v3 are ones
The workaround may be the following:
import numpy as np
y = []
for i, j, k in zip(data['v1'], data['v2'], data['v3']):
if i and j and k > 0:
y.append(3)
elif i and j or i and k or j and k > 0:
y.append(2)
elif i or j or k > 0:
y.append(1)
else:
y.append(0)

How would one use Kernel Density Estimation as a 1D clustering method in scikit learn?

I need to cluster a simple univariate data set into a preset number of clusters. Technically it would be closer to binning or sorting the data since it is only 1D, but my boss is calling it clustering, so I'm going to stick to that name.
The current method used by the system I'm on is K-means, but that seems like overkill.
Is there a better way of performing this task?
Answers to some other posts are mentioning KDE (Kernel Density Estimation), but that is a density estimation method, how would that work?
I see how KDE returns a density, but how do I tell it to split the data into bins?
How do I have a fixed number of bins independent of the data (that's one of my requirements) ?
More specifically, how would one pull this off using scikit learn?
My input file looks like:
str ID sls
1 10
2 11
3 9
4 23
5 21
6 11
7 45
8 20
9 11
10 12
I want to group the sls number into clusters or bins, such that:
Cluster 1: [10 11 9 11 11 12]
Cluster 2: [23 21 20]
Cluster 3: [45]
And my output file will look like:
str ID sls Cluster ID Cluster centroid
1 10 1 10.66
2 11 1 10.66
3 9 1 10.66
4 23 2 21.33
5 21 2 21.33
6 11 1 10.66
7 45 3 45
8 20 2 21.33
9 11 1 10.66
10 12 1 10.66

Write code yourself. Then it fits your problem best!
Boilerplate: Never assume code you download from the net to be correct or optimal... make sure to fully understand it before using it.
%matplotlib inline
from numpy import array, linspace
from sklearn.neighbors.kde import KernelDensity
from matplotlib.pyplot import plot
a = array([10,11,9,23,21,11,45,20,11,12]).reshape(-1, 1)
kde = KernelDensity(kernel='gaussian', bandwidth=3).fit(a)
s = linspace(0,50)
e = kde.score_samples(s.reshape(-1,1))
plot(s, e)
from scipy.signal import argrelextrema
mi, ma = argrelextrema(e, np.less)[0], argrelextrema(e, np.greater)[0]
print "Minima:", s[mi]
print "Maxima:", s[ma]
> Minima: [ 17.34693878 33.67346939]
> Maxima: [ 10.20408163 21.42857143 44.89795918]
Your clusters therefore are
print a[a < mi[0]], a[(a >= mi[0]) * (a <= mi[1])], a[a >= mi[1]]
> [10 11 9 11 11 12] [23 21 20] [45]
and visually, we did this split:
plot(s[:mi[0]+1], e[:mi[0]+1], 'r',
s[mi[0]:mi[1]+1], e[mi[0]:mi[1]+1], 'g',
s[mi[1]:], e[mi[1]:], 'b',
s[ma], e[ma], 'go',
s[mi], e[mi], 'ro')
We cut at the red markers. The green markers are our best estimates for the cluster centers.

There is a little error in the accepted answer by #Has QUIT--Anony-Mousse (I can't comment nor suggest an edit due my reputation).
The line:
print(a[a < mi[0]], a[(a >= mi[0]) * (a <= mi[1])], a[a >= mi[1]])
Should be edited into:
print(a[a < s[mi][0]], a[(a >= s[mi][0]) * (a <= s[mi][1])], a[a >= s[mi][1]])
That's because mi and ma is an index, where s[mi] and s[ma] is the value. If you use mi[0] as the limit, you risk and error splitting if your upper and lower linspace >> your upper and lower data. For example, run this code and see the difference in split result:
import numpy as np
from numpy import array, linspace
from sklearn.neighbors import KernelDensity
from matplotlib.pyplot import plot
from scipy.signal import argrelextrema
a = array([10,11,9,23,21,11,45,20,11,12]).reshape(-1, 1)
kde = KernelDensity(kernel='gaussian', bandwidth=3).fit(a)
s = linspace(0,100)
e = kde.score_samples(s.reshape(-1,1))
mi, ma = argrelextrema(e, np.less)[0], argrelextrema(e, np.greater)[0]
print('Grouping by HAS QUIT:')
print(a[a < mi[0]], a[(a >= mi[0]) * (a <= mi[1])], a[a >= mi[1]])
print('Grouping by yasirroni:')
print(a[a < s[mi][0]], a[(a >= s[mi][0]) * (a < s[mi][1])], a[a >= s[mi][1]])
result:
Grouping by Has QUIT:
[] [10 11 9 11 11 12] [23 21 45 20]
Grouping by yasirroni:
[10 11 9 11 11 12] [23 21 20] [45]

Further improving the responses above by #yasirroni, to dynamically print all clusters (not just 3 from the above) the line:
print(a[a < s[mi][0]], a[(a >= s[mi][0]) * (a <= s[mi][1])], a[a >= s[mi][1]])
can be changed into:
print(a[a < s[mi][0]]) # print most left cluster
# print all middle cluster
for i_cluster in range(len(mi)-1):
print(a[(a >= s[mi][i_cluster]) * (a <= s[mi][i_cluster+1])])
print(a[a >= s[mi][-1]]) # print most right cluster
This would ensure that all the clusters are taken into account.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Dask distributed executes tasks sequentially - dask

Related

Find the row of highest numbers from each of names or group who'd has been have a some of similarity of names then sumif their values group of names

Clustering to achieve heterogeneous groups

When re-inserting into queue - Huffman Code

How to predict multi-label dataset using svm

How would one use Kernel Density Estimation as a 1D clustering method in scikit learn?

Categories

Resources