Best way to parallelize computation over dask blocks that do not return np arrays? - dask

I'd like to return a dask dataframe from an overlapping dask array computation, where each block's computation returns a pandas dataframe. The example below shows one way to do this, simplified for demonstration purposes. I've found a combination of da.overlap.overlap and to_delayed().ravel() as able to get the job done, if I pass in the relevant block key and chunk information.
Edit:
Thanks to a #AnnaM who caught bugs in the original post and then made it general! Building off of her comments, I'm including an updated version of the code. Also, in responding to Anna's interest in memory usage, I verified that this does not seem to take up more memory than naively expected.
def extract_features_generalized(chunk, offsets, depth, columns):
shape = np.asarray(chunk.shape)
offsets = np.asarray(offsets)
depth = np.asarray(depth)
coordinates = np.stack(np.nonzero(chunk)).T
keep = ((coordinates >= depth) & (coordinates < (shape - depth))).all(axis=1)
data = coordinates + offsets - depth
df = pd.DataFrame(data=data, columns=columns)
return df[keep]
def my_overlap_generalized(data, chunksize, depth, columns, boundary):
data = data.rechunk(chunksize)
data_overlapping_chunks = da.overlap.overlap(data, depth=depth, boundary=boundary)
dfs = []
for block in data_overlapping_chunks.to_delayed().ravel():
offsets = np.array(block.key[1:]) * np.array(data.chunksize)
df_block = dask.delayed(extract_features_generalized)(block, offsets=offsets,
depth=depth, columns=columns)
dfs.append(df_block)
return dd.from_delayed(dfs)
data = np.zeros((2,4,8,16,16))
data[0,0,4,2,2] = 1
data[0,1,4,6,2] = 1
data[1,2,4,8,2] = 1
data[0,3,4,2,2] = 1
arr = da.from_array(data)
df = my_overlap_generalized(arr,
chunksize=(-1,-1,-1,8,8),
depth=(0,0,0,2,2),
columns=['r', 'c', 'z', 'y', 'x'],
boundary=tuple(['reflect']*5))
df.compute().reset_index()
-- Remainder of original post, including original bugs --
My example only does xy overlaps, but it's easy to generalize. Is there anything below that is suboptimal or could be done better? Is anything likely to break because it's relying on low-level information that could change (e.g. block key)?
def my_overlap(data, chunk_xy, depth_xy):
data = data.rechunk((-1,-1,-1, chunk_xy, chunk_xy))
data_overlapping_chunks = da.overlap.overlap(data,
depth=(0,0,0,depth_xy,depth_xy),
boundary={3: 'reflect', 4: 'reflect'})
dfs = []
for block in data_overlapping_chunks.to_delayed().ravel():
offsets = np.array(block.key[1:]) * np.array(data.chunksize)
df_block = dask.delayed(extract_features)(block, offsets=offsets, depth_xy=depth_xy)
dfs.append(df_block)
# All computation is delayed, so downstream comptutions need to know the format of the data. If the meta
# information is not specified, a single computation will be done (which could be expensive) at this point
# to infer the metadata.
# This empty dataframe has the index, column, and type information we expect in the computation.
columns = ['r', 'c', 'z', 'y', 'x']
# The dtypes are float64, except for a small number of columns
df_meta = pd.DataFrame(columns=columns, dtype=np.float64)
df_meta = df_meta.astype({'c': np.int64, 'r': np.int64})
df_meta.index.name = 'feature'
return dd.from_delayed(dfs, meta=df_meta)
def extract_features(chunk, offsets, depth_xy):
r, c, z, y, x = np.nonzero(chunk)
df = pd.DataFrame({'r': r, 'c': c, 'z': z, 'y': y+offsets[3]-depth_xy,
'x': x+offsets[4]-depth_xy})
df = df[(df.y > depth_xy) & (df.y < (chunk.shape[3] - depth_xy)) &
(df.z > depth_xy) & (df.z < (chunk.shape[4] - depth_xy))]
return df
data = np.zeros((2,4,8,16,16)) # round, channel, z, y, x
data[0,0,4,2,2] = 1
data[0,1,4,6,2] = 1
data[1,2,4,8,2] = 1
data[0,3,4,2,2] = 1
arr = da.from_array(data)
df = my_overlap(arr, chunk_xy=8, depth_xy=2)
df.compute().reset_index()

First of all, thanks for posting your code. I am working on a similar problem and this was really helpful for me.
When testing your code, I discovered a few mistakes in the extract_features function that prevent your code from returning correct indices.
Here is a corrected version:
def extract_features(chunk, offsets, depth_xy):
r, c, z, y, x = np.nonzero(chunk)
df = pd.DataFrame({'r': r, 'c': c, 'z': z, 'y': y, 'x': x})
df = df[(df.y >= depth_xy) & (df.y < (chunk.shape[3] - depth_xy)) &
(df.x >= depth_xy) & (df.x < (chunk.shape[4] - depth_xy))]
df['y'] = df['y'] + offsets[3] - depth_xy
df['x'] = df['x'] + offsets[4] - depth_xy
return df
The updated code now returns the indices that were set to 1:
index r c z y x
0 0 0 0 4 2 2
1 1 0 1 4 6 2
2 2 0 3 4 2 2
3 1 1 2 4 8 2
For comparison, this is the output of the original version:
index r c z y x
0 1 0 1 4 6 2
1 3 1 2 4 8 2
2 0 0 1 4 6 2
3 1 1 2 4 8 2
It returns lines number 2 and 4, two times each.
The reason why this happens is three mistakes in the extract_features function:
You first add the offset and subtract the depth and then filter out the overlapping parts: the order needs to be swapped
df.y > depth_xy should be replaced with df.y >= depth_xy
df.z should be replaced with df.x, since it is the x dimension that has an overlap
To optimize this even further, here is a generalized version of the code that would work for an arbitrary number of dimension:
def extract_features_generalized(chunk, offsets, depth, columns):
coordinates = np.nonzero(chunk)
df = pd.DataFrame()
rows_to_keep = np.ones(len(coordinates[0]), dtype=int)
for i in range(len(columns)):
df[columns[i]] = coordinates[i]
rows_to_keep = rows_to_keep * np.array((df[columns[i]] >= depth[i])) * \
np.array((df[columns[i]] < (chunk.shape[i] - depth[i])))
df[columns[i]] = df[columns[i]] + offsets[i] - depth[i]
del coordinates
return df[rows_to_keep > 0]
def my_overlap_generalized(data, chunksize, depth, columns):
data = data.rechunk(chunksize)
data_overlapping_chunks = da.overlap.overlap(data, depth=depth,
boundary=tuple(['reflect']*len(columns)))
dfs = []
for block in data_overlapping_chunks.to_delayed().ravel():
offsets = np.array(block.key[1:]) * np.array(data.chunksize)
df_block = dask.delayed(extract_features_generalized)(block, offsets=offsets,
depth=depth, columns=columns)
dfs.append(df_block)
return dd.from_delayed(dfs)
data = np.zeros((2,4,8,16,16))
data[0,0,4,2,2] = 1
data[0,1,4,6,2] = 1
data[1,2,4,8,2] = 1
data[0,3,4,2,2] = 1
arr = da.from_array(data)
df = my_overlap_generalized(arr, chunksize=(-1,-1,-1,8,8),
depth=(0,0,0,2,2), columns=['r', 'c', 'z', 'y', 'x'])
df.compute().reset_index()

Related

shape of input to calculate information gain

I want to calculate the information gain on 20_newsgroup data set.
I am using the code here(also I put a copy of the code down of the question).
As you see the input to the algorithm is X,y
My confusion is that, X is going to be a matrix with documents in rows and features as column. (according to 20_newsgroup it is 11314,1000
in case i only considered 1000 features).
but according to the concept of information gain, it should calculate information gain for each feature.
(So I was expecting to see the code in a way loop through each feature, so the input to the function be a matrix where rows are features and columns are class)
But X is not feature here but X stands for documents, and I can not see the part in the code that take care of this part! ( I mean considering each document, and then going through each feature of that document; like looping through rows but at the same time looping through columns as the features are stored in columns).
I have read this and this and many similar questions but they are not clear in terms of input matrix shape.
this is the code for reading 20_newsgroup:
newsgroup_train = fetch_20newsgroups(subset='train')
X,y = newsgroup_train.data,newsgroup_train.target
cv = CountVectorizer(max_df=0.99,min_df=0.001, max_features=1000,stop_words='english',lowercase=True,analyzer='word')
X_vec = cv.fit_transform(X)
(X_vec.shape) is (11314,1000) which is not features in the 20_newsgroup data set. I am thinking am I calculating Information gain in a incorrect way?
This is the code for Information gain:
def information_gain(X, y):
def _calIg():
entropy_x_set = 0
entropy_x_not_set = 0
for c in classCnt:
probs = classCnt[c] / float(featureTot)
entropy_x_set = entropy_x_set - probs * np.log(probs)
probs = (classTotCnt[c] - classCnt[c]) / float(tot - featureTot)
entropy_x_not_set = entropy_x_not_set - probs * np.log(probs)
for c in classTotCnt:
if c not in classCnt:
probs = classTotCnt[c] / float(tot - featureTot)
entropy_x_not_set = entropy_x_not_set - probs * np.log(probs)
return entropy_before - ((featureTot / float(tot)) * entropy_x_set
+ ((tot - featureTot) / float(tot)) * entropy_x_not_set)
tot = X.shape[0]
classTotCnt = {}
entropy_before = 0
for i in y:
if i not in classTotCnt:
classTotCnt[i] = 1
else:
classTotCnt[i] = classTotCnt[i] + 1
for c in classTotCnt:
probs = classTotCnt[c] / float(tot)
entropy_before = entropy_before - probs * np.log(probs)
nz = X.T.nonzero()
pre = 0
classCnt = {}
featureTot = 0
information_gain = []
for i in range(0, len(nz[0])):
if (i != 0 and nz[0][i] != pre):
for notappear in range(pre+1, nz[0][i]):
information_gain.append(0)
ig = _calIg()
information_gain.append(ig)
pre = nz[0][i]
classCnt = {}
featureTot = 0
featureTot = featureTot + 1
yclass = y[nz[1][i]]
if yclass not in classCnt:
classCnt[yclass] = 1
else:
classCnt[yclass] = classCnt[yclass] + 1
ig = _calIg()
information_gain.append(ig)
return np.asarray(information_gain)
Well, after going through the code in detail, I learned more about X.T.nonzero().
Actually it is correct that information gain needs to loop through features.
Also it is correct that the matrix scikit-learn give us here is based on doc-features.
But:
in code it uses X.T.nonzero() which technically transform all the nonzero values into array. and then in the next row loop through the length of that array range(0, len(X.T.nonzero()[0]).
Overall, this part X.T.nonzero()[0] is returning all the none zero features to us :)

How to find accumulator matrix for line in an image?

I am a newbie in the field of CV and IP. I was writing the HoughTransform algorithm for finding line.I am not getting what is wrong with this code in which i m trying to find the accumulator array
numRowsInBW = size(BW,1);
numColsInBW = size(BW,2);
%length of the diagonal of image
D = sqrt((numRowsInBW - 1)^2 + (numColsInBW - 1)^2);
%number of rows in the accumulator array
nrho = 2*(ceil(D/rhoStep)) + 1;
%number of cols in the accumulator array
ntheta = length(theta);
H = zeros(nrho,ntheta);
%this means the particular pixle is white
%i.e the edge pixle
[allrows allcols] = find(BW == 1);
for i = (1 : size(allrows))
y = allrows(i);
x = allcols(i);
for th = (1 : 180)
d = floor(x*cos(th) - y*sin(th));
H(d+floor(nrho/2),th) += 1;
end
end
I m applying this for a simple image
I m getting this result
But this is expected
I am not able to find the mistake.Please help me.Thanks in advance.
There are several issues with your code. The main issue is here:
ntheta = length(theta);
% ...
for i = (1 : size(allrows))
% ...
for th = (1 : 180)
d = floor(x*cos(th) - y*sin(th));
% ...
th seems to be an angle in degrees. cos(th) is meaningless. Instead, use cosd and sind.
Another issue is that th iterates from 1 to 180, but there is no guarantee that ntheta is 180. So, loop as follows instead:
for i = 1 : size(allrows)
% ...
for j = 1 : numel(theta)
th = theta(j);
% ...
and use th as the angle, and j as the index into H.
Finally, given your image and your expected output, you should apply some edge detection first (Canny, for example). Maybe you already did this?

Is an admisible heuristic always monotone (consistent)?

For the A* search algorithm, provided an heuristic h, supose h is admisible.
That is:
h(n) ≤ h*(n) for every node n, where h* is the real cost from n to goal.
Does this ensure the heuristic is monotone?
That is:
f(n) ≤ g(n') + h(n') for every sucesor n' of n, where f(n)= h(n) + g(n) and g(n) is the accumulated cost.
No.
Assume you have three successor states s1, s2, s3 and a goal state g so that s1 -> s2 -> s3 -> g.
s1 is the starting node.
Consider also the following values for h(s) and h*(s) (i.e. true cost):
h(s1) = 3 , h*(s1) = 6
h(s2) = 4 , h*(s2) = 5
h(s3) = 3 , h*(s3) = 3
h(g) = 0 , h*(g) = 0
Following the only path to the goal we can have that:
g(s1) = 0, g(s2) = 1, g(s3) = 3, g(g) = 6, coinciding with the true cost above.
Although the heuristic function is admissible (h(s) <= h*(s)), f(n) will not be monotonic. For instance f(s1) = h(s1) + g(s1) = 3 while f(s2) = h(s2) + g(s2) = 5 with f(s1) < f(s2). Same holds between f(s2) and f(s3).
Of course this means you have a quite uninformative heuristic.

Attempt to get numbers instead of NaN values when printing

I've a simple program with a for loop where i calculate some value that I print to the screen, but only the first value is printed, the rest is just NaN values. Is there any way to fix this? I suppose the numbers might have a lot of decimals thus the NaN issue.
Output from program:
0.18410
NaN
NaN
NaN
NaN
etc.
This is the code, maybe it helps:
for i=1:30
t = (100*i)*1.1*0.5;
b = factorial(round(100*i)) / (factorial(round((100*i)-t)) * factorial(round(t)));
% binomial distribution
d = b * 0.5^(t) * 0.5^(100*i-(t));
% cumulative
p = binocdf(1.1 * (100*i) * 0.5,100*i,0.5);
% >= AT LEAST
result = 1-p + d;
disp(result);
end
You could do the calculation of the fraction yourself.
Therefore you need to calculate $d$ directly. Then you can get all values of the numerators and the denominators and multiply them by hand and make sure that the result will not get too big. The following code is poorly in terms of speed and memory, but it may be a good start:
for i=1:30
t = (55*i);
b = factorial(100*i) / (factorial(100*i-t) * factorial(t));
% binomial distribution
d = b * 0.5^(t) * 0.5^(100*i-(t));
numerators = 1:(100*i);
denominators = [1:(100*i-t),1:55*i,ones(1,100*i)*2];
value = 1;
while length(numerators) > 0 || length(denominators) > 0
if length(numerators) == 0
value = value/denominators(1);
denominators(1) = [];
elseif length(denominators) == 0
value = value* numerators(1);
numerators(1) = [];
elseif value > 10000
value = value/denominators(1);
denominators(1) = [];
else
value = value* numerators(1);
numerators(1) = [];
end
end
% cumulative
p = binocdf(1.1 * (100*i) * 0.5,100*i,0.5);
% >= AT LEAST
result = 1-p + value;
disp(result);
end
output:
0.1841
0.0895
0.0470
0.0255
0.0142
0.0080
0.0045
...
Take a look at the documentation of factorial:
Note that the factorial function grows large quite quickly, and
even with double precision values overflow will occur if N > 171.
For such cases consider 'gammaln'.
On your second iteration you are already doing factorial (200) which returns Inf and then Inf/Inf returns NaN.

Codility: Passing cars in Lua

I'm currently practicing programming problems and out of interest, I'm trying a few Codility exercises in Lua. I've been stuck on the Passing Cars problem for a while.
Problem:
A non-empty zero-indexed array A consisting of N integers is given. The consecutive elements of array A represent consecutive cars on a road.
Array A contains only 0s and/or 1s:
0 represents a car traveling east,
1 represents a car traveling west.
The goal is to count passing cars. We say that a pair of cars (P, Q), where 0 ≤ P < Q < N, is passing when P is traveling to the east and Q is traveling to the west.
For example, consider array A such that:
A[0] = 0
A[1] = 1
A[2] = 0
A[3] = 1
A[4] = 1
We have five pairs of passing cars: (0, 1), (0, 3), (0, 4), (2, 3), (2, 4).
Write a function:
function solution(A)
that, given a non-empty zero-indexed array A of N integers, returns the number of pairs of passing cars.
The function should return −1 if the number of pairs of passing cars exceeds 1,000,000,000.
For example, given:
A[0] = 0
A[1] = 1
A[2] = 0
A[3] = 1
A[4] = 1
the function should return 5, as explained above.
Assume that:
N is an integer within the range [1..100,000];
each element of array A is an integer that can have one of the following values: 0, 1.
Complexity:
expected worst-case time complexity is O(N);
expected worst-case space complexity is O(1), beyond input storage (not counting the storage required for input arguments).
Elements of input arrays can be modified.
My attempt in Lua keeps failing but I can't seem to find the issue.
local function solution(A)
local zeroes = 0
local pairs = 0
for i = 1, #A do
if A[i] == 0 then
zeroes = zeroes + 1
else
pairs = pairs + zeroes
if pairs > 1e9 then
return -1
end
end
end
return pairs
end
In terms of time-space complexity constraints, I think it should pass so I can't seem to find the issue. What am I doing wrong? Any advice or tips to make my code more efficient would be appreciated.
FYI: I keep getting a result of 2 when the desired example result is 5.
The problem statement says A is 0-based so if we ignore the first and start at 1, the output would be 2 instead of 5. 0-based tables should be avoided in Lua, they go against convention and will lead to a lot of off-by one errors: for i=1,#A do will not do what you want.
function solution1based(A)
local zeroes = 0
local pairs = 0
for i = 1, #A do
if A[i] == 0 then
zeroes = zeroes + 1
else
pairs = pairs + zeroes
if pairs > 1e9 then
return -1
end
end
end
return pairs
end
print(solution1based{0, 1, 0, 1, 1}) -- prints 5 as you wanted
function solution0based(A)
local zeroes = 0
local pairs = 0
for i = 0, #A do
if A[i] == 0 then
zeroes = zeroes + 1
else
pairs = pairs + zeroes
if pairs > 1e9 then
return -1
end
end
end
return pairs
end
print(solution0based{[0]=0, [1]=1, [2]=0, [3]=1, [4]=1}) -- prints 5

Resources