Related
I'd like to return a dask dataframe from an overlapping dask array computation, where each block's computation returns a pandas dataframe. The example below shows one way to do this, simplified for demonstration purposes. I've found a combination of da.overlap.overlap and to_delayed().ravel() as able to get the job done, if I pass in the relevant block key and chunk information.
Edit:
Thanks to a #AnnaM who caught bugs in the original post and then made it general! Building off of her comments, I'm including an updated version of the code. Also, in responding to Anna's interest in memory usage, I verified that this does not seem to take up more memory than naively expected.
def extract_features_generalized(chunk, offsets, depth, columns):
shape = np.asarray(chunk.shape)
offsets = np.asarray(offsets)
depth = np.asarray(depth)
coordinates = np.stack(np.nonzero(chunk)).T
keep = ((coordinates >= depth) & (coordinates < (shape - depth))).all(axis=1)
data = coordinates + offsets - depth
df = pd.DataFrame(data=data, columns=columns)
return df[keep]
def my_overlap_generalized(data, chunksize, depth, columns, boundary):
data = data.rechunk(chunksize)
data_overlapping_chunks = da.overlap.overlap(data, depth=depth, boundary=boundary)
dfs = []
for block in data_overlapping_chunks.to_delayed().ravel():
offsets = np.array(block.key[1:]) * np.array(data.chunksize)
df_block = dask.delayed(extract_features_generalized)(block, offsets=offsets,
depth=depth, columns=columns)
dfs.append(df_block)
return dd.from_delayed(dfs)
data = np.zeros((2,4,8,16,16))
data[0,0,4,2,2] = 1
data[0,1,4,6,2] = 1
data[1,2,4,8,2] = 1
data[0,3,4,2,2] = 1
arr = da.from_array(data)
df = my_overlap_generalized(arr,
chunksize=(-1,-1,-1,8,8),
depth=(0,0,0,2,2),
columns=['r', 'c', 'z', 'y', 'x'],
boundary=tuple(['reflect']*5))
df.compute().reset_index()
-- Remainder of original post, including original bugs --
My example only does xy overlaps, but it's easy to generalize. Is there anything below that is suboptimal or could be done better? Is anything likely to break because it's relying on low-level information that could change (e.g. block key)?
def my_overlap(data, chunk_xy, depth_xy):
data = data.rechunk((-1,-1,-1, chunk_xy, chunk_xy))
data_overlapping_chunks = da.overlap.overlap(data,
depth=(0,0,0,depth_xy,depth_xy),
boundary={3: 'reflect', 4: 'reflect'})
dfs = []
for block in data_overlapping_chunks.to_delayed().ravel():
offsets = np.array(block.key[1:]) * np.array(data.chunksize)
df_block = dask.delayed(extract_features)(block, offsets=offsets, depth_xy=depth_xy)
dfs.append(df_block)
# All computation is delayed, so downstream comptutions need to know the format of the data. If the meta
# information is not specified, a single computation will be done (which could be expensive) at this point
# to infer the metadata.
# This empty dataframe has the index, column, and type information we expect in the computation.
columns = ['r', 'c', 'z', 'y', 'x']
# The dtypes are float64, except for a small number of columns
df_meta = pd.DataFrame(columns=columns, dtype=np.float64)
df_meta = df_meta.astype({'c': np.int64, 'r': np.int64})
df_meta.index.name = 'feature'
return dd.from_delayed(dfs, meta=df_meta)
def extract_features(chunk, offsets, depth_xy):
r, c, z, y, x = np.nonzero(chunk)
df = pd.DataFrame({'r': r, 'c': c, 'z': z, 'y': y+offsets[3]-depth_xy,
'x': x+offsets[4]-depth_xy})
df = df[(df.y > depth_xy) & (df.y < (chunk.shape[3] - depth_xy)) &
(df.z > depth_xy) & (df.z < (chunk.shape[4] - depth_xy))]
return df
data = np.zeros((2,4,8,16,16)) # round, channel, z, y, x
data[0,0,4,2,2] = 1
data[0,1,4,6,2] = 1
data[1,2,4,8,2] = 1
data[0,3,4,2,2] = 1
arr = da.from_array(data)
df = my_overlap(arr, chunk_xy=8, depth_xy=2)
df.compute().reset_index()
First of all, thanks for posting your code. I am working on a similar problem and this was really helpful for me.
When testing your code, I discovered a few mistakes in the extract_features function that prevent your code from returning correct indices.
Here is a corrected version:
def extract_features(chunk, offsets, depth_xy):
r, c, z, y, x = np.nonzero(chunk)
df = pd.DataFrame({'r': r, 'c': c, 'z': z, 'y': y, 'x': x})
df = df[(df.y >= depth_xy) & (df.y < (chunk.shape[3] - depth_xy)) &
(df.x >= depth_xy) & (df.x < (chunk.shape[4] - depth_xy))]
df['y'] = df['y'] + offsets[3] - depth_xy
df['x'] = df['x'] + offsets[4] - depth_xy
return df
The updated code now returns the indices that were set to 1:
index r c z y x
0 0 0 0 4 2 2
1 1 0 1 4 6 2
2 2 0 3 4 2 2
3 1 1 2 4 8 2
For comparison, this is the output of the original version:
index r c z y x
0 1 0 1 4 6 2
1 3 1 2 4 8 2
2 0 0 1 4 6 2
3 1 1 2 4 8 2
It returns lines number 2 and 4, two times each.
The reason why this happens is three mistakes in the extract_features function:
You first add the offset and subtract the depth and then filter out the overlapping parts: the order needs to be swapped
df.y > depth_xy should be replaced with df.y >= depth_xy
df.z should be replaced with df.x, since it is the x dimension that has an overlap
To optimize this even further, here is a generalized version of the code that would work for an arbitrary number of dimension:
def extract_features_generalized(chunk, offsets, depth, columns):
coordinates = np.nonzero(chunk)
df = pd.DataFrame()
rows_to_keep = np.ones(len(coordinates[0]), dtype=int)
for i in range(len(columns)):
df[columns[i]] = coordinates[i]
rows_to_keep = rows_to_keep * np.array((df[columns[i]] >= depth[i])) * \
np.array((df[columns[i]] < (chunk.shape[i] - depth[i])))
df[columns[i]] = df[columns[i]] + offsets[i] - depth[i]
del coordinates
return df[rows_to_keep > 0]
def my_overlap_generalized(data, chunksize, depth, columns):
data = data.rechunk(chunksize)
data_overlapping_chunks = da.overlap.overlap(data, depth=depth,
boundary=tuple(['reflect']*len(columns)))
dfs = []
for block in data_overlapping_chunks.to_delayed().ravel():
offsets = np.array(block.key[1:]) * np.array(data.chunksize)
df_block = dask.delayed(extract_features_generalized)(block, offsets=offsets,
depth=depth, columns=columns)
dfs.append(df_block)
return dd.from_delayed(dfs)
data = np.zeros((2,4,8,16,16))
data[0,0,4,2,2] = 1
data[0,1,4,6,2] = 1
data[1,2,4,8,2] = 1
data[0,3,4,2,2] = 1
arr = da.from_array(data)
df = my_overlap_generalized(arr, chunksize=(-1,-1,-1,8,8),
depth=(0,0,0,2,2), columns=['r', 'c', 'z', 'y', 'x'])
df.compute().reset_index()
OK, so I am wanting to do a sumif on a column on an array because I don't want to print to the worksheet in order to obtain a range type for a Worksheet.function.Sumif, the idea is to stay completely out in VBA code and write to the worksheet as little as possible. I am trying to optimize for speed: 4814 rows by 40 columns X 60.
The first column total is total of 197,321,164 is correct, the next columns are low and instead of going to quarter 40 the Else kicks in and everything after 8 is 0. The first "To_Quarter" in the array is 9 so with the >= I would think it would go to 9. I tried putting my Next I before the End IF but then it just asks for the For.
image of locals box: https://ibb.co/cgFQhxY
Any help would be much appreciated.
Sub SumifONarray()
Dim arrQuarters, arrNumber_of_Assets As Variant
Dim I As Long, J As Long
arrNumber_of_Assets = Range("Costs_Number_of_Assets")
arrQuarters = Range("Quarters_1to40")
Dim MaxRecov_If_result, arr_Row10_Resolution_Physical_Possession_Expenses_To_Quarter, arr_Row10_Resolution__Max_Recovery_Grid As Variant
arr_Row10_Resolution_Physical_Possession_Expenses_To_Quarter = Range("_Row10_Resolution_Physical_Possession_Expenses_To_Quarter")
arr_Row10_Resolution__Max_Recovery_Grid = Range("_Row10_Resolution__Max_Recovery_Grid")
ReDim arrIf_Max_Recovery_Amount_Sum(1 To 1, 1 To UBound(arrQuarters, 2))
For J = LBound(arrQuarters, 2) To UBound(arrQuarters, 2)
For I = LBound(arrNumber_of_Assets, 1) To UBound(arrNumber_of_Assets, 1)
If arr_Row10_Resolution_Physical_Possession_Expenses_To_Quarter(I, 1) >= arrQuarters(1, J) Then
MaxRecov_If_result = MaxRecov_If_result + arr_Row10_Resolution__Max_Recovery_Grid(I, J)
Else: MaxRecov_If_result = 0
End If
Next I
arrIf_Max_Recovery_Amount_Sum(1, J) = MaxRecov_If_result
MaxRecov_If_result = 0
Next J
End Sub
I've uploaded a sample below with code with 10 rows.
https://easyupload.io/wfixds
I would like to implement a 2D LSTM as in this paper, specifically I would like to do so dynamically, so using tf.while. In brief this network works as follows.
order the pixels in an image so that pixel i, j -> i * width + j
run a 2D-LSTM over this sequence
The difference between a 2D and regular LSTM is we have a recurrent connection between the previous element in the sequence and the pixel directly above the current pixel, so at pixel i,j are connections to i - 1, j and i, j - 1.
What I have done
I have tried to do this using tf.while where in each iteration of the loop I accumulate the activations and cell states into a tensor whose shape I allow to vary. This is what the following block of code tries to do.
def single_lstm_layer(inputs, height, width, units, direction = 'tl'):
with tf.variable_scope(direction) as scope:
#Get 2D lstm cell
cell = lstm_cell
#position in sequence
row, col = tf.to_int32(0), tf.to_int32(0)
#use for when i - 1 < 0 or j - 1 < 0
zero_state = tf.fill([1, units], 0.0)
#get first activation and cell_state
output, state = cell(inputs.read(row * width + col), zero_state, zero_state, zero_state, zero_state)
#these are currently of shape (1, units) will ultimately be of shape
#(height * width, untis)
activations = output
cell_states = state
col += 1
with tf.variable_scope(direction, reuse = True) as scope:
def loop_fn(activations, cell_states, row, col):
#Read next input in sequence
i = inputs.read(row * width + col)
#if we are not in the first row then we want to get the activation/cell_state
#above us. Otherwise use zero state.
hidden_state_t = tf.cond(tf.greater_equal(row - 1, 0),
lambda:tf.gather(activations, [(row - 1) * (width) + col]),
lambda:tf.identity(zero_state))
cell_state_t = tf.cond(tf.greater_equal(row - 1, 0),
lambda:tf.gather(cell_states, [(row - 1) * (width) + col]),
lambda:tf.identity(zero_state))
#if we are not in the first col then we want to get the activation/cell_state
#left of us. Otherwise use zero state.
hidden_state_l = tf.cond(tf.greater_equal(col - 1, 0),
lambda:tf.gather(activations, [row * (width) + col - 1]),
lambda:tf.identity(zero_state))
cell_state_l = tf.cond(tf.greater_equal(col - 1, 0),
lambda:tf.gather(cell_states, [row * (width) + col - 1]),
lambda:tf.identity(zero_state))
#Using previous activations/cell_states get current activation/cell_state
output, state = cell(i, hidden_state_l, hidden_state_t, cell_state_l, cell_state_t)
#Append to bottom, will increase number of rows by 1
activations = tf.concat(0, [activations, output])
cell_states = tf.concat(0, [cell_states, state])
#move to next item in sequence
col = tf.cond(tf.equal(col, width - 1), lambda:tf.mul(col, 0), lambda:tf.add(col, 1))
row = tf.cond(tf.equal(col, 0), lambda:tf.add(row, 1), lambda:tf.identity(row))
return activations, cell_states, row, col,
row, col = tf.to_int32(0), tf.constant(1)
activations, cell_states, _, _ = tf.while_loop(
cond = lambda activations, cell_states, row, col: tf.logical_and(tf.less_equal(row , (height - 1)), tf.less_equal(col, width -1)) ,
body = loop_fn,
loop_vars = (activations,
cell_states,
row,
col),
shape_invariants = (tf.TensorShape((None, units)),
tf.TensorShape((None, units)),
tf.TensorShape([]),
tf.TensorShape([]),
),
)
#Return activations with shape [height, width, units]
return tf.pack(tf.split(0, height, activations))
This works, at least in the forward direction. That is to say if I look at what is returned in a session then I get what I want which is a 3D tensor, call it T, of shape [height, width, units] where T[i,j,:] contains the activation of the LSTM cell at input i, j.
I then would like to classify each pixel and for this purpose I conv2D across T then reshape the result into [height * width, num_labels] and construct the cross entropy loss.
T = tf.nn.conv2d(T, W, strides = [1, 1, 1, 1], padding = 'VALID')
T = tf.reshape(T, [height * width, num_labels])
loss = tf.reduce_mean(
tf.nn.softmax_cross_entropy_with_logits(
labels = tf.reshape(labels, [height * width, num_labels]),
logits = T)
)
optimizer = tf.train.AdagradOptimizer(0.01).minimize(loss)
The problem
However now when I try with an image which 28 x 28 and 32 units
sess.run(optimizer, feed_dict = feed_dict)
I get the following error
File "Assignment2/train_model.py", line 52, in <module>
train_models()
File "/Assignment2/train_model.py", line 12, in train_models
image, out, labels, optomizer, accuracy, prediction, ac = build_graph(28, 28)
File "/Assignment2/multidimensional.py", line 101, in build_graph
optimizer = tf.train.AdagradOptimizer(0.01).minimize(loss)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/optimizer.py", line 196, in minimize
grad_loss=grad_loss)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/optimizer.py", line 253, in compute_gradients
colocate_gradients_with_ops=colocate_gradients_with_ops)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gradients.py", line 491, in gradients
in_grad.set_shape(t_in.get_shape())
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 408, in set_shape
self._shape = self._shape.merge_with(shape)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/tensor_shape.py", line 579, in merge_with
(self, other))
ValueError: Shapes (784, 32) and (1, 32) are not compatible
I think this is a problem with calculating the gradients resulting from the tf.while loop but I am pretty lost at this point.
I am trying to use DBSCAN from scikitlearn to segment an image based on color. The results I'm getting are . As you can see there are 3 clusters. My goal is to separate the buoys in the picture into different clusters. But obviously they are showing up as the same cluster. I've tried a wide range of eps values and min_samples but those two things always cluster together. My code is:
img= cv2.imread("buoy1.jpg)
labimg = cv2.cvtColor(img, cv2.COLOR_BGR2LAB)
n = 0
while(n<4):
labimg = cv2.pyrDown(labimg)
n = n+1
feature_image=np.reshape(labimg, [-1, 3])
rows, cols, chs = labimg.shape
db = DBSCAN(eps=5, min_samples=50, metric = 'euclidean',algorithm ='auto')
db.fit(feature_image)
labels = db.labels_
plt.figure(2)
plt.subplot(2, 1, 1)
plt.imshow(img)
plt.axis('off')
plt.subplot(2, 1, 2)
plt.imshow(np.reshape(labels, [rows, cols]))
plt.axis('off')
plt.show()
I assume this is taking the euclidean distance and since its in lab space euclidean distance would be different between different colors. If anyone can give me guidance on this I'd really appreciate it.
Update:
The below answer works. Since DBSCAN requires an array with no more then 2 dimensions I concatenated the columns to the original image and reshaped to produce a n x 5 matrix where n is the x dimension times the y dimension. This seems to work for me.
indices = np.dstack(np.indices(img.shape[:2]))
xycolors = np.concatenate((img, indices), axis=-1)
np.reshape(xycolors, [-1,5])
You need to use both color and position.
Right now, you are using colors only.
Could you please add the enitre code in the answer? Im not able to understand where do I add the those 3 lines which have worked for you – user8306074 Sep 4 at 8:58
Let me answer for you, and here is the full version of the code:
import numpy as np
import cv2
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
img= cv2.imread('your image')
labimg = cv2.cvtColor(img, cv2.COLOR_BGR2LAB)
n = 0
while(n<4):
labimg = cv2.pyrDown(labimg)
n = n+1
feature_image=np.reshape(labimg, [-1, 3])
rows, cols, chs = labimg.shape
db = DBSCAN(eps=5, min_samples=50, metric = 'euclidean',algorithm ='auto')
db.fit(feature_image)
labels = db.labels_
indices = np.dstack(np.indices(labimg.shape[:2]))
xycolors = np.concatenate((labimg, indices), axis=-1)
feature_image2 = np.reshape(xycolors, [-1,5])
db.fit(feature_image2)
labels2 = db.labels_
plt.figure(2)
plt.subplot(2, 1, 1)
plt.imshow(img)
plt.axis('off')
# plt.subplot(2, 1, 2)
# plt.imshow(np.reshape(labels, [rows, cols]))
# plt.axis('off')
plt.subplot(2, 1, 2)
plt.imshow(np.reshape(labels2, [rows, cols]))
plt.axis('off')
plt.show()
I wanna combine three graphics on one graph. The data from inside of R which is " nottem ". Can someone help me to write code to put a seasonal mean and harmonic (cosine model) and its time series plots together by using different colors? I already wrote model code just don't know how to combine them together to compare.
Code :library(TSA)
nottem
month.=season(nottem)
model=lm(nottem~month.-1)
summary(nottem)
har.=harmonic(nottem,1)
model1=lm(nottem~har.)
summary(model1)
plot(nottem,type="l",ylab="Average monthly temperature at Nottingham castle")
points(y=nottem,x=time(nottem), pch=as.vector(season(nottem)))
Just put your time series inside a matrix:
x = cbind(serie1 = ts(cumsum(rnorm(100)), freq = 12, start = c(2013, 2)),
serie2 = ts(cumsum(rnorm(100)), freq = 12, start = c(2013, 2)))
plot(x)
Or configure the plot region:
par(mfrow = c(2, 1)) # 2 rows, 1 column
serie1 = ts(cumsum(rnorm(100)), freq = 12, start = c(2013, 2))
serie2 = ts(cumsum(rnorm(100)), freq = 12, start = c(2013, 2))
require(zoo)
plot(serie1)
lines(rollapply(serie1, width = 10, FUN = mean), col = 'red')
plot(serie2)
lines(rollapply(serie2, width = 10, FUN = mean), col = 'blue')
hope it helps.
PS.: zoo package is not needed in this example, you could use the filter function.
You can extract the seasonal mean with:
s.mean = tapply(serie, cycle(serie), mean)
# January, assuming serie is monthly data
print(s.mean[1])
This graph is pretty hard to read, because your three sets of values are so similar. Still, if you want to simply want to graph all of these on the sample plot, you can do it pretty easily by using the coefficients generated by your models.
Step 1: Plot the raw data. This comes from your original code.
plot(nottem,type="l",ylab="Average monthly temperature at Nottingham castle")
Step 2: Set up x-values for the mean and cosine plots.
x <- seq(1920, (1940 - 1/12), by=1/12)
Step 3: Plot the seasonal means by repeating the coefficients from the first model.
lines(x=x, y=rep(model$coefficients, 20), col="blue")
Step 4: Calculate the y-values for the cosine function using the coefficients from the second model, and then plot.
y <- model1$coefficients[2] * cos(2 * pi * x) + model1$coefficients[1]
lines(x=x, y=y, col="red")
ggplot variant: If you decide to switch to the popular 'ggplot2' package for your plot, you would do it like so:
x <- seq(1920, (1940 - 1/12), by=1/12)
y.seas.mean <- rep(model$coefficients, 20)
y.har.cos <- model1$coefficients[2] * cos(2 * pi * x) + model1$coefficients[1]
plot_Data <- melt(data.frame(x=x, temp=nottem, seas.mean=y.seas.mean, har.cos=y.har.cos), id="x")
ggplot(plot_Data, aes(x=x, y=value, col=variable)) + geom_line()