I'm tryin to optimize this loop in python:
def apply_kernel(X,Z,kernel):
d=kernel.shape
for i in range(X.shape[0]):
for j in range(X.shape[1]):
for k in range(X.shape[2]):
X[i,j,k]=np.sum(kernel*Z[i:(i+d[0]),j:(j+d[1]),k:(k+d[2])])
return X
I tried with cython:
import cython
import numpy as np
cimport numpy as np
ctypedef np.float64_t cpl_t
cpl = np.float64
#cython.boundscheck(False) # compiler directive
#cython.wraparound(False) # compiler directive
def appKernel( np.ndarray[ cpl_t, ndim=3] X,np.ndarray[ cpl_t, ndim=3] Z,np.ndarray[ cpl_t, ndim=3] kernel):
cdef Py_ssize_t i, j, k, ik, jk, kk
for i in range(X.shape[0]):
for j in range(X.shape[1]):
for k in range(X.shape[2]):
for ik in range(kernel.shape[0]):
for jk in range(kernel.shape[1]):
for kk in range(kernel.shape[2]):
X[i,j,k]=X[i,j,k]+Z[i+ik,j+jk,k+kk]*kernel[ik,jk,kk]
return X
There is an improvement (about 10 times faster); but I think that it could be better. Do you have any ideas on how to improve this?
Next step would be paralelize this using mpi4py.
Related
I'm learning about clustering and KMeans and such, so my knowldge is very basic on the topic. What I have below is a bit of a self study on how it works. Basically, if 'a' shows up in any of the columns, 'Binary' will equal 1. Essentially I am trying to teach it a pattern. I learned the following from a tutorial using the Titanic dataset, but I've adapted to my own data.
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
import seaborn as sns
import matplotlib.pyplot as plt
my constructed data
dataset = [
[0,'x','f','g'],[1,'a','c','b'],[1,'d','k','a'],[0,'y','v','w'],
[0,'q','w','e'],[1,'c','a','l'],[0,'t','x','j'],[1,'w','o','a'],
[0,'z','m','n'],[1,'z','x','a'],[0,'f','g','h'],[1,'h','a','c'],
[1,'a','r','e'],[0,'g','c','c']
]
df = pd.DataFrame(dataset, columns=['Binary','Col1','Col2','Col3'])
df.head()
df:
Binary Col1 Col2 Col3
------------------------
1 a b c
0 x t v
0 s q w
1 n m a
1 u a r
Encode non binary to binary:
labelEncoder = LabelEncoder()
labelEncoder.fit(df['Col1'])
df['Col1'] = labelEncoder.transform(df['Col1'])
labelEncoder.fit(df['Col2'])
df['Col2'] = labelEncoder.transform(df['Col2'])
labelEncoder.fit(df['Col3'])
df['Col3'] = labelEncoder.transform(df['Col3'])
Set clusters to two, because its either 1 or 0?
X = np.array(df.drop(['Binary'], 1).astype(float))
y = np.array(df['Binary'])
kmeans = KMeans(n_clusters=2)
kmeans.fit(X)
Test it:
correct = 0
for i in range(len(X)):
predict_me = np.array(X[i].astype(float))
predict_me = predict_me.reshape(-1, len(predict_me))
prediction = kmeans.predict(predict_me)
if prediction[0] == y[i]:
correct += 1
The result:
print(f'{round(correct/len(X) * 100)}% Accuracy')
>>> 71%
How can I get it more accurate to the point where it 99.99% knows that 'a' means binary column is 1? More data?
K-means does not even try to predict this value. Because it is an unsupervised method. Because it is not a prediction algorithm; it is a structure discovery task. Don't mistake clustering for classification.
The cluster numbers have no meaning. They are 0 and 1 because these are the first two integers. K-means is randomized. Run it a few times and you will also score just 29% sometimes.
Also, k-means is designed for continuous input. You can apply it on binary encoded data, but the results will be pretty poor.
I am attempting to translate a semidefinite programming problem from CVX to CVXPY as described here. My attempt follows:
import cvxpy as cvx
import numpy as np
c = [0, 1]
n = len(c)
# Create optimization variables.
f = cvx.Variable((n, n), hermitian=True)
# Create constraints.
constraints = [f >> 0]
for k in range(1, n):
indices = [(i * n) + i - (n - k) for i in range(n - k, n)]
constraints += [cvx.sum(cvx.vec(f)[indices]) == c[n - k]]
# Form objective.
obj = cvx.Maximize(c[0] - cvx.trace(f))
# Form and solve problem.
prob = cvx.Problem(obj, constraints)
sol = prob.solve()
print(sol)
print(f.value)
The issue here is that when I take the coefficients of the Fourier series and translate them into the array c it fails on complex values. I think this is due to a discrepancy between the maximize function of CVX and CVXPY. I'm not sure what CVX is maximizing, since the trace of the matrix is a complex value. As pointed out below the trace is real since the matrix is Hermitian, but the code still fails. Can someone with CVXPY knowledge clear this up?
Is it possible to create a dask array from a delayed value by specifying its shape with an other delayed value?
My algorithm won't give me the shape of the array until pretty late in the computation.
Eventually, I will be creating some blocks with shapes specified by the intermediate results of my computation, eventually calling da.concatenate on all the results (well da.block if it were more flexible)
I don't think it is too detrimental if I can't, but it would be cool if could.
Sample code
from dask import delayed
from dask import array as da
import numpy as np
n_shape = (3, 3)
shape = delayed(n_shape, nout=2)
d_shape = (delayed(n_shape[0]), delayed(n_shape[1]))
n = delayed(np.zeros)(n_shape, dtype=np.float)
# this doesn't work
# da.from_delayed(n, shape=shape, dtype=np.float)
# this doesn't work either, but I think goes a little deeper
# into the function call
da.from_delayed(n, shape=d_shape, dtype=np.float)
You can not provide a delayed shape, but you can state that the shape is unknown using np.nan as a value wherever you don't know a dimension
Example
import random
import numpy as np
import dask
import dask.array as da
#dask.delayed
def f():
return np.ones((5, random.randint(10, 20))) # a 5 x ? array
values = [f() for _ in range(5)]
arrays = [da.from_delayed(v, shape=(5, np.nan), dtype=float) for v in values]
x = da.concatenate(arrays, axis=1)
>>> x
dask.array<concatenate, shape=(5, nan), dtype=float64, chunksize=(5, nan)>
>>> x.shape
(5, np.nan)
>>> x.compute().shape
(5, 88)
Docs
See http://dask.pydata.org/en/latest/array-chunks.html#unknown-chunks
How can I use sklearn scaler / imputer to impute a tensor? I want to scale / impute within a pipeline. My input is a 3-d numpy array.
I have a tensor of shape (n_samples, n_timesteps, n_feat) a la Keras. This is a sequence that can be learned by an LSTM. I want to scale / impute first, however. In particular, I want to scale on the fly inside a sci-kit learn pipeline, since scaling the full dataset, which would be easy, leads to leakage. Keras already integrates w sklearn (see here), but there do not appear to be easy ways to scale and impute the tensors that keras time series models process.
Unfortunately, the following gives an error
import numpy as np
X = np.array([[[3,5],[6,2]],[[8.,23.],[7.,23]],[[3, 4],[2, 55]]])
print X
from sklearn.preprocessing import StandardScaler
s = StandardScaler()
X = s.fit_transform(X)
print X
Of the effect, "the scaler only works on 2-d numpy arrays".
My solution was to add a decorator to the sklearn preprocessing data.py file
def flat(func):
def wrapper(*args, **kwargs):
self, X = args
a, b, c = X.shape
X = X.reshape(a, b*c)
r = func(self, X, **kwargs)
if hasattr(r,'ndim'):
X = r.reshape(a, b, c)
return X
else:
return r
return wrapper
Then use it on the functions, eg fit
#flat
def fit(self, X, y=None):
"""Compute the mean and std to be used for later scaling.
Parameters
----------
X : {array-like, sparse matrix}, shape [n_samples, n_features]
The data used to compute the mean and standard deviation
used for later scaling along the features axis.
y : Passthrough for ``Pipeline`` compatibility.
"""
# Reset internal state before fitting
self._reset()
return self.partial_fit(X, y)
This works well; with the same script as above, I get
[[[ 3. 5.]
[ 6. 2.]]
[[ 8. 23.]
[ 7. 23.]]
[[ 3. 4.]
[ 2. 55.]]]
[[[-0.70710678 -0.64906302]
[ 0.46291005 -1.13191668]]
[[ 1.41421356 1.41266656]
[ 0.9258201 -0.16825789]]
[[-0.70710678 -0.76360355]
[-1.38873015 1.30017457]]]
But beware, it doesn't check for 2d arrays, which it can't process. So, use the normal preprocessing module for 2d arrays!
I'm trying out scikit-learn LinearRegression model on a simple dataset (comes from Andrew NG coursera course, I doesn't really matter, look the plot for reference)
this is my script
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
dataset = np.loadtxt('../mlclass-ex1-008/mlclass-ex1/ex1data1.txt', delimiter=',')
X = dataset[:, 0]
Y = dataset[:, 1]
plt.figure()
plt.ylabel('Profit in $10,000s')
plt.xlabel('Population of City in 10,000s')
plt.grid()
plt.plot(X, Y, 'rx')
model = LinearRegression()
model.fit(X[:, np.newaxis], Y)
plt.plot(X, model.predict(X[:, np.newaxis]), color='blue', linewidth=3)
print('Coefficients: \n', model.coef_)
plt.show()
my question is:
I expect to have 2 coefficient for this linear model: the intercept term and the x coefficient, how comes I just get one?
OOOPS
I didn't notice that the intercept is a separated attribute of the model!
print('Intercept: \n', model.intercept_)
look documentation here
intercept_ : array
Independent term in the linear model.