Linear regression on multidimensional dask arrays - dask

I am trying to perform linear regression on multidimensional A and b for arrays that are larger-than-memory. Using sklearn notation, my arrays have shapes:
A: (n_samples, n_features)
b: (n_samples, n_targets)
I have so far been using numpy arrays, and using sklearn.linear_model.ridge_regression.
Now I am swapping to dask, and need to find something equivalent. I've tried passing the dask arrays to ridge_regression, but I observe that I run out of memory when doing so. I've looked at dask_ml, but their LinearRegression model only takes 1D-arrays b/y with shape (n_samples), and requires the data to be chunked correctly, which I am not sure how to do. I can use a standard matrix inversion approach, but then I risk singular matrix errors.
A small worked example, which shows how the ridge_regression method ends up with a numpy array, not a dask array:
import sklearn.linear_model
X = da.random.random((1024, 30))
y = da.random.random((1024, 10000))
fit = sklearn.linear_model.ridge_regression(X, y, alpha=0.0)
type(fit)
Does anyone have any suggestions for models that can perform the above regression, but for large data?

Related

How to reverse GP-LVM Gaussian process latent variable and reconstruct original variables using python?

GP-LVM an be used for dimensionality reduction. After such dimensionality reduction is performed, how can one approximately reconstruct the original variables/features?
Using GPy, you can use the predict function, which returns the predictive mean and variance:
import numpy as np
from GPy.models import GPLVM
m = GPLVM(Y, input_dim=2)
m.optimize()
mean, var = m.predict(m.X)
print(np.linalg.norm(Y - mean, 2))

Error while predicting a single value using a linear regression model

I'm a beginner and making a linear regression model, when I make predictions on the basis of test sets, it works fine. But when I try to predict something for a specific value. It gives an error. The tutorial I'm watching, they don't have any errors.
dataset = pd.read_csv('Position_Salaries.csv')
X = dataset.iloc[:, 1:2].values
y = dataset.iloc[:, 2].values
# Fitting Linear Regression to the dataset
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X, y)
# Visualising the Linear Regression results
plt.scatter(X, y, color = 'red')
plt.plot(X, lin_reg.predict(X), color = 'blue')
plt.title('Truth or Bluff (Linear Regression)')
plt.xlabel('Position level')
plt.ylabel('Salary')
plt.show()
# Predicting a new result with Linear Regression
lin_reg.predict(6.5)
ValueError: Expected 2D array, got scalar array instead:
array=6.5.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
According to the Scikit-learn documentation, the input array should have shape (n_samples, n_features). As such, if you want a single example with a single value, you should expect the shape of your input to be (1,1).
This can be done by doing:
import numpy as np
test_X = np.array(6.5).reshape(-1, 1)
lin_reg.predict(test_X)
You can check the shape by doing:
test_X.shape
The reason for this is because the input can have many samples (i.e. you want to predict for multiple data points at once), or/and each sample can have many features.
Note: Numpy is a Python library to support large arrays and matrices. When scikit-learn is installed, Numpy should be installed as well.

error while passing data-frame through k-means

Although my data-frame as all the float values everywhere. While passing the data frame through k-means it shows that couldn't convert the string to float.
How to convert nan values if any to float values in the entire data-frame?
This would do your job and convert all the columns in string format to categorical codes or use one hot encoding of the variables in these columns.
import numpy as np
from sklearn.cluster import KMeans
import pandas
df = pandas.read_csv('zipIncome.csv')
print(df)
df[col_name]= df[col_name].astype('category')
df[col_name] = df[col_name].cat.codes
kmeans = KMeans(n_clusters=4,init='k-means++', max_iter=600, algorithm = 'auto').fit(df)
print (kmeans.labels_)
print(kmeans.cluster_centers_)
Based on your code, it would seem that you only instantiated the KMeans but haven't used it.
You'll need input data X that is clean (i.e. no strings etc), let's call it X
kmeans = KMeans(n_clusters=4,init='k-means++', max_iter=600, algorithm = 'auto')
clusters = kmeans.fit_predict(X)
now clusters has the cluster number for each sample in X.
(alternatively, you can do the fit(X) and then later predict(X) separately, but ultimately it is the predict that will output the cluster labels that you will need)
If you want to later get clusters on data, you should use kmeans.predict(new_data) rather than fit_predict() so that KMeans uses the learning from X, and applies it to your new_data (or depending on your needs, you might want to retrain it).
Hope this helps.
Finally, you can add another column to your pandas DataFrame by doing:
df['cluster'] = clusters
where 'cluster' is a string for your new column name, you can of course call it whatever you want

Is kernel regression the same as linear kernel regression?

i wanted to code the linear kernel regression in sklearn so i made this code :
model = LinearRegression()
weights = rbf_kernel(X_train,X_test)
for i in range(weights.shape[1]):
model.fit(X_train,y_train,weights[:,i])
model.predict(X_test[i])
then i found that there is KernelRidge in sklearn :
model = KernelRidge(kernel='rbf')
model.fit(X_train,y_train)
pred = model.predict(X_train)
my question is:
1-what is the difference between these 2 codes?
2-in model.fit() that come after KernelRidge(), i found in the documentation that i can add a third argument "weight" to fit() function, i would i do that if i already applied a kernel function to the model?
What is the difference between these two code snippets?
Basically, they have nothing in common. Your first code snippet implements linear regression, with arbitrary set weights of samples. (How did you even come up with calling rbf_kernel this way?) This is still just a linear model, nothing more. You simply assigned (a bit randomly) which samples are important and then looped over features (?). This makes no sense at all. In general: what you have done with rbf_kernel is simply wrong; this is completely not how it is supposed to be used (and why it gave you errors when you tried to pass it to the fit method and you ended up doing a loop and passing each column separately).
Example of fitting such a model to data which is a cosine (thus 0 in mean):
I found in the documentation for the model.fit() function that comes after KernelRidge() that I can add a third argument, weight. Would I do that if I had already applied a kernel function to the model?
This is actual kernel method, kernel is not samples weighting. (One might use kernel function to assign weights, but this is not the meaning of kernel in "linear kernel regression" or in general "kernel methods".) Kernel is a method of introducing nonlinearity to the classifier, which comes from the fact that many methods (including linear regression) can be expressed as dot products between vectors, which can be substituted by kernel function leading to solving the problem in different space (Reproducing Hilbert Kernel Space), which might have very high complexity (like the infinite dimensional space of continuous functions induced by the RBF kernel).
Example of fitting to the same data as above:
from sklearn.linear_model import LinearRegression
from sklearn.kernel_ridge import KernelRidge
import numpy as np
from matplotlib import pyplot as plt
X = np.linspace(-10, 10, 100).reshape(100, 1)
y = np.cos(X)
for model in [LinearRegression(), KernelRidge(kernel='rbf')]:
model.fit(X, y)
p = model.predict(X)
plt.figure()
plt.title(model.__class__.__name__)
plt.scatter(X[:, 0], y)
plt.plot(X, p)
plt.show()

Translating a TensorFlow LSTM into synapticjs

I'm working on implementing an interface between a TensorFlow basic LSTM that's already been trained and a javascript version that can be run in the browser. The problem is that in all of the literature that I've read LSTMs are modeled as mini-networks (using only connections, nodes and gates) and TensorFlow seems to have a lot more going on.
The two questions that I have are:
Can the TensorFlow model be easily translated into a more conventional neural network structure?
Is there a practical way to map the trainable variables that TensorFlow gives you to this structure?
I can get the 'trainable variables' out of TensorFlow, the issue is that they appear to only have one value for bias per LSTM node, where most of the models I've seen would include several biases for the memory cell, the inputs and the output.
Internally, the LSTMCell class stores the LSTM weights as a one big matrix instead of 8 smaller ones for efficiency purposes. It is quite easy to divide it horizontally and vertically to get to the more conventional representation. However, it might be easier and more efficient if your library does the similar optimization.
Here is the relevant piece of code of the BasicLSTMCell:
concat = linear([inputs, h], 4 * self._num_units, True)
# i = input_gate, j = new_input, f = forget_gate, o = output_gate
i, j, f, o = array_ops.split(1, 4, concat)
The linear function does the matrix multiplication to transform the concatenated input and the previous h state into 4 matrices of [batch_size, self._num_units] shape. The linear transformation uses a single matrix and bias variables that you're referring to in the question. The result is then split into different gates used by the LSTM transformation.
If you'd like to explicitly get the transformations for each gate, you can split that matrix and bias into 4 blocks. It is also quite easy to implement it from scratch using 4 or 8 linear transformations.

Resources