Memory error while doing Hierarchical Clustering - machine-learning

I have a large dataset (207989, 23), and I am trying to apply Hierarchical clustering on just one column right now to test if it's suitable for the task at my hand.
What I have tried:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import preprocessing
data = pd.read_csv('gpmd.csv', header = 0)
X = data.loc[:, ['ContextID', 'BacksGas_Flow_sccm']]
min_max_scaler = preprocessing.MinMaxScaler()
X_minmax = min_max_scaler.fit_transform(X.values[:,[1]])
import scipy.cluster.hierarchy as sch
dendrogram = sch.dendrogram(sch.linkage(X_minmax, method = 'ward'))
after doing this, I am getting the following error:
dendrogram = sch.dendrogram(sch.linkage(X_minmax, method = 'ward'))
Traceback (most recent call last):
File "<ipython-input-4-429f42b68112>", line 1, in <module>
dendrogram = sch.dendrogram(sch.linkage(X_minmax, method = 'ward'))
File "C:\Users\kashy\Anaconda3\envs\py36\lib\site-packages\scipy\cluster\hierarchy.py", line 708, in linkage
y = distance.pdist(y, metric)
File "C:\Users\kashy\Anaconda3\envs\py36\lib\site-packages\scipy\spatial\distance.py", line 1877, in pdist
dm = np.empty((m * (m - 1)) // 2, dtype=np.double)
MemoryError
Can someone explain what exactly is the problem here?
Thanks in advance

Hierarchical clustering in most variants needs O(n²) memory.
Because of this, most implementations will fail at around 65535 instances, when they hit the 32 bit mark (some may fail at 32k already). But just do the math: n * n * 8 bytes for double precision: how much memory would you need?

Related

scv.pl.proportions(): numpy.AxisError in `Cellrank` workflow

I am new to use python to anlyze scRNA-seq. I run the cellrank workflow and always found this error.
Here is my code for Cellrank:
import scvelo as scv
import scanpy as sc
import cellrank
import numpy as np
scv.settings.verbosity = 3
scv.settings.set_figure_params("scvelo")
cellrank.settings.verbosity = 2
import warnings
warnings.simplefilter("ignore", category=UserWarning)
warnings.simplefilter("ignore", category=FutureWarning)
warnings.simplefilter("ignore", category=DeprecationWarning)
adata = sc.read_h5ad('./my.h5ad') # my data
**scv.pl.proportions(adata)**
The errorcode:
Traceback (most recent call last):
File "test_cellrank.py", line 25, in <module>
**scv.pl.proportions(adata)**
...........
**numpy.AxisError: axis 1 is out of bounds for array of dimension 1**
I tried to use SeuratDisk or loom to get h5ad from a seurat object. I thought that must be some problem in this progress.
Here is the anndata object from tutorial:
>>> adata
AnnData object with n_obs × n_vars = 2531 × 27998
obs: 'day', 'proliferation', 'G2M_score', 'S_score', 'phase', 'clusters_coarse', 'clusters', 'clusters_fine', 'louvain_Alpha', 'louvain_Beta', 'palantir_pseudotime'
var: 'highly_variable_genes'
uns: 'clusters_colors', 'clusters_fine_colors', 'day_colors', 'louvain_Alpha_colors', 'louvain_Beta_colors', 'neighbors', 'pca'
obsm: 'X_pca', 'X_umap'
layers: 'spliced', 'unspliced'
obsp: 'connectivities', 'distances'
Here is mine:
>>> adata
AnnData object with n_obs × n_vars = 5443 × 18489
obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'percent.mt', 'nCount_SCT', 'nFeature_SCT', 'SCT_snn_res.0.8', 'seurat_clusters', 'SCT_snn_res.0.5', 'SCT_snn_res.0.6',
'SCT_snn_res.0.7', 'S.Score', 'G2M.Score', 'Phase', 'old.ident', 'new.ident', 'nCount_MAGIC_RNA', 'nFeature_MAGIC_RNA'
var: 'SCT_features', '_index', 'features'
obsm: 'X_tsne', 'X_umap'
layers: 'SCT'
So, What packages or protocols should I follow to convert a seurat into a h5ad?
Thank you for your help!
scv.pl.proportions gives the proportion of spliced and unspliced reads in your dataset. These count tables must be added to your adata layers before you can call this function.
Your adata object does not have these layers. I think that is why you are seeing this error.
Conversion from Seurat to h5ad can be accomplished using two step process given here

Why does using X[0] in MNIST classifier code give me an error?

I was learning to do classification with the MNIST dataset. And I got an error with I am not able to figure out, I have done a lot of google searches and I am not able to do anything, maybe you are an expert and can help me. Here is the code--
>>> from sklearn.datasets import fetch_openml
>>> mnist = fetch_openml('mnist_784', version=1)
>>> mnist.keys()
output:
dict_keys(['data', 'target', 'frame', 'categories', 'feature_names', 'target_names', 'DESCR', 'details', 'url'])
>>> X, y = mnist["data"], mnist["target"]
>>> X.shape
output:(70000, 784)
>>> y.shape
output:(70000)
>>> X[0]
output:KeyError Traceback (most recent call last)
c:\users\khush\appdata\local\programs\python\python39\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
2897 try:
-> 2898 return self._engine.get_loc(casted_key)
2899 except KeyError as err:
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 0
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
<ipython-input-10-19c40ecbd036> in <module>
----> 1 X[0]
c:\users\khush\appdata\local\programs\python\python39\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
2904 if self.columns.nlevels > 1:
2905 return self._getitem_multilevel(key)
-> 2906 indexer = self.columns.get_loc(key)
2907 if is_integer(indexer):
2908 indexer = [indexer]
c:\users\khush\appdata\local\programs\python\python39\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
2898 return self._engine.get_loc(casted_key)
2899 except KeyError as err:
-> 2900 raise KeyError(key) from err
2901
2902 if tolerance is not None:
KeyError: 0
Please answer, there can be a silly mistake because I am a beggineer in ML. It would be really helpful if you gave me some hint also.
The API of fetch_openml changed between versions. In earlier versions, it returns a numpy.ndarray array. Since 0.24.0 (December 2020), as_frame argument of fetch_openml is set to auto (instead of False as default option earlier) which gives you a pandas.DataFrame for the MNIST data. You can force the data read as a numpy.ndarray by setting as_frame = False. See fetch_openml reference .
I was also facing the same problem.
scikit-learn: 0.24.0
matplotlib: 3.3.3
Python: 3.9.1
I used to below code to resolve the issue.
import matplotlib as mpl
import matplotlib.pyplot as plt
# instead of some_digit = X[0]
some_digit = X.to_numpy()[0]
some_digit_image = some_digit.reshape(28,28)
plt.imshow(some_digit_image,cmap="binary")
plt.axis("off")
plt.show()
You don't need to downgrade you scikit-learn library, if you follow the code below:
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version= 1, as_frame= False)
mnist.keys()
You load the dataset as a dataframe for you to able to access the images, you have two ways to do this,
Transform the dataframe to an Array
# Transform the dataframe into an array. Check the first value
some_digit = X.to_numpy()[0]
# Reshape it to (28,28). Note: 28 x 28 = 7064, if the reshaping doesn't meet
# this you are not able to show the image
some_digit_image = some_digit.reshape(28,28)
plt.imshow(some_digit_image,cmap="binary")
plt.axis("off")
plt.show()
Transform the row
# Transform the row of your choosing into an array
some_digit = X.iloc[0,:].values
# Reshape it to (28,28). Note: 28 x 28 = 7064, if the reshaping doesn't
# meet this you are not able to show the image
some_digit_image = some_digit.reshape(28,28)
plt.imshow(some_digit_image,cmap="binary")
plt.axis("off")
plt.show()

Is np.linalg.solve() not working for AutoDiff?

Does np.linalg.solve() not work for AutoDiff? I use is to solve manipulator equation. The error message is shown below.
I try a similar "double" version code, it is no issue. Please tell me how to fix it, thanks!
### here is the error message
vdot_ad = np.linalg.solve(M_,ggg_ad)
File "<__array_function__ internals>", line 5, in solve
File "/usr/local/lib/python3.8/site-packages/numpy/linalg/linalg.py", line 394, in solve
r = gufunc(a, b, signature=signature, extobj=extobj)
TypeError: No loop matching the specified signature and casting was found for ufunc solve1
####. here is the code
plant = MultibodyPlant(time_step= 0.01)
parser = Parser(plant)
parser.AddModelFromFile("double_pendulum.sdf")
plant.Finalize()
plant_autodiff = plant.ToAutoDiffXd()
####### <AutoDiff> get the error message
xu = np.hstack((x, u))
xu_ad = initializeAutoDiff(xu)[:,0]
x_ad = xu_ad[:4]
q_ad = x_ad[:2]
v_ad = x_ad[2:4]
u_ad = xu_ad[4:]
(M_, Cv_, tauG_, B_, tauExt_) = ManipulatorDynamics(plant_autodiff, q_ad, v_ad)
vdot_ad = np.linalg.solve(M_,tauG_ + np.dot(B_,u_ad) - np.dot(Cv_,v_ad))
Note that in pydrake, AutoDiffXd scalars are exposed to NumPy using dtype=object.
There are some drawbacks to this approach, like what you have ran into now.
This is not necessarily an issue with Drake, but a limitation on NumPy itself given the ufunc's that are implemented on the (super old) version that is on 18.04.
To illustrate, here is what I see on Ubuntu 18.04, CPython 3.6.9, NumPy 1.13.3:
>>> import numpy as np
>>> A = np.eye(2)
>>> b = np.array([1, 2])
>>> np.linalg.solve(A, b)
array([ 1., 2.])
>>> A = A.astype(object)
>>> b = b.astype(object)
>>> np.linalg.solve(A, b)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3/dist-packages/numpy/linalg/linalg.py", line 375, in solve
r = gufunc(a, b, signature=signature, extobj=extobj)
TypeError: No loop matching the specified signature and casting
was found for ufunc solve1
The most direct solution would be to expose an analogous routine in pydrake, and have users leverage that.
That is what we had to do for np.linalg.inv as well:
https://github.com/RobotLocomotion/drake/pull/11173/files
Not the best solution :( However, it's simple enough!

Troubles using dask distributed with datashader: 'can't pickle weakref objects'

I'm working with datashader and dask but I'm having problems when trying to plot with a cluster running. To make it more concrete, I have the following example (embedded in a bokeh plot):
import holoviews as hv
import pandas as pd
import dask.dataframe as dd
import numpy as np
from holoviews.operation.datashader import datashade
import datashader.transfer_functions as tf
#initialize the client/cluster
cluster = LocalCluster(n_workers=4, threads_per_worker=1)
dask_client = Client(cluster)
def datashade_plot():
hv.extension('bokeh')
#create some random data (in the actual code this is a parquet file with millions of rows, this is just an example)
delta = 1/1000
x = np.arange(0, 1, delta)
y = np.cumsum(np.sqrt(delta)*np.random.normal(size=len(x)))
df = pd.DataFrame({'X':x, 'Y':y})
#create dask dataframe
points_dd = dd.from_pandas(df, npartitions=3)
#create plot
points = hv.Curve(points_dd)
return hd.datashade(points)
dask_client.submit(datashade_plot,).result()
This raises a:
TypeError: can't pickle weakref objects
I have the theory that this happens because you can't distribute the datashade operations in the cluster. Sorry if this is a noob question, I'd be very grateful for any advice you could give me.
I think you want to go the other way. That is, pass datashader a dask dataframe instead of a pandas dataframe:
>>> from dask import dataframe as dd
>>> import multiprocessing as mp
>>> dask_df = dd.from_pandas(df, npartitions=mp.cpu_count())
>>> dask_df.persist()
...
>>> cvs = datashader.Canvas(...)
>>> agg = cvs.points(dask_df, ...)
XREF: https://datashader.org/user_guide/Performance.html

I am not able Training models in sklearn (scikit-learn) using python

i have data file it contain data to predict the admission in MS.
it contain 9 column 8 column contain student data and 9th column contain chance of selection of student.
i am new and i don't understand error come in training model
import pandas
import numpy as np
import sklearn as sl
from sklearn.neural_network import MLPClassifier
classifier = MLPClassifier()
data = pandas.read_csv('Addmition.csv')
data_array = np.array(data)
X = data_array[:,1:8]
y = data_array[:,8]
classifier.fit(X,y)
print(classifier)
Traceback (most recent call last):
File "c.py", line 14, in <module>
classifier.fit(X,y)
File "C:\Users\vishal jangid\AppData\Roaming\Python\Python37\site-packages\sklearn\neural_network\multilayer_perceptron.py", line 977, in fit
hasattr(self, "classes_")))
File "C:\Users\vishal jangid\AppData\Roaming\Python\Python37\site-packages\sklearn\neural_network\multilayer_perceptron.py", line 324, in _fit
X, y = self._validate_input(X, y, incremental)
File "C:\Users\vishal jangid\AppData\Roaming\Python\Python37\site-packages\sklearn\neural_network\multilayer_perceptron.py", line 920, in _validate_input
self._label_binarizer.fit(y)
File "C:\Users\vishal jangid\AppData\Roaming\Python\Python37\site-packages\sklearn\preprocessing\label.py", line 413, in fit
self.classes_ = unique_labels(y)
File "C:\Users\vishal jangid\AppData\Roaming\Python\Python37\site-packages\sklearn\utils\multiclass.py", line 96, in unique_labels
raise ValueError("Unknown label type: %s" % repr(ys))
ValueError: Unknown label type: (array
Try this:
import numpy as np
import sklearn as sl
from sklearn.neural_network import MLPRegressor
classifier = MLPRegressor()
data = pandas.read_csv('Addmition.csv')
data_array = np.array(data)
X = data_array[:,1:8]
y = data_array[:,8]
classifier.fit(X,y)
print(classifier)
Explanation:
In machine learning we may have two types of problems:
1) Classification:
Ex: Predict if a person is male or female. (discrete)
2) Regression:
Ex: Predict the age of the person. (continuous)
With this in hand we are going to see your problem, your label (chance of selection) is continous, thus we have a regression problem.
See that you are using the MLPClassifier, resulting in the 'Unknown label error'.
Try using the MLPRegressor.

Resources