scv.pl.proportions(): numpy.AxisError in `Cellrank` workflow - seurat

I am new to use python to anlyze scRNA-seq. I run the cellrank workflow and always found this error.
Here is my code for Cellrank:
import scvelo as scv
import scanpy as sc
import cellrank
import numpy as np
scv.settings.verbosity = 3
scv.settings.set_figure_params("scvelo")
cellrank.settings.verbosity = 2
import warnings
warnings.simplefilter("ignore", category=UserWarning)
warnings.simplefilter("ignore", category=FutureWarning)
warnings.simplefilter("ignore", category=DeprecationWarning)
adata = sc.read_h5ad('./my.h5ad') # my data
**scv.pl.proportions(adata)**
The errorcode:
Traceback (most recent call last):
File "test_cellrank.py", line 25, in <module>
**scv.pl.proportions(adata)**
...........
**numpy.AxisError: axis 1 is out of bounds for array of dimension 1**
I tried to use SeuratDisk or loom to get h5ad from a seurat object. I thought that must be some problem in this progress.
Here is the anndata object from tutorial:
>>> adata
AnnData object with n_obs × n_vars = 2531 × 27998
obs: 'day', 'proliferation', 'G2M_score', 'S_score', 'phase', 'clusters_coarse', 'clusters', 'clusters_fine', 'louvain_Alpha', 'louvain_Beta', 'palantir_pseudotime'
var: 'highly_variable_genes'
uns: 'clusters_colors', 'clusters_fine_colors', 'day_colors', 'louvain_Alpha_colors', 'louvain_Beta_colors', 'neighbors', 'pca'
obsm: 'X_pca', 'X_umap'
layers: 'spliced', 'unspliced'
obsp: 'connectivities', 'distances'
Here is mine:
>>> adata
AnnData object with n_obs × n_vars = 5443 × 18489
obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'percent.mt', 'nCount_SCT', 'nFeature_SCT', 'SCT_snn_res.0.8', 'seurat_clusters', 'SCT_snn_res.0.5', 'SCT_snn_res.0.6',
'SCT_snn_res.0.7', 'S.Score', 'G2M.Score', 'Phase', 'old.ident', 'new.ident', 'nCount_MAGIC_RNA', 'nFeature_MAGIC_RNA'
var: 'SCT_features', '_index', 'features'
obsm: 'X_tsne', 'X_umap'
layers: 'SCT'
So, What packages or protocols should I follow to convert a seurat into a h5ad?
Thank you for your help!

scv.pl.proportions gives the proportion of spliced and unspliced reads in your dataset. These count tables must be added to your adata layers before you can call this function.
Your adata object does not have these layers. I think that is why you are seeing this error.
Conversion from Seurat to h5ad can be accomplished using two step process given here

Related

Why does using X[0] in MNIST classifier code give me an error?

I was learning to do classification with the MNIST dataset. And I got an error with I am not able to figure out, I have done a lot of google searches and I am not able to do anything, maybe you are an expert and can help me. Here is the code--
>>> from sklearn.datasets import fetch_openml
>>> mnist = fetch_openml('mnist_784', version=1)
>>> mnist.keys()
output:
dict_keys(['data', 'target', 'frame', 'categories', 'feature_names', 'target_names', 'DESCR', 'details', 'url'])
>>> X, y = mnist["data"], mnist["target"]
>>> X.shape
output:(70000, 784)
>>> y.shape
output:(70000)
>>> X[0]
output:KeyError Traceback (most recent call last)
c:\users\khush\appdata\local\programs\python\python39\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
2897 try:
-> 2898 return self._engine.get_loc(casted_key)
2899 except KeyError as err:
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 0
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
<ipython-input-10-19c40ecbd036> in <module>
----> 1 X[0]
c:\users\khush\appdata\local\programs\python\python39\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
2904 if self.columns.nlevels > 1:
2905 return self._getitem_multilevel(key)
-> 2906 indexer = self.columns.get_loc(key)
2907 if is_integer(indexer):
2908 indexer = [indexer]
c:\users\khush\appdata\local\programs\python\python39\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
2898 return self._engine.get_loc(casted_key)
2899 except KeyError as err:
-> 2900 raise KeyError(key) from err
2901
2902 if tolerance is not None:
KeyError: 0
Please answer, there can be a silly mistake because I am a beggineer in ML. It would be really helpful if you gave me some hint also.
The API of fetch_openml changed between versions. In earlier versions, it returns a numpy.ndarray array. Since 0.24.0 (December 2020), as_frame argument of fetch_openml is set to auto (instead of False as default option earlier) which gives you a pandas.DataFrame for the MNIST data. You can force the data read as a numpy.ndarray by setting as_frame = False. See fetch_openml reference .
I was also facing the same problem.
scikit-learn: 0.24.0
matplotlib: 3.3.3
Python: 3.9.1
I used to below code to resolve the issue.
import matplotlib as mpl
import matplotlib.pyplot as plt
# instead of some_digit = X[0]
some_digit = X.to_numpy()[0]
some_digit_image = some_digit.reshape(28,28)
plt.imshow(some_digit_image,cmap="binary")
plt.axis("off")
plt.show()
You don't need to downgrade you scikit-learn library, if you follow the code below:
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version= 1, as_frame= False)
mnist.keys()
You load the dataset as a dataframe for you to able to access the images, you have two ways to do this,
Transform the dataframe to an Array
# Transform the dataframe into an array. Check the first value
some_digit = X.to_numpy()[0]
# Reshape it to (28,28). Note: 28 x 28 = 7064, if the reshaping doesn't meet
# this you are not able to show the image
some_digit_image = some_digit.reshape(28,28)
plt.imshow(some_digit_image,cmap="binary")
plt.axis("off")
plt.show()
Transform the row
# Transform the row of your choosing into an array
some_digit = X.iloc[0,:].values
# Reshape it to (28,28). Note: 28 x 28 = 7064, if the reshaping doesn't
# meet this you are not able to show the image
some_digit_image = some_digit.reshape(28,28)
plt.imshow(some_digit_image,cmap="binary")
plt.axis("off")
plt.show()

Memory error while doing Hierarchical Clustering

I have a large dataset (207989, 23), and I am trying to apply Hierarchical clustering on just one column right now to test if it's suitable for the task at my hand.
What I have tried:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import preprocessing
data = pd.read_csv('gpmd.csv', header = 0)
X = data.loc[:, ['ContextID', 'BacksGas_Flow_sccm']]
min_max_scaler = preprocessing.MinMaxScaler()
X_minmax = min_max_scaler.fit_transform(X.values[:,[1]])
import scipy.cluster.hierarchy as sch
dendrogram = sch.dendrogram(sch.linkage(X_minmax, method = 'ward'))
after doing this, I am getting the following error:
dendrogram = sch.dendrogram(sch.linkage(X_minmax, method = 'ward'))
Traceback (most recent call last):
File "<ipython-input-4-429f42b68112>", line 1, in <module>
dendrogram = sch.dendrogram(sch.linkage(X_minmax, method = 'ward'))
File "C:\Users\kashy\Anaconda3\envs\py36\lib\site-packages\scipy\cluster\hierarchy.py", line 708, in linkage
y = distance.pdist(y, metric)
File "C:\Users\kashy\Anaconda3\envs\py36\lib\site-packages\scipy\spatial\distance.py", line 1877, in pdist
dm = np.empty((m * (m - 1)) // 2, dtype=np.double)
MemoryError
Can someone explain what exactly is the problem here?
Thanks in advance
Hierarchical clustering in most variants needs O(n²) memory.
Because of this, most implementations will fail at around 65535 instances, when they hit the 32 bit mark (some may fail at 32k already). But just do the math: n * n * 8 bytes for double precision: how much memory would you need?

I am not able Training models in sklearn (scikit-learn) using python

i have data file it contain data to predict the admission in MS.
it contain 9 column 8 column contain student data and 9th column contain chance of selection of student.
i am new and i don't understand error come in training model
import pandas
import numpy as np
import sklearn as sl
from sklearn.neural_network import MLPClassifier
classifier = MLPClassifier()
data = pandas.read_csv('Addmition.csv')
data_array = np.array(data)
X = data_array[:,1:8]
y = data_array[:,8]
classifier.fit(X,y)
print(classifier)
Traceback (most recent call last):
File "c.py", line 14, in <module>
classifier.fit(X,y)
File "C:\Users\vishal jangid\AppData\Roaming\Python\Python37\site-packages\sklearn\neural_network\multilayer_perceptron.py", line 977, in fit
hasattr(self, "classes_")))
File "C:\Users\vishal jangid\AppData\Roaming\Python\Python37\site-packages\sklearn\neural_network\multilayer_perceptron.py", line 324, in _fit
X, y = self._validate_input(X, y, incremental)
File "C:\Users\vishal jangid\AppData\Roaming\Python\Python37\site-packages\sklearn\neural_network\multilayer_perceptron.py", line 920, in _validate_input
self._label_binarizer.fit(y)
File "C:\Users\vishal jangid\AppData\Roaming\Python\Python37\site-packages\sklearn\preprocessing\label.py", line 413, in fit
self.classes_ = unique_labels(y)
File "C:\Users\vishal jangid\AppData\Roaming\Python\Python37\site-packages\sklearn\utils\multiclass.py", line 96, in unique_labels
raise ValueError("Unknown label type: %s" % repr(ys))
ValueError: Unknown label type: (array
Try this:
import numpy as np
import sklearn as sl
from sklearn.neural_network import MLPRegressor
classifier = MLPRegressor()
data = pandas.read_csv('Addmition.csv')
data_array = np.array(data)
X = data_array[:,1:8]
y = data_array[:,8]
classifier.fit(X,y)
print(classifier)
Explanation:
In machine learning we may have two types of problems:
1) Classification:
Ex: Predict if a person is male or female. (discrete)
2) Regression:
Ex: Predict the age of the person. (continuous)
With this in hand we are going to see your problem, your label (chance of selection) is continous, thus we have a regression problem.
See that you are using the MLPClassifier, resulting in the 'Unknown label error'.
Try using the MLPRegressor.

dask can not read the file that pandas can

I have a csv file that can be accessed using pandas but fails with dask dataframe.
I am using exact same parameters and still getting error with dask.
Pandas use case:
import pandas as pd
mycols = ['id', 'tran_id', 'client_id', 'm_text', 'retry', 'tran_date']
df = pd.read_csv('s3://some_bucket/abigd/hed4.csv',
sep=',', header=None, names=mycols, skipinitialspace=True, escapechar='\\',
engine='python', dtype=str )
Pandas output:
df.retry.value_counts()
1 2792174
2 907081
3 116369
6 6475
4 5598
7 1314
5 1053
8 288
16 3
13 3
Name: retry, dtype: int64
dask code:
import dask.dataframe as dd
from dask.distributed import Client
client = Client('Dask-Scheduler.local-dask:8786')
df = dd.read_csv('s3://some_bucket/abigd/hed4.csv',
sep=',', header=None, names=mycols, skipinitialspace=True, escapechar='\\',
engine='python', dtype=str,
storage_options = {'anon':False, 'key': 'xxx' , 'secret':'xxx'} )
df_persisted = client.persist(df)
df_persisted.retry.value_counts().compute()
Dask Output:
ParserError: unexpected end of data
I have tried opening smaller (and bigger) files in dask and there was no issue with them. It is possible that this file may have unclosed quotations. I can not see any reason why dask is unable to read the file.
Dask splits your files by looking for the line separator character b"\n". It looks for this single byte in parts of the file, so that the whole thing does not need to be read beforehand. When it finds it is not aware of whether the byte is escaped or within a quoted scope.
Thus, the chunking up of a large file by Dask can fail, and it appears that this is happening for you: some block is finishing on a newline which is not really a line ending.

Probable issue with LSTM in lasagne

With a simple constructor for the LSTM, as given in the tutorial, and an input of dimension [,,1] one would expect to see an output of shape [,,num_units].
But regardless of the num_units passed during construction, the output has the same shape as the input.
Following is the min code to replicate this issue...
import lasagne
import theano
import theano.tensor as T
import numpy as np
num_batches= 20
sequence_length= 100
data_dim= 1
train_data_3= np.random.rand(num_batches,sequence_length,data_dim).astype(theano.config.floatX)
#As in the tutorial
forget_gate = lasagne.layers.Gate(b=lasagne.init.Constant(5.0))
l_lstm = lasagne.layers.LSTMLayer(
(num_batches,sequence_length, data_dim),
num_units=8,
forgetgate=forget_gate
)
lstm_in= T.tensor3(name='x', dtype=theano.config.floatX)
lstm_out = lasagne.layers.get_output(l_lstm, {l_lstm:lstm_in})
f = theano.function([lstm_in], lstm_out)
lstm_output_np= f(train_data_3)
lstm_output_np.shape
#= (20, 100, 1)
An unqualified LSTM (I mean in its default mode) should produce one output for each unit right?
The code was run on kaixhin's cuda lasagne docker image docker image
What gives?
Thanks !
You can fix that by using a lasagne.layers.InputLayer
import lasagne
import theano
import theano.tensor as T
import numpy as np
num_batches= 20
sequence_length= 100
data_dim= 1
train_data_3= np.random.rand(num_batches,sequence_length,data_dim).astype(theano.config.floatX)
#As in the tutorial
forget_gate = lasagne.layers.Gate(b=lasagne.init.Constant(5.0))
input_layer = lasagne.layers.InputLayer(shape=(num_batches, # <-- change
sequence_length, data_dim),) # <-- change
l_lstm = lasagne.layers.LSTMLayer(input_layer, # <-- change
num_units=8,
forgetgate=forget_gate
)
lstm_in= T.tensor3(name='x', dtype=theano.config.floatX)
lstm_out = lasagne.layers.get_output(l_lstm, lstm_in) # <-- change
f = theano.function([lstm_in], lstm_out)
lstm_output_np= f(train_data_3)
print lstm_output_np.shape
If you feed your input into the input_layer, it is not ambiguous anymore, so you do not even need to specify where the input is supposed to go. Directly specifying a shape and adding the tensor3 into the LSTM does not work.

Resources