Why does pandas DataFrame giving 1-Dimesional Error? - machine-learning

I'm doing standardization on dataset after applying standardization I'm converting the resulting array into dataframe and then I got this 1-dimensional error. please help me out.
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train_scaled)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test_scaled)

Related

SVD on huge dataset with dask and xarray

I use xarray and dask to open multiple netcdf4 files that all together are around 200Gb via
import xarray as xr
ds = xr.open_mfdataset('/path/files*.nc', parallel=True)
The dimensions of this dataset "ds" are (longitude, latitude, height, time).
The files are automatically concatenated along time, which is okay.
Now I would like to apply the "svd_compressed" function from the dask library.
I would like to reshape the longitude, latitude, and height dimension into one dimension, such that I have a 2-d matrix on which I can apply the svd.
I tried using the
dask.array.reshape
function, but I get "'Dataset' object has no attribute 'shape'".
I can convert the xarray dataset to an array and use stack, which makes it 2-d, but If I then use
Dataset.to_dask_dataframe
to convert my xarray to a dask dataframe, my memory runs out.
Somebody has an Idea how I can tackle this problem?
Should I chunk my data differently for the "to_dask_dataframe" function?
Or can I use somehow the "dask svd_compressed" function on the loaded netcdf4 dataset without a reshape?
Thanks for the help.
Edit:
Here a code example that is not working. I have donwloaded Data from the ERA5 (https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysis-era5-pressure-levels?tab=overview), which I load from disk.
After that I take the temperature data and stack the longitude, latitude, and level values in one dimension to have a time-space 2d-array.
Then I would like to apply an SVD on the data.
from dask.distributed import Client, progress
import xarray as xr
import dask
import dask.array as da
client = Client(processes=False, threads_per_worker=4,
n_workers=1, memory_limit='9GB')
ds = xr.open_mfdataset('/home/user/Arbeit/ERA5/Data/era_5_m*.nc', parallel=True)
ds = ds['t']
ds = ds.stack(z=("longitude", "latitude", "level"))
u, s, v = da.linalg.svd_compressed(ds, k=5, compute=True)
I get an error "dot only operates on DataArrays."
I assume its because I need to convert it to a dask array, so I do.
da = ds.to_dask_dataframe()
which gives me "DataArray' object has no attribute 'to_dask_dataframe".
So I try
ds = ds.to_dataset(name="temperature")
da = ds.to_dask_dataframe()
which results in "Unable to allocate 89.4 GiB for an array with shape".
I guess I need to rechunk it?

Normalize and de-normalizing data in prediction model

I have developed a Random Forest model which is including two inputs as X and one output as Y. I have normalized both X and Y values for the training process.
After the model get trained, I selected the dataset as an unseen data for an input for the model. The data is coming from another resource. I normalized the X values and imported them to the trained model and get the Y-normalized value as an output. I wonder how the de normalizing process would be. I mean I have to multiply the output by which value to get the denormalized value?
I'd appreciate it if someone can help me in this regard.
You need to do the prepossessing inversely. But, you the mean and sd (standard deviation) values that used for normalization.
For example with scikit learn you can do it easily. You can do it with 1 line of code.
enter code here
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
data= ...
scaled_data = scaler.fit_transform(data)
inverse = scaler.inverse_transform(scaled_data)

GridSearchCV ranking not returning desired values

I am doing a NearestNeighbor recommendation model that takes a list of words and recommend similar words and I want to tune the values for n_neighbors. This is the code I typed out.
from sklearn.model_selection import GridSearchCV
gs_clf = GridSearchCV(NearestNeighbors(algorithm = 'brute'), {
'n_neighbors': [1,2,3,4,5,6,7,8,9,10]
}, scoring = 'f1', cv=5)
gs_clf = gs_clf.fit(transformed_courses_new , np.array(courses.code))
trasnformed_courses_new is a array of shape (159, 120) and np.array(courses.code) is (159,) and each value is an unique label. So my understanding was that the gridsearch will do testing for all the values of n_neighbors and rank the best value of k for which the f1 scoring is maximum. But when I ran the code, I got a warning that NearestNeighbors don't have .predict functionality.
Is there any workaround for this?
Any help is appreciated.
Use KNeighborsClassifier.
(Keep in mind that kNN suffers from curse of dimensionality.)

error while passing data-frame through k-means

Although my data-frame as all the float values everywhere. While passing the data frame through k-means it shows that couldn't convert the string to float.
How to convert nan values if any to float values in the entire data-frame?
This would do your job and convert all the columns in string format to categorical codes or use one hot encoding of the variables in these columns.
import numpy as np
from sklearn.cluster import KMeans
import pandas
df = pandas.read_csv('zipIncome.csv')
print(df)
df[col_name]= df[col_name].astype('category')
df[col_name] = df[col_name].cat.codes
kmeans = KMeans(n_clusters=4,init='k-means++', max_iter=600, algorithm = 'auto').fit(df)
print (kmeans.labels_)
print(kmeans.cluster_centers_)
Based on your code, it would seem that you only instantiated the KMeans but haven't used it.
You'll need input data X that is clean (i.e. no strings etc), let's call it X
kmeans = KMeans(n_clusters=4,init='k-means++', max_iter=600, algorithm = 'auto')
clusters = kmeans.fit_predict(X)
now clusters has the cluster number for each sample in X.
(alternatively, you can do the fit(X) and then later predict(X) separately, but ultimately it is the predict that will output the cluster labels that you will need)
If you want to later get clusters on data, you should use kmeans.predict(new_data) rather than fit_predict() so that KMeans uses the learning from X, and applies it to your new_data (or depending on your needs, you might want to retrain it).
Hope this helps.
Finally, you can add another column to your pandas DataFrame by doing:
df['cluster'] = clusters
where 'cluster' is a string for your new column name, you can of course call it whatever you want

Issue with the results of PCA component values

I am performing PCA on a dataset of (28 features + 1 class label) and 11M rows (samples) using the following simple code:
from sklearn.decomposition import PCA
import pandas as pd
df = pd.read_csv('HIGGS.csv', sep=',', header=None)
df_labels = df[df.columns[0]]
df_features = df.drop(df.columns[0], axis=1)
pca = PCA()
pca.fit(df_features.values)
print(pca.explained_variance_ratio_)
print(pca.explained_variance_ratio_.shape)
transformed_data = pca.transform(df_features.values)
The pca.explained_variance_ratio_ (or eigenvalues) are the following:
[0.11581302 0.09659324 0.08451179 0.07000956 0.0641502 0.05651781
0.055588 0.05446682 0.05291956 0.04468113 0.04248516 0.04108151
0.03885671 0.03775394 0.0255504 0.02181292 0.01979832 0.0185323
0.0164828 0.01047363 0.00779365 0.00702242 0.00586635 0.00531234
0.00300572 0.00135565 0.00109707 0.00046801]
Based on the explained_variance_ratio_, I don't know if there is something wrong here. The highest component is 11%, as opposed to the fact that we should be getting values starting at 99% and so. Does it imply that the dataset needs some preprocessing such as ensuring the data are in a normal distribution?
Dude, 99% for the first component means that the axis associated with the largest eigenvalue encodes 99% of the variance in your dataset. It is quite uncommon for any dataset to have a situation like this. Otherwise, the problem shrinks to a 1-D classification/regression problem.
There is nothing wrong with this output. Retain the first axes that encode aound 80% of the variance and build your model.
note: The PCA transformation is usually used to decrease the dimensions of your problem space. Since you have only 28 variables, I recommend abondoning PCA altogether.

Resources