How can I do dask_ml preprocessing in a dask distributed cluster? My dataset is about 200GB and Every time I categorize the dataset preparing for OneHotEncoding, it looks like dask is ignoring the client and try to load the dataset in the local machine's memory. Maybe I miss something:
from dask_ml.preprocessing import Categorizer, DummyEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
import pandas as pd
import dask.dataframe as dd
df = dd.read_csv('s3://some-bucket/files*.csv', dtypes={'column': 'category'})
pipe = make_pipeline(
Categorizer(),
DummyEncoder(),
LogisticRegression(solver='lbfgs')
)
pipe.fit(df, y)
Two immediate things to address:
You have not instantiated a distributed scheduler in your code.
You should probably use the LogisticRegression instance from
dask-ml rather than scikit-learn.
Working Code Example
Below is a minimal code example that works.
Note that the preprocessing functions accept only Dask Dataframes while the LogisticRegression estimator accepts only Dask arrays. You can split the pipeline or use a custom FunctionTransformer (from this answer). See this open Dask issue for more context.
from dask_ml.preprocessing import Categorizer, DummyEncoder
from dask_ml.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import FunctionTransformer
import pandas as pd
import dask.dataframe as dd
from dask.distributed import Client
client = Client()
from dask_ml.datasets import make_classification
X, y = make_classification(chunks=50)
# define custom transformers to include in pipeline
def trans_array(array):
return dd.from_array(array)
transform_array = FunctionTransformer(trans_array)
def trans_df(dataframe):
return dataframe.to_dask_array(lengths=True)
transform_df = FunctionTransformer(trans_df)
pipe = make_pipeline(
transform_array,
Categorizer(),
DummyEncoder(),
transform_df,
LogisticRegression(solver='lbfgs')
)
pipe.fit(X,y)
Related
I am trying to apply unsupervised learning on a data with 97 features and around 6500 rows/samples. All features have discrete data (mostly from 1-10) with some being binary (0/1). What are some of the best clustering algorithms to apply on this data. Thank You!
It's impossible to say which clustering algo will perform best on your given dataset. You just have to try several methodologies and inspect the final results that you get. Here are several clustering algos that you can try.
https://github.com/ASH-WICUS/Notebooks/blob/master/Clustering%20Algorithms%20Compared.ipynb
Here is a small sample.
import statsmodels.api as sm
import numpy as np
import pandas as pd
mtcars = sm.datasets.get_rdataset("mtcars", "datasets", cache=True).data
df_cars = pd.DataFrame(mtcars)
df_cars.head()
from numpy import unique
from numpy import where
from sklearn.datasets import make_classification
from sklearn.cluster import KMeans
from matplotlib import pyplot
# define dataset
X = df_cars[['mpg','hp']]
# define the model
model = KMeans(n_clusters=8)
# fit the model
model.fit(X)
# assign a cluster to each example
yhat = model.predict(X)
X['kmeans']=yhat
pyplot.scatter(X['mpg'], X['hp'], c=X['kmeans'], cmap='rainbow', s=50, alpha=0.8)
# plot X & Y coordinates and color by cluster number
import plotly.express as px
df = px.data.iris()
fig = px.scatter(df_cars, x="hp", y="mpg", color="kmeans", size='mpg', hover_data=['kmeans'])
fig.show()
I am using the credit card fraud dataset(link below) and it's highly imbalanced where the positive class has only 492 instances and the negative class has 284315 instances.
I was applying PU Bagging (link below) on it to extract hidden positives in the negative class i.e negative instances having similar property/values like positive instances. In the max_samples hyperparameter, I was putting sum(y) which worked, but just for testing purposes I typed max_samples as 1000 just to check if it gives an error, but it does not. If I have given max_samples=1000 that means it should take 1000 samples from both classes but it did not give me any error. I also tested with values less than 492 like 30 but it still worked and I also tried with bootstrap and oob_score as False but still no error. I also tried giving max_samples as a list like [492,492] but it does not accept a list like that.
I want the classifier to take 492 samples from both the classes as [492,492] but i don't know its doing that or not.
Link for the dataset: https://machinelearningmastery.com/standard-machine-learning-datasets-for-imbalanced-classification/
Link for pu_bagging code: https://github.com/roywright/pu_learning/blob/master/baggingPU.py
My code is:
#importing and preprocessing
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import train_test_split
df_bank=pd.read_csv('testingcreditcard.csv')
y_bank=df_bank['labels']
df_bank.drop(['labels'],axis=1,inplace=True)
#counter
unique, counts = np.unique(y_bank, return_counts=True)
dict(zip(unique, counts))
#Pu_bagging
from sklearn.ensemble import RandomForestClassifier
from baggingPU import BaggingClassifierPU
bc = BaggingClassifierPU(RandomForestClassifier(), n_estimators = 200, n_jobs = -1, max_samples = 30 )
bc.fit(df_bank, y_bank)
#Predictions
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report
rpredd=bc.predict(df_bank)
print(confusion_matrix(y_bank,rpredd))
print(accuracy_score(y_bank,rpredd))
print(classification_report(y_bank,rpredd))
"I have scaled my dataset by using Standard Scaler , Now how to know it has been scaled, I am sure it has been scaled but how to see it"
As #Coderji said you can always find out the mean and standard deviation, which should be equal to 0 and 1 respectively.
However, there is another method to visualize it.
from sklearn import datasets
import numpy as np
from sklearn.preprocessing import StandardScaler
I am using iris dataset for this example.
iris = datasets.load_iris()
X = iris.data
sc = StandardScaler()
sc.fit(X)
x = sc.transform(X)
import matplotlib.pyplot as plt
import seaborn as sns
sns.distplot(x[:,1])
See this Output for sepel length
Similarly you can see for all variables or a simple pairplot will do the job.
This gives an idea that the data is standardised visually.
I am trying a code but it show me this error
NameError:name 'cross_validation' is not defined
when I run this line
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X,y,test_size=0.2)
sklrean version is 0.19.1
use cross_val_score and train_test_split separately. Import them using
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
Then before applying cross validation score you need to pass the data through some model. Follow below code as an example and change accordingly:
xtrain,ytrain,xtest,ytest=train_test_split(balancedData.iloc[:,0:29],balancedData['Left'],test_size=0.25,random_state=123)
rf=RandomForestClassifier(max_depth=8,n_estimators=5)
rf_cv_score=cross_val_score(estimator=rf,X=xtrain,y=xtest,cv=5)
print(rf_cv_score)
import random forest from sklearn before using it.
I'm trying to use Keras' Functional API as I need to add additional input into my recurrent neural net between two layers. This would be fine however I also have input I want to forecast and so I'm using a masking layer to keep the out of sample data from affecting the model. The masking layer throws an error when I try to pass it an Input(...) as it's expecting a list of integers not tensor variables. Is there a specific masking layer for the Functional API? Here is my code:
import numpy
import pandas
import math
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import GRU
from keras.layers import SimpleRNN
from keras.layers.core import Masking
from keras.optimizers import Adam
from keras.optimizers import RMSprop
from keras.optimizers import Nadam
from keras.layers import TimeDistributed
from keras import initializations
from keras.layers import Input, merge
from keras.models import Model
dataframe = pandas.read_csv('C:/Users/RNCZF01/Documents/Cameron-Fen/Economic Ideas/LSTM/LSTM-data/GDP+stockmarket.csv', usecols=[1,3], header=None, engine='python')
dataset = dataframe.values
dataset = dataset.astype('float32')
dataframe1 = pandas.read_csv('C:/Users/RNCZF01/Documents/Cameron-Fen/Economic Ideas/LSTM/LSTM-data/GDP+stockmarket.csv', usecols=[0], header=None, engine='python')
dataset1 = dataframe1.values
dataset1 = dataset1.astype('float32')
train, test = dataset[start:train_size+test_size,:]*mult, dataset1[start:train_size+test_size,:]*mult
#set the masking to 0.0
for each in range(test_size):
train[train_size + each,:] = [0.0,0.0]
train, test = train[:259], test[:259]
validx, validy = dataset[start:train_size+test_size,:]*mult, dataset1[start:train_size+test_size,:]
main_input = Input(shape=(259,2), name='main_input')
m = Masking(mask_value=0.0)(main_input)#error is her because masking expects indices to be integers not a tensor variable
Here is the data. Use columns 0 as the test data and columns 1 and 3 as training data in the code.