How to use array features in RandomForest without flatten the input?
import numpy as np
from sklearn.ensemble import RandomForestClassifier
array_feature = np.array([0,0,1])
train_x = np.matrix([[1, 2, array_feature], [3, 4, array_feature] , [1,1, array_feature] ])
train_y = np.array([1,0,1])
clf_rf = RandomForestClassifier(n_estimators=2)
clf_rf.fit(train_x, train_y)
ValueError: setting an array element with a sequence.
You can't.
In sklearn, most models can only use numerical data, and preprocessing is done separately. Tree models (in sklearn) in particular can only make splits on whether a given feature is less or greater than a given value. You can either flatten the arrays, or provide some encoding for them, depending on what those arrays represent.
*(Tree models in other packages, and perhaps soon in sklearn, can treat categorical variables directly. Ordinal variables get treated just like continuous ones, and unordered categorical variables can be split into arbitrary bipartitions in CART or cause multiple-arity splits in Quinlan-family trees. But then still you would need to inform the model that your arrays should be treated as ordinal or unordered categorical or ...)
Related
So I have the following train data (no header, explanation bellow):
[1.3264,1.3264,1.3263,1.32632]
[2.32598,2.3256,2.3257,2.326,2.3256,2.3257,2.32566]
[10.3215,10.3215,10.3214,10.3214,10.3214,10.32124]
It does not have an header because all elements with exception of the last 1 on each array are inputs and the last one is the result/output.
So taking first example: 1.3264,1.3264,1.3263 are inputs/feed data that I want to give to the algorith and 1.32632 is the outcome/result.
All of these are historical values that would lead to a pattern recognition.
I would like to give some test data to the algorith and he would give me outcome/result based on that pattern he identified.
From all the examples I looked into with ML and sklearn, I have never seen one where you have(for the same type of data) multiple entries. They all seem to have the same number of columns and diferent types of inputs whereas mine is always the same type of input.
You can try two different approaches:
Extract features from your variable length data to make the features have fixed size. After that you can use any algorithm from sklearn or other packages. Feature extraction is highly domain-specific process that requires context of what the data actually is. For example you can try similar features:
import numpy as np
def extract_features_one_row(arr):
arr = np.array(arr[:-1])
y = arr[-1]
features = [
np.mean(arr),
np.sum(arr),
np.median(arr),
np.std(arr),
np.percentile(arr, 5),
np.percentile(arr, 95),
np.percentile(arr, 25),
np.percentile(arr, 75),
(arr[1:] > arr[:-1]).sum(), # number of increasing pairs
(arr > arr.mean()).sum(), # number of elements > mean value
# extract trends, number of modes, etc
]
return features, y
data = [
[1.3264, 1.3264, 1.3263, 1.32632],
[2.32598, 2.3256, 2.3257, 2.326, 2.3256, 2.3257, 2.32566],
[10.3215, 10.3215, 10.3214, 10.3214, 10.3214, 10.32124],
]
X, y = zip(*[extract_features_one_row(row) for row in data])
X = np.array(X) # (3, 10)
print(X.shape, y)
So now X_data have the same number of columns.
Use ML algorithm that supports variable length data: Recurrent neural networks, transformers, convolutional networks with padding.
I am trying to perform linear regression on multidimensional A and b for arrays that are larger-than-memory. Using sklearn notation, my arrays have shapes:
A: (n_samples, n_features)
b: (n_samples, n_targets)
I have so far been using numpy arrays, and using sklearn.linear_model.ridge_regression.
Now I am swapping to dask, and need to find something equivalent. I've tried passing the dask arrays to ridge_regression, but I observe that I run out of memory when doing so. I've looked at dask_ml, but their LinearRegression model only takes 1D-arrays b/y with shape (n_samples), and requires the data to be chunked correctly, which I am not sure how to do. I can use a standard matrix inversion approach, but then I risk singular matrix errors.
A small worked example, which shows how the ridge_regression method ends up with a numpy array, not a dask array:
import sklearn.linear_model
X = da.random.random((1024, 30))
y = da.random.random((1024, 10000))
fit = sklearn.linear_model.ridge_regression(X, y, alpha=0.0)
type(fit)
Does anyone have any suggestions for models that can perform the above regression, but for large data?
In gridsearchCV, when I fit like something as follows:
forest_reg = RandomForestRegressor()
grid_search = GridSearchCV(forest_reg, param_grid,cv=5,scoring = 'neg_mean_squared_error')
grid_search.fit(X_train,y_train)
and after that,
when I execute this,
GridSearch.best_estimator_.feature_importances_
it gives an array of values
so my question is what values does GridSearch.best_estimator_.feature_importances_ this line return??
In your case, GridSearch.best_estimator_.feature_importances_ returns a RandomForestRegressor object.
Therefore, according to RandomForestRegressor documentation:
feature_importances_ : array of shape = [n_features]
Return the feature importances (the higher, the more important the feature).
In other words, it returns the most important features according to your training set X_train. Each element of feature_importances_ corresponds to one feature of X_train (e.g: first element of feature_importances_ refers to the first feature/column of X_train).
The higher the value of an element in feature_importances_, the more important is the feature in X_train.
Although my data-frame as all the float values everywhere. While passing the data frame through k-means it shows that couldn't convert the string to float.
How to convert nan values if any to float values in the entire data-frame?
This would do your job and convert all the columns in string format to categorical codes or use one hot encoding of the variables in these columns.
import numpy as np
from sklearn.cluster import KMeans
import pandas
df = pandas.read_csv('zipIncome.csv')
print(df)
df[col_name]= df[col_name].astype('category')
df[col_name] = df[col_name].cat.codes
kmeans = KMeans(n_clusters=4,init='k-means++', max_iter=600, algorithm = 'auto').fit(df)
print (kmeans.labels_)
print(kmeans.cluster_centers_)
Based on your code, it would seem that you only instantiated the KMeans but haven't used it.
You'll need input data X that is clean (i.e. no strings etc), let's call it X
kmeans = KMeans(n_clusters=4,init='k-means++', max_iter=600, algorithm = 'auto')
clusters = kmeans.fit_predict(X)
now clusters has the cluster number for each sample in X.
(alternatively, you can do the fit(X) and then later predict(X) separately, but ultimately it is the predict that will output the cluster labels that you will need)
If you want to later get clusters on data, you should use kmeans.predict(new_data) rather than fit_predict() so that KMeans uses the learning from X, and applies it to your new_data (or depending on your needs, you might want to retrain it).
Hope this helps.
Finally, you can add another column to your pandas DataFrame by doing:
df['cluster'] = clusters
where 'cluster' is a string for your new column name, you can of course call it whatever you want
I have a handwritten dataset for classification purpose where the classes are from a-z. If I want to use MLPClassifier, I think I cannot use such categorical classes directly because MLP implementation in scikit-learn only handles numerical classes. Thus, what is the appropriate action to do here? How about converting these classes to be numbered from 1-28, does it make sense? If not, does scikit-learn provide special encoding mechanism for class labels to handle this case (I guess one-hot encoding is not the option here)?
Thank you
You may need to preprocess the data, as scikit-learn only handles numeric values. In this case I wanted to predict the currency of a transaction. The currency is expressed in ISO code so LabelEncoder was used to transform it into numeric categories (ie: 1, 2, 3...):
#Import the object LabelEncoder
from sklearn.preprocessing import LabelEncoder
#defining class column
my_encoder = LabelEncoder()
my_class_currency = np.array(my_encoder.fit_transform(my_data['currency'])).reshape(-1,1)
#Create a "diccionary" to translate the categories into the actual values once you have the output
my_class_decoder = list(np.unique(my_data['currency']))