sklearn - ValueError: could not convert string to float: 'yes' - machine-learning

I have a dataset which has following categorical data colname.
When performing one_hot_encoding using sklearn I get an error.
def ohe_encode(train, test, index):
Onehot = OneHotEncoder(categorical_features='all', handle_unknown='error')
x_train_1 = train
x_test_1 = test
colname = df.columns[index]
Onehot.fit(train[colname].astype(str))
x_trans = Onehot.transform(train[columnns].astype(str))
new_features = Onehot.transform(test[colname].astype(str))
return (x_transform, new_features)
error appeared on screen displaying,
ValueError: could not convert string to float: 'yes'
not able to get cause of error.
Thanks in advance,

Taken from sklearns OneHotEncoder documentation (emphasis mine):
Encode categorical integer features using a one-hot aka one-of-K scheme.
The input to this transformer should be a matrix of integers, denoting
the values taken on by categorical (discrete) features.
You, however, feed in the raw categorical values, eg. strings like «yes» and «no». Therefore you get the ValueError.
You need to factorize your data first, which means you convert your strings into categorical numbers (integers). Then you can do the one hot encoding.

Related

Use array feature in RandomForest without flatten

How to use array features in RandomForest without flatten the input?
import numpy as np
from sklearn.ensemble import RandomForestClassifier
array_feature = np.array([0,0,1])
train_x = np.matrix([[1, 2, array_feature], [3, 4, array_feature] , [1,1, array_feature] ])
train_y = np.array([1,0,1])
clf_rf = RandomForestClassifier(n_estimators=2)
clf_rf.fit(train_x, train_y)
ValueError: setting an array element with a sequence.
You can't.
In sklearn, most models can only use numerical data, and preprocessing is done separately. Tree models (in sklearn) in particular can only make splits on whether a given feature is less or greater than a given value. You can either flatten the arrays, or provide some encoding for them, depending on what those arrays represent.
*(Tree models in other packages, and perhaps soon in sklearn, can treat categorical variables directly. Ordinal variables get treated just like continuous ones, and unordered categorical variables can be split into arbitrary bipartitions in CART or cause multiple-arity splits in Quinlan-family trees. But then still you would need to inform the model that your arrays should be treated as ordinal or unordered categorical or ...)

error while passing data-frame through k-means

Although my data-frame as all the float values everywhere. While passing the data frame through k-means it shows that couldn't convert the string to float.
How to convert nan values if any to float values in the entire data-frame?
This would do your job and convert all the columns in string format to categorical codes or use one hot encoding of the variables in these columns.
import numpy as np
from sklearn.cluster import KMeans
import pandas
df = pandas.read_csv('zipIncome.csv')
print(df)
df[col_name]= df[col_name].astype('category')
df[col_name] = df[col_name].cat.codes
kmeans = KMeans(n_clusters=4,init='k-means++', max_iter=600, algorithm = 'auto').fit(df)
print (kmeans.labels_)
print(kmeans.cluster_centers_)
Based on your code, it would seem that you only instantiated the KMeans but haven't used it.
You'll need input data X that is clean (i.e. no strings etc), let's call it X
kmeans = KMeans(n_clusters=4,init='k-means++', max_iter=600, algorithm = 'auto')
clusters = kmeans.fit_predict(X)
now clusters has the cluster number for each sample in X.
(alternatively, you can do the fit(X) and then later predict(X) separately, but ultimately it is the predict that will output the cluster labels that you will need)
If you want to later get clusters on data, you should use kmeans.predict(new_data) rather than fit_predict() so that KMeans uses the learning from X, and applies it to your new_data (or depending on your needs, you might want to retrain it).
Hope this helps.
Finally, you can add another column to your pandas DataFrame by doing:
df['cluster'] = clusters
where 'cluster' is a string for your new column name, you can of course call it whatever you want

Auto-fill for testing data after DictVectorizer and fit_transform

In Sklearn, for training data's dictionary, I have the full-length vectors, transformed from the original dict, like [1,0,0,0,0]. (using DictVectorizer and fit_transform.)
vec = DictVectorizer(sparse=True)
data = vec.fit_transform(dict_list)
However, in the testing data, due to the limited sample number, I do not have the full-length vectors transformed from dict, like [0,0,0,1].
Is these anything that can help, automatically fills in the testing data for the missing vector columns (to be 0)?
On the testing side as well, all you do is:
test_data = vec.transform(test_dict_list)

Handle categorical class labels for scikit-learn MLPClassifier

I have a handwritten dataset for classification purpose where the classes are from a-z. If I want to use MLPClassifier, I think I cannot use such categorical classes directly because MLP implementation in scikit-learn only handles numerical classes. Thus, what is the appropriate action to do here? How about converting these classes to be numbered from 1-28, does it make sense? If not, does scikit-learn provide special encoding mechanism for class labels to handle this case (I guess one-hot encoding is not the option here)?
Thank you
You may need to preprocess the data, as scikit-learn only handles numeric values. In this case I wanted to predict the currency of a transaction. The currency is expressed in ISO code so LabelEncoder was used to transform it into numeric categories (ie: 1, 2, 3...):
#Import the object LabelEncoder
from sklearn.preprocessing import LabelEncoder
#defining class column
my_encoder = LabelEncoder()
my_class_currency = np.array(my_encoder.fit_transform(my_data['currency'])).reshape(-1,1)
#Create a "diccionary" to translate the categories into the actual values once you have the output
my_class_decoder = list(np.unique(my_data['currency']))

tf.nn.embedding_lookup with float input?

I would like to implement an embedding table with float inputs instead of int32 or 64b.
The reason is that instead of words like in a simple RNN, I would like to use percentages.
For example in case of a recipe; I may have 1000 or 3000 ingredients; but in every recipe I may have a maximum of 80.
The ingredients will be represented in percentage for example: ingredient1=0.2 ingredient2=0.8... etc
my problem is that tensorflow forces me to use integers for my embedding table:
TypeError: Value passed to parameter ‘indices’ has DataType float32 not in list of allowed values: int32, int64
any suggestion?
I appreciate your feedback,
example of embedding look up:
inputs = tf.placeholder(tf.float32, shape=[None, ninp], name=“x”)
n_vocab = len(int_to_vocab)
n_embedding = 200 # Number of embedding features
with train_graph.as_default():
embedding = tf.Variable(tf.random_uniform((n_vocab, n_embedding), -1, 1))
embed = tf.nn.embedding_lookup(embedding, inputs)
the error is caused by
inputs = tf.placeholder(**tf.float32,** shape=[None, ninp], name=“x”)
I have thought of an algorithm that could work using loops. But, I was wondering if there is a more direct solution.
Thanks!
tf.nn.embedding_lookup can't allow float input, because the point of this function is to select the embeddings at the specified rows.
Example:
Here there are 5 words and 5 embedding 3D vectors, and the operation returns the 3-rd row (with 0-indexing). This is equivalent to this line in tensorflow:
embed = tf.nn.embedding_lookup(embed_matrix, [3])
You can't possibly look up a floating point index, such as 0.2 or 0.8, because there is no 0.2 and 0.8 row index in the matrix. Highly recommend this post by Chris McCormick about word2vec.
What you describe sounds more like a softmax loss function, which outputs a probability distribution over the target classes.

Resources