Calculate Confusion Matrix of a FastText Classifier model - machine-learning

I'm calculating for a Facebook FastText classifier model the confusion matrix in this way:
#!/usr/local/bin/python3
import argparse
import numpy as np
from sklearn.metrics import confusion_matrix
def parse_labels(path):
with open(path, 'r') as f:
return np.array(list(map(lambda x: int(x[9:]), f.read().split())))
if __name__ == "__main__":
parser = argparse.ArgumentParser(description='Display confusion matrix.')
parser.add_argument('test', help='Path to test labels')
parser.add_argument('predict', help='Path to predictions')
args = parser.parse_args()
test_labels = parse_labels(args.test)
pred_labels = parse_labels(args.predict)
print(test_labels)
print(pred_labels)
eq = test_labels == pred_labels
print("Accuracy: " + str(eq.sum() / len(test_labels)))
print(confusion_matrix(test_labels, pred_labels))
My predictions and test set are like
$ head -n10 /root/pexp
__label__spam
__label__verified
__label__verified
__label__spam
__label__verified
__label__verified
__label__verified
__label__verified
__label__verified
__label__verified
$ head -n10 /root/dataset_test.csv
__label__spam
__label__verified
__label__verified
__label__spam
__label__verified
__label__verified
__label__verified
__label__verified
__label__verified
__label__verified
Predictions of the model has been calculated over the test set in this way:
./fasttext predict /root/my_model.bin /root/dataset_test.csv > /root/pexp
I'm then going the calculate the FastText Confusion Matrix:
$ ./confusion.py /root/dataset_test.csv /root/pexp
but I'm stuck with this error:
Traceback (most recent call last):
File "./confusion.py", line 18, in <module>
test_labels = parse_labels(args.test)
File "./confusion.py", line 10, in parse_labels
return np.array(list(map(lambda x: int(x[9:]), f.read().split())))
File "./confusion.py", line 10, in <lambda>
return np.array(list(map(lambda x: int(x[9:]), f.read().split())))
ValueError: invalid literal for int() with base 10: 'spam'
I have fixed the script as suggested to handle non numeric labels:
def parse_labels(path):
with open(path, 'r') as f:
return np.array(list(map(lambda x: x[9:], f.read().split())))
Also, in the case of FastText it's possibile that the test set will have normalized labels (without the prefix __label__) at some point, so to convert back to the prefix you can do like:
awk 'BEGIN{FS=OFS="\t"}{ $1 = "__label__" tolower($1) }1' /root/dataset_test.csv > /root/dataset_test_norm.csv
See here about this.
Also, the input test file must be cut of the other columns than the label column:
cut -f 1 -d$'\t' /root/dataset_test_norm.csv > /root/dataset_test_norm_label.csv
So finally we get the Confusion Matrix:
$ ./confusion.py /root/dataset_test_norm_label.csv /root/pexp
Accuracy: 0.998852852227
[[9432 21]
[ 3 14543]]
My final solution is here.
[UPDATE]
The script is now working fine. I have added the Confusion Matrix calculation script directly in my FastText Node.js implementation, FastText.js here.

from sklearn.metrics import confusion_matrix
# predict the data
df["predicted"] = df["text"].apply(lambda x: model.predict(x)[0][0])
# Create the confusion matrix
confusion_matrix(df["labeled"], df["predicted"])
## OutPut:
# array([[5823, 8, 155, 1],
# [ 199, 51, 22, 0],
# [ 561, 2, 764, 0],
# [ 48, 0, 4, 4]], dtype=int64)

Related

ValueError: 'logits' and 'labels' must have the same shape for NLP sentiment multi-class classifier

I am trying to make a NLP multi-class sentiment classifier where it takes in sentences as input and classifies them into three classes (negative, neutral and positive). However, when training the model, I run into the error where my logits (None, 3) are not the same size as my labels (None, 1) and the model can't begin training.
My model is a multi-class classifier and not a multi-label classifier since it is only predicting one label per object. I made sure that my last layer had an output of 3 and had the activation = 'softmax'. This should be correct from what I have searched online so I think that the problem lies with my labels.
Currently, my labels have a dimension of (None, 1) since I mapped each class to a unique integer and passed this as my test and train y values (which are in the form of one dimensional numpy array.
Right now I am confused if I have change the dimensions of this array to match the output dimensions and how to go about doing it.
import os
import sys
import tensorflow as tf
import numpy as np
import pandas as pd
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from keras.optimizers import SGD
device_name = tf.test.gpu_device_name()
if len(device_name) > 0:
print("Found GPU at: {}".format(device_name))
else:
device_name = "/device:CPU:0"
print("No GPU, using {}.".format(device_name))
# Load dataset into a dataframe
train_data_path = "/content/drive/MyDrive/ML Datasets/tweet_sentiment_analysis/train.csv"
test_data_path = "/content/drive/MyDrive/ML Datasets/tweet_sentiment_analysis/test.csv"
train_df = pd.read_csv(train_data_path, encoding='unicode_escape')
test_df = pd.read_csv(test_data_path, encoding='unicode_escape').dropna()
sentiment_types = ('neutral', 'negative', 'positive')
train_df['sentiment'] = train_df['sentiment'].astype('category')
test_df['sentiment'] = test_df['sentiment'].astype('category')
train_df['sentiment_cat'] = train_df['sentiment'].cat.codes
test_df['sentiment_cat'] = test_df['sentiment'].cat.codes
train_y = np.array(train_df['sentiment_cat'])
test_y = np.array(test_df['sentiment_cat'])
# Function to convert df into a list of strings
def convert_to_list(df, x):
selected_text_list = []
labels = []
for index, row in df.iterrows():
selected_text_list.append(str(row[x]))
labels.append(str(row['sentiment']))
return np.array(selected_text_list), np.array(labels)
train_sentences, train_labels = convert_to_list(train_df, 'selected_text')
test_sentences, test_labels = convert_to_list(test_df, 'text')
# Instantiate tokenizer and create word_index
tokenizer = Tokenizer(num_words=1000, oov_token='<oov>')
tokenizer.fit_on_texts(train_sentences)
word_index = tokenizer.word_index
# Convert sentences into a sequence
train_sequence = tokenizer.texts_to_sequences(train_sentences)
test_sequence = tokenizer.texts_to_sequences(test_sentences)
# Padding sequences
pad_test_seq = pad_sequences(test_sequence, padding='post')
max_len = pad_test_seq[0].size
pad_train_seq = pad_sequences(train_sequence, padding='post', maxlen=max_len)
model = tf.keras.Sequential([
tf.keras.layers.Embedding(10000, 64, input_length=max_len),
tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64, return_sequences=True)),
tf.keras.layers.GlobalAveragePooling1D(),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(3, activation='softmax')
])
with tf.device(device_name):
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
num_epochs = 10
with tf.device(device_name):
history = model.fit(pad_train_seq, train_y, epochs=num_epochs, validation_data=(pad_test_seq, test_y), verbose=2)
Here is the error:
ValueError Traceback (most recent call last)
<ipython-input-28-62f3c6445887> in <module>
2
3 with tf.device(device_name):
----> 4 history = model.fit(pad_train_seq, train_y, epochs=num_epochs, validation_data=(pad_test_seq, test_y), verbose=2)
1 frames
/usr/local/lib/python3.8/dist-packages/keras/engine/training.py in tf__train_function(iterator)
13 try:
14 do_return = True
---> 15 retval_ = ag__.converted_call(ag__.ld(step_function), (ag__.ld(self), ag__.ld(iterator)), None, fscope)
16 except:
17 do_return = False
ValueError: in user code:
File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1051, in train_function *
return step_function(self, iterator)
File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1040, in step_function **
outputs = model.distribute_strategy.run(run_step, args=(data,))
File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1030, in run_step **
outputs = model.train_step(data)
File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 890, in train_step
loss = self.compute_loss(x, y, y_pred, sample_weight)
File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 948, in compute_loss
return self.compiled_loss(
File "/usr/local/lib/python3.8/dist-packages/keras/engine/compile_utils.py", line 201, in __call__
loss_value = loss_obj(y_t, y_p, sample_weight=sw)
File "/usr/local/lib/python3.8/dist-packages/keras/losses.py", line 139, in __call__
losses = call_fn(y_true, y_pred)
File "/usr/local/lib/python3.8/dist-packages/keras/losses.py", line 243, in call **
return ag_fn(y_true, y_pred, **self._fn_kwargs)
File "/usr/local/lib/python3.8/dist-packages/keras/losses.py", line 1930, in binary_crossentropy
backend.binary_crossentropy(y_true, y_pred, from_logits=from_logits),
File "/usr/local/lib/python3.8/dist-packages/keras/backend.py", line 5283, in binary_crossentropy
return tf.nn.sigmoid_cross_entropy_with_logits(labels=target, logits=output)
ValueError: `logits` and `labels` must have the same shape, received ((None, 3) vs (None, 1)).
my logits (None, 3) are not the same size as my labels (None, 1)
I made sure that my last layer had an output of 3 and had the activation = 'softmax'
my labels have a dimension of (None, 1) since I mapped each class to a unique integer
The key concept you are missing is that you need to one-hot encode your labels (after assigning integers to them - see below).
So your model, after the softmax, is spitting out three values: how probable each of your labels is. E.g. it might say A is 0.6, B is 0.1, and C is 0.3. If the correct answer is C, then it needs to see that correct answer as 0, 0, 1. It can then say that its prediction for A is 0.6 - 0 = +0.6 wrong, B is 0.1 - 0 = +0.1 wrong, and C is 0.3 - 1 = -0.7 wrong.
Theoretically you can go from a string label directly to a one-hot encoding. But it seems Tensorflow needs the labels to first be encoded as integers, and then that is one-hot encoded.
https://www.tensorflow.org/api_docs/python/tf/keras/layers/CategoryEncoding#examples says to use:
tf.keras.layers.CategoryEncoding(num_tokens=3, output_mode="one_hot")
Also see https://stackoverflow.com/a/69791457/841830 (the higher-voted answer there is from 2019, so applies to TensorFlow v1 I think). And searching for "tensorflow one-hot encoding" will bring up plenty of tutorials and examples.
The issue here was indeed due to the shape of my labels not being the same as logits. Logits were of shape (3) since they contained a float for the probability of each of the three classes that I wanted to predict. Labels were originally of shape (1) since it only contained one int.
To solve this, I used one-hot encoding which turned all labels into a shape of (3) and this solved the problem. Used the keras.utils.to_categorical() function to do so.
sentiment_types = ('negative', 'neutral', 'positive')
train_df['sentiment'] = train_df['sentiment'].astype('category')
test_df['sentiment'] = test_df['sentiment'].astype('category')
# Turning labels from strings to int
train_sentiment_cat = train_df['sentiment'].cat.codes
test_sentiment_cat = test_df['sentiment'].cat.codes
# One-hot encoding
train_y = to_categorical(train_sentiment_cat)
test_y = to_categorical(test_sentiment_cat)

RandomForestRegressor in Julia

I'm trying to train a RandomForestRegressor using DecisionTree.jl
and RandomizedSearchCV (contained in ScikitLearn.jl) in Julia. Primary datasets like x_train and y_train etc. are provided in my google drive as well, So you can test it on your machine. The code is as follows:
using CSV
using DataFrames
using ScikitLearn: fit!, predict
using ScikitLearn.GridSearch: RandomizedSearchCV
using DecisionTree
x = CSV.read("x.csv", DataFrames.DataFrame)
x_test = CSV.read("x_test.csv", DataFrames.DataFrame)
y_train = CSV.read("y_train.csv", DataFrames.DataFrame)
mod = RandomForestRegressor()
param_dist = Dict("n_trees"=>[50 , 100, 200, 300],
"max_depth"=> [3, 5, 6 ,8 , 9 ,10])
model = RandomizedSearchCV(mod, param_dist, n_iter=10, cv=5)
fit!(model, Matrix(x), Matrix(DataFrames.dropmissing(y_train)))
predict(x_test)
This throws a MethodError like this:
ERROR: MethodError: no method matching fit!(::RandomForestRegressor, ::Matrix{Float64}, ::Matrix{Float64})
Closest candidates are:
fit!(::ScikitLearn.Models.FixedConstant, ::Any, ::Any) at C:\Users\Shayan\.julia\packages\ScikitLearn\ssekP\src\models\constant_model.jl:26
fit!(::ScikitLearn.Models.ConstantRegressor, ::Any, ::Any) at C:\Users\Shayan\.julia\packages\ScikitLearn\ssekP\src\models\constant_model.jl:10
fit!(::ScikitLearn.Models.LinearRegression, ::AbstractArray{XT}, ::AbstractArray{yT}) where {XT, yT} at C:\Users\Shayan\.julia\packages\ScikitLearn\ssekP\src\models\linear_regression.jl:27
...
Stacktrace:
[1] _fit!(self::RandomizedSearchCV, X::Matrix{Float64}, y::Matrix{Float64}, parameter_iterable::Vector{Any})
# ScikitLearn.Skcore C:\Users\Shayan\.julia\packages\ScikitLearn\ssekP\src\grid_search.jl:332
[2] fit!(self::RandomizedSearchCV, X::Matrix{Float64}, y::Matrix{Float64})
# ScikitLearn.Skcore C:\Users\Shayan\.julia\packages\ScikitLearn\ssekP\src\grid_search.jl:748
[3] top-level scope
# c:\Users\Shayan\Desktop\AUT\Thesis\test.jl:17
If you're curious about the shape of the data:
julia> size(x)
(1550, 71)
julia> size(y_train)
(1550, 10)
How can I solve this problem?
PS: Also I tried:
julia> fit!(model, Matrix{Any}(x), Matrix{Any}(DataFrames.dropmissing(y_train)))
ERROR: MethodError: no method matching fit!(::RandomForestRegressor, ::Matrix{Any}, ::Matrix{Any})
Closest candidates are:
fit!(::ScikitLearn.Models.FixedConstant, ::Any, ::Any) at C:\Users\Shayan\.julia\packages\ScikitLearn\ssekP\src\models\constant_model.jl:26
fit!(::ScikitLearn.Models.ConstantRegressor, ::Any, ::Any) at C:\Users\Shayan\.julia\packages\ScikitLearn\ssekP\src\models\constant_model.jl:10
fit!(::ScikitLearn.Models.LinearRegression, ::AbstractArray{XT}, ::AbstractArray{yT}) where {XT, yT} at C:\Users\Shayan\.julia\packages\ScikitLearn\ssekP\src\models\linear_regression.jl:27
...
Stacktrace:
[1] _fit!(self::RandomizedSearchCV, X::Matrix{Any}, y::Matrix{Any}, parameter_iterable::Vector{Any})
# ScikitLearn.Skcore C:\Users\Shayan\.julia\packages\ScikitLearn\ssekP\src\grid_search.jl:332
[2] fit!(self::RandomizedSearchCV, X::Matrix{Any}, y::Matrix{Any})
# ScikitLearn.Skcore C:\Users\Shayan\.julia\packages\ScikitLearn\ssekP\src\grid_search.jl:748
[3] top-level scope
# c:\Users\Shayan\Desktop\AUT\Thesis\MyWork\Thesis.jl:327
Looking at Random Forest Regression example docs in DecisionTree.jl, the example doesn't follow the fit!() / predict() design pattern. The error confirms that fit!() doesn't support RandomForestRegression. Alternatively, you might look at RandomForest.jl package which does follow fit!() / predict() pattern.
As stated here, DecisionTree.jl doesn't support Multi-output RF yet. So I gave up on using DecisionTree.jl, And ScikitLearn.jl is adequate in my case:
using ScikitLearn: #sk_import, fit!, predict
#sk_import ensemble: RandomForestRegressor
using ScikitLearn.GridSearch: RandomizedSearchCV
using CSV
using DataFrames
x = CSV.read("x.csv", DataFrames.DataFrame)
x_test = CSV.read("x_test.csv", DataFrames.DataFrame)
y_train = CSV.read("y_train.csv", DataFrames.DataFrame)
x_test = reshape(x_test, 1,length(x_test))
mod = RandomForestRegressor()
param_dist = Dict("n_estimators"=>[50 , 100, 200, 300],
"max_depth"=> [3, 5, 6 ,8 , 9 ,10])
model = RandomizedSearchCV(mod, param_dist, n_iter=10, cv=5)
fit!(model, Matrix(x), Matrix(DataFrames.dropmissing(y_train)))
predict(model, x_test)
This works fine for me, But it's super slow! Much slower than Python. I'll add the benchmarking with the same data sets across these two languages.
Benchmarking
Here I report the result of benchmarking with the same action, the same values, and the same data. All the data and code files are available in my Google Drive. So feel free to test it by yourself. First, I start with Julia.
Julia
using CSV
using DataFrames
using ScikitLearn: #sk_import, fit!, predict
#sk_import ensemble: RandomForestRegressor
using ScikitLearn.GridSearch: RandomizedSearchCV
using BenchmarkTools
x = CSV.read("x.csv", DataFrames.DataFrame)
y_train = CSV.read("y_train.csv", DataFrames.DataFrame)
mod = RandomForestRegressor(max_leaf_nodes=2)
param_dist = Dict("n_estimators"=>[50 , 100, 200, 300],
"max_depth"=> [3, 5, 6 ,8 , 9 ,10])
model = RandomizedSearchCV(mod, param_dist, n_iter=10, cv=5, n_jobs=1)
#btime fit!(model, Matrix(x), Matrix(DataFrames.dropmissing(y_train)))
# 52.123 s (6965 allocations: 44.34 MiB)
Python
>>> import cProfile, pstats
>>> import pandas as pd
>>> from sklearn.ensemble import RandomForestRegressor
>>> from sklearn.model_selection import RandomizedSearchCV
>>> x = pd.read_csv("x.csv")
>>> y_train = pd.read_csv("y_train.csv")
>>> mod = RandomForestRegressor(max_leaf_nodes=2)
>>> parameters = {
'n_estimators': [50 , 100, 200, 300],
'max_depth': [3, 5, 6 ,8 , 9 ,10]}
>>> model = RandomizedSearchCV(mod, param_distributions=parameters, cv=5, n_iter=10, n_jobs=1)
>>> pr = cProfile.Profile()
>>> pr.enable()
>>> model.fit(x , y_train)
>>> pr.disable()
>>> stats = pstats.Stats(pr).strip_dirs().sort_stats("cumtime")
>>> stats.print_stats(5)
12097437 function calls (11936452 primitive calls) in 73.452 seconds
Ordered by: cumulative time
List reduced from 736 to 5 due to restriction <5>
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 73.445 73.445 _search.py:738(fit)
102/2 0.027 0.000 73.370 36.685 parallel.py:960(__call__)
12252/152 0.171 0.000 73.364 0.483 parallel.py:798(dispatch_one_batch)
12150/150 0.058 0.000 73.324 0.489 parallel.py:761(_dispatch)
12150/150 0.025 0.000 73.323 0.489 _parallel_backends.py:206(apply_async)
So I conclude that Julia performs better than Python in this specific problem in case of speed.

Tensorflow Shape Error (Latest Version) Multivariate Linear Regression

Keep in mind I am extremely new to tf.
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston
np.set_printoptions(suppress=True)
boston = load_boston()
m = boston.data.shape[0] - 1
bt_unfixed = np.transpose(boston.data)
bt = np.insert(bt_unfixed, 0, 1)
Y = tf.placeholder(tf.float64)
X = tf.placeholder(tf.float64, [None, 13])
print X.shape
W = tf.Variable(np.transpose(np.array([0.00,0,0,0,0,0,0,0,0,0,0,0,0]).flatten()), name='weights')
b = tf.Variable(0.5, name='bias')
hypothesis = tf.add(tf.matmul(X, W), tf.cast(b, tf.float64))
loss = tf.reduce_sum(tf.square(hypothesis - Y)) / (2 * m)
optimizer = tf.train.GradientDescentOptimizer(0.01)
train_op = optimizer.minimize(loss)
with tf.Session() as sess:
sess.run(tf.initialize_all_variables())
for i in range(0, 1200):
for (x, y) in zip(boston.data, boston.target.data):
sess.run(train_op, feed_dict={X:x, Y:y})
print "Done!\n"
print "Running test...\n"
t = sess.run(cost, feed_dict={X:boston.data[504], Y:boston.target.data[504]})
print "loss =" + str(t) + "Real value" + str(boston.target.data[504]) + "Pred " +str(sess.run(hypothesis, feed_dict={X:boston.data[504]}))
Please feel free to correct any other errors in my code!
The error is thrown when I define hypothesis, I need to transpose my vector to multiply them together, but it is not working.
Traceback (most recent call last):
File "multihouse.py", line 20, in <module>
hypothesis = tf.add(tf.matmul(X, W), tf.cast(b, tf.float64))
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_ops.py", line 1398, in matmul
name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 1348, in _mat_mul
transpose_b=transpose_b, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 749, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2382, in create_op
set_shapes_for_outputs(ret)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1783, in set_shapes_for_outputs
shapes = shape_func(op)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/common_shapes.py", line 596, in call_cpp_shape_fn
raise ValueError(err.message)
ValueError: Shape must be rank 2 but is rank 1
The problem is there is no such thing as transposition of a vector in numpy. you have to define a matrix with one trivial dimension to work with such things, for example
np.array([0.00,0,0,0,0,0,0,0,0,0,0,0,0]) # this cannot be transposed
np.array([[0.00,0,0,0,0,0,0,0,0,0,0,0,0]]) # this can
just like in this example
>>> import numpy as np
>>> np.transpose(np.array([1,1,1]))
array([1, 1, 1])
>>> np.transpose(np.array([[1,1,1]]))
array([[1],
[1],
[1]])
Once you fix this, your hypothesis will be correctly defined (right now you try to multiply matrix with a vector, which is not defined in TF).

Dimension mismatch error with scikit pipeline FeatureUnion

This is my first post. I've been trying to combine features with FeatureUnion and Pipeline, but when I add a tf-idf + svd piepline the test fails with a 'dimension mismatch' error. My simple task is to create a regression model to predict search relevance. Code and errors are reported below. Is there something wrong in my code?
df = read_tsv_data(input_file)
df = tokenize(df)
df_train, df_test = train_test_split(df, test_size = 0.2, random_state=2016)
x_train = df_train['sq'].values
y_train = df_train['relevance'].values
x_test = df_test['sq'].values
y_test = df_test['relevance'].values
# char ngrams
char_ngrams = CountVectorizer(ngram_range=(2,5), analyzer='char_wb', encoding='utf-8')
# TFIDF word ngrams
tfidf_word_ngrams = TfidfVectorizer(ngram_range=(1, 4), analyzer='word', encoding='utf-8')
# SVD
svd = TruncatedSVD(n_components=100, random_state = 2016)
# SVR
svr_lin = SVR(kernel='linear', C=0.01)
pipeline = Pipeline([
('feature_union',
FeatureUnion(
transformer_list = [
('char_ngrams', char_ngrams),
('char_ngrams_svd_pipeline', make_pipeline(char_ngrams, svd)),
('tfidf_word_ngrams', tfidf_word_ngrams),
('tfidf_word_ngrams_svd', make_pipeline(tfidf_word_ngrams, svd))
]
)
),
('svr_lin', svr_lin)
])
model = pipeline.fit(x_train, y_train)
y_pred = model.predict(x_test)
When adding the pipeline below to the FeatureUnion list:
('tfidf_word_ngrams_svd', make_pipeline(tfidf_word_ngrams, svd))
The exception below is generated:
2016-07-31 10:34:08,712 : Testing ... Test Shape: (400,) - Training Shape: (1600,)
Traceback (most recent call last):
File "src/model/end_to_end_pipeline.py", line 236, in <module>
main()
File "src/model/end_to_end_pipeline.py", line 233, in main
process_data(input_file, output_file)
File "src/model/end_to_end_pipeline.py", line 175, in process_data
y_pred = model.predict(x_test)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/utils/metaestimators.py", line 37, in <lambda>
out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/pipeline.py", line 203, in predict
Xt = transform.transform(Xt)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/pipeline.py", line 523, in transform
for name, trans in self.transformer_list)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 800, in __call__
while self.dispatch_one_batch(iterator):
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 658, in dispatch_one_batch
self._dispatch(tasks)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 566, in _dispatch
job = ImmediateComputeBatch(batch)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 180, in __init__
self.results = batch()
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 72, in __call__
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/pipeline.py", line 399, in _transform_one
return transformer.transform(X)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/utils/metaestimators.py", line 37, in <lambda>
out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/pipeline.py", line 291, in transform
Xt = transform.transform(Xt)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/decomposition/truncated_svd.py", line 201, in transform
return safe_sparse_dot(X, self.components_.T)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/utils/extmath.py", line 179, in safe_sparse_dot
ret = a * b
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scipy/sparse/base.py", line 389, in __mul__
raise ValueError('dimension mismatch')
ValueError: dimension mismatch
What if you change second svd usage to new svd?
transformer_list = [
('char_ngrams', char_ngrams),
('char_ngrams_svd_pipeline', make_pipeline(char_ngrams, svd)),
('tfidf_word_ngrams', tfidf_word_ngrams),
('tfidf_word_ngrams_svd', make_pipeline(tfidf_word_ngrams, clone(svd)))
]
Seems your problem occurs because you're using same object 2 times. I is fitted first time on CountVectorizer, and second time on TfidfVectorizer (Or vice versa), and after you call predict of whole pipeline this svd object cannot understand output of CountVectorizer, because it was fitted on or TfidfVectorizer's output (Or again, vice versa).

Python : Gaussian Process Regression and GridSearchCV

I am working on Gaussian Process Regression with Python on NIR spectrum data. I can get some results with GPR and would like to optimize parameters for GPR. I am trying to use GridSearchCV to optimize parameters, but I keep getting an error and could not find any examples that people used GridSearchCV for Gaussian Process (from sklearn.gaussian_process). My quick question is if I can use GridSearchCV for GPR. If not, what would you recommend to use to optimize parameters.
This is my error:
---------------------------------------------------
-# Tuning hyper-parameters for precision
Traceback (most recent call last):
File "", line 1, in runfile('C:/Users/hkim.N04485/Desktop/Python/untitled14.py', wdir='C:/Users/hkim.N04485/Desktop/Python')
File "C:\Users\hkim.N04485\Anaconda2\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 699, in runfile execfile(filename, namespace)
File "C:\Users\hkim.N04485\Anaconda2\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 74, in execfile exec(compile(scripttext, filename, 'exec'), glob, loc)
File "C:/Users/hkim.N04485/Desktop/Python/untitled14.py", line 39, in gp.fit(X1, y1_glucose)
File "C:\Users\hkim.N04485\Anaconda2\lib\site-packages\sklearn\grid_search.py", line 804, in fit return self._fit(X, y, ParameterGrid(self.param_grid))
File "C:\Users\hkim.N04485\Anaconda2\lib\site-packages\sklearn\grid_search.py", line 553, in _fit for parameters in parameter_iterable
File "C:\Users\hkim.N04485\Anaconda2\lib\site-packages\sklearn\externals\joblib\parallel.py", line 804, in call while self.dispatch_one_batch(iterator):
File "C:\Users\hkim.N04485\Anaconda2\lib\site-packages\sklearn\externals\joblib\parallel.py", line 662, in dispatch_one_batch self._dispatch(tasks)
File "C:\Users\hkim.N04485\Anaconda2\lib\site-packages\sklearn\externals\joblib\parallel.py", line 570, in _dispatch job = ImmediateComputeBatch(batch)
File "C:\Users\hkim.N04485\Anaconda2\lib\site-packages\sklearn\externals\joblib\parallel.py", line 183, in init self.results = batch()
File "C:\Users\hkim.N04485\Anaconda2\lib\site-packages\sklearn\externals\joblib\parallel.py", line 72, in call return [func(*args, **kwargs) for func, args, kwargs in self.items]
File "C:\Users\hkim.N04485\Anaconda2\lib\site-packages\sklearn\cross_validation.py", line 1550, in _fit_and_score test_score = _score(estimator, X_test, y_test, scorer)
File "C:\Users\hkim.N04485\Anaconda2\lib\site-packages\sklearn\cross_validation.py", line 1606, in _score score = scorer(estimator, X_test, y_test)
File "C:\Users\hkim.N04485\Anaconda2\lib\site-packages\sklearn\metrics\scorer.py", line 90, in call **self._kwargs)
File "C:\Users\hkim.N04485\Anaconda2\lib\site-packages\sklearn\metrics\classification.py", line 1203, in precision_score sample_weight=sample_weight)
File "C:\Users\hkim.N04485\Anaconda2\lib\site-packages\sklearn\metrics\classification.py", line 956, in precision_recall_fscore_support y_type, y_true, y_pred = _check_targets(y_true, y_pred)
File "C:\Users\hkim.N04485\Anaconda2\lib\site-packages\sklearn\metrics\classification.py", line 82, in _check_targets "".format(type_true, type_pred))
ValueError: Can't handle mix of multiclass and continuous
How do I fix this?
Here is my code.
tuned_parameters = [{'corr':['squared_exponential'], 'theta0': [0.01, 0.2, 0.8, 1.]},
{'corr':['cubic'], 'theta0': [0.01, 0.2, 0.8, 1.]}]
scores = ['precision', 'recall']
xy_line=(0,1200)
for score in scores:
print("# Tuning hyper-parameters for %s" % score)
print()
gp = GridSearchCV(GaussianProcess(normalize=False), tuned_parameters, cv=5,
scoring='%s_weighted' % score)
gp.fit(X1, y1_glucose)
print("Best parameters set found on development set:")
print()
print(gp.best_params_)
print()
print("Grid scores on development set:")
print()
for params, mean_score, scores in gp.grid_scores_:
print("%0.3f (+/-%0.03f) for %r"
% (mean_score, scores.std() * 2, params))
y_true, y_pred = y2_glucose, gp.predict(X2)
# Scatter plot (reference vs predicted )
fig, ax = plt.subplots(figsize=(11,13))
ax.scatter(y2_glucose,y_pred)
ax.plot(xy_line, xy_line, 'r--')
major_ticks = np.arange(-300,2000,100)
minor_ticks = np.arange(0,1201,100)
ax.set_xticks(minor_ticks)
ax.set_yticks(major_ticks)
ax.grid()
plt.title('1')
ax.set_xlabel('Reference')
ax.set_ylabel('Predicted')

Resources