Machine Learning- Dividing data into test and train sets - machine-learning

How to divide a given dataset into train and test sets along with their correct labels.
There is an implementation for same through sklearn library :
from sklearn.cross_validation import train_test_split
train, test = train_test_split(df, test_size = 0.2)
where df is the original dataset....for eg : a list of strings
The problem is that it doesnt take the target/labels along with the data sets. So we cannot track which label belongs to what data point...
Is there any way to bind data points and their labels and then split the data sets into train and test?

sklearn.cross_validation.train_test_split essentially takes a variable number of arrays which it will split
*arrays : sequence of arrays or scipy.sparse matrices with same shape[0]
Returns:
splitting : list of arrays, length=2 * len(arrays)
List containing train-test split of input array.
so you can just add along the labels list:
from sklearn import cross_validation
df = ['the', 'quick', 'brown', 'fox']
labels = [0, 1, 0, 0]
>> cross_validation.train_test_split(df, labels, test_size=0.2)
[['quick', 'fox', 'the'], ['brown'], [1, 0, 0], [0]]

Related

Sklearn: Found input variables with inconsistent numbers of samples:

I have built a model.
est1_pre = ColumnTransformer([('catONEHOT', OneHotEncoder(dtype='int',handle_unknown='ignore'),['Var1'])],remainder='drop')
est2_pre = ColumnTransformer([('BOW', TfidfVectorizer(ngram_range=(1, 3),max_features=1000),['Var2'])],remainder='drop')
m1= Pipeline([('FeaturePreprocessing', est1_pre),
('clf',alternative)])
m2= Pipeline([('FeaturePreprocessing', est2_pre),
('clf',alternative)])
model_combo = StackingClassifier(
estimators=[('cate',m1),('text',m2)],
final_estimator=RandomForestClassifier(n_estimators=10,
random_state=42)
)
I can successfully, fit and predict using m1 and m2.
However, when I look at the combination model_combo
Any attempt in calling .fit/.predict results in ValueError: Found input variables with inconsistent numbers of samples:
model_fitted=model_combo.fit(x_train,y_train)
x_train contains Var1 and Var2
How to fit model_combo?
The problem is that sklearn text preprocessors (TfidfVectorizer in this case) operate on one-dimensional data, not two-dimensional as most other preprocessors. So the vectorizer treats its input as an iterable of its columns, so there's only one "document". This can be fixed in the ColumnTransformer by specifying the column to operate on not in a list:
est2_pre = ColumnTransformer([('BOW', TfidfVectorizer(ngram_range=(1, 3),max_features=1000),'Var2')],remainder='drop')

What do the 'normalize' parameters mean in sklearns confusion_matrix?

I am using sklearns confusion_matrix package to plot the results coupled with the accuracy, recall and precision score etc and the graph renders as it should. However I am slightly confused by what the different values for what the normalize parameter mean. Why do we do it and what are the differences between the 3 options? As quoting from their documentation:
normalize{‘true’, ‘pred’, ‘all’}, default=None
Normalizes confusion matrix over the true (rows), predicted (columns) conditions or all the population.
If None, confusion matrix will not be normalized.
Does it normalize the points to a percentage format to make it easily visually if datasets are too large? Or am I missing the point all together here. I have searched but the questions all appear to be stating how to do it, rather than the meaning behind them.
A normalized version makes it easier to visually interpret how the labels are being predicted and also to perform comparisons. You can also pass values_format= '.0%' to display the values as percentages. The normalize parameter specifies what the denominator should be
'true': sum of rows (True label)
'pred': sum of columns (Predicted label)
'all': sum of all
Example:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_moons
from sklearn.metrics import plot_confusion_matrix
from sklearn.model_selection import train_test_split
# Generate some example data
X, y = make_moons(noise=0.3, random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=10)
# Train the classifier
clf = LogisticRegression()
clf.fit(X, y)
plot_confusion_matrix(clf, X_test, y_test); plt.title("Not normalized");
plot_confusion_matrix(clf, X_test, y_test, values_format= '.0%', normalize='true'); plt.title("normalize='true'");
plot_confusion_matrix(clf, X_test, y_test, values_format= '.0%', normalize='pred'); plt.title("normalize='pred'");
plot_confusion_matrix(clf, X_test, y_test, values_format= '.0%', normalize='all'); plt.title("normalize='all'");
Yes, you can think of it as a percentage. The default is to just show the absolute count value in each cell of the confusion matrix, i.e. how often each combination of true and predicted category levels occurrs.
But if you choose e.g. normalize='all', every count value will be divided by the sum of all count values, so that you have relative frequencies whose sum over the whole matrix is 1. Similarly, if you pick normalize='true', you will have relative frequencies per row.
If you repeat an experiment with different sample sizes, you may want to compare confusion matrices across experiments. To do so, you wouldn't want to see the total counts for each matrix. Instead, you would want to see the counts normalized but you need to decide if you want terms normalized by total number of samples ("all"), predicted class counts ("pred"), or true class counts ("true"). For example:
In [30]: yt
Out[30]: array([1, 0, 0, 0, 0, 1, 1, 0, 0, 0])
In [31]: yp
Out[31]: array([0, 0, 1, 0, 1, 0, 0, 1, 0, 0])
In [32]: confusion_matrix(yt, yp)
Out[32]:
array([[4, 3],
[3, 0]])
In [33]: confusion_matrix(yt, yp, normalize='pred')
Out[33]:
array([[0.57142857, 1. ],
[0.42857143, 0. ]])
In [34]: confusion_matrix(yt, yp, normalize='true')
Out[34]:
array([[0.57142857, 0.42857143],
[1. , 0. ]])
In [35]: confusion_matrix(yt, yp, normalize='all')
Out[35]:
array([[0.4, 0.3],
[0.3, 0. ]])

Transfer learning with CNTK and pre-trained ONNX model fails

I'm trying to use the ResNet-50 model from the ONNX model zoo and load and train it in CNTK for an image classification task. The first thing that confuses me is, that the batch axis (not sure what's the official name for it, dynamic axis?) is set to 1 in this model:
Why is that? Couldn't it simply be [3x224x224]? In this model for example, the input looks like this:
To load the model and use my own Dense layer, I use the following code:
def create_model(num_classes, input_features, freeze=False):
base_model = load_model("restnet-50.onnx", format=ModelFormat.ONNX)
feature_node = find_by_name(base_model, "gpu_0/data_0")
last_node = find_by_uid(base_model, "Reshape2959")
substitutions = {
feature_node : placeholder(name='new_input')
}
cloned_layers = last_node.clone(CloneMethod.clone, substitutions)
cloned_out = cloned_layers(input_features)
z = Dense(num_classes, activation=softmax, name="prediction") (cloned_out)
return z
For training I use (shortened):
# datasets = list of classes
feature = input_variable(shape=(1, 3, 224, 224))
label = input_variable(shape=(1,3))
model = create_model(len(datasets), feature)
loss = cross_entropy_with_softmax(model, label)
# some definitions for learner, epochs, ProgressPrinters missing
for epoch in range(epochs):
loss.train((X_current,y_current), parameter_learners=[learner], callbacks=[progress_printer])
X_current is a single image and y_current the corresponding class label both encoded as numpy arrays with the followings shapes
X_current.shape
(1, 3, 224, 224)
y_current.shape
(1, 3)
When I try to train the model, I get
"ValueError: ToBatchAxis7504 ToBatchAxisNode operation can only operate on tensor without minibatch data (no layout)"
What's wrong here?

Found input variables with inconsistent numbers of samples: [2, 144]

I am having a training data set consisting of 144 feedback with 72 positive and 72 negative respectively. there are two target labels positive and negative respectively. Consider the following code segment :
import pandas as pd
feedback_data = pd.read_csv('output.csv')
print(feedback_data)
data target
0 facilitates good student teacher communication. positive
1 lectures are very lengthy. negative
2 the teacher is very good at interaction. positive
3 good at clearing the concepts. positive
4 good at clearing the concepts. positive
5 good at teaching. positive
6 does not shows test copies. negative
7 good subjective knowledge. positive
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(binary = True)
cv.fit(feedback_data)
X = cv.transform(feedback_data)
X_test = cv.transform(feedback_data_test)
from sklearn import svm
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
target = [1 if i<72 else 0 for i in range(144)]
# the below line gives error
X_train, X_val, y_train, y_val = train_test_split(X, target, train_size = 0.50)
I do not understand what the problem is. Please help.
You are not using the count vectorizer right. This what you have now:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(binary = True)
cv.fit(df)
X = cv.transform(df)
X
<2x2 sparse matrix of type '<class 'numpy.int64'>'
with 2 stored elements in Compressed Sparse Row format>
So you see that you don't achieve what you want. you do not transform each line correctly. You don't even train the count vectorizer right because you use the entire DataFrame and not just the corpus of comments.
To solve the issue we need to make sure that the Count is well done:
if you do this (Use the right corpus):
cv = CountVectorizer(binary = True)
cv.fit(df['data'].values)
X = cv.transform(df)
X
<2x23 sparse matrix of type '<class 'numpy.int64'>'
with 0 stored elements in Compressed Sparse Row format>
you see that we are coming close to what we want. We just have to transform it right (transform each line):
cv = CountVectorizer(binary = True)
cv.fit(df['data'].values)
X = df['data'].apply(lambda x: cv.transform([x])).values
X
array([<1x23 sparse matrix of type '<class 'numpy.int64'>'
with 5 stored elements in Compressed Sparse Row format>,
...
<1x23 sparse matrix of type '<class 'numpy.int64'>'
with 3 stored elements in Compressed Sparse Row format>], dtype=object)
We have a more suitable X! Now we just need to check if we can split:
target = [1 if i<72 else 0 for i in range(8)] # The dataset is here of size 8
# the below line gives error
X_train, X_val, y_train, y_val = train_test_split(X, target, train_size = 0.50)
And it works!
You need to be sure you understand what CountVectorizer do to use it the right way

TensorFlow - Classification with thousands of labels

I'm very new to TensorFlow. I've been trying use TensorFlow to create a function where I give it a vector with 6 features and get back a label.
I have a training data set in the form of 6 features and 1 label. The label is in the first column:
309,3,0,2,4,0,6
309,12,0,2,4,0,6
309,0,4,17,2,0,6
318,0,660,414,58,3,12
311,0,0,414,58,0,2
298,0,53,355,5,0,2
60,16,14,381,30,4,2
312,0,8,8,13,0,3
...
I have the index for the labels which is a list of thousand and thousands of names:
309,Joe
318,Joey
311,Bruce
...
How do I create a model and train it using TensorFlow to be able to predict the label, given a vector without the first column?
--
This is what I tried:
from __future__ import print_function
import tflearn
name_count = sum(1 for line in open('../../names.csv')) # this comes out to 24260
# Load CSV file, indicate that the first column represents labels
from tflearn.data_utils import load_csv
data, labels = load_csv('../../data.csv', target_column=0,
categorical_labels=True, n_classes=name_count)
# Build neural network
net = tflearn.input_data(shape=[None, 6])
net = tflearn.fully_connected(net, 32)
net = tflearn.fully_connected(net, 32)
net = tflearn.fully_connected(net, 2, activation='softmax')
net = tflearn.regression(net)
# Define model
model = tflearn.DNN(net)
# Start training (apply gradient descent algorithm)
model.fit(data, labels, n_epoch=10, batch_size=16, show_metric=True)
# Predict
pred = model.predict([[218,5,124,26,0,3]]) # 326
print("Name:", pred[0][1])
It's based on https://github.com/tflearn/tflearn/blob/master/tutorials/intro/quickstart.md
I get the error:
ValueError: Cannot feed value of shape (16, 24260) for Tensor u'TargetsData/Y:0', which has shape '(?, 2)'
24260 is the number of lines in names.csv
Thank you!
net = tflearn.fully_connected(net, 2, activation='softmax')
looks to be saying you have 2 output classes, but in reality you have 24260. 16 is the size of your minibatch, so you have 16 rows of 24260 columns (one of these 24260 will be a 1, the others will be all 0s).

Resources