Train/Test split object detection - machine-learning

Are there any script/function to split the data counting the number of class appearances in each image and balance them?
I've tryed sklearn train_test_split in this way:
data = pd.read_csv('train_labels.csv')
data.head()
Class is what I want to predict, on one image I can have 0..n rectangles and each rectangle has a class.
data = data.drop_duplicates(subset="filename")
y = data['class']
X = data.drop('class',axis = 1)
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.2)
But when I delete duplicates in filenames I'm loosing information and maybe I send files to train or test with many other classes, but if I don't delete them I can have a file in train and test.
Thanks for your help.

Their is a library scikit-multilearn which will help to split multi-label data.
Installation: pip install scikit-multilearn
Documentation: http://scikit.ml/stratification.html
Implementation:
Let's suppose in dataframe df, X1, X2 are the feature columns and y is the target column.
Our data can be classified in following classes class1, class2, class3
df = pd.DataFrame({
"X1": [1,2,3,4,5,6,7,8],
"X2": [6,7,8,9,10,11,12,13],
"y": ["class1", "class1", "class2", "class2", "class3", "class3", "class1", "class2"]})
After running above code we have dataframe:
X1 X2 y
0 1 6 class1
1 2 7 class1
2 3 8 class2
3 4 9 class2
4 5 10 class3
5 6 11 class3
6 7 12 class1
7 8 13 class2
But scikit multilearn works with one hot target columns. So we have to transform our target column.
one_hot_classes = pd.get_dummies(df["y"])
Which will output:
class1 class2 class3
0 1 0 0
1 1 0 0
2 0 1 0
3 0 1 0
4 0 0 1
5 0 0 1
6 1 0 0
7 0 1 0
We will drop y column and concatenate one_hot_classes
df.drop("y", axis = 1, inplace=True)
df = pd.concat([df, one_hot_classes], axis=1)
After running above code:
X1 X2 class1 class2 class3
0 1 6 1 0 0
1 2 7 1 0 0
2 3 8 0 1 0
3 4 9 0 1 0
4 5 10 0 0 1
5 6 11 0 0 1
6 7 12 1 0 0
7 8 13 0 1 0
Now we have features and target columns in X and y variable
X = df[["X1", "X2"]]
y = df[["class1", "class2", "class3"]]
Now we will get splits:
from skmultilearn.model_selection import iterative_train_test_split
X_train, y_train, X_test, y_test = iterative_train_test_split(X.values, y.values, test_size = 0.5)

Related

Ramp least squares estimation with CVXPY and DCCP

This is a ramp least squares estimation problem, better described in math formula elsewhere:
https://scicomp.stackexchange.com/questions/33524/ramp-least-squares-estimation
I used Disciplined Convex-Concave Programming and DCCP package based on CVXPY. The code follows:
import cvxpy as cp
import numpy as np
import dccp
from dccp.problem import is_dccp
# Generate data.
m = 20
n = 15
np.random.seed(1)
X = np.random.randn(m, n)
Y = np.random.randn(m)
# Define and solve the DCCP problem.
def loss_fn(X, Y, beta):
return cp.norm2(cp.matmul(X, beta) - Y)**2
def obj_g(X, Y, beta, sval):
return cp.pos(loss_fn(X, Y, beta) - sval)
beta = cp.Variable(n)
s = 10000000000000
constr = obj_g(X, Y, beta, s)
t = cp.Variable(1)
t.value = [1]
cost = loss_fn(X, Y, beta) - t
problem = cp.Problem(cp.Minimize(cost), [constr >= t])
print("problem is DCP:", problem.is_dcp()) # false
print("problem is DCCP:", is_dccp(problem)) # true
problem.solve(verbose=True, solver=cp.ECOS, method='dccp')
# Print result.
print("\nThe optimal value is", problem.value)
print("The optimal beta is")
print(beta.value)
print("The norm of the residual is ", cp.norm(X*beta - Y, p=2).value)
Because of the large value s, I would hope to get a solution similar to the least squares estimation. But there is no solution as the output shows (with different solver, dimension of the problem, etc):
problem is DCP: False
problem is DCCP: True
ECOS 2.0.7 - (C) embotech GmbH, Zurich Switzerland, 2012-15. Web: www.embotech.com/ECOS
It pcost dcost gap pres dres k/t mu step sigma IR | BT
0 +0.000e+00 -0.000e+00 +2e+01 9e-02 1e-04 1e+00 9e+00 --- --- 1 1 - | - -
1 -7.422e-04 +2.695e-09 +2e-01 1e-03 1e-06 1e-02 9e-02 0.9890 1e-04 2 1 1 | 0 0
2 -1.638e-05 +5.963e-11 +2e-03 1e-05 2e-08 1e-04 1e-03 0.9890 1e-04 2 1 1 | 0 0
3 -2.711e-07 +9.888e-13 +2e-05 1e-07 2e-10 2e-06 1e-05 0.9890 1e-04 4 1 1 | 0 0
4 -3.991e-09 +1.379e-14 +2e-07 1e-09 2e-12 2e-08 1e-07 0.9890 1e-04 1 0 0 | 0 0
5 -5.507e-11 +1.872e-16 +3e-09 2e-11 2e-14 2e-10 1e-09 0.9890 1e-04 1 0 0 | 0 0
OPTIMAL (within feastol=1.6e-11, reltol=4.8e+01, abstol=2.6e-09).
Runtime: 0.001112 seconds.
ECOS 2.0.7 - (C) embotech GmbH, Zurich Switzerland, 2012-15. Web: www.embotech.com/ECOS
It pcost dcost gap pres dres k/t mu step sigma IR | BT
0 +0.000e+00 -5.811e-01 +1e+01 6e-01 6e-01 1e+00 2e+00 --- --- 1 1 - | - -
1 -7.758e+00 -2.575e+00 +1e+00 2e-01 7e-01 6e+00 3e-01 0.9890 1e-01 1 1 1 | 0 0
2 -3.104e+02 -9.419e+01 +4e-02 2e-01 8e-01 2e+02 8e-03 0.9725 8e-04 2 1 1 | 0 0
3 -2.409e+03 -9.556e+02 +5e-03 2e-01 8e-01 1e+03 1e-03 0.8968 5e-02 3 2 2 | 0 0
4 -1.103e+04 -5.209e+03 +2e-03 2e-01 7e-01 6e+03 4e-04 0.9347 3e-01 2 2 2 | 0 0
5 -1.268e+04 -1.592e+03 +8e-04 1e-01 1e+00 1e+04 2e-04 0.7916 4e-01 3 2 2 | 0 0
6 -1.236e+05 -2.099e+04 +9e-05 1e-01 1e+00 1e+05 2e-05 0.8979 9e-03 1 1 1 | 0 0
7 -4.261e+05 -1.850e+05 +4e-05 2e-01 7e-01 2e+05 1e-05 0.7182 3e-01 2 1 1 | 0 0
8 -2.492e+07 -1.078e+07 +7e-07 1e-01 7e-01 1e+07 2e-07 0.9838 1e-04 3 2 2 | 0 0
9 -2.226e+08 -9.836e+07 +5e-08 9e-02 5e-01 1e+08 1e-08 0.9339 2e-03 2 3 2 | 0 0
UNBOUNDED (within feastol=1.0e-09).
Runtime: 0.001949 seconds.
The optimal value is None
The optimal beta is
None
The norm of the residual is None

How does Weka evaluate classifier model

I used random forest algorithm and got this result
=== Summary ===
Correctly Classified Instances 10547 97.0464 %
Incorrectly Classified Instances 321 2.9536 %
Kappa statistic 0.9642
Mean absolute error 0.0333
Root mean squared error 0.0952
Relative absolute error 18.1436 %
Root relative squared error 31.4285 %
Total Number of Instances 10868
=== Confusion Matrix ===
a b c d e f g h i <-- classified as
1518 1 3 1 0 14 0 0 4 | a = a
3 2446 0 0 0 1 1 27 0 | b = b
0 0 2942 0 0 0 0 0 0 | c = c
0 0 0 470 0 1 1 2 1 | d = d
9 0 0 9 2 19 0 3 0 | e = e
23 1 2 19 0 677 1 22 6 | f = f
4 0 2 0 0 13 379 0 0 | g = g
63 2 6 17 0 15 0 1122 3 | h = h
9 0 0 0 0 9 0 4 991 | i = i
I wonder how Weka evaluate errors(mean absolute error, root mean squared error, ...) using non numerical values('a', 'b', ...).
I mapped each classes to numbers from 0 to 8 and evaluated errors manually, but the evaluation was different from Weka.
How to reimplemen the evaluating steps of Weka?

Torch: Concatenating tensors of different dimensions

I have a x_at_i = torch.Tensor(1,i) that grows at every iteration where i = 0 to n. I would like to concatenate all tensors of different sizes into a matrix and fill the remaining cells with zeroes. What is the most idiomatic way to this. For example:
x_at_1 = 1
x_at_2 = 1 2
x_at_3 = 1 2 3
x_at_4 = 1 2 3 4
X = torch.cat(x_at_1, x_at_2, x_at_3, x_at_4)
X = [ 1 0 0 0
1 2 0 0
1 2 3 0
1 2 3 4 ]
If you know n and assuming you have access to your x_at_i easily at each iteration I would try something like
X = torch.Tensor(n, n):zero()
for i = 1, n do
X[i]:narrow(1, 1, i):copy(x_at[i])
end

OneVsRestClassifier(svm.SVC()).predict() gives continous values

I am trying to use y_scores=OneVsRestClassifier(svm.SVC()).predict() on datasets
like iris and titanic .The trouble is that I am getting y_scores as continous values.like for iris dataset I am getting :
[[ -3.70047231 -0.74209097 2.29720159]
[ -1.93190155 0.69106231 -2.24974856]
.....
I am using the OneVsRestClassifier for other classifier models like knn,randomforest,naive bayes and they are giving appropriate results in the form of
[[ 0 1 0]
[ 1 0 1]...
etc on the iris dataset .Please help.
Well this is simply not true.
>>> from sklearn.multiclass import OneVsRestClassifier
>>> from sklearn.svm import SVC
>>> from sklearn.datasets import load_iris
>>> iris = load_iris()
>>> clf = OneVsRestClassifier(SVC())
>>> clf.fit(iris['data'], iris['target'])
OneVsRestClassifier(estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
kernel='rbf', max_iter=-1, probability=False, random_state=None,
shrinking=True, tol=0.001, verbose=False),
n_jobs=1)
>>> print clf.predict(iris['data'])
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2]
maybe you called decision_function instead (which would match your output dimension, as predict is supposed to return a vector, not a matrix). Then, SVM returns signed distances to each hyperplane, which is its decision function from mathematical perspective.

Is there a way to use Machine Learning classify discrete and infinite scale data?

The data like that:
x y
7773 0
9805 4
7145 0
7645 1
2529 1
4814 2
6027 2
7499 2
3367 1
8861 5
9776 2
8009 5
3844 2
1218 2
1120 1
4553 0
3017 1
2582 2
1691 2
5342 0
...
The real function f(x) is: (Return the circle count of a decimal integer)
# 0 1 2 3 4 5 6 7 8 9
_f_map = [1, 0, 0, 0, 0, 0, 1, 0, 2, 1]
def f(x):
x = int(x)
assert x >= 0
if x == 0:
return 1
r = 0
while x:
r += _f_map[x % 10]
x /= 10
return r
The training data and test data can be produced by random:
data = []
target = []
for i in xrange(3000):
x = random.randint(0, 999999) #hardcode a scale
data.append([x])
target.append(f(x))
The real function is discrete and infinite scale.
Is there a way or a model can classify this data?
I tried SVM(Support Vector Machine), and acquired a 20% accuracy rate.
Looks like a typical use case of sequential models. You can easily learn LSTM/ other recurrent neural network to do so by considering your numbers as sequences of integers feeded to the network. At this point it just has to learn sum operation and a simple mapping(your f_map).

Resources