I am trying to separate background from signal where it is known that the quantity x^2 - y^2 is the physical reason why the background and signal are different. If I provide x and y as input variables, the BDT is having the hard time figuring out how to achieve the separation. Is BDT unable to do squares?
No, a binary decision tree is unable to take squares of input features. Given input features x, y, it will try to approximate the desired function by subdividing the x,y plane along vertical and horizontal lines. Let us take a look at an example: I fit a decision tree classifier to a square grid of points, and plot the decision boundary.
from sklearn.tree import DecisionTreeClassifier
import numpy as np
import matplotlib.pyplot as plt
x = np.arange(-5.5, 5.5, 1)
y = np.arange(-5.0, 6.0, 1)
xx, yy = np.meshgrid(x,y)
#the function we want to learn:
target = xx.ravel()**2 - yy.ravel()**2 > 0
data = np.c_[xx.ravel(), yy.ravel()]
#Fit a decision tree:
clf = DecisionTreeClassifier()
clf.fit(data, target)
#Plot the decision boundary:
xxplot, yyplot = np.meshgrid(np.arange(-7, 7, 0.1),
np.arange(-7, 7, 0.1))
Z = clf.predict(np.c_[xxplot.ravel(), yyplot.ravel()])
# Put the result into a color plot
Z = Z.reshape(xxplot.shape)
plt.contourf(xxplot, yyplot, Z, cmap=plt.cm.hot)
# Plot also the training points
plt.scatter(xx.ravel(), yy.ravel(), c=target, cmap=plt.cm.flag)
plt.xlabel("x")
plt.ylabel("y")
plt.title("Decision boundary for a binary decision tree learning a function x**2 - y**2 > 0")
plt.show()
Here you can see what kind of boundaries a decision tree can learn: piecewise-rectangular. They are not going to approximate your function well, especially in the area where there are few training points. Since you know that x^2 - y^2 is the quantity that determines the answer, you can just add it as a new feature instead of trying to learn it.
Related
I'm in a situation where I need to train a model to predict a scalar value, and it's important to have the predicted value be in the same direction as the true value, while the squared error being minimum.
What would be a good choice of loss function for that?
For example:
Let's say the predicted value is -1 and the true value is 1. The loss between the two should be a lot greater than the loss between 3 and 1, even though the squared error of (3, 1) and (-1, 1) is equal.
Thanks a lot!
This turned out to be a really interesting question - thanks for asking it! First, remember that you want your loss functions to be defined entirely of differential operations, so that you can back-propagation though it. This means that any old arbitrary logic won't necessarily do. To restate your problem: you want to find a differentiable function of two variables that increases sharply when the two variables take on values of different signs, and more slowly when they share the same sign. Additionally, you want some control over how sharply these values increase, relative to one another. Thus, we want something with two configurable constants. I started constructing a function that met these needs, but then remembered one you can find in any high school geometry text book: the elliptic paraboloid!
The standard formulation doesn't meet the requirement of sign agreement symmetry, so I had to introduce a rotation. The plot above is the result. Note that it increases more sharply when the signs don't agree, and less sharply when they do, and that the input constants controlling this behaviour are configurable. The code below is all that was needed to define and plot the loss function. I don't think I've ever used a geometric form as a loss function before - really neat.
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm
def elliptic_paraboloid_loss(x, y, c_diff_sign, c_same_sign):
# Compute a rotated elliptic parabaloid.
t = np.pi / 4
x_rot = (x * np.cos(t)) + (y * np.sin(t))
y_rot = (x * -np.sin(t)) + (y * np.cos(t))
z = ((x_rot**2) / c_diff_sign) + ((y_rot**2) / c_same_sign)
return(z)
c_diff_sign = 4
c_same_sign = 2
a = np.arange(-5, 5, 0.1)
b = np.arange(-5, 5, 0.1)
loss_map = np.zeros((len(a), len(b)))
for i, a_i in enumerate(a):
for j, b_j in enumerate(b):
loss_map[i, j] = elliptic_paraboloid_loss(a_i, b_j, c_diff_sign, c_same_sign)
fig = plt.figure()
ax = fig.gca(projection='3d')
X, Y = np.meshgrid(a, b)
surf = ax.plot_surface(X, Y, loss_map, cmap=cm.coolwarm,
linewidth=0, antialiased=False)
plt.show()
From what I understand, your current loss function is something like:
loss = mean_square_error(y, y_pred)
What you could do, is to add one other component to your loss, being this a component that penalizes negative numbers and does nothing with positive numbers. And you can choose a coefficient for how much you want to penalize it. For that, we can use like a negative shaped ReLU. Something like this:
Let's call "Neg_ReLU" to this component. Then, your loss function will be:
loss = mean_squared_error(y, y_pred) + Neg_ReLU(y_pred)
So for example, if your result is -1, then the total error would be:
mean_squared_error(1, -1) + 1
And if your result is 3, then the total error would be:
mean_squared_error(1, -1) + 0
(See in the above function how Neg_ReLU(3) = 0, and Neg_ReLU(-1) = 1.
If you want to penalize more the negative values, then you can add a coefficient:
coeff_negative_value = 2
loss = mean_squared_error(y, y_pred) + coeff_negative_value * Neg_ReLU
Now the negative values are more penalized.
The ReLU negative function we can build it like this:
tf.nn.relu(tf.math.negative(value))
So summarizing, in the end your total loss will be:
coeff = 1
Neg_ReLU = tf.nn.relu(tf.math.negative(y))
total_loss = mean_squared_error(y, y_pred) + coeff * Neg_ReLU
How can I use sklearn scaler / imputer to impute a tensor? I want to scale / impute within a pipeline. My input is a 3-d numpy array.
I have a tensor of shape (n_samples, n_timesteps, n_feat) a la Keras. This is a sequence that can be learned by an LSTM. I want to scale / impute first, however. In particular, I want to scale on the fly inside a sci-kit learn pipeline, since scaling the full dataset, which would be easy, leads to leakage. Keras already integrates w sklearn (see here), but there do not appear to be easy ways to scale and impute the tensors that keras time series models process.
Unfortunately, the following gives an error
import numpy as np
X = np.array([[[3,5],[6,2]],[[8.,23.],[7.,23]],[[3, 4],[2, 55]]])
print X
from sklearn.preprocessing import StandardScaler
s = StandardScaler()
X = s.fit_transform(X)
print X
Of the effect, "the scaler only works on 2-d numpy arrays".
My solution was to add a decorator to the sklearn preprocessing data.py file
def flat(func):
def wrapper(*args, **kwargs):
self, X = args
a, b, c = X.shape
X = X.reshape(a, b*c)
r = func(self, X, **kwargs)
if hasattr(r,'ndim'):
X = r.reshape(a, b, c)
return X
else:
return r
return wrapper
Then use it on the functions, eg fit
#flat
def fit(self, X, y=None):
"""Compute the mean and std to be used for later scaling.
Parameters
----------
X : {array-like, sparse matrix}, shape [n_samples, n_features]
The data used to compute the mean and standard deviation
used for later scaling along the features axis.
y : Passthrough for ``Pipeline`` compatibility.
"""
# Reset internal state before fitting
self._reset()
return self.partial_fit(X, y)
This works well; with the same script as above, I get
[[[ 3. 5.]
[ 6. 2.]]
[[ 8. 23.]
[ 7. 23.]]
[[ 3. 4.]
[ 2. 55.]]]
[[[-0.70710678 -0.64906302]
[ 0.46291005 -1.13191668]]
[[ 1.41421356 1.41266656]
[ 0.9258201 -0.16825789]]
[[-0.70710678 -0.76360355]
[-1.38873015 1.30017457]]]
But beware, it doesn't check for 2d arrays, which it can't process. So, use the normal preprocessing module for 2d arrays!
Given a kernel in Gaussian Process, is it possible to know the shape of functions being drawn from the prior distribution without sampling at first?
I think the best way to know the shape of prior functions is to draw them. Here's 1-dimensional example:
These are the samples from the GP prior (mean is 0 and covariance matrix induced by the squared exponential kernel). As you case see they are smooth and generally it gives a feeling how "wiggly" they are. Also note that in case of multi-dimensions each one of them will look somewhat like this.
Here's a full code I used, feel free to write your own kernel or tweak the parameters to see how it affects the samples:
import numpy as np
import matplotlib.pyplot as pl
def kernel(a, b, gamma=0.1):
""" GP squared exponential kernel """
sq_dist = np.sum(a**2, 1).reshape(-1, 1) + np.sum(b**2, 1) - 2*np.dot(a, b.T)
return np.exp(-0.5 * (1 / gamma) * sq_dist)
n = 300 # number of points.
m = 10 # number of functions to draw.
s = 1e-6 # noise variance.
X = np.linspace(-5, 5, n).reshape(-1, 1)
K = kernel(X, X)
L = np.linalg.cholesky(K + s * np.eye(n))
f_prior = np.dot(L, np.random.normal(size=(n, m)))
pl.figure(1)
pl.clf()
pl.plot(X, f_prior)
pl.title('%d samples from the GP prior' % m)
pl.axis([-5, 5, -3, 3])
pl.show()
I am getting different shapes for my PCA using sklearn. Why isn't my transformation resulting in an array of the same dimensions like the docs say?
fit_transform(X, y=None)
Fit the model with X and apply the dimensionality reduction on X.
Parameters:
X : array-like, shape (n_samples, n_features)
Training data, where n_samples is the number of samples and n_features is the number of features.
Returns:
X_new : array-like, shape (n_samples, n_components)
Check this out with the iris dataset which is (150, 4) where I'm making 4 PCs:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn import decomposition
import seaborn as sns; sns.set_style("whitegrid", {'axes.grid' : False})
%matplotlib inline
np.random.seed(0)
# Iris dataset
DF_data = pd.DataFrame(load_iris().data,
index = ["iris_%d" % i for i in range(load_iris().data.shape[0])],
columns = load_iris().feature_names)
Se_targets = pd.Series(load_iris().target,
index = ["iris_%d" % i for i in range(load_iris().data.shape[0])],
name = "Species")
# Scaling mean = 0, var = 1
DF_standard = pd.DataFrame(StandardScaler().fit_transform(DF_data),
index = DF_data.index,
columns = DF_data.columns)
# Sklearn for Principal Componenet Analysis
# Dims
m = DF_standard.shape[1]
K = m
# PCA (How I tend to set it up)
M_PCA = decomposition.PCA()
A_components = M_PCA.fit_transform(DF_standard)
#DF_standard.shape, A_components.shape
#((150, 4), (150, 4))
but then when I use the same exact approach on my actual dataset (76, 1989) as in 76 samples and 1989 attributes/dimensions I get a (76, 76) array instead of (76, 1989)
DF_centered = normalize(DF_mydata, method="center", axis=0)
m = DF_centered.shape[1]
# print(m)
# 1989
M_PCA = decomposition.PCA(n_components=m)
A_components = M_PCA.fit_transform(DF_centered)
DF_centered.shape, A_components.shape
# ((76, 1989), (76, 76))
normalize is just a wrapper I made that subtracts the mean from each dimension.
(Note: this answer is adapted from my answer on Cross Validated here: Why are there only n−1 principal components for n data points if the number of dimensions is larger or equal than n?)
PCA (as most typically run) creates a new coordinate system by:
shifting the origin to the centroid of your data,
squeezes and/or stretches the axes to make them equal in length, and
rotates your axes into a new orientation.
(For more details, see this excellent CV thread: Making sense of principal component analysis, eigenvectors & eigenvalues.) However, step 3 rotates your axes in a very specific way. Your new X1 (now called "PC1", i.e., the first principal component) is oriented in your data's direction of maximal variation. The second principal component is oriented in the direction of the next greatest amount of variation that is orthogonal to the first principal component. The remaining principal components are formed likewise.
With this in mind, let's examine a simple example (suggested by #amoeba in a comment). Here is a data matrix with two points in a three dimensional space:
X = [ 1 1 1
2 2 2 ]
Let's view these points in a (pseudo) three dimensional scatterplot:
So let's follow the steps listed above. (1) The origin of the new coordinate system will be located at (1.5,1.5,1.5). (2) The axes are already equal. (3) The first principal component will go diagonally from what used to be (0,0,0) to what was originally (3,3,3), which is the direction of greatest variation for these data. Now, the second principal component must be orthogonal to the first, and should go in the direction of the greatest remaining variation. But what direction is that? Is it from (0,0,3) to (3,3,0), or from (0,3,0) to (3,0,3), or something else? There is no remaining variation, so there cannot be any more principal components.
With N=2 data, we can fit (at most) N−1=1 principal components.
I read this thread about the difference between SVC() and LinearSVC() in scikit-learn.
Now I have a data set of binary classification problem(For such a problem, the one-to-one/one-to-rest strategy difference between both functions could be ignore.)
I want to try under what parameters would these 2 functions give me the same result. First of all, of course, we should set kernel='linear' for SVC()
However, I just could not get the same result from both functions. I could not find the answer from the documents, could anybody help me to find the equivalent parameter set I am looking for?
Updated:
I modified the following code from an example of the scikit-learn website, and apparently they are not the same:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets
# import some data to play with
iris = datasets.load_iris()
X = iris.data[:, :2] # we only take the first two features. We could
# avoid this ugly slicing by using a two-dim dataset
y = iris.target
for i in range(len(y)):
if (y[i]==2):
y[i] = 1
h = .02 # step size in the mesh
# we create an instance of SVM and fit out data. We do not scale our
# data since we want to plot the support vectors
C = 1.0 # SVM regularization parameter
svc = svm.SVC(kernel='linear', C=C).fit(X, y)
lin_svc = svm.LinearSVC(C=C, dual = True, loss = 'hinge').fit(X, y)
# create a mesh to plot in
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
# title for the plots
titles = ['SVC with linear kernel',
'LinearSVC (linear kernel)']
for i, clf in enumerate((svc, lin_svc)):
# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, m_max]x[y_min, y_max].
plt.subplot(1, 2, i + 1)
plt.subplots_adjust(wspace=0.4, hspace=0.4)
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap=plt.cm.Paired, alpha=0.8)
# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Paired)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.xticks(())
plt.yticks(())
plt.title(titles[i])
plt.show()
Result:
Output Figure from previous code
In mathematical sense you need to set:
SVC(kernel='linear', **kwargs) # by default it uses RBF kernel
and
LinearSVC(loss='hinge', **kwargs) # by default it uses squared hinge loss
Another element, which cannot be easily fixed is increasing intercept_scaling in LinearSVC, as in this implementation bias is regularized (which is not true in SVC nor should be true in SVM - thus this is not SVM) - consequently they will never be exactly equal (unless bias=0 for your problem), as they assume two different models
SVC : 1/2||w||^2 + C SUM xi_i
LinearSVC: 1/2||[w b]||^2 + C SUM xi_i
Personally I consider LinearSVC one of the mistakes of sklearn developers - this class is simply not a linear SVM.
After increasing intercept scaling (to 10.0)
However, if you scale it up too much - it will also fail, as now tolerance and number of iterations are crucial.
To sum up: LinearSVC is not linear SVM, do not use it if do not have to.