Recently I am studying SVM models. And I have understood the mathematics of SVC and SVR.
But I have some confusions on the issue of loss functions.
For example SVCLoss:
SVC Loss function is here
SVCLoss = \sum_{i} a_i - 0.5 \sum_{i,j} a_i a_j y_{i} y_{j} K_(x_i, x_j)
# Train a quantum support vector classifier
svc = SVC()
svc.fit(samples, labels)
# Get dual coefficients
dual_coefs = svc.dual_coef_[0]
# Get support vectors
support_vecs = svc.support_
# Prune kernel matrix of non-support-vector entries
kmatrix = kmatrix[support_vecs, :][:, support_vecs]
# Calculate loss
loss = np.sum(np.abs(dual_coefs)) - (0.5 * (dual_coefs.T # kmatrix # dual_coefs))
And I am working to write out the loss function code for SVR in python, but I met some unsolved problem.
SVR Loss fuction
# Train a quantum support vector regressor
svr = SVR(kernel="rbf")
svr.fit(samples, labels)
# Get dual coefficients
dual_coefs = svr.dual_coef_[0]
# Get support vectors
support_vecs = svr.support_
# Prune kernel matrix of non-support-vector entries
kmatrix = kmatrix[support_vecs, :][:, support_vecs]
# Calculate loss
loss = ???
I wonder how to represent dual_coefs and support_vecs in SVR_Loss?
How to write SVR_Loss in Python according to advanced code?
I am searching for a long time on net. But no use. Please help or try to give some ideas how to achieve this.
Thanks in advance.
I have tired to work out the following problems for a long time.
I wonder how to represent dual_coefs and support_vecs in SVR_Loss?
How to write SVR_Loss in Python according to advanced code?
Related
I'm learning about Neural Networks and I recently had this idea: trying to give a NN training data of the function $f(x) = 2x$. The question is, can the NN accurately predict that it has to double the input number to give the correct output?
This is just a "mental exercise", to better my understanding of how NNs work.
My Python code doesn't work, here's what I've tried:
Neural Network class:
import numpy as np
class NeuralNetwork:
def __init__(self, inputnodes, hiddennodes, outputnodes, learningrate):
self.inodes = inputnodes
self.hnodes = hiddennodes
self.onodes = outputnodes
self.lr = learningrate
self.wih = np.random.normal(0.0, pow(self.inodes, -0.5), (self.hnodes, self.inodes))
self.who = np.random.normal(0.0, pow(self.hnodes, -0.5), (self.onodes, self.hnodes))
def train(self, inputs_list, targets_list):
inputs = np.array(inputs_list, ndmin=2).T
targets = np.array(targets_list, ndmin=2).T
hidden_outputs = np.dot(self.wih, inputs)
final_outputs = np.dot(self.who, hidden_outputs)
output_errors = targets - final_outputs
hidden_errors = np.dot(self.who.T, output_errors)
self.who += self.lr * np.dot(
(output_errors * final_outputs * (1.0 - final_outputs)),
np.transpose(hidden_outputs)
)
self.wih += self.lr * np.dot(
(hidden_errors * hidden_outputs * (1.0 - hidden_outputs)),
np.transpose(inputs)
)
def query(self, inputs_list):
inputs = np.array(inputs_list, ndmin=2).T
hidden_outputs = np.dot(self.wih, inputs)
final_outputs = np.dot(self.who, hidden_outputs)
return final_outputs
Training the network and predicting a value:
input_nodes = 1
hidden_nodes = 20
output_nodes = 1
learning_rate = 0.3
nn = NeuralNetwork(input_nodes, hidden_nodes, output_nodes, learning_rate)
for i in range(10):
i += 1
inputs = np.log(i)
targets = np.log(2*i)
nn.train(inputs, targets)
print(nn.query(np.asfarray([4])))
Here's the output I'm getting trying to run this code:
x.py:26: RuntimeWarning: overflow encountered in multiply
(output_errors * final_outputs * (1.0 - final_outputs)),
x.py:31: RuntimeWarning: overflow encountered in multiply
(hidden_errors * hidden_outputs * (1.0 - hidden_outputs)),
[[nan]]
I don't really know how to interpret this, and if my design is correct for this application. Any help would be appreciated.
Thanks.
Some suggestions:
Since the function of interest (f(x)=2x) is linear and requires only one weight, we can vastly simplify the network by having 1 weight and 0 hidden layers. We're trying to debug a problem, so we should simplify as much as possible to eliminate sources of error. Using a hidden layer with multiple hidden nodes implies that we need to find matrices such that W1.dot(W2)=2 because we seek the function x.dot(W1).dot(W2), which is harder because changing 1 weight changes the entire product; finding the correct answer requires aligning all of those weights.
Because the function of interest is linear, we know that any use of nonlinear functions is a distraction. Also, Saturation of sigmoid and tanh functions, or the dying ReLU phenomenon, could introduce additional problems to the optimization dynamics which could prevent us from making progress. See: https://stats.stackexchange.com/questions/301285/what-is-vanishing-gradient
The learning rate is probably too large. I believe this is the problem because you're having numerical overflow; this can happen when the optimizer consistently overshoots the minimum. See: https://stats.stackexchange.com/questions/364360/how-can-change-in-cost-function-be-positive
Scaling the inputs and the targets of a regression problem can dramatically improve the optimizer dynamics. For an example, see https://stats.stackexchange.com/questions/432707/alternating-negative-and-positive-value-of-slope-and-y-intercept-in-gradient-des/432714#432714
Additional tips for training neural networks are here: https://stats.stackexchange.com/questions/352036/what-should-i-do-when-my-neural-network-doesnt-learn/352037#352037
I think you are missing a very important part / building block in artificial neural networks architecture , this block is called the activation function , which tries to normalize output between [0,1] or [-1,1]
so i think attaching (which is very important) an activation function after computing every hidden layer outputs may solve this problem , as data propagating network will maintain normalized values for example between [0,1] so overflow may will not happen
notes
sigmoid activation and tanh are most popular and suitable for you problem
your learning rate maybe slightly large , try use 0.01
When we train neural networks, we typically use gradient descent, which relies on a continuous, differentiable real-valued cost function. The final cost function might, for example, take the mean squared error. Or put another way, gradient descent implicitly assumes the end goal is regression - to minimize a real-valued error measure.
Sometimes what we want a neural network to do is perform classification - given an input, classify it into two or more discrete categories. In this case, the end goal the user cares about is classification accuracy - the percentage of cases classified correctly.
But when we are using a neural network for classification, though our goal is classification accuracy, that is not what the neural network is trying to optimize. The neural network is still trying to optimize the real-valued cost function. Sometimes these point in the same direction, but sometimes they don't. In particular, I've been running into cases where a neural network trained to correctly minimize the cost function, has a classification accuracy worse than a simple hand-coded threshold comparison.
I've boiled this down to a minimal test case using TensorFlow. It sets up a perceptron (neural network with no hidden layers), trains it on an absolutely minimal dataset (one input variable, one binary output variable) assesses the classification accuracy of the result, then compares it to the classification accuracy of a simple hand-coded threshold comparison; the results are 60% and 80% respectively. Intuitively, this is because a single outlier with a large input value, generates a correspondingly large output value, so the way to minimize the cost function is to try extra hard to accommodate that one case, in the process misclassifying two more ordinary cases. The perceptron is correctly doing what it was told to do; it's just that this does not match what we actually want of a classifier. But the classification accuracy is not a continuous differentiable function, so we can't use it as the target for gradient descent.
How can we train a neural network so that it ends up maximizing classification accuracy?
import numpy as np
import tensorflow as tf
sess = tf.InteractiveSession()
tf.set_random_seed(1)
# Parameters
epochs = 10000
learning_rate = 0.01
# Data
train_X = [
[0],
[0],
[2],
[2],
[9],
]
train_Y = [
0,
0,
1,
1,
0,
]
rows = np.shape(train_X)[0]
cols = np.shape(train_X)[1]
# Inputs and outputs
X = tf.placeholder(tf.float32)
Y = tf.placeholder(tf.float32)
# Weights
W = tf.Variable(tf.random_normal([cols]))
b = tf.Variable(tf.random_normal([]))
# Model
pred = tf.tensordot(X, W, 1) + b
cost = tf.reduce_sum((pred-Y)**2/rows)
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
tf.global_variables_initializer().run()
# Train
for epoch in range(epochs):
# Print update at successive doublings of time
if epoch&(epoch-1) == 0 or epoch == epochs-1:
print('{} {} {} {}'.format(
epoch,
cost.eval({X: train_X, Y: train_Y}),
W.eval(),
b.eval(),
))
optimizer.run({X: train_X, Y: train_Y})
# Classification accuracy of perceptron
classifications = [pred.eval({X: x}) > 0.5 for x in train_X]
correct = sum([p == y for (p, y) in zip(classifications, train_Y)])
print('{}/{} = perceptron accuracy'.format(correct, rows))
# Classification accuracy of hand-coded threshold comparison
classifications = [x[0] > 1.0 for x in train_X]
correct = sum([p == y for (p, y) in zip(classifications, train_Y)])
print('{}/{} = threshold accuracy'.format(correct, rows))
How can we train a neural network so that it ends up maximizing classification accuracy?
I'm asking for a way to get a continuous proxy function that's closer to the accuracy
To start with, the loss function used today for classification tasks in (deep) neural nets was not invented with them, but it goes back several decades, and it actually comes from the early days of logistic regression. Here is the equation for the simple case of binary classification:
The idea behind it was exactly to come up with a continuous & differentiable function, so that we would be able to exploit the (vast, and still expanding) arsenal of convex optimization for classification problems.
It is safe to say that the above loss function is the best we have so far, given the desired mathematical constraints mentioned above.
Should we consider this problem (i.e. better approximating the accuracy) solved and finished? At least in principle, no. I am old enough to remember an era when the only activation functions practically available were tanh and sigmoid; then came ReLU and gave a real boost to the field. Similarly, someone may eventually come up with a better loss function, but arguably this is going to happen in a research paper, and not as an answer to a SO question...
That said, the very fact that the current loss function comes from very elementary considerations of probability and information theory (fields that, in sharp contrast with the current field of deep learning, stand upon firm theoretical foundations) creates at least some doubt as to if a better proposal for the loss may be just around the corner.
There is another subtle point on the relation between loss and accuracy, which makes the latter something qualitatively different than the former, and is frequently lost in such discussions. Let me elaborate a little...
All the classifiers related to this discussion (i.e. neural nets, logistic regression etc) are probabilistic ones; that is, they do not return hard class memberships (0/1) but class probabilities (continuous real numbers in [0, 1]).
Limiting the discussion for simplicity to the binary case, when converting a class probability to a (hard) class membership, we are implicitly involving a threshold, usually equal to 0.5, such as if p[i] > 0.5, then class[i] = "1". Now, we can find many cases whet this naive default choice of threshold will not work (heavily imbalanced datasets are the first to come to mind), and we'll have to choose a different one. But the important point for our discussion here is that this threshold selection, while being of central importance to the accuracy, is completely external to the mathematical optimization problem of minimizing the loss, and serves as a further "insulation layer" between them, compromising the simplistic view that loss is just a proxy for accuracy (it is not). As nicely put in the answer of this Cross Validated thread:
the statistical component of your exercise ends when you output a probability for each class of your new sample. Choosing a threshold beyond which you classify a new observation as 1 vs. 0 is not part of the statistics any more. It is part of the decision component.
Enlarging somewhat an already broad discussion: Can we possibly move completely away from the (very) limiting constraint of mathematical optimization of continuous & differentiable functions? In other words, can we do away with back-propagation and gradient descend?
Well, we are actually doing so already, at least in the sub-field of reinforcement learning: 2017 was the year when new research from OpenAI on something called Evolution Strategies made headlines. And as an extra bonus, here is an ultra-fresh (Dec 2017) paper by Uber on the subject, again generating much enthusiasm in the community.
I think you are forgetting to pass your output through a simgoid. Fixed below:
import numpy as np
import tensorflow as tf
sess = tf.InteractiveSession()
tf.set_random_seed(1)
# Parameters
epochs = 10000
learning_rate = 0.01
# Data
train_X = [
[0],
[0],
[2],
[2],
[9],
]
train_Y = [
0,
0,
1,
1,
0,
]
rows = np.shape(train_X)[0]
cols = np.shape(train_X)[1]
# Inputs and outputs
X = tf.placeholder(tf.float32)
Y = tf.placeholder(tf.float32)
# Weights
W = tf.Variable(tf.random_normal([cols]))
b = tf.Variable(tf.random_normal([]))
# Model
# CHANGE HERE: Remember, you need an activation function!
pred = tf.nn.sigmoid(tf.tensordot(X, W, 1) + b)
cost = tf.reduce_sum((pred-Y)**2/rows)
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
tf.global_variables_initializer().run()
# Train
for epoch in range(epochs):
# Print update at successive doublings of time
if epoch&(epoch-1) == 0 or epoch == epochs-1:
print('{} {} {} {}'.format(
epoch,
cost.eval({X: train_X, Y: train_Y}),
W.eval(),
b.eval(),
))
optimizer.run({X: train_X, Y: train_Y})
# Classification accuracy of perceptron
classifications = [pred.eval({X: x}) > 0.5 for x in train_X]
correct = sum([p == y for (p, y) in zip(classifications, train_Y)])
print('{}/{} = perceptron accuracy'.format(correct, rows))
# Classification accuracy of hand-coded threshold comparison
classifications = [x[0] > 1.0 for x in train_X]
correct = sum([p == y for (p, y) in zip(classifications, train_Y)])
print('{}/{} = threshold accuracy'.format(correct, rows))
The output:
0 0.28319069743156433 [ 0.75648874] -0.9745011329650879
1 0.28302448987960815 [ 0.75775659] -0.9742625951766968
2 0.28285878896713257 [ 0.75902224] -0.9740257859230042
4 0.28252947330474854 [ 0.76154679] -0.97355717420578
8 0.28187844157218933 [ 0.76656926] -0.9726400971412659
16 0.28060704469680786 [ 0.77650583] -0.970885694026947
32 0.27818527817726135 [ 0.79593837] -0.9676888585090637
64 0.2738055884838104 [ 0.83302218] -0.9624817967414856
128 0.26666420698165894 [ 0.90031379] -0.9562843441963196
256 0.25691407918930054 [ 1.01172411] -0.9567816257476807
512 0.2461051195859909 [ 1.17413962] -0.9872989654541016
1024 0.23519910871982574 [ 1.38549554] -1.088881492614746
2048 0.2241383194923401 [ 1.64616168] -1.298340916633606
4096 0.21433120965957642 [ 1.95981205] -1.6126530170440674
8192 0.2075471431016922 [ 2.31746769] -1.989408016204834
9999 0.20618653297424316 [ 2.42539024] -2.1028473377227783
4/5 = perceptron accuracy
4/5 = threshold accuracy
what google told me is:
For keras, the ImageDataGenerator function seems to have a zca_whitening which can be used out of the box. But if this option been set, it requires to call the ImageDataGenerator.fit on the whole dataset X. So this is not an option.
For sklearn, the IncrementalPCA seems to work with a huge dataset, but I don't know how to rotate PCA to ZCA in an generator style.
Thanks for the help!
I have defined a function that might be helpful following the ZCA transformation:
def ZCAtransform(X,IPCA_model):
# get the Eigenvectors and Eigenvalues
U = IPCA_model.components_.transpose()
S = np.sqrt(IPCA_model.explained_variance_)
Xdemeand = (X-np.mean(X,0)).transpose()
#get the transformed data
# Xproj' = U * diag(1/(S+I*epsilon)) * U' * X_data
return (U.dot(np.diag(1/(S+IPCA_model.noise_variance_))).dot(U.transpose()).dot(Xdemeand)).transpose()
Xproj = ZCAtransform(X, ipca)
Following the given example at Scikit-learn, I was able to generate the ZCA of Iris dataset as shown below:
ZCA Whitened PCA
I am training a neural network for multilabel classification, with a large number of classes (1000). Which means more than one output can be active for every input. On an average, I have two classes active per output frame. On training with a cross entropy loss the neural network resorts to outputting only zeros, because it gets the least loss with this output since 99.8% of my labels are zeros. Any suggestions on how I can push the network to give more weight to the positive classes?
Tensorflow has a loss function weighted_cross_entropy_with_logits, which can be used to give more weight to the 1's. So it should be applicable to a sparse multi-label classification setting like yours.
From the documentation:
This is like sigmoid_cross_entropy_with_logits() except that pos_weight, allows one to trade off recall and precision by up- or down-weighting the cost of a positive error relative to a negative error.
The argument pos_weight is used as a multiplier for the positive targets
If you use the tensorflow backend in Keras, you can use the loss function like this (Keras 2.1.1):
import tensorflow as tf
import keras.backend.tensorflow_backend as tfb
POS_WEIGHT = 10 # multiplier for positive targets, needs to be tuned
def weighted_binary_crossentropy(target, output):
"""
Weighted binary crossentropy between an output tensor
and a target tensor. POS_WEIGHT is used as a multiplier
for the positive targets.
Combination of the following functions:
* keras.losses.binary_crossentropy
* keras.backend.tensorflow_backend.binary_crossentropy
* tf.nn.weighted_cross_entropy_with_logits
"""
# transform back to logits
_epsilon = tfb._to_tensor(tfb.epsilon(), output.dtype.base_dtype)
output = tf.clip_by_value(output, _epsilon, 1 - _epsilon)
output = tf.log(output / (1 - output))
# compute weighted loss
loss = tf.nn.weighted_cross_entropy_with_logits(targets=target,
logits=output,
pos_weight=POS_WEIGHT)
return tf.reduce_mean(loss, axis=-1)
Then in your model:
model.compile(loss=weighted_binary_crossentropy, ...)
I have not found many resources yet which report well working values for the pos_weight in relation to the number of classes, average active classes, etc.
Many thanks to tobigue for the great solution.
The tensorflow and keras apis have changed since that answer. So the updated version of weighted_binary_crossentropy is below for Tensorflow 2.7.0.
import tensorflow as tf
POS_WEIGHT = 10
def weighted_binary_crossentropy(target, output):
"""
Weighted binary crossentropy between an output tensor
and a target tensor. POS_WEIGHT is used as a multiplier
for the positive targets.
Combination of the following functions:
* keras.losses.binary_crossentropy
* keras.backend.tensorflow_backend.binary_crossentropy
* tf.nn.weighted_cross_entropy_with_logits
"""
# transform back to logits
_epsilon = tf.convert_to_tensor(tf.keras.backend.epsilon(), output.dtype.base_dtype)
output = tf.clip_by_value(output, _epsilon, 1 - _epsilon)
output = tf.math.log(output / (1 - output))
loss = tf.nn.weighted_cross_entropy_with_logits(labels=target, logits=output, pos_weight=POS_WEIGHT)
return tf.reduce_mean(loss, axis=-1)
I am trying to use RBFNN for point cloud to surface reconstruction but I couldn't understand what would be my feature vectors in RBFNN.
Can any one please help me to understand this one.
A goal to get to this:
From inputs like this:
An RBF network essentially involves fitting data with a linear combination of functions that obey a set of core properties -- chief among these is radial symmetry. The parameters of each of these functions is learned by incremental adjustment based on errors generated through repeated presentation of inputs.
If I understand (it's been a very long time since I used one of these networks), your question pertains to preprocessing of the data in the point cloud. I believe that each of the points in your point cloud should serve as one input. If I understand properly, the features are your three dimensions, and as such each point can already be considered a "feature vector."
You have other choices that remain, namely the number of radial basis neurons in your hidden layer, and the radial basis functions to use (a Gaussian is a popular first choice). The training of the network and the surface reconstruction can be done in a number of ways but I believe this is beyond the scope of the question.
I don't know if it will help, but here's a simple python implementation of an RBF network performing function approximation, with one-dimensional inputs:
import numpy as np
import matplotlib.pyplot as plt
def fit_me(x):
return (x-2) * (2*x+1) / (1+x**2)
def rbf(x, mu, sigma=1.5):
return np.exp( -(x-mu)**2 / (2*sigma**2));
# Core parameters including number of training
# and testing points, minimum and maximum x values
# for training and testing points, and the number
# of rbf (hidden) nodes to use
num_points = 100 # number of inputs (each 1D)
num_rbfs = 20.0 # number of centers
x_min = -5
x_max = 10
# Training data, evenly spaced points
x_train = np.linspace(x_min, x_max, num_points)
y_train = fit_me(x_train)
# Testing data, more evenly spaced points
x_test = np.linspace(x_min, x_max, num_points*3)
y_test = fit_me(x_test)
# Centers of each of the rbf nodes
centers = np.linspace(-5, 10, num_rbfs)
# Everything is in place to train the network
# and attempt to approximate the function 'fit_me'.
# Start by creating a matrix G in which each row
# corresponds to an x value within the domain and each
# column i contains the values of rbf_i(x).
center_cols, x_rows = np.meshgrid(centers, x_train)
G = rbf(center_cols, x_rows)
plt.plot(G)
plt.title('Radial Basis Functions')
plt.show()
# Simple training in this case: use pseudoinverse to get weights
weights = np.dot(np.linalg.pinv(G), y_train)
# To test, create meshgrid for test points
center_cols, x_rows = np.meshgrid(centers, x_test)
G_test = rbf(center_cols, x_rows)
# apply weights to G_test
y_predict = np.dot(G_test, weights)
plt.plot(y_predict)
plt.title('Predicted function')
plt.show()
error = y_predict - y_test
plt.plot(error)
plt.title('Function approximation error')
plt.show()
First, you can explore the way in which inputs are provided to the network and how the RBF nodes are used. This should extend to 2D inputs in a straightforward way, though training may get a bit more involved.
To do proper surface reconstruction you'll likely need a representation of the surface that is altogether different than the representation of the function that's learned here. Not sure how to take this last step.