How to predict result with no label but a specific loss function - machine-learning

I've meet a problem recently. The target result are several columns(3 columns, only contain 1 or 0).
The target result could lead to a penalty function which could be used as a loss function for my model.
I've researched MLP\FFNN\SVM for these kind of problem which seems like to be unsupervised learning. But still got a little confused about how to apply these algorithm to the problem.
Cause there are a lot of examples of these algorithm seems to have a label for training.
So how could i tackle this problem? Any suggestion please?

Related

Backtracking filter coefficients of Convolutional Neural Networks

I'm starting to learn how convolutional neural networks work, and I have a question regarding the filters. Apparently, these are randomly generated when the model is generated, and then as the data is fed, these are corrected accordingly as with the weights in backtracking.
However, how does this work in filters? To my understanding, backtracking works by calculating how much an actual weight contributed to the total error after an output has been predicted, and then correct it accordingly. I've been thinking about how this might work with the filters, and the thing that appears to bother me is that all coefficients in a single filter have contributed the same to the final error. So, suppose you would have a matrix (filter) of random numbers at the beginning, then backtracking happens, and as all the coefficients get corrected in the same way, wouldn't you get a matrix which is just the same as the original times a constant?
Any help or insight you might be able to provide me in this matter is greatly appreciated!

Is the Bias unit in neural networks always one?

I have been studying Neural Networks for a couple of weeks and noticed that all guides and documentation either never mentioned the Bias unit and/or always assumed it to be 1.
Is there any reason or cases where we want a bias unit not to be 1?
Or have it as an adjustable parameter in the network?
Edit:I'm sorry, i'm new to stack overflow and found similar questions so I thought this was a good place to ask, thank you for correcting me.
Edit: When people refer to bias they are in most cases referring to the bias_weight:
Bias&BiasUnit
The bias unit is also the reason we get the equation for the bias Δb in back-propagation as:
Δb = ΔY * 1 (the * 1 is just normally left out as it has no effect on the equation)
Hope that clears thinks up.
This question is better suited for cross-validation or maybe data-science (not about code at all).
I think you have a misunderstanding, the bias term is a trainable parameter that is also learned and updated during training.
I think I know what is the source to your confusion (correct me if I'm wrong). In many places, the bias term is incorporated into the input vector x as a constant 1 element.
So if we have the following input:
The output for some operation can be written as:
Where the trained parameters are:
But it can also be written in the following way:
But, despite the fact that we have the constant 1 in the input, since is still one of the trainable parameters, the bias can still be anything.

Prediction of Stock Returns with ML Algrithm

I am working on a prediction model for stock returns over a fixed period of time (say n days). I am was hoping to gather a few ideas ahead of time. My questions are:
1) Would it be best to turn this into a classification problem, say create a dummy variable with returns larger than x%? Then I could try the entire arsenal of ML Algorithms.
2) If I don't turn it into a classification problem but use say a regression model, would it make sense or be necessary to transform the returns into logs?
Any thoughts are appreciated.
EDIT: My goal with this is relatively broadly defined, in the sense that I would simple like to improve performance of the selection process (pick positive returns and avoid negative ones)
Best under what quality? Turning it into a thresholding problem simply means translating the problem space to a much simpler one. Your problem definition is your own; you can turn it into a binary classification problem (>x or not), a multi-class classification problem (binning into ranges) or simply keep it as a prediction task. If you do the latter, you can still apply binning or classification as a post-processing step.
Classification is just a subclass of prediction. The log transformation employed by logistic regression is no more than a neat trick to turn the outputs into something that resembles a probability distribution; don't put too much thought into it. That said, applying transformations on your output is not necessarily bad (you could for instance apply some normalization to keep your output within the range of some activation function).

Regression. Optimize median instead of mean for skewed distribution

Let say I do the DNN regression task for some skewed data distribution. Now I am using mean absolute error as loss function.
All typical approaches in machine learning are minimizing mean loss, but for skewed that is unappropriating. It is better from a practical point of view to minimize median loss. I think one way is to penalize big losses with some coefficient. And then mean will be close to the median. But how to calculate that coef for the unknown distribution type? Are there other approaches? What can you to advice?
(I am using tensorflow/keras)
Just use the mean absolute error loss function in keras, instead of the mean squared.
The mean absolute is pretty much equivalent the median, and anyway would be more robust to outliers or skewed data. you should have a look at all of the possible keras losses:
https://keras.io/losses/
and obviously, you can create your own too.
But for most data sets it just empirically turns out that mean square gets you better accuracy. so i would recommend to at least try both methods before settling on the mean absolute one.
If you have skewed error distributions, you can use tfp.stats.percentile as your Keras loss function, with something like:
def loss_fn(y_true, y_pred):
return tfp.stats.percentile(tf.abs(y_true - y_pred), q=50)
model.compile(loss=loss_fn)
It gives gradients, so works with Keras, although isn't as fast as MAE / MSE.
https://www.tensorflow.org/probability/api_docs/python/tfp/stats/percentile
Customizing loss (/objective) functions is tough. Keras does theoretically allow you to do this, though they seem to have removed the documentation specifically describing it in their 2.0 release.
You can check their docs on loss functions for ideas, and then head over to the source code to see what kind of an API you should implement.
However, there are a number of issues filed by people who are having trouble with this, and the fact that they've removed the documentation on it is not inspiring.
Just remember that you have to use Keras own backend to compute your loss function. If you get it working, please write a blog post, or update with an answer here, because this is something quite a few other people have struggled/are struggling with!

Things to try when Neural Network not Converging

One of the most popular questions regarding Neural Networks seem to be:
Help!! My Neural Network is not converging!!
See here, here, here, here and here.
So after eliminating any error in implementation of the network, What are the most common things one should try??
I know that the things to try would vary widely depending on network architecture.
But tweaking which parameters (learning rate, momentum, initial weights, etc) and implementing what new features (windowed momentum?) were you able to overcome some similar problems while building your own neural net?
Please give answers which are language agnostic if possible. This question is intended to give some pointers to people stuck with neural nets which are not converging..
If you are using ReLU activations, you may have a "dying ReLU" problem. In short, under certain conditions, any neuron with a ReLU activation can be subject to a (bias) adjustment that leads to it never being activated ever again. It can be fixed with a "Leaky ReLU" activation, well explained in that article.
For example, I produced a simple MLP (3-layer) network with ReLU output which failed. I provided data it could not possibly fail on, and it still failed. I turned the learning rate way down, and it failed more slowly. It always converged to predicting each class with equal probability. It was all fixed by using a Leaky ReLU instead of standard ReLU.
If we are talking about classification tasks, then you should shuffle examples before training your net. I mean, don't feed your net with thousands examples of class #1, after thousands examples of class #2, etc... If you do that, your net most probably wouldn't converge, but would tend to predict last trained class.
I had faced this problem while implementing my own back prop neural network. I tried the following:
Implemented momentum (and kept the value at 0.5)
Kept the learning rate at 0.1
Charted the error, weights, input as well as output of each and every neuron, Seeing the data as a graph is more helpful in figuring out what is going wrong
Tried out different activation function (all sigmoid). But this did not help me much.
Initialized all weights to random values between -0.5 and 0.5 (My network's output was in the range -1 and 1)
I did not try this but Gradient Checking can be helpful as well
If the problem is only convergence (not the actual "well trained network", which is way to broad problem for SO) then the only thing that can be the problem once the code is ok is the training method parameters. If one use naive backpropagation, then these parameters are learning rate and momentum. Nothing else matters, as for any initialization, and any architecture, correctly implemented neural network should converge for a good choice of these two parameters (in fact, for momentum=0 it should converge to some solution too, for a small enough learning rate).
In particular - there is a good heuristic approach called "resillient backprop" which is in fact parameterless appraoch, which should (almost) always converge (assuming correct implementation).
after you've tried different meta parameters (optimization / architecture), the most probable place to look at is - THE DATA
as for myself - to minimize fiddling with meta parameters, i keep my optimizer automated - Adam is by opt-of-choice.
there are some rules of thumb regarding application vs architecture... but its really best to crunch those on your own.
to the point:
in my experience, after you've debugged the net (the easy debugging), and still don't converge or get to an undesired local minima, the usual suspect is the data.
weather you have contradictory samples or just incorrect ones (outliers), a small amount can make the difference from say 0.6-acc to (after cleaning) 0.9-acc..
a smaller but golden (clean) dataset is much better than a big slightly dirty one...
with augmentation you can tweak results even further.

Resources