How does backpropagation work?

How does backpropagation work? - machine-learning

I created my first simple Neural Net on the paper. It has 5 inputs(data - float number from 0.0 to 10.0) and one output. Without hidden layers. For example at start my weights = [0.2, 0.2, 0.15, 0.15, 0.3]. Result should be in range like input data(0.0 - 10.0). For example network returned 8 when right is 8.5. How backprop will change weights? I know how grad.descent works but I can't understand how i should choose parameters of partial derivative. Help, please. I can elaborate something if you need. If you advise some literature (if possible then in simple English).

If you first start with 1, then continue to 2 and 3 respectively. I believe that you will be able to have a pretty strong understanding of how neural networks work.
Andrew Ng's Coursera videos, especially Lecture 9.1, Lecture 9.2, Lecture 9.4 and the others.
Tom Michell's Machine Learning book's 4th Chapter
Raul Rojas' Neural Networks, a Systematic Introduction, Chapter 4, 6 and 7. This is long, although very easy to follow and understand. Also it is a very nice and complete book (also freely available from the author's website).
It's essential to start with understanding how one single perceptron is learned (which is what you have done). Once that is done, the others will not be too difficult.

Related

Predict long jump results: is this a time series forecasting problem or a regression problem?

Here is my data (simplified):
Athletics Age Competition Result(m)
--------------------------------------------
Alex 10.2 CompA 3.2
Alex 11.5 CompB 4.3
...
Bob 9.9 CompC 3.5
Bob 10.7 CompD 5.6
...
Dave 10.3 CompB 5.2
Dave 11.6 CompD 6.3
....
So my data is about a set of children at different ages (8-28) the results of long jump in different competitions.
What I want to know:
Given a new child Paul, if we know his history (age 8 - 16 for example), how to forecast his future result (say at age 18, 20, 24)?
If we can group jumpers into A-E based on their best results, how to predict in which group Paul will be in the future (say when he is 18)?
I recently learned a bit about machine learning and deep learning, and I know this is a problem that can be solved using those models, but I'm confused what models I am supposed to use.
Am I supposed to do the forecasting for Paul (the new child) ONLY based on Paul's history data? Or I am supposed to do it using others' data like Alex, Bob, Dave?
Is this a time series forecasting problem, where I supposed to use models like ARIMA, ARCH, LSTM (RNN)?
Or this is a "normal" supervised or non-supervised regression or classification problem, where I supposed to use textbook models like Linear Regression, Logistic Regression, KNN, NB, DT, SVM, Random Forest, ANN, DNN, CNN?
Any direction will be greatly appreciated.

The answer is both. Regression just means there is no sigmoid activation on the output layer of the model. So you could use a time series model like LSTM or GRU (this may lead to overfitting to use such a complex model), then use them to perform regression. This way, the model will learn the way other children perform, then use the data for Paul to predict how well he will perform. This is not a classification problem! You are predicting continues value, not classes. This means it has to be regression.
I would suggest reading books or taking tutorials, I love Deep Learning with Python.

The problem you're trying to solve is usually called panel (or supervised) forecasting.
Whether or not to use data from other children is a practical question. You can compare models that use the data against models that use only Paul's data.
There is no need to use deep learning, but of course you can try. Other standard machine learning algorithm (random forest, etc.) or statistical forecasting algorithm (ARIMA, etc) can also be adapted to solve this kind of problem.
There are few libraries that solve this problem off-the-shelf. One is pysf with a tutorial on weather data (https://github.com/alan-turing-institute/pysf/blob/master/examples/Walkthrough.ipynb), another one is gluon-ts (mostly deep-learning methods).

Why are images flattened out during hand written script matching in pattern matching (Neural Networks)

I'm new to neural networks, I picked off with this video for a general introduction to the subject from Martin Gorner (https://youtu.be/vq2nnJ4g6N0) at around 5:23 (https://youtu.be/vq2nnJ4g6N0?t=3230), he goes on to say that the image will be flattened out from 28x28 to 1x784. Why is this step necessary ? Is it because the Weights (W) with which the X will be multiplied (cross product) with will be a a single row vector of the same length ? Like for the sake of cross product or is it something else ? Thanks in advance :)

Nothing like that. The aim is purely to make things easier for newbies. Fully connected neural network, which is presented to you, is the basic building block. Later you will learn more advanced patterns.
Images contain a spatial structure and as such it would be more suitable to use an architecture that is structure-aware, e.g. Convolutional Neural Network (1, 2). In fact if you continue watching the lecture you will find exactly that.

Differentiate between sentences that have the same meaning but use different word combinations

I am trying to learn Natural Language Processing and an stuck with an open ended question. How do I club together sentences that mean the same. There can be a finite set of sentences that have the same meaning. What kind of algorithms do I use to club them?
For example: Consider the following sentences:
There is a man. There is a lion. The lion will chase the man on seeing him. If the lion catches the man he dies.
There is a man and a lion. If the lion catches the man he dies. The lion will chase the man if he sees him.
You have a lion that chases men on seeing them. There is one man. If the lion catches the man he dies.
Basically what all these sentences say is this:
1 Lion. 1 Man. Lions chase men. If lion catches men the man dies.
I am unable to zero in on one category of Machine Learning or Deep Learning algorithm that would help me achieve something similar. Please guide me in the right direction or point me to some algorithms that are good enough to achieve this.
Another important factor is having a scale-able solution. There could be lots of such sentences out there. What happens then?
One possible solutions is:
Use the parts of speech and the relations between words in a sentence as features for some Machine Leaning algo. But will this be practical in a large set of sentences? Do we need to consider more things?

One of Deep Learning based solution would be to use word embeddings (which ideally should represent a word by a fixed dimensional vector such that similar words lie close in that embedding space and even vector operations like Germany - Berlin ~= Italy - Rome may hold), two famous word embeddings techniques are Word2Vec and Glove, another option is to represent a sentence by a fixed dimensional vector such that similar sentence lie close in that embedding space, check Skip-Thought vectors. Until now we have only tried to represent text (words/sentences) in a more semantic numerical way, next step is to capture the meaning of the current context (paragraphs, documents), a very naive approach would be to just average word/sentence embeddings (you have to try this to see if it works or not), better way would be to use some kind of sequence model like RNN (actually LSTM or GRU) to capture whatever has been said before. The problem in using sequence models is that it will need supervision (you should have a labelled data, but if you don't have it which I guess is the case), then just use sequence models in a language modelling setting and get the hidden representation of RNN/GRU/LSTM at last time step i.e after reading the last word or the aggregated word embeddings if you are using the naive approach. Once you have the hidden representation you may apply any clustering technique to cluster different paragraphs (you have to find the appropriate distance metric) or you can manually apply some distance metric and define or learn a threshold for similar paragraphs to be categorized as one.

Having trouble understanding neural networks

I am trying to use a neural network to solve a problem. I learned about them from the Machine Learning course offered on Coursera, and was happy to find that FANN is a Ruby implementation of neural networks, so I didn't have to re-invent the airplane.
However, I'm not really understanding why FANN is giving me such strange output. Based on what I learned from the class,
I have a set of training data that's results of matches. The player is given a number, their opponent is given a number, and the result is 1 for a win and 0 for a loss. The data is a little noisy because of upsets, but not terribly so. My goal is to find which rating gaps are more prone to upsets - for instance, my intuition tells me that lower-rated matches tend to entail more upsets because the ratings are less accurate.
So I got a training set of about 100 examples. Each example is (rating, delta) => 1/0. So it's a classification problem, but not really one that I think lends itself to a logistic regression-type chart, and a neural network seemed more correct.
My code begins
training_data = RubyFann::TrainData.new(:inputs => inputs, :desired_outputs => outputs)
I then set up the neural network with
network = RubyFann::Standard.new(
:num_inputs=>2,
:hidden_neurons=>[8, 8, 8, 8],
:num_outputs=>1)
In the class, I learned that a reasonably default is to have each hidden layer with the same number of units. Since I don't really know how to work this or what I'm doing yet, I went with the default.
network.train_on_data(training_data, 1000, 1, 0.15)
And then finally, I went through a set of sample input ratings in increments and, at each increment, increased delta until the result switched from being > 0.5 to < 0.5, which I took to be about 0 and about 1, although really they were more like 0.45 and 0.55.
When I ran this once, it gave me 0 for every input. I ran it again twice with the same data and got a decreasing trend of negative numbers and an increasing trend of positive numbers, completely opposite predictions.
I thought maybe I wasn't including enough features, so I added (rating**2 and delta**2). Unfortunately, then I started getting either my starting delta or my maximum delta for every input every time.
I don't really understand why I'm getting such divergent results or what Ruby-FANN is telling me, partly because I don't understand the library but also, I suspect, because I just started learning about neural networks and am missing something big and obvious. Do I not have enough training data, do I need to include more features, what is the problem and how can I either fix it or learn how to do things better?

What about playing a little with parameters? At first I would highly recommend only two layers..there should be mathematical proof somewhere that it is enough for many problems. If you have too many neurons your NN will not have enough epochs to really learn something.. so you can also play with number of epochs as well as gama..I think that in your case it's 0.15 ..if you use a little bigger value your NN should learn a little bit faster(don't be afraid to try 0.3 or even 0.7), right value of gama usually depends on weight's intervals or input normalization.
Your NN shows such a different results most probably because in each run there is new initialization and then there is totally different network and it will learn in different way as the previous one(different weights will have higher values so different parts of NN will learn same things).
I am not familiar with this library I am just writing some experiences with NN. Hope something from these will help..

Ordinal classification packages and algorithms

I'm attempting to make a classifier that chooses a rating (1-5) for a item i. For each item i, I have a vector x containing about 40 different quantities pertaining to i. I also have a gold standard rating for each item. Based on some function of x, I want to train a classifier to give me a rating 1-5 that closely matches the gold standard.
Most of the information I've seen on classifiers deal with just binary decisions, while I have a rating decision. Are there common techniques or code libraries out there to deal with this sort of problem?

I agree with you that ML problems in which the response variable is on an ordinal scale
require special handling--'machine-mode' (i.e., returning a class label) seems insufficient
because the class labels ignore the relationship among the labels ("1st, 2nd, 3rd");
likewise, 'regression-mode' (i.e., treating the ordinal labels as floats, {1, 2, 3}) because
it ignores the metric distance between the response variables (e.g., 3 - 2 != 1).
R has (at least) several packages directed to ordinal regression. One of these is actually called Ordinal, but i haven't used it. I have used the Design Package in R for ordinal regression and i can certainly recommend it. Design contains a complete set of functions for solution, diagnostics, testing, and results presentation of ordinal regression problems via the Ordinal Logistic Model. Both Packages are available from CRAN) A step-by-step solution of an ordinal regression problem using the Design Package is presented on the UCLA Stats Site.
Also, i recently looked at a paper by a group at Yahoo working on ordinal classification using Support Vector Machines. I have not attempted to apply their technique.

Have you tried using Weka? It supports binary, numerical, and nominal attributes out of the box, the latter two of which might work well enough for your purposes.
Furthermore, it looks like one of the classifiers that's available is a meta-classifier called OrdinalClassClassifier.java, which is the result of this research:
Eibe Frank and Mark Hall, A simple approach to ordinal classification. In Proceedings of the 12th European Conference on Machine Learning, 2001, pp. 145-156.
If you don't need a pre-made approach, then these references (in addition to doug's note about the Yahoo SVM paper) might be useful:
W Chu and Z Ghahramani, Gaussian processes for ordinal regression. Journal of Machine Learning Research, 2006.
Wei Chu and S. Sathiya Keerthi, New approaches to support vector ordinal regression. In Proceedings of the 22nd international conference on Machine Learning, 2005, 145-152.

The problems that dough has raised are all valid. Let me add another one. You didn't say how you would like to measure the agreement between the classification and the "gold standard". You have to formulate the answer to that question as soon as possible, as this will have a huge impact on your next step. In my experience, the most problematic part of any (ok, not any, most) optimization task is the score function. Try asking yourself whether all errors equal? Does miss-classifying the "3" as being "4" has the same impact as classifying "4" as "3"? What about "1" vs "5". Can mistakenly missing one case have disastrous consequences (miss HIV diagnosis, activate pilot ejection in a plane)
The simplest way to measure the agreement between categorical classifiers is Cohen's Kappa. More complicated methods are described in the following links here, here, here, and here
Having said that, sometimes picking a solution that "just works", instead of "the right one" is faster and easier. If I were you I would pick a machine learning library (R, Weka, I personally love Orange) and see what I get. Only if you don't have reasonably good results with that, look for more complex solutions

If not interested in fancy statistics a one hidden layer back propagation neural network with 3 or 5 output nodes will probably do the trick if the training data is sufficiently large. Most NN classifiers try to minimize the mean squared error which is not always desired. Support Vector Machines mentioned earlier is a good alternative.
FANN is a good library for back propagation NNs, it also has some tools to assist in training of the network.

There are two packages in R that might help taming ordinal data
ordinalForest on CRAN
rpartScore on CRAN

I'm working on an OrdinalClassifier that is based on the sklearn framework (specifically the OVR multiclass classifier) and which works well with sklearn workflow such as pipelines, cross validation, and scoring.
Through testing, I'm finding that it performs very well vs. standard non-ordinal multiclass classification using SVC. And it gives much greater control over optimizing for precision and recall on the positive class (in my testing, I used sklearn's diabetes dataset and transformed the disease progression target(y) into a low, medium, high class label. Testing via cross validation is on my repo along with attribution. Scoring is based on weighted f1.
https://github.com/leeprevost/OrdinalClassifier

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart