I want to implement neural network which would guess whether user will pick true or false.
What would be most appropriate design? I'm a newbie on such things, but I think of the following:
have 128 inputs, for 128 previous guesses (memory if you will).
First input is the latest guess, second is previous, third is 2 turns back and etc.
I'm thinking of having hidden fully connected layer of 256 nodes and the final single layer for the output.
This is how I'd represent my inputs: 0 - not yet guessed, 0.5 - the guess was false, 1 - the guess was true
And this is the value of the outputs: 0 is false and 1 is true (I'd convert this to input format when output becomes the input)
Now, the main fear I have with this design is that inputs are moving. That is, the first input in first guess will be second in another guess, third in another and I fear there will be no reusability and logic will be duplicated and guesses will be rock solid.
Is my fear baseless? Should I just let neural network do the thinking for me and be done with it?
I remember I saw something in youtube once for memory in neural network to feed inputs one by one. Is this the wrong way I am doing this and there's special way for having memory in neural net?
If my design is wrong what design would you suggest?
First things first, you should be asking yourself why you want to use neural networks. Have you tried simpler (fewer parameters) models first, for example Hidden Markov Models? Do you have enough data to be training a network beyond a couple of layer?
Assuming that you've sure that you want to use a neural network then we should start at the begin, i.e. the input. Your input is a categorical feature, so should be encoded in a way that is suitable. One common way to encode these features is one-hot encoding. So in your case your inputs would look like:
NYG = [1, 0, 0]
T = [0, 1, 0]
F = [0, 0, 1]
Once you have your input data formated you can think about a network architecture. While there's a few ways to tackle every problem it sounds like you might want to use Recurrent neural networks, which can output a classification based on a sequence of inputs. I'd suggest starting with a couple of smallish layers (64 nodes) of LSTM cells and see how you get on, then you can think about how to grow from there (if you have enough data).
I really can't stress enough that you shouldn't just jump into neural networks. They can be very useful, or they can add massive confusion where it's just not needed.
Related
I am totally new in machine learning and I have a problem, which I want to solve using some AI or so. I would appreciate, if you will recommend me some concrete algorithms, neural network architectures or some related reading.
I am doing research about predicting users intent based on mouse movement. Currently I am in a phase of analysis without concrete dataset. The goal is predict target of users's intention (e.g. button, where user will click) by predicting mouse trajectory.
Let me introduce the problem
I have a lot of sequences. The length of each sequence may vary. As an input I will pass some smaller sequence, for which I want to predict next x values. So I want to know next possible sequence (or more possible sequences). The length of output sequence (x) could also be variable. Maybe sequence ends here? Prediction should be done in “real time”.
So what are those sequences?
Sequence represents directions of movement in 2-dimensional space after some preprocessing. Each value is integer of the interval <0,8>. Algorithm should be capable of increasing upper limit of the interval (16, 32, ...). Actually, the value is interpolated angle.
Three example sequences. Real sequences will be much bigger.
How do I imagine the solution?
Sequences will be clustered based on some similarities. When a dataset of sequences is made, some neural network will be trained to retrieve sequences, which contain the input sequence as a subsequence, as quick as possible.
Clustering
Matching subsequence should have some tolerance. Sequence [3, 3, 3, 3, 2] is similar to [3, 3, 4, 3, 2] = deviation tolerance*. Or sequence [4, 3, 3, 2] is also similar to [4, 3, 3, 3, 3, 2] = tolerance on values repeated continuously.
*I can tell the difference between two values as relative number - 0% the same direction => 100% opposite direction.
If input is [ 1,2,2,2 ] - red - the output should be [ 4,3,2,2 ].
If input is [ 3,3,3,2 ] - blue - the output should be [ 2 ].
Neural network
After some research I found the Hopfield network, which should give the most similar sequence. But then I realized that my sequences lengths are variable and Hopfield network architecture expects binary values.
I could somehow create the binary representation of sequence but I have no idea how to manage the lengths which may vary.
Let’s make it to another level
What if every value in sequence is not a scalar but velocity vector (d, s), where d is direction and s is speed?
Related questions
Can neural networks be trained “online”? So no need to know previous train dataset, just give new dataset.
Can neural networks be trained on server side (e.g. python) but used for prediction on client side (javascript)?
Can neural networks have some kind of “short term memory” - prediction will be affected by 2-3 previous predictions?
Most important - should I use neural networks or some another approach?
Thanks to everyone.
Feel free to correct my English.
Can neural networks be trained “online”? So no need to know previous
train dataset, just give new dataset.
Typically, you dont train an ANN continuously. You train it until your error is within tolerance, then use that model going forward to make predictions. You could store the information off and retrain the network every night if you want to periodically adjust the model, but odds are that's not going to offer much improvement, and runs the risk of prolonged bad data of skewing your model.
Can neural networks be trained on server side (e.g. python) but used
for prediction on client side (javascript)?
It depends. Do you intend to use a trained model for the client prediction, or do you intend for the user actions to live-train the model which are immediately used for prediction?
If the model is already trained, you can use it for prediction of user events.
If the model is not trained, you run the risk of bad data corrupting the model.
Live-training like that would also require a constant update of the model settings on the client side with the new model generated by the server.
Can neural networks have some kind of “short term memory” - prediction
will be affected by 2-3 previous predictions?
Using previous predictions as input is not recommended. It introduces entropy to the system which can allow the model to drastically deviate from reliable predictions if it makes a few bad predictions in a row. You can try it, in which case you'll need n*k additional nodes on your input layer, where n is the number of previous predictions you want to use, and k is the number of output values in a prediction.
Most important - should I use neural networks or some another approach?
ANNs are very useful for predicting things. The biggest problem is defining the scope, and relevant reliable data necessary to make a prediction. I've made ANNs which predict market volatility in video games, with thousands of input values, but predicting mouse movements is going to be a challenge. Nothing is stopping a user from moving the mouse in a circle for hours, or leaving the cursor in one spot. Each time you sample such an action, its going to make your model more likely to predict that type of behavior. Good training data, and a controlled environment is essential. Video games would make for a bad environment for predicting mouse movement, as user behavior is dependent on more than previous mouse movements. Websites would be a favorable environment though, as during a session, a user navigates in predictable ways through a finite space.
I am using a Bike Sharing dataset to predict the number of rentals in a day, given the input. I will use 2011 data to train and 2012 data to validate. I successfully built a linear regression model, but now I am trying to figure out how to predict time series by using Recurrent Neural Networks.
Data set has 10 attributes (such as month, working day or not, temperature, humidity, windspeed), all numerical, though an attribute is day (Sunday: 0, Monday:1 etc.).
I assume that one day can and probably will depend on previous days (and I will not need all 10 attributes), so I thought about using RNN. I don't know much, but I read some stuff and also this. I think about a structure like this.
I will have 10 input neurons, a hidden layer and 1 output neuron. I don't know how to decide on how many neurons the hidden layer will have.
I guess that I need a matrix to connect input layer to hidden layer, a matrix to connect hidden layer to output layer, and a matrix to connect hidden layers in neighbouring time-steps, t-1 to t, t to t+1. That's total of 3 matrices.
In one tutorial, activation function was sigmoid, although I'm not sure exactly, if I use sigmoid function, I will only get output between 0 and 1. What should I use as activation function? My plan is to repeat this for n times:
For each training data:
Forward propagate
Propagate the input to hidden layer, add it to propagation of previous hidden layer to current hidden layer. And pass this to activation function.
Propagate the hidden layer to output.
Find error and its derivative, store it in a list
Back propagate
Find current layers and errors from list
Find current hidden layer error
Store weight updates
Update weights (matrices) by multiplying them by learning rate.
Is this the correct way to do it? I want real numerical values as output, instead of a number between 0-1.
It seems to be the correct way to do it, if you are just wanting to learn the basics. If you want to build a neural network for practical use, this is a very poor approach and as Marcin's comment says, almost everyone who constructs neural nets for practical use do so by using packages which have an ready simulation of neural network available. Let me answer your questions one by one...
I don't know how to decide on how many neurons the hidden layer will have.
There is no golden rule to choose the right architecture for your neural network. There are many empirical rules people have established out of experience, and the right number of neurons are decided by trying out various combinations and comparing the output. A good starting point would be (3/2 times your input plus output neurons, i.e. (10+1)*(3/2)... so you could start with a 15/16 neurons in hidden layer, and then go on reducing the number based on your output.)
What should I use as activation function?
Again, there is no 'right' function. It totally depends on what suits your data. Additionally, there are many types of sigmoid functions like hyperbolic tangent, logistic, RBF, etc. A good starting point would be logistic function, but again you will only find the right function through trial and error.
Is this the correct way to do it? I want real numerical values as output, instead of a number between 0-1.
All activation functions(including the one assigned to output neuron) will give you an output of 0 to 1, and you will have to use multiplier to convert it to real values, or have some kind of encoding with multiple output neurons. Coding this manually will be complicated.
Another aspect to consider would be your training iterations. Doing it 'n' times doesn't help. You need to find the optimal training iterations with trial and error as well to avoid both under-fitting and over-fitting.
The correct way to do it would be to use packages in Python or R, which will allow you to train neural nets with large amount of customization quickly, where you can train and test multiple nets with different activation functions (and even different training algorithms) and network architecture without too much hassle. With some amount of trial and error, you will eventually find the net that gives you desirable output.
So one of the standard things to do with the data is normalize it and standardize it to have data that's normally distributed with a mean 0 and standard deviation of 1, right? But, what if the data is NOT normally distributed?
Also, does the desired output has to be normally distributed too? What if I want my feedforward net to classify between two classes (-1, and 1), that would be impossible to standardize into a normal distribution of mean 0 and std of 1 right?
Feedforward nets are non-parametric right? So if they are, is it still important to standarize data? And why do people standarize it?
Standardizing the features isn't to make the data fit a normal distribution, it's to put the feature values in a known range that makes it easier for algorithms to learn from the data. This is because most algorithms are not scale/shift invariant. Decision Trees, for example, are both scale and shift invariant, and so doing the normalization has no impact on the performance of the tree.
Also, does the desired output has to be normally distributed too?
No. That's not a thing. The output is whatever the output is. You do have to make sure the activation function of the final layer of your network can make the predictions you want (i.e.: Sigmoid activation can't output negative values or values > 1).
Feedforward nets are non-parametric right?
No, they would generally be considered parametric. Parametric / non-parametric doesn't really have a hard definition. People may mean slightly different things when talking about this.
So if they are, is it still important to standardize data?
Those things have nothing to do with each other at all.
And why do people standardize it?
That's the very first thing I mention, it's to make learning easier/possible.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
First, let me apologize for cramming three questions in that title. I'm not sure what better way is there.
I'll get right to it. I think I understand feedforward neural networks pretty well.
But LSTM really escapes me, and I feel maybe this is because I don't have a very good grasp of Recurrent neural networks in general. I have went through Hinton's and Andrew Ng's course on Coursera. A lot of it still doesn't make sense to me.
From what I understood, recurrent neural networks are different from feedforward neural networks in that past values influence the next prediction. Recurrent neural network are generally used for sequences.
The example I saw of recurrent neural network was binary addition.
010
+ 011
A recurrent neural network would take the right most 0 and 1 first, output a 1. Then take the 1,1 next, output a zero, and carry the 1. Take the next 0,0 and output a 1 because it carried the 1 from last calculation. Where does it store this 1? In feed forward networks the result is basically:
y = a(w*x + b)
where w = weights of connections to previous layer
and x = activation values of previous layer or inputs
How is a recurrent neural network calculated? I am probably wrong but from what I understood, recurrent neural networks are pretty much feedforward neural network with T hidden layers, T being number of timesteps. And each hidden layer takes the X input at timestep T and it's outputs are then added to the next respective hidden layer's inputs.
a(l) = a(w*x + b + pa)
where l = current timestep
and x = value at current timestep
and w = weights of connections to input layer
and pa = past activation values of hidden layer
such that neuron i in layer l uses the output value of neuron i in layer l-1
y = o(w*a(l-1) + b)
where w = weights of connections to last hidden layer
But even if I understood this correctly, I don't see the advantage of doing this over simply using past values as inputs to a normal feedforward network (sliding window or whatever it's called).
For example, what is the advantage of using a recurrent neural network for binary addition instead of than training a feedforward network with two output neurons. One for the binary result and the other for the carry? And then take the carry output and plug it back into the feedforward network.
However, I'm not sure how is this different than simply having past values as inputs in a feedforward model.
It seems to me that the more timesteps there are, recurrent neural networks are only a disadvantage over feedforward networks because of vanishing gradient. Which brings me to my second question, from what I understood, LSTM is a solution to the problem of vanishing gradient. But I have no actual grasp of how they work. Furthermore, are they simply better than recurrent neural networks, or are there sacrifices to using a LSTM?
What is a Recurrent neural network?
The basic idea is that recurrent networks have loops. These loops allow the network to use information from previous passes, which acts as memory. The length of this memory depends on a number of factors but it is important to note that it is not indefinite. You can think of the memory as degrading, with older information being less and less usable.
For example, let's say we just want the network to do one thing: Remember whether an input from earlier was 1, or 0. It's not difficult to imagine a network which just continually passes the 1 around in a loop. However every time you send in a 0, the output going into the loop gets a little lower (This is a simplification, but displays the idea). After some number of passes the loop input will be arbitrarily low, making the output of the network 0. As you are aware, the vanishing gradient problem is essentially the same, but in reverse.
Why not just use a window of time inputs?
You offer an alternative: A sliding window of past inputs being provided as current inputs. That's is not a bad idea, but consider this: While the RNN may have eroded over time, you will always lose the entirety of your time information after you window ends. And while you would remove the vanishing gradient problem, you would have to increase the number of weights of your network by several times. Having to train all those additional weights will hurt you just as badly as (if not worse than) vanishing gradient.
What is an LSTM network?
You can think of LSTM as a special type of RNN. The difference is that LSTM is able to actively maintain self connecting loops without them degrading. This is accomplished through a somewhat fancy activation, involving an additional "memory" output for the self looping connection. The network must then be trained to select what data gets put onto this bus. By training the network to explicit select what to remember, we don't have to worry about new inputs destroying important information, and the vanishing gradient doesn't affect the information we decided to keep.
There are two main drawbacks:
It is more expensive to calculate the network output and apply back propagation. You simply have more math to do because of the complex activation. However this is not as important as the second point.
The explicit memory adds several more weights to each node, all of which must be trained. This increases the dimensionality of the problem, and potentially makes it harder to find an optimal solution.
Is it always better?
Which structure is better depends on a number of factors, like the number of nodes you need for you problem, the amount of available data, and how far back you want your network's memory to reach. However if you only want the theoretical answer, I would say that given infinite data and computing speed, an LSTM is the better choice, however one should not take this as practical advice.
A feed forward neural network has connections from layer n to layer n+1.
A recurrent neural network allows connections from layer n to layer n as well.
These loops allow the network to perform computations on data from previous cycles, which creates a network memory. The length of this memory depends on a number of factors and is an area of active research, but could be anywhere from tens to hundreds of time steps.
To make it a bit more clear, the carried 1 in your example is stored in the same way as the inputs: in a pattern of activation of a neural layer. It's just the recurrent (same layer) connections that allow the 1 to persist through time.
Obviously it would be infeasible to replicate every input stream for more than a few past time steps, and choosing which historical streams are important would be very difficult (and lead to reduced flexibility).
LSTM is a very different model which I'm only familiar with by comparison to the PBWM model, but in that review LSTM was able to actively maintain neural representations indefinitely, so I believe it is more intended for explicit storage. RNNs are more suited to non-linear time series learning, not storage. I don't know if there are drawbacks to using LSTM rather RNNs.
Both RNN and LSTM can be sequence learners. RNN suffers from vanishing gradient point problem. This problem causes the RNN to have trouble in remembering values of past inputs after more than 10 timesteps approx. (RNN can remember previously seen inputs for a few time steps only)
LSTM is designed to solve the vanishing gradient point problem in RNN. LSTM has the capability of bridging long time lags between inputs. In other words, it is able to remember inputs from up to 1000 time steps in the past (some papers even made claims it can go more than this). This capability makes LSTM an advantage for learning long sequences with long time lags. Refer to Alex Graves Ph.D. thesis Supervised Sequence Labelling
with Recurrent Neural Networks for some details. If you are new to LSTM, I recommend Colah's blog for super simple and easy explanation.
However, recent advances in RNN also claim that with careful initialization, RNN can also learn long sequences comparable to the performance of LSTM. A Simple Way to Initialize Recurrent Networks of Rectified Linear Units.
I am trying to use a neural network to solve a problem. I learned about them from the Machine Learning course offered on Coursera, and was happy to find that FANN is a Ruby implementation of neural networks, so I didn't have to re-invent the airplane.
However, I'm not really understanding why FANN is giving me such strange output. Based on what I learned from the class,
I have a set of training data that's results of matches. The player is given a number, their opponent is given a number, and the result is 1 for a win and 0 for a loss. The data is a little noisy because of upsets, but not terribly so. My goal is to find which rating gaps are more prone to upsets - for instance, my intuition tells me that lower-rated matches tend to entail more upsets because the ratings are less accurate.
So I got a training set of about 100 examples. Each example is (rating, delta) => 1/0. So it's a classification problem, but not really one that I think lends itself to a logistic regression-type chart, and a neural network seemed more correct.
My code begins
training_data = RubyFann::TrainData.new(:inputs => inputs, :desired_outputs => outputs)
I then set up the neural network with
network = RubyFann::Standard.new(
:num_inputs=>2,
:hidden_neurons=>[8, 8, 8, 8],
:num_outputs=>1)
In the class, I learned that a reasonably default is to have each hidden layer with the same number of units. Since I don't really know how to work this or what I'm doing yet, I went with the default.
network.train_on_data(training_data, 1000, 1, 0.15)
And then finally, I went through a set of sample input ratings in increments and, at each increment, increased delta until the result switched from being > 0.5 to < 0.5, which I took to be about 0 and about 1, although really they were more like 0.45 and 0.55.
When I ran this once, it gave me 0 for every input. I ran it again twice with the same data and got a decreasing trend of negative numbers and an increasing trend of positive numbers, completely opposite predictions.
I thought maybe I wasn't including enough features, so I added (rating**2 and delta**2). Unfortunately, then I started getting either my starting delta or my maximum delta for every input every time.
I don't really understand why I'm getting such divergent results or what Ruby-FANN is telling me, partly because I don't understand the library but also, I suspect, because I just started learning about neural networks and am missing something big and obvious. Do I not have enough training data, do I need to include more features, what is the problem and how can I either fix it or learn how to do things better?
What about playing a little with parameters? At first I would highly recommend only two layers..there should be mathematical proof somewhere that it is enough for many problems. If you have too many neurons your NN will not have enough epochs to really learn something.. so you can also play with number of epochs as well as gama..I think that in your case it's 0.15 ..if you use a little bigger value your NN should learn a little bit faster(don't be afraid to try 0.3 or even 0.7), right value of gama usually depends on weight's intervals or input normalization.
Your NN shows such a different results most probably because in each run there is new initialization and then there is totally different network and it will learn in different way as the previous one(different weights will have higher values so different parts of NN will learn same things).
I am not familiar with this library I am just writing some experiences with NN. Hope something from these will help..