I'm writing a multilayer perceptron neural network for playing two-player card games. I'd like to know if there is a better way to optimize weights than testing neural nets with randomly regenerated weights against each other.
Here's the way I implemented the neural net.
The neurons in the first layer output field values representing states of cards in the deck. For each of these neurons there is an array of constant weights. For example, if the card is in AI's hand, the neuron outputs field equal to the first weight in the array, if the card is on the table - the second, and so forth. These constant input weights need to be optimized in the training process.
Next, there are several hidden layers of neurons. The topology is fixed. All neurons in the preceding layer are connected to every neuron in the following layer. The connections' weights need to be optimized.
The last layer of neurons represents player's actions. These correspond to card that can be played, plus a couple non-card-specific actions, like take cards from the table, or end turn. The largest output field value corresponding to a legal action determines the action to play.
There is a caveat. I want the neural net to find the optimum strategy, so I cannot train it on individual turns. Rather, I have to let it play until it wins or looses, and that's approximately 50 turns.
I'm wondering what is the best approach to training in this scenario, where one does not know the proper response for every turn, but only know if the problem was solved correctly after multiple NN evaluations, i.e. it won the game.
For now, I've only thought of a simple evolutionary approach, in which a group of randomly generated NNs play against each other multiple times, and a few most successful ones remain for the next round, where the NNs which didn't pass are replaced by other random ones. The problem I see is that in this approach it's going to take a long time for the weights to start converging. But since the fraction of wins is a function of many weights (I'm expecting to need several hundreds to properly model the problem) which have highly non-linear effect on the NN output, I don't see how I could use a function minimization technique.
Does anyone know if this weight optimization problem would lend itself better to anything other then the a Monte Carlo technique?
I think this depends on what your card game is. In general, I think this statement of yours is false:
There is a caveat. I want the neural net to find the optimum strategy, so I cannot train it on individual turns.
It should be possible to find a way to train your network on individual turns. For example, if both players can make the same exact set of moves at each turn, you can train the loser network according to what the winner did at each of the turns. Admittedly, this might not be the case for most card games, where the set of moves at a given turn is usually determined by the cards each player is holding.
If you're playing something like poker, look at this. The idea there is to train your network based on the history of a player you consider good enough to learn from. For example, if you have a lot of data about your favorite (poker) player's games, you can train a neural network to learn their moves. Then, at each turn of a new game, do what the neural network tells you to do given its previous training and the data you have available up to that turn: what cards you're holding, what cards are on the table, what cards you know your opponents to be holding etc.
You could also consider reinforcement learning, which can make use of neural nets, but is based on a different idea. This might help you deal with your "cannot train on individual turns" problem, without needing training data.
Related
Suppose we want to make a neural network to predict the outcome of a race between some number of participants.
Each participant in the race has various statistics: Engine Power, Max Speed, Driver Experience, etc.
Now imagine we have been asked to build a system which can handle any number of participants from 2 to 400 participants (just to pick a concrete number).
From what I have learned about "traditional" Neural Nets so far, our choices are:
Build many different neural nets for each number of participants: n = 2, 3, 4, 5, ... , 400.
Train one neural network taking input from 400 participants. When a piece of data refers to a race with less that 400 participants (this will be a large percentage of the data) just set all remaining statistic inputs to 0.
Assuming this would work, is there any reason to expect one method to perform better than the other?
The former is more specialized, but you have much less training data per net, so my guess is that it would work out roughly the same?
Is there a standard way to approach problems similar to this?
We could imagine (simplistically) that the neural network first classifies the strength of each participant, and therefore, each time a new participant is added, it needs to apply this same analysis to these new inputs, potentially hinting that there might be a "smart" way to reduce the total amount of work required.
Is this just screaming for a convolutional neural network?
Between your two options, option 1 would involve repeating a lot of effort to train for different sizes, and would probably be very slow to train as a result.
Option 2 is a bit more workable, but the network would need extra training on different sized inputs.
Another option, which I think would be the most likely to work, would be to only train a neural net to choose a winner between two participants, and use this to create a ranking via many comparisons between pairs. Such an approach is described here.
We could imagine (simplistically) that the neural network first classifies the strength of each participant, and therefore, each time a new participant is added, it needs to apply this same analysis to these new inputs, potentially hinting that there might be a "smart" way to reduce the total amount of work required.
I think you've got the key idea here. Since we want to perform exactly the same analysis on each participants (assuming it makes no difference whether they're participant 1 or participant 400), this is an ideal problem for Weight Sharing. This means that the weights on the neurons doing the initial analysis on a participant are identical for each participant. When these weights change for one participant, they change for all participants.
While CNNs do use weight sharing, we don't need to use a CNN to use this technique. The details of how you'd go about doing this would depend on your framework.
Definitely a noob NN question, but here it is:
I understand that neurons in a layer of a neural network all initialize with different (essentially random) input-feature weights as a means to vary their back-propagation results so they can converge to different functions describing the input data. However, I do not understand when or how these neurons generating unique functions to describe the input data "communicate" their results with each other, as is done in ensemble ML methods (e.g. by growing a forest of trees with randomized initial-decision criteria and then determining the most discriminative models in the forest). In the trees ensemble example, all of the trees are working together to generalize the rules each model learns.
How, where, and when do neurons communicate their prediction functions? I know individual neurons use gradient descent to converge to their respective functions, but they are unique since they started with unique weights. How do they communicate these differences? I imagine there's some subtle behavior in combining the neuronic results in the output layer where this communication is occurring. Also, is this communication part of the iterative training process?
Someone in the comments section (https://datascience.stackexchange.com/questions/14028/what-is-the-purpose-of-multiple-neurons-in-a-hidden-layer) asked a similar question, but I didn't see it answered.
Help would be greatly appreciated!
During propagation, each neuron typically participates in forming the value in multiple neurons of the next layer. In back-propagation, each of those next-layer neurons will try to push the participating neurons' weights around in order to minimise the error. That's pretty much it.
For example, let's say you're trying to get a NN to recognise digits. Let's say that one neuron in a hidden layer starts getting ideas on recognising vertical lines, another starts finding horisontal lines, and so on. The next-layer neuron that is responsible for finding 1 will see that if it wants to be more accurate, it should pay lots of attention to the vertical line guy; and also the more the horisontal line guy yells the more it's not a 1. That's what weights are: telling each neuron how strongly it should care about each of its inputs. In turn, the vertical line guy will learn how to recognise vertical lines better, by adjusting weights for its input layer (e.g. individual pixels).
(This is quite abstract though. No-one told the vertical line guy that he should be recognising vertical lines. Different neurons just train for different things, and by the virtue of mathematics involved, they end up picking different features. One of them might or might not end up being vertical line.)
There is no "communication" between neurons on the same layer (in the base case, where layers flow linearly from one to the next). It's all about neurons on one layer getting better at predicting features that the next layer finds useful.
At the output layer, the 1 guy might be saying "I'm 72% certain it's a 1", while the 7 guy might be saying "I give that 7 a B+", while the third one might be saying "A horrible 3, wouldn't look at twice". We instead usually either take whoever's loudest's word for it, or we normalise the output layer (divide by the sum of all outputs) so that we have actual comparable probabilities. However, this normalisation is not actually a part of neural network itself.
I am using a Bike Sharing dataset to predict the number of rentals in a day, given the input. I will use 2011 data to train and 2012 data to validate. I successfully built a linear regression model, but now I am trying to figure out how to predict time series by using Recurrent Neural Networks.
Data set has 10 attributes (such as month, working day or not, temperature, humidity, windspeed), all numerical, though an attribute is day (Sunday: 0, Monday:1 etc.).
I assume that one day can and probably will depend on previous days (and I will not need all 10 attributes), so I thought about using RNN. I don't know much, but I read some stuff and also this. I think about a structure like this.
I will have 10 input neurons, a hidden layer and 1 output neuron. I don't know how to decide on how many neurons the hidden layer will have.
I guess that I need a matrix to connect input layer to hidden layer, a matrix to connect hidden layer to output layer, and a matrix to connect hidden layers in neighbouring time-steps, t-1 to t, t to t+1. That's total of 3 matrices.
In one tutorial, activation function was sigmoid, although I'm not sure exactly, if I use sigmoid function, I will only get output between 0 and 1. What should I use as activation function? My plan is to repeat this for n times:
For each training data:
Forward propagate
Propagate the input to hidden layer, add it to propagation of previous hidden layer to current hidden layer. And pass this to activation function.
Propagate the hidden layer to output.
Find error and its derivative, store it in a list
Back propagate
Find current layers and errors from list
Find current hidden layer error
Store weight updates
Update weights (matrices) by multiplying them by learning rate.
Is this the correct way to do it? I want real numerical values as output, instead of a number between 0-1.
It seems to be the correct way to do it, if you are just wanting to learn the basics. If you want to build a neural network for practical use, this is a very poor approach and as Marcin's comment says, almost everyone who constructs neural nets for practical use do so by using packages which have an ready simulation of neural network available. Let me answer your questions one by one...
I don't know how to decide on how many neurons the hidden layer will have.
There is no golden rule to choose the right architecture for your neural network. There are many empirical rules people have established out of experience, and the right number of neurons are decided by trying out various combinations and comparing the output. A good starting point would be (3/2 times your input plus output neurons, i.e. (10+1)*(3/2)... so you could start with a 15/16 neurons in hidden layer, and then go on reducing the number based on your output.)
What should I use as activation function?
Again, there is no 'right' function. It totally depends on what suits your data. Additionally, there are many types of sigmoid functions like hyperbolic tangent, logistic, RBF, etc. A good starting point would be logistic function, but again you will only find the right function through trial and error.
Is this the correct way to do it? I want real numerical values as output, instead of a number between 0-1.
All activation functions(including the one assigned to output neuron) will give you an output of 0 to 1, and you will have to use multiplier to convert it to real values, or have some kind of encoding with multiple output neurons. Coding this manually will be complicated.
Another aspect to consider would be your training iterations. Doing it 'n' times doesn't help. You need to find the optimal training iterations with trial and error as well to avoid both under-fitting and over-fitting.
The correct way to do it would be to use packages in Python or R, which will allow you to train neural nets with large amount of customization quickly, where you can train and test multiple nets with different activation functions (and even different training algorithms) and network architecture without too much hassle. With some amount of trial and error, you will eventually find the net that gives you desirable output.
I created a simple game of pacman(no power pills) and trained it using Q Learning algorithm. Now i am thinking about training it using some supervised learning algorithm.I could create a dataset by collecting state information and then storing it against an action taken by some human player and then training a classifier from it.My question is am i going in the right direction and is it the right approach to get the pacman move along the maze perfectly as it doesn't have any reward system ?
What would you use as state? Supervised learning is all about generalization. You define some parametrized model (e.g. a neural network) and then learn/estimate the parameters (e.g. the weights) from your data. Then you can use this model to predict something.
If all you have is a finite list of states (as you probably had with Q-Learning) and there is only a single "right" choice for each state (whatever the human teacher says). Then there is nothing to predict. There is no kind of "axis along which you can generalize". You only need a simple look-up table and a very patient human to fill it all up.
If you want to apply supervised learning, you need to put in some prior knowledge. You need have some kind of similarity measure (e.g. real-valued inputs/outputs - those have an inherent similarity for near-identical values) or create multiple instances of something.
For example, you could use a 3x3 grid around the player as input and predict the probability that a human player would move up/down/left/right in this situation. You could then try to mimic the human by choosing random moves with the predicted probability. Obviously, this approach will not move the pac-man perfectly, unless you use a very large grid (e.g. 20x20) at which point you are practically back again filling ones and zeroes into a simple look-up table.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
First, let me apologize for cramming three questions in that title. I'm not sure what better way is there.
I'll get right to it. I think I understand feedforward neural networks pretty well.
But LSTM really escapes me, and I feel maybe this is because I don't have a very good grasp of Recurrent neural networks in general. I have went through Hinton's and Andrew Ng's course on Coursera. A lot of it still doesn't make sense to me.
From what I understood, recurrent neural networks are different from feedforward neural networks in that past values influence the next prediction. Recurrent neural network are generally used for sequences.
The example I saw of recurrent neural network was binary addition.
010
+ 011
A recurrent neural network would take the right most 0 and 1 first, output a 1. Then take the 1,1 next, output a zero, and carry the 1. Take the next 0,0 and output a 1 because it carried the 1 from last calculation. Where does it store this 1? In feed forward networks the result is basically:
y = a(w*x + b)
where w = weights of connections to previous layer
and x = activation values of previous layer or inputs
How is a recurrent neural network calculated? I am probably wrong but from what I understood, recurrent neural networks are pretty much feedforward neural network with T hidden layers, T being number of timesteps. And each hidden layer takes the X input at timestep T and it's outputs are then added to the next respective hidden layer's inputs.
a(l) = a(w*x + b + pa)
where l = current timestep
and x = value at current timestep
and w = weights of connections to input layer
and pa = past activation values of hidden layer
such that neuron i in layer l uses the output value of neuron i in layer l-1
y = o(w*a(l-1) + b)
where w = weights of connections to last hidden layer
But even if I understood this correctly, I don't see the advantage of doing this over simply using past values as inputs to a normal feedforward network (sliding window or whatever it's called).
For example, what is the advantage of using a recurrent neural network for binary addition instead of than training a feedforward network with two output neurons. One for the binary result and the other for the carry? And then take the carry output and plug it back into the feedforward network.
However, I'm not sure how is this different than simply having past values as inputs in a feedforward model.
It seems to me that the more timesteps there are, recurrent neural networks are only a disadvantage over feedforward networks because of vanishing gradient. Which brings me to my second question, from what I understood, LSTM is a solution to the problem of vanishing gradient. But I have no actual grasp of how they work. Furthermore, are they simply better than recurrent neural networks, or are there sacrifices to using a LSTM?
What is a Recurrent neural network?
The basic idea is that recurrent networks have loops. These loops allow the network to use information from previous passes, which acts as memory. The length of this memory depends on a number of factors but it is important to note that it is not indefinite. You can think of the memory as degrading, with older information being less and less usable.
For example, let's say we just want the network to do one thing: Remember whether an input from earlier was 1, or 0. It's not difficult to imagine a network which just continually passes the 1 around in a loop. However every time you send in a 0, the output going into the loop gets a little lower (This is a simplification, but displays the idea). After some number of passes the loop input will be arbitrarily low, making the output of the network 0. As you are aware, the vanishing gradient problem is essentially the same, but in reverse.
Why not just use a window of time inputs?
You offer an alternative: A sliding window of past inputs being provided as current inputs. That's is not a bad idea, but consider this: While the RNN may have eroded over time, you will always lose the entirety of your time information after you window ends. And while you would remove the vanishing gradient problem, you would have to increase the number of weights of your network by several times. Having to train all those additional weights will hurt you just as badly as (if not worse than) vanishing gradient.
What is an LSTM network?
You can think of LSTM as a special type of RNN. The difference is that LSTM is able to actively maintain self connecting loops without them degrading. This is accomplished through a somewhat fancy activation, involving an additional "memory" output for the self looping connection. The network must then be trained to select what data gets put onto this bus. By training the network to explicit select what to remember, we don't have to worry about new inputs destroying important information, and the vanishing gradient doesn't affect the information we decided to keep.
There are two main drawbacks:
It is more expensive to calculate the network output and apply back propagation. You simply have more math to do because of the complex activation. However this is not as important as the second point.
The explicit memory adds several more weights to each node, all of which must be trained. This increases the dimensionality of the problem, and potentially makes it harder to find an optimal solution.
Is it always better?
Which structure is better depends on a number of factors, like the number of nodes you need for you problem, the amount of available data, and how far back you want your network's memory to reach. However if you only want the theoretical answer, I would say that given infinite data and computing speed, an LSTM is the better choice, however one should not take this as practical advice.
A feed forward neural network has connections from layer n to layer n+1.
A recurrent neural network allows connections from layer n to layer n as well.
These loops allow the network to perform computations on data from previous cycles, which creates a network memory. The length of this memory depends on a number of factors and is an area of active research, but could be anywhere from tens to hundreds of time steps.
To make it a bit more clear, the carried 1 in your example is stored in the same way as the inputs: in a pattern of activation of a neural layer. It's just the recurrent (same layer) connections that allow the 1 to persist through time.
Obviously it would be infeasible to replicate every input stream for more than a few past time steps, and choosing which historical streams are important would be very difficult (and lead to reduced flexibility).
LSTM is a very different model which I'm only familiar with by comparison to the PBWM model, but in that review LSTM was able to actively maintain neural representations indefinitely, so I believe it is more intended for explicit storage. RNNs are more suited to non-linear time series learning, not storage. I don't know if there are drawbacks to using LSTM rather RNNs.
Both RNN and LSTM can be sequence learners. RNN suffers from vanishing gradient point problem. This problem causes the RNN to have trouble in remembering values of past inputs after more than 10 timesteps approx. (RNN can remember previously seen inputs for a few time steps only)
LSTM is designed to solve the vanishing gradient point problem in RNN. LSTM has the capability of bridging long time lags between inputs. In other words, it is able to remember inputs from up to 1000 time steps in the past (some papers even made claims it can go more than this). This capability makes LSTM an advantage for learning long sequences with long time lags. Refer to Alex Graves Ph.D. thesis Supervised Sequence Labelling
with Recurrent Neural Networks for some details. If you are new to LSTM, I recommend Colah's blog for super simple and easy explanation.
However, recent advances in RNN also claim that with careful initialization, RNN can also learn long sequences comparable to the performance of LSTM. A Simple Way to Initialize Recurrent Networks of Rectified Linear Units.