based on my last question I built a 3D-Game in which two robot arms are playing table tennis with each other. The robots have six degrees of freedom.
The states are composed by:
x, y and z-Position of the Ball
6 angles of the robot
All values are normalized, so that they take a value between [-1,1].
With 4 consecutive frames, I get a total input of 37 parameters.
Rewards
+0.3 when the player hits the ball
+0.7 when the player wins the match
-0.7 when the player loses the match
Output
Every of the six robot joints can move with a certain speed, so every joint has the possibilities to move in a positive direction, to stay or to move in a negative direction.
This results in 3^6=729 outputs.
With these settings, the neural network should learn the inverse kinematics of the robot and to play table tennis.
My problem is, that my network converges, but seems to get stuck in a local minimum and depending on the configuration, afterward, begins to converge.
I first tried networks with two and three hidden layers with 1000 nodes, and after a few epochs, the network began to converge. I realized that 1000 nodes are way too much and lowered them to 100, with the result, that the network behaves as described, it converges first and then slightly diverges. So I decided to add hidden layers. Currently, I am trying a network with 6 hidden layers, 80 nodes each. The current loss looks like:
So what do you think, experienced machine learning experts? Do you see any problems with my configuration? Which type of network would you choose?
I'm glad for every suggestion.
I had in the past a similar problem. The goal was to learn the inverse kinematics for a robotarm with the neuroevolution framework NEAT. In the figure on the left is the errorplot. At the beginning all works fine, the network improves, but on a certain point the errorvalue remains on the same value and even after 30 Minutes of calculating no change was there. I don't think that your neural network is wrong, or that the number of neurons are wrong. I think, that neural networks are in general not capable of learning the inverse kinematics problem. I also think that the famous paper of deepmind (Playing Atari games with neuralnetwork) is bogus.
But back to the facts. The plot in the OP (loss average) and my plot (population fitness) are showing both an improvment at the beginning and after a certain time a standstillcurve which can't be improved despite the fact, that the cpu is running on 100% for finding a better solution. It is unclear, how long the neuralnetwork have to be optimized until a significant improvement is visable and perhaps even after days or years of constant calculation no better solution will be found. A look into literature shows, that for every middle- or hardproblem that result is normal and until now no better neural networks or better learning algorithm were invented. The underlying problem is called combinatorial explosion and means that there are many milions possible solution for the network-weights and that a computer can only scan a small amount of them. If the problem is really easy like the "xor problem" a learning algorithm like backpropagation or RPropMinus will find a solution. On slightly more difficult problems like navigating in a maze, finding inverse kinematics or peg-in-hole task no current neural network will find a solution.
Related
I would like to analyze a problem similar to the following.
Problem:
You will be given N dices.
You will be given a lot of data about each dice (eg surface information, material information, location of the center of gravity … etc).
The features of the dice are randomly generated every game and are fired at the same speed, angle and initial position.
As a result of rolling the dice, you get 1 point if you get 6 and 0 points otherwise.
There are training data of 100000 games. (Dice data and match results)
I would like to learn the rule of selecting only dice whose probability of getting 6 is higher than 1/6.
I apologize for the vague problem statement.
First of all, it is my mistake to assume that "N dice".
The dice may be one by one.
One dice with random characteristics are distributed
When it rolls, it is recorded whether 6 has come out or not.
It was easy to understand if it was made into the problem that "this [characteristics, result] data is 100,000".
If you get something other than 6, you will get -1 points.
If you get 6, you will get +5 points.
Example:
X: vector of a dice data
f: function I want to know
f: X-> [0, 1]
(if result> 0.5, I pick this dice.)
For example, a dice with a 1/5 chance of getting a 6 gets 4 out of 5 times a non-6, so I wondered if it would be better to give an immediate reward.
Is it good to decide the reward by the number of points after 100000 games?
I have read some general reinforcement learning methods, but there is a concept of state transition. However, there is no state transition in this game. (Each game ends in 1 step, and each game is independent.)
I am a student just learning neural networks from scratch. It helps if you give me a hint. Thank you.
by the way,
I think that the result of this learning can be concluded "It is good to choose the dice whose pips farthest to the center of gravity is 6."
Let's first talk about Reinforcement-Learning.
Problem setups, in order of increasing generality:
Multi-Armed Bandit - no state, just actions with unknown rewards
Contextual Bandit - rewards also depend on some context (state)
Reinforcement Learning (MDP) - actions can also influence the next state
Common to all of all three is that you want to maximize the sum of rewards over time, and there is an exploration vs exploitation trade-off. You are not just given a large dataset. If you want to know what the best action is, you have to try it a few times and observe the reward. This may cost you some reward you could have earned otherwise.
Of those three, the Contextual Bandit is the closest match to your setup, although it doesn't quite match to your goals. It would go like this: Given some properties of the dice (the context), select the best dice from a group of possible choices (the action, e.g. the output from your network), such that you get the highest expected reward. At the same time you are also training your network, so you have to pick bad or unknown properties sometimes to explore them.
However, there are two reasons why it doesn't match:
You already have data from several 100000 of games, and seem to be not interested in minimizing the cost of trial and error to acquire more data. You assume this data is representative, so no exploration is required.
You are only interested in prediction. You want classify the dice into "good for rolling 6" vs "bad". This piece of information could be used later to make a decision between different choices if you know the cost for making a wrong decision. If you are just learning f() because you are curious about the property of a dice, then is a pure statistical prediction problem. You don't have to worry about short- or long-term rewards. You don't have to worry about selection or consequences of any actions.
Because of this, you actually only have a supervised learning problem. You could still solve it with reinforcement learning because RL is more general. But your RL algorithm would be wasting a lot of time figuring out that it really cannot influence the next state.
Supervised Learning
Your dice actually behaves like a biased coin, it's a Bernoulli trial with ~1/6 success probability. This is now a standard classification problem: given your features, predict the probability that a dice will lead to a good match result.
It seems that your "match results" can be easily converted in the number of rolls and the number of positive outcomes (rolled a six) with the same dice. If you have a large number of rolls for every dice, you can simply classify this die and use this class (together with the physical properties) as one data point to train your network.
You can do more fancy things if you have fewer rolls but I won't go into that. (If you are interested, have a look at the beta distribution and how the cross-entropy loss works with neural networks.)
I'm using a neural network to control the movement of a character in a game. I've currently got a huge amount of dimensions and in the interest of trimming them to improve storage and code manageability, I'm considering removing all derived variables i.e. any variable which can be calculated from data already sent into to the network.
An example of this would be the relationship between a) position, b) velocity, and c) acceleration along a path. Currently, I send the last 50 data points of all three to the NN to help it decide its next movement. However, I wonder if system control / error could be minimized just as easily by sending only position. Theoretically the neural network should be able to derive the velocity and acceleration at a point in time entirely on it's own given the position history.
Generally, is dimension reduction in this capacity recommended? Why or why not?
I know the oft recommendation in this scenario is just to test it and see what happens, but in this case there are so many variables here that it would take days to test, so I was hoping to hear anyone's experience given this type of situation and what they surmise the general rule to be.
Bonus question--would this assessment / decision be different for a neural network (intent on mapping functions to data) as opposed to a random forest (seems to use more of a nearest neighbor approach).
Thanks!!
Implement PCA to reduce the number of features. They reduced features will have unusual units like [positionvelocityacceleration]. However, if you do PCA correctly you can retain a feature set that has 99% variance of the original set.
Then use the new feature set in your NN.
Reducing dimensions is recommended to speed-up algorithms because, as you observed, there is a lot of similarity between your features.
I know how the algorithm works, but I'm not sure how it determines the clusters. Based on images I guess that it sees all the neurons that are connected by edges as one cluster. So that you might have two clusters of two groups of neurons each all connected. But is that really it?
I also wonder.. is GNG really a neural network? It doesn't have a propagation function or an activation function or weighted edges.. isn't it just a graph? I guess that depends on personal opinion a bit but I would like to hear them.
UPDATE:
This thesis www.booru.net/download/MasterThesisProj.pdf deals with GNG-clustering and on page 11 you can see an example of what looks like clusters of connected neurons. But then I'm also confused by the number of iterations. Let's say I have 500 data points to cluster. Once I put them all in, do I remove them and add them again to adapt die existing network? And how often do I do that?
I mean.. I have to re-add them at some point.. when adding a new neuron r, between two old neurons u and v then some data points formerly belonging to u should now belong to r because it's closer. But the algorithm does not contain changing the assignment of these data points. And even if I remove them after one iteration and add them all again, then the false assignment of the points for the rest of that first iteration changes the processing of the network doesn't it?
NG and GNG are a form of self-organizing maps (SOM), which are also referred to as "Kohonen neural networks".
These are based on older, much wider view of neutal networks when they were still inspired by nature rather than being driven by GPU capabilites of matrix operations. Back then, when you did not yet have massive-SIMD architectures yet, there was nothing bad about having neurons self-organize rather than being preorganized in strict layers.
I would not call them clustering although that term is commonly (ab-) used in related work. Because I don't see any strong propery of these "clusters".
SOMs are literally maps as in geography. A SOM is a set of nodes ("neurons") usually arranged in a 2d rectangular or hexagonal grid. (=the map). The positions in the input space are then optimized iteratively to fit the data. Because they influence their neighbors, they cannot move freely. Think of wrapping a net around a tree; the knots of the net are your neurons. NG and GNG appear to be pretty mich the same thing, but with a more flexible structure of nodes. But actually a nice property of SOMs is the 2d map that you can get.
The only approach I remember for clustering was to project the input data to the discrete 2d space of the SOM grid, then run k-means on this projection. It will probably work okayish (as in: it will perform similar to k-means), but I'm not convinced that it's theoretically well supported.
I'm writing a multilayer perceptron neural network for playing two-player card games. I'd like to know if there is a better way to optimize weights than testing neural nets with randomly regenerated weights against each other.
Here's the way I implemented the neural net.
The neurons in the first layer output field values representing states of cards in the deck. For each of these neurons there is an array of constant weights. For example, if the card is in AI's hand, the neuron outputs field equal to the first weight in the array, if the card is on the table - the second, and so forth. These constant input weights need to be optimized in the training process.
Next, there are several hidden layers of neurons. The topology is fixed. All neurons in the preceding layer are connected to every neuron in the following layer. The connections' weights need to be optimized.
The last layer of neurons represents player's actions. These correspond to card that can be played, plus a couple non-card-specific actions, like take cards from the table, or end turn. The largest output field value corresponding to a legal action determines the action to play.
There is a caveat. I want the neural net to find the optimum strategy, so I cannot train it on individual turns. Rather, I have to let it play until it wins or looses, and that's approximately 50 turns.
I'm wondering what is the best approach to training in this scenario, where one does not know the proper response for every turn, but only know if the problem was solved correctly after multiple NN evaluations, i.e. it won the game.
For now, I've only thought of a simple evolutionary approach, in which a group of randomly generated NNs play against each other multiple times, and a few most successful ones remain for the next round, where the NNs which didn't pass are replaced by other random ones. The problem I see is that in this approach it's going to take a long time for the weights to start converging. But since the fraction of wins is a function of many weights (I'm expecting to need several hundreds to properly model the problem) which have highly non-linear effect on the NN output, I don't see how I could use a function minimization technique.
Does anyone know if this weight optimization problem would lend itself better to anything other then the a Monte Carlo technique?
I think this depends on what your card game is. In general, I think this statement of yours is false:
There is a caveat. I want the neural net to find the optimum strategy, so I cannot train it on individual turns.
It should be possible to find a way to train your network on individual turns. For example, if both players can make the same exact set of moves at each turn, you can train the loser network according to what the winner did at each of the turns. Admittedly, this might not be the case for most card games, where the set of moves at a given turn is usually determined by the cards each player is holding.
If you're playing something like poker, look at this. The idea there is to train your network based on the history of a player you consider good enough to learn from. For example, if you have a lot of data about your favorite (poker) player's games, you can train a neural network to learn their moves. Then, at each turn of a new game, do what the neural network tells you to do given its previous training and the data you have available up to that turn: what cards you're holding, what cards are on the table, what cards you know your opponents to be holding etc.
You could also consider reinforcement learning, which can make use of neural nets, but is based on a different idea. This might help you deal with your "cannot train on individual turns" problem, without needing training data.
I need some help with solving a problem that uses the Q-learning algorithm.
Problem description:
I have a rocket simulator where the rocket is taking random paths and also crashes sometimes. The rocket has 3 different engines that can be either on or off. Depending on which engine(s) is activated, the rocket flies towards different directions.
Functions for turning the engines off/on is available
The task:
Construct the Q-learning controller that will turn to rocket to face up all the time.
A sensor that reads the angle of the rocket is available as input.
My solution:
I have the following states:
I also have the following actions:
all engines off
left engine on
right engine on
middle engine on
left and right on
left and middle on
right and middle on
And the following rewards:
Angle = 0, Reward = 100
All other angles, reward = 0
Question:
Now to the question, is this a good choice of rewards and states ? Can I improve my solution ? Is it better to have more rewards for other angles ?
Thanks in advance
16 states x 7 actions is a very small problem.
Rewards for other angles will help you learn faster, but can create odd behaviors later depending on your dynamics.
If you don't have momentum you may decrease the number of states, which will speed up learning and reduce memory useage (which is already tiny). To find the optimal number of states, try decreasing the number of states while analyzing a metric such as reward/timesteps over multiple games, or mean error (normalized by starting angle) over multiple games. Some state representations may perform much better than others. If not, choose the one which converges fastest. This should be relatively cheap with your small Q table.
If you want to learn quickly, you may also try Q-lambda or some other modified Reinforcement Learning algorithm to make use of temporal difference learning.
Edit: Depending on your dynamics this problem may not actually be suitable as a Markov Decision Process. For example, you may need to include the current rotation rate.
Try putting smaller rewards on the states next to the desired state. This will get your agent to learn to go up quicker.