How to get out of 'sticky' states? [closed] - machine-learning

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
The problem:
I've trained an agent to perform a simple task in a grid world (go to the top of the grid while not hitting obstacles), but the following situation always seems to occur. It finds itself in a easy part of the state space (no obstacles), and so continually gets a strong positive reinforcement signal. Then when it does find itself is difficult part of the state space (wedged next to two obstacles) it simply chooses same action as before, to no effect (It goes up and hits the obstacle). Eventually the Q value for this value matches the negative reward, but by this time the other actions have even lower Q values from being useless in the easy part of the state space, so the error signal drops to zero and the incorrect action is still always chosen.
How can I prevent this from happening? I've thought of a few solutions, but none seem viable:
Use a policy that is always exploration heavy. As the obstacles take ~5 actions to get around, a single random action every now and then seems ineffective.
Make the reward function such that bad actions are worse when they are repeated. This makes the reward function break the Markov property. Maybe this isn't a bad thing, but I simply don't have a clue.
Only reward the agent for completing the task. The task takes over a thousand actions to complete, so the training signal would be way too weak.
Some background on the task:
So I've made a little testbed for trying out RL algorithms -- something like a more complex version of the grid-world described in the Sutton book. The world is a large binary grid (300 by 1000) populated by 1's in the form of randomly sized rectangles on a backdrop of 0's. A band of 1's surrounds the edges of the world.
An agent occupies a single space in this world and only a fixed windows around it (41 by 41 window with the agent in the center). The agent's actions consist of moving by 1 space in any of the four cardinal directions. The agent can only move through spaces marked by a 0, 1's are impassible.
The current task to be performed in this environment is to make it to the top of the grid world starting from a random position along the bottom. A reward of +1 is given for successfully moving upwards. A reward of -1 is given for any move that would hit an obstacle or the edge of the world. All other states receive a reward of 0.
The agent uses the basic SARSA algorithm with a neural net value function approximator (as discussed in the Sutton book). For policy decisions I've tried both e-greedy and softmax.

The typical way of teaching such tasks is to give the agent a negative reward each step and then a big payout on completion. You can compensate for the long delay by using eligibility traces and by placing the agent close to the goal initially, and then close to the area it has explored.

Related

Position Controller is able to translate the joint, but not to rotate [duplicate]

An issue arises with contact modeling in robotics simulation package drake.
I try to position-control an iiwa manipulator to affect a body, attached to a screw joint.
What I expect is the nut progressing downwards. What I see is the end-effector sliding around the nut, unable to induce a nuts rotation along the bolt.
note: This is a continued investigation into this question.
The simplifying experiment is to use ExternallyAppliedSpatialForce, instead of the manipulator.
I apply torque around the bolts's Z axis in the centre of bolts axis. The amount of torque is 1 Nm. And, this produces the desired effect:
Then, I stop artificial force application, and try manipulation again, combined with the contact visualization by ContactResultsToMeshcat:
I also print the contact force magnitudes. Forces are in the range of 50-200 N.
But the nut does not move.
My question is, where to take the investigation next, when the 1 Nm torque is able to turn a screw joint, while the 200N force cannot?
So far, after watching 10 repeats of the visualization, I notice that only one finger of end-effector is in contact at any given moment. So, I think, I must prepare less sloppy trajectories for the fingers of end effector [and for the whole arm, for that matter].
It could be the inverse dynamics controller used in ManipulationStation, which is different from (and softer than) iiwaโ€™s real controller, thatโ€™s making the robotโ€™s joints too soft and unable to turn the nut.
Specifically, the inverse dynamics controller scales the joint stiffness matrix by the mass matrix of the robot, whereas iiwa's real controller does not do that.
The problem most likely was in having actuation port of bolt-and-nut model not connected to any input port and hence not simulated properly (more details here).
After fix the simulation looks that way:

Impact of epsilon in the Q-learning process? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I am in the process of learning Reinforcement Learning. I understood the general idea of RL.
However, so far I am yet to understand what is the impact of ๐œ– in RL? What should be the recommended value?
When ๐œ–=1, does it imply explore randomly? If this intuition is correct, then it will not learn anything with ๐œ–=1.
On the other hand, if we set ๐œ–=0 that will imply do not explore and in this case agent will not learn anything either.
I am wondering whether my intuition is correct or not.
What does epsilon control?
From https://en.wikipedia.org/wiki/Reinforcement_learning
with probability epsilon, exploration is chosen, and the action is chosen uniformly at random.
This means that with a probability equal to epsilon the system will ignore what it has learnt so far and effectively stumble or jerk blindly to the next state. That kinds sounds ridiculous as a strategy and certainly if I saw a pedestrian sometimes flailing randomly as they walked down the street I wouldn't say they were 'exploring'.
So, how is this exploring in something like q-learning or RL? If instead of small actions like flailing the the pedestrian example, if we are talking about larger logical moves then trying something new once in a while, like walking north first then east instead of always east first then north, may take us by a nice icecream shop, or game reward.
Depending on what kinds of actions your q-learning or RL system controls, adding various levels (epsilon) of noisy actions will help.
Let's take a look at two scenarios:
A 8x8 grid, some reward squares, a final goal square, and some 'evil' squares.
A Call-Of-Duty like game, a FPS in a huge open world with few rewards other than after several minutes of play, with each second of game play involving several controller movements.
In the first one, if we have an epsilon of .1, that means 10% of the time we just move at random. If we are right beside an 'evil' square, that means even if the RL has learnt that it needs to not move to the 'evil' square, it has a 10% of moving randomly, and if it does than a 1/4 chance of moving to the 'evil' square... so a 2.5% of just dying for no reason whenever beside an 'evil' square. It will have to navigate across the 8x8, and if it is set up like a maze between 'evil' squares and the start is opposite the end, then there will be about 20 or so moves. With a 10% error rate, that will be about 20 to 24 moves once it has reached mastery. When it just starts out, its informed moves are no better than random and it will have a 1/4^20 chance of making it the first time. Alternatively, if it learns some path that is sub-optimal, the only method it has to learn the optimal path is for its random moves (epsilon) to happen at the right time and in the right direction. That can take a really really long time if we need 5 correct random moves to happen in a row (1/10^5 x 1/4^5)
Now let's look at the FPS game with millions of moves. Each move isn't so critical. So an epsilon of .1 (10%) isn't so bad. On the flip side, the number of random moves that need to be chained together to make a meaningful new move or follow a path is astronomically large. Setting a higher epsilon (like .5) will certainly increase the chance, but we still have the issue of branching factor. If in order to go all the way down a new alley way we need 12 seconds of game play with 5 actions per second, then that's 60 actions in a row. There is a 1/2^60 chance of getting 60 random new moves in a row to go down this alleyway when the RL has a belief to not to. That doesn't really sound like an exploration factor to me.
But it does work for toy problems.
Lots of possibilities to improve it
If we frame the RL problem just a little differently, we can get really different results.
Let's say instead of epsilon meaning that we do something random for one time step, it instead is the chance of switching states between: using RL to guide our actions or doing one thing continuously (go north, chop wood). In either state we switch to the other with chance epsilon.
In the FPS game example, if we had an epsilon of 1/15 , then for about 15 steps (5 steps per second, this is 3 seconds on average) we would have RL control our moves, then for 3 seconds it would do just one thing (run north, shoot the sky, ...) then back to RL. This will be a lot less like a pedestrian twitching as they walked, and instead more like once in a while running north and then west.

What is momentum in machine learning? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I'm new in the field of machine learning and recently I heard about this term. I tried to read some articles in the internet but I still don't understand the idea behind it. Can someone give me some examples?
During back-propagation, we're adjusting the weights of the model to adapt to the most recent training results. On a nicely-behaved surface, we would simply use Newton's method and converge to the optimum solution with no problem. However, reality is rarely well-behaved, especially in the initial chaos of a randomly-initialized model. We need to traverse the space with something less haphazard than a full-scale attempt to hit the optimum on the next iteration (as Newton's method does).
Instead, we make two amendments to Newton's approach. The first is the learning rate: Newton adjusted weights by using the local gradient to compute where the solution should be, and going straight to that new input value for the next iteration. Learning rate scales this down quite a bit, taking smaller steps in the indicated direction. For instance, a learning rate of 0.1 says to go only 10% of the computed distance. From that new value, we again compute the gradient, "sneaking up" on the solution. This gives us a better chance of finding the optimum on a varied surface, rather than overshooting or oscillating past it in all directions.
Momentum see here is a similar attempt to maintain a consistent direction. If we're taking smaller steps, it also makes sense to maintain a somewhat consistent heading through our space. We take a linear combination of the previous heading vector, and the newly-computed gradient vector, and adjust in that direction. For instance, if we have a momentum of 0.90, we will take 90% of the previous direction plus 10% of the new direction, and adjust weights accordingly -- multiplying that direction vector by the learning rate.
Does that help?
Momentum is a term used in gradient descent algorithm.
Gradient descent is an optimization algorithm which works by finding the direction of steepest slope in its current status and updates its status by moving towards that direction. As a result, in each step its guaranteed that the value of function to be minimized decreases by each step. The problem is this direction can change greatly in some points of the function while there best path to go usually does not contain a lot of turns. So it's desirable for us to make the algorithm keep the direction it has already been going along for sometime before it changes its direction. To do that the momentum is introduced.
A way of thinking about this is imagining a stone rolling down a hill till it stops at a flat area (a local minimum). If the stone rolling down a hill happens to pass a point where the steepest direction changes for a just moment, we don't expect it to change its direction entirely (since its physical momentum keeps it going). But if the direction of slope changes entirely the stone will gradually change its direction towards the steepest descent again.
here is a detailed link, you might want to check out the math behind it or just see the effect of momentum in action:
https://towardsdatascience.com/stochastic-gradient-descent-with-momentum-a84097641a5d

Image processing surveillance camera [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I was given this question on a job interview and think I really messed up. I was wondering how others would go about it so I could learn from this experience.
You have one image from a surveillance video located at an airport which includes line of people waiting for check-in. You have to assess if the line is big/crowded and therefore additional clerks are necessary. You can assume anything that may help your answer. What would you do?
I told them I would try to
segment the area containing people from the rest by edge detection
use assumptions on body contour such as relative height/width to denoise unwanted edges
use color knowledges; but then they asked how to do that and I didn't know
You failed to mention one of the things that makes it easy to identify people standing in a queue โ€” the fact that they aren't going anywhere (at least, not very quickly). I'd do it something like this (Warning: contains lousy Blender graphics):
You said I could assume anything, so I'll assume that the airport's floor is a nice uniform green colour. Let's take a snapshot of the queue every 10 seconds:
We can use a colour range filter to identify the areas of floor that are empty in each image:
Then by calculating the maximum pixel values in each of these images, we can eliminate people who are just milling around and not part of the queue. Calculating the queue length from this image should be very easy:
There are several ways of improving on this. For example, green might not be a good choice of colour in Dublin airport on St Patrick's day. Chequered tiles would be a little more difficult to segregate from foreground objects, but the results would be more reliable. Using an infrared camera to detect heat patterns is another alternative.
But the general approach should be fairly robust. There's absolutely no need to try and identify the outlines of individual people โ€” this is really very difficult when people are standing close together.
I would just use a person detector, for example OpenCV's HOG people detection:
http://docs.opencv.org/modules/gpu/doc/object_detection.html
or latent svm with the person model:
http://docs.opencv.org/modules/objdetect/doc/latent_svm.html
I would count the number of people in the queue...
I would estimate the color of the empty floor, and go to a normalized color space (like { R/(R+G+B), G/(R+G+B) } ). Also do this for the image you want to check, and compare these two.
My assumption: where the difference is larger than a threshold T it is due to a person.
When this is happening for too much space it is crowded and you need more clerks for check-in.
This processing will be way more robust than trying to recognize and count individual persons, and will work with quite row resolution / low amount of pixels per person.

Weights updating and estimating training example values in playing checks

I am reading the Tom Mitchell's Machine Learning book, the first chapter.
What I want do is to write the program to play checker with itself, and learn to win at the end. My question is about the credit assignment of a non-terminal board position it encounters. Maybe we can set the value using the linear combination of its feature and randomly weights, how to updates it with LMS rules? Because we don't have the training samples apart from ending states.
I am not sure whether I state my question clearly although I tried to.
I haven't read that specific book, but my approach would be the following. Suppose that White wins. Then, every position White passed through should receive positive credit, while every position Black passed through should receive negative credit. If you iterate this reasoning, whenever you have a set of moves making up a game, you should add some amount of score to all positions from the victor and remove some amount of score from all positions from the loser. You do this for a bunch of computer vs. computer games.
You now have a data set made up of a bunch of checker positions and respective scores. You can now compute features over those positions and train your favorite regressor, such as LMS.
An improvement of this approach would be to train the regressor, then make some more games where each move is randomly drawn according to the predicted score of that move (i.e. moves which lead to positions with higher scores have higher probability). Then you update those scores and re-train the regressor, etc.

Resources