There is no api description about change friction coefficient and moment of inertia in CompassGait in https://drake.mit.edu/pydrake/pydrake.examples.compass_gait.html. What's the best way to deal with this?
It's not a missing API -- those concepts are explicitly missing from the mathematical model. The assumption of infinite friction (no slip) allows us to capture the dynamics in minimal coordinates with a single mode. The point mass assumptions could be replaced with inertias without much additional complexity, but that's not how we have derived these equations.
The "searching for limit cycles" exercise in my course notes derives the equations with friction explicitly: (currently the first exercise in this chapter http://underactuated.csail.mit.edu/simple_legs.html)
Related
I am interested in using the Direct Collocation Method to generate a walking trajectory for a 2D 7-link biped robot (torso, left and right upper leg, lower leg and foot).
Specifically,
input : torque to each joint
state : position of waist and angle of each joint
I have parameters of each link and equations of motion.
However, I couldn't understand how to write "system" (and "context") by reading the API.
Is there a good way to describe "system" from this information or a similar example anywhere?
I'm going to use pydrake.
I have a number of relevant examples in my course notes. I would recommend the compass gait limit cycle exercise from the chapter on planning through contact (which uses a URDF to specify the dynamics), or the SLIP model example in the notebook associated with the "Simple models of legged robots" chapter for an example of writing the equations of motion out manually.
Please understand that DirectCollocation, by itself, is not ideal for planning through collisions. Those chapters describe the "hybrid trajectory optimization" approach that is likely what you will want.
I am working on a problem for which we aim to solve with deep Q learning. However, the problem is that training just takes too long for each episode, roughly 83 hours. We are envisioning to solve the problem within, say, 100 episode.
So we are gradually learning a matrix (100 * 10), and within each episode, we need to perform 100*10 iterations of certain operations. Basically we select a candidate from a pool of 1000 candidates, put this candidate in the matrix, and compute a reward function by feeding the whole matrix as the input:
The central hurdle is that the reward function computation at each step is costly, roughly 2 minutes, and each time we update one entry in the matrix.
All the elements in the matrix depend on each other in the long term, so the whole procedure seems not suitable for some "distributed" system, if I understood correctly.
Could anyone shed some lights on how we look at the potential optimization opportunities here? Like some extra engineering efforts or so? Any suggestion and comments would be appreciated very much. Thanks.
======================= update of some definitions =================
0. initial stage:
a 100 * 10 matrix, with every element as empty
1. action space:
each step I will select one element from a candidate pool of 1000 elements. Then insert the element into the matrix one by one.
2. environment:
each step I will have an updated matrix to learn.
An oracle function F returns a quantitative value range from 5000 ~ 30000, the higher the better (roughly one computation of F takes 120 seconds).
This function F takes the matrix as the input and perform a very costly computation, and it returns a quantitative value to indicate the quality of the synthesized matrix so far.
This function is essentially used to measure some performance of system, so it do takes a while to compute a reward value at each step.
3. episode:
By saying "we are envisioning to solve it within 100 episodes", that's just an empirical estimation. But it shouldn't be less than 100 episode, at least.
4. constraints
Ideally, like I mentioned, "All the elements in the matrix depend on each other in the long term", and that's why the reward function F computes the reward by taking the whole matrix as the input rather than the latest selected element.
Indeed by appending more and more elements in the matrix, the reward could increase, or it could decrease as well.
5. goal
The synthesized matrix should let the oracle function F returns a value greater than 25000. Whenever it reaches this goal, I will terminate the learning step.
Honestly, there is no effective way to know how to optimize this system without knowing specifics such as which computations are in the reward function or which programming design decisions you have made that we can help with.
You are probably right that the episodes are not suitable for distributed calculation, meaning we cannot parallelize this, as they depend on previous search steps. However, it might be possible to throw more computing power at the reward function evaluation, reducing the total time required to run.
I would encourage you to share more details on the problem, for example by profiling the code to see which component takes up most time, by sharing a code excerpt or, as the standard for doing science gets higher, sharing a reproduceable code base.
Not a solution to your question, just some general thoughts that maybe are relevant:
One of the biggest obstacles to apply Reinforcement Learning in "real world" problems is the astoundingly large amount of data/experience required to achieve acceptable results. For example, OpenAI in Dota 2 game colletected the experience equivalent to 900 years per day. In the original Deep Q-network paper, in order to achieve a performance close to a typicial human, it was required hundres of millions of game frames, depending on the specific game. In other benchmarks where the input are not raw pixels, such as MuJoCo, the situation isn't a lot better. So, if you don't have a simulator that can generate samples (state, action, next state, reward) cheaply, maybe RL is not a good choice. On the other hand, if you have a ground-truth model, maybe other approaches can easily outperform RL, such as Monte Carlo Tree Search (e.g., Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Search Planning or Simple random search provides a competitive approach to reinforcement learning). All these ideas a much more are discussed in this great blog post.
The previous point is specially true for deep RL. The fact of approximatting value functions or policies using a deep neural network with millions of parameters usually implies that you'll need a huge quantity of data, or experience.
And regarding to your specific question:
In the comments, I've asked a few questions about the specific features of your problem. I was trying to figure out if you really need RL to solve the problem, since it's not the easiest technique to apply. On the other hand, if you really need RL, it's not clear if you should use a deep neural network as approximator or you can use a shallow model (e.g., random trees). However, these questions an other potential optimizations require more domain knowledge. Here, it seems you are not able to share the domain of the problem, which could be due a numerous reasons and I perfectly understand.
You have estimated the number of required episodes to solve the problem based on some empirical studies using a smaller version of size 20*10 matrix. Just a caution note: due to the curse of the dimensionality, the complexity of the problem (or the experience needed) could grow exponentially when the state space dimensionalty grows, although maybe it is not your case.
That said, I'm looking forward to see an answer that really helps you to solve your problem.
I’ve been reading a bit about feature hashing for dimensionality reduction. I understand that it’s important to use a hash function that has a uniform output distribution (the chance of an input being mapped to a specific value is that same as every other value in the range), as well an avalanche/cascade effect (a small change in input produces a big change in output). These properties will ensure that collisions between features will be independent of their frequency. However, I’m still unclear on how the avalanche effect (specifically) impacts this. Could anyone explain why/how it matters here? What constitutes a ‘big change’ in output?
References:
http://blog.someben.com/2013/01/hashing-lang/
http://metaoptimize.com/qa/questions/6943/what-is-the-hashing-trick#6945
The idea is that if you have a tight cluster of input data, you still want the hashing function to splatter the outputs all over the map. The effect is that a collision will be a uniformly random event, as opposed to that tight cluster giving you a spate of collisions -- or a spate of collisions with the mappings of another tight cluster.
"Big change" suggests that your hashing function, h, should show that h(a) - h(b) is stochastically independent of (a-b).
Is that enough? Follow up if you need more explanation.
The avalanche effect ensures that a tiny change in the input (e.g. words: cloud vs clouds) will produce a big change in the output, that is, that close input values will produce distant and unpredictable output values.
I need some help with solving a problem that uses the Q-learning algorithm.
Problem description:
I have a rocket simulator where the rocket is taking random paths and also crashes sometimes. The rocket has 3 different engines that can be either on or off. Depending on which engine(s) is activated, the rocket flies towards different directions.
Functions for turning the engines off/on is available
The task:
Construct the Q-learning controller that will turn to rocket to face up all the time.
A sensor that reads the angle of the rocket is available as input.
My solution:
I have the following states:
I also have the following actions:
all engines off
left engine on
right engine on
middle engine on
left and right on
left and middle on
right and middle on
And the following rewards:
Angle = 0, Reward = 100
All other angles, reward = 0
Question:
Now to the question, is this a good choice of rewards and states ? Can I improve my solution ? Is it better to have more rewards for other angles ?
Thanks in advance
16 states x 7 actions is a very small problem.
Rewards for other angles will help you learn faster, but can create odd behaviors later depending on your dynamics.
If you don't have momentum you may decrease the number of states, which will speed up learning and reduce memory useage (which is already tiny). To find the optimal number of states, try decreasing the number of states while analyzing a metric such as reward/timesteps over multiple games, or mean error (normalized by starting angle) over multiple games. Some state representations may perform much better than others. If not, choose the one which converges fastest. This should be relatively cheap with your small Q table.
If you want to learn quickly, you may also try Q-lambda or some other modified Reinforcement Learning algorithm to make use of temporal difference learning.
Edit: Depending on your dynamics this problem may not actually be suitable as a Markov Decision Process. For example, you may need to include the current rotation rate.
Try putting smaller rewards on the states next to the desired state. This will get your agent to learn to go up quicker.
Last couple of days I spent on searching for curve reconstruction implementations, and found none - not as a library nor as a tool.
To describe my problem.
My main concern are contours with gaps:
From papers I've read in the meantime, I guess solution will require usage of Delaunay triangulation, and the method referenced most seems to be described in 1997 paper "The Crust and the β-Skeleton: Combinatorial Curve Reconstruction
"
Can someone point me to a curve reconstruction implementation, that can help me solve this problem?
Algorithm is implemented in CGAL. Example implementation can be seen in C++ in CGAL ipelets demo package. Even more compiling the demo allows user applying the algorithm in ipe GUI application:
In above example I selected just part of my image, as bottom lines did not meet necessary requirements, so crust can't be applied on that part until corrected. Further, image has to be sampled, as can be noticed.
If no one provides another implementation example, I'll mark my answer as correct after couple of days.
Delaunay triangulation uses discretized curve, and with that loses information. That can cause strange problems where you don't expect them. In your example, probably middle part on lower boundary would cause a problem.
In this situations maybe it is good to collect relevant information from model and try to make a matching.
Something like, for each end point collect contour derivative in a neighbourhood. Than find all end points to which that end point can be connected, with approximative derivative direction and that joint doesn't cross other line. It is possible to give weight to possible connection by joint distance and deviation from local derivative. Giving weight defines weighted graph with possible end point connections. Maximal edge matching in that graph would be good solution to a problem.
There are quite a few ways to solve this;
You could simply write a worm that follows the curves and when you reach the end of one, you take your current direction vector along with gradient and extrapolate it forward. Find all the other endpoints that would best fit and then score them; Reconnect up with the one with the highest score. Simple, and prone to problems if its more than a simple break up.
A hierarchical waterfall method might be interesting
There are threshold methods in waterfall (and level-set methods) that can be used to detect these gaps and fill them in.