I have been playing around with the MIT DeepTraffic Challenge
Also watching the lecture and reading the slides
After getting a General understanding of the architecture I was wondering what exactly the reward function given by the Environment is.
Is it the same as the Input of the gridcell (max. drivable Speed)?
And are they using Reward Clipping, or not?
I also found this javascript Codebase, which does not really help my understanding either.
The reward is scaled average speed within the interval:
[-3, 3].
The implementation of the deeptraffic environment locates in this file:
https://selfdrivingcars.mit.edu/deeptraffic/gameopt.js
I'm trying to make it readable. Here's the WIP one:
https://github.com/mljack/deeptraffic/blob/master/gameopt.js
var reward = (avgSpeedMeasurement - 60) / 20;
Related
A student asked me this and I can't find the answer. You can set the turtle's speed to 0-10. But what does that actually mean? x actions / second?
We are on Code.org, which translates its code in the lessons into Javascript, but this command is found in the Play lab, which provides no translation. I am assuming this is analogous to JS-turtle, but if you know the answer for Python Turtle, etc, I'd love to hear it.
What precisely is the turtle's speed? X actions/second? ... if you
know the answer for Python Turtle, etc, I'd love to hear it.
In standard Python, the turtle's speed() method indirectly controls the speed of the turtle by dividing up the turtle's motion into smaller or larger steps, where each step has a defined delay.
By default, if we don't mess with setworldcoordinates(), or change the default screen update delay using delay(), or tracer(), then the motion of a turtle is broken up into a number of individual steps determined by:
steps = int(distance / (3 * 1.1**speed * speed))
At the default speed (3 or 'slow'), a 100px line would be drawn in 8 steps. At the slowest speed (1 or 'slowest'), 30 steps. At a fast speed (10 or 'fast'), 1 step. (Oddly, the default speed isn't the 'normal' (6) speed!) Each step incurs a screen update delay of 10ms by default.
Using a speed of 0 ('fastest'), or turning off tracer(), avoids this process altogether and just draws lines in one step.
There's a similar logic for how the speed() setting affects the number of steps the turtle takes to rotate when you do right() or left().
https://docs.python.org/3/library/turtle.html#turtle.speed
From the docs you can see that it is just an arbitrary value.
I am trying my hands on Reinforcement/Deep-Q learning these days. And I started with a basic game of 'Snake'.
With the help of this article: https://towardsdatascience.com/how-to-teach-an-ai-to-play-games-deep-reinforcement-learning-28f9b920440a
Which I successfully trained to eat food.
Now I want it to eat food in specific number of steps say '20', not more, not less. How will the reward system and Policy be changed for this?
I have tried many things, with little to no result.
For example I tried this:
def set_reward(self, player, crash):
self.reward = 0
if crash:
self.reward = -10
return self.reward
if player.eaten:
self.reward = 20-abs(player.steps - 20)-player.penalty
if (player.steps == 10):
self.reward += 10 #-abs(player.steps - 20)
else:
player.penalty+=1
print("Penalty:",player.penalty)
Thank You.
Here's is the program:
https://github.com/maurock/snake-ga
I would suggest this approach is problematic because despite changing your reward function you haven't included the number of steps in the observation space. The agent needs that information in the observation space to be able to differentiate at what point it should bump into the goal. As it stands, if your agent is next to the goal and all it has to do is turn right but all it's done so far is five moves, that is exactly the same observation as if it had done 19 moves. The point is you can't feed the agent the same state and expect it to make different actions because the agent doesn't see your reward function it only receives a reward based on state. Therefore you are contradicting the actions.
Think of when you come to the testing the agents performance. There is no longer a reward. All you are doing is passing the network a state and you are expecting it to choose different actions for the same state.
I assume your state space is some kind of 2D array. Should be straightforward to alter the code to contain the number of steps in the state space. Then the reward function would be something like if observation[num_steps] = 20: reward = 10.
Ask if you need more help coding it
I'm really hoping someone here can help.
I have performed a chi-square test of independence, looking at men/women and early/late drop out from therapy. I have a p-value of 0.047. Do I need to do any post hoc testing on this? Men drop out almost 50:50 early:late whereas women drop out almost 25:75 early:late. Do I need post hoc testing for this and a Bonferonni correction, or is the answer simply:
The frequency of retention rates was compared across gender, finding a significant interaction (X2 (1) = 3.94, p = 0.047), indicating that females were more likely to be retained past the third CBT session than men.
Any help would be greatly appreciated, stats hurt my head and I can't continue past this problem.
Since there's only one test performed, with a single degree of freedom, there's no way (or need) to do any multiple comparison correction.
I'm trying to optimize my current EA that contains approximately 40 different inputs with MetaTrader genetic algorithm.
The inputs have constraints such as I1 < I2 < I3, I24 > 0, ... For total of about 20 constraints.
I tried to filter the solutions that do not respect the constraints with the following code :
int OnInit(){
if(I1 >= I2 || I2 >= I3) {
return(INIT_FAILED);
}
...
}
The problem is then the following : no viable solutions are found after the first 512 iterations and the optimization stops (same happens with the non genetic optimizer).
If I remove the constraints the algorithm will run and optimize the solutions but then those solutions will not respect the constraints.
Has anyone already faced similar issues ? Currently I think I'll have to use an external tool to optimize but this does not feel right
As Daniel has yesterday recommended an OnInit(){...}-handler located shortcutting, the Genetic-mode optimiser will and has to give-up, as it has not seen any progression on the evolutionary journey across some recent amount of population modifications/mutations down the road.
What has surprised me, is that the fully-meshed mode ( going across the whole Cartesian parameterSetSPACE ) rejected to test each and every parameterSetSPACE-vector, one after another. Having spent remarkable hundreds of machine-years in this very sort of testing, this sounds strange to my prior MT4 [ Strategy Tester ] experience.
One more trick :
Let me share one more option :
let pass the tested code through the OnInit(){...}, but make the conditions shortcut the OnTick(){...}-event-handler, returning straight upon entering there. This was a trick, we have invented so as our code was able to simulate some delayed starts ( an internal time-based iterator, for a sliding window location in a flow of time ) of the actual trading-under-test. This way one may simulate some adverse effect of "wrong" parameterSet-vectors, and the Genetics may evolve further, even finding as a side-effect what types of parametrisation gets penalised :o)
SearchSpace having 40+ parameters ? ... The Performance !
If this is your concerd, your next level of performance gets delivered, once you start using a distributed-computing testing-farm, where many machines perform tests upon centrally managed distribution of parameterSet-vectors and report back the results.
This was indeed a performance booster for our Quant R&D.
After some time, we have also implemented a "standalone" farm for ( again, distributed-computing ) off-platform Quant R&D prototyping and testing.
In my application I need to determine what the plates a user can load on their barbell to achieve the desired weight.
For example, the user might specify they are using a 45LB bar and have 45,35,25,10,5,2.5 pound plates to use. For a weight like 115, this is an easy problem to solve as the result neatly matches a common plate. 115 - 45 / 2 = 35.
So the objective here is to find the largest to smallest plate(s) (from a selection) the user needs to achieve the weight.
My starter method looks like this...
-(void)imperialNonOlympic:(float)barbellWeight workingWeight:(float)workingWeight {
float realWeight = (workingWeight - barbellWeight);
float perSide = realWeight / 2;
.... // lots of inefficient mod and division ....
}
My thought process is to determine first what the weight per side would be. Total weight - weight of the barbell / 2. Then determine what the largest to smallest plate needed would be (and the number of each, e.g. 325 would be 45 * 3 + 5 or 45,45,45,5.
Messing around with fmodf and a couple of other ideas it occurred to me that there might be an algorithm that solves this problem. I was looking into BFS, and admit that it is above my head but still willing to give it a shot.
Appreciate any tips on where to look in algorithms or code examples.
Your problem is called Knapsack problem. You will find a lot solution for this problem. There are some variant of this problem. It is basically a Dynamic Programming (DP) problem.
One of the common approach is that, you start taking the largest weight (But less than your desired weight) and then take the largest of the remaining weight. It easy. I am adding some more links ( Link 1, Link 2, Link 3 ) so that it becomes clear. But some problems may be hard to understand, skip them and try to focus on basic knapsack problem. Good luck.. :)
Let me know if that helps.. :)