How does one determine when the CartPole environment has been solved?

How does one determine when the CartPole environment has been solved? - machine-learning

I was going through this tutorial and saw the following piece of code:
# Calculate score to determine when the environment has been solved
scores.append(time)
mean_score = np.mean(scores[-100:])
if episode % 50 == 0:
print('Episode {}\tAverage length (last 100 episodes): {:.2f}'.format(
episode, mean_score))
if mean_score > env.spec.reward_threshold:
print("Solved after {} episodes! Running average is now {}. Last episode ran to {} time steps."
.format(episode, mean_score, time))
break
however, it didn't really made sense to me. How does one define when a "RL environment has been solved"? Not sure what that even means. I guess in classification it would make sense to define it to be when loss is zero. In regression maybe when the total l2 loss is less than some value? Perhaps it would have made sense to define it when the expected returns (discounted rewards) is greater than some value.
But here it seems they are counting the # of time steps? This doesn't make any sense to me.
Note the original tutorial had this:
def main(episodes):
running_reward = 10
for episode in range(episodes):
state = env.reset() # Reset environment and record the starting state
done = False
for time in range(1000):
action = select_action(state)
# Step through environment using chosen action
state, reward, done, _ = env.step(action.data[0])
# Save reward
policy.reward_episode.append(reward)
if done:
break
# Used to determine when the environment is solved.
running_reward = (running_reward * 0.99) + (time * 0.01)
update_policy()
if episode % 50 == 0:
print('Episode {}\tLast length: {:5d}\tAverage length: {:.2f}'.format(episode, time, running_reward))
if running_reward > env.spec.reward_threshold:
print("Solved! Running reward is now {} and the last episode runs to {} time steps!".format(running_reward, time))
break
not sure if this makes much more sense...
is this only a particular quirk of this environment/task? How does the task end in general?

The time used in case of cartpole equals the reward of the episode. The longer you balance the pole the higher the score, stopping at some maximum time value.
So the episode would be considered solved if the running average of the last episodes is near enough that maximum time.

is this only a particular quirk of this environment/task?
Yes. Episode termination depends totally on the respective environment.
CartPole challenge is considered as solved when the average reward is greater than or equal to 195.0 over 100 consecutive trials.
Performance of your solution is measured by how quickly your algorithm was able to solve the problem.
For more information on Cartpole env refer to this wiki.
For information on any GYM environment refer to this wiki.

Related

How do I setup the timestep when using DifferentialEquations.jl in Julia for an irregular time series?

Playing with the harmonic oscillator, the differential equation is driven by a regular time series
w_i in the millisecond range.
ζ = 1/4pi # damped ratio
function oscillator!(du,u,p,t)
du[1] = u[2] # y'(t) = z(t)
du[2] = -2*ζ*p(t)*u[2] - p(t)^2*u[1] # z'(t) = -2ζw(t)z(t) -w(t)^2y(t)
end
y0 = 0.0 # initial position
z0 = 0.0002 # initial speed
u0 = [y0, z0] # initial state vector
tspan = (0.0,10) # time interval
dt = 0.001 # timestep
w = t -> freq[Int(floor(t/dt))+1] # time series
prob = ODEProblem(oscillator!,u0,tspan,w) # define ODEProblem
sol = solve(prob,DP5(),adaptive=false,dt=0.001)
How do I setup the timestep when the parameter w_i is an irregular time series in the millisecond range.
date │ w
────────────────────────┼───────
2022-09-26T00:00:00.023 │ 4.3354
2022-09-26T00:00:00.125 │ 2.34225
2022-09-26T00:00:00.383 │ -2.0312
2022-09-26T00:00:00.587 │ -0.280142
2022-09-26T00:00:00.590 │ 6.28319
2022-09-26T00:00:00.802 │ 9.82271
2022-09-26T00:00:00.906 │ -5.21289
....................... | ........

While it's possible to disable adaptivity, and even if it was possible to force arbitrary step sizes, this isn't in general what you want to do, as it limits the accuracy of the solution greatly.
Instead, interpolate the parameter to let it take any value of t.
Fortunately, it's really simple to do!
using Interpolations
...
ts = [0, 0.1, 0.4, 1.0]
ws = [1.0, 2.0, 3.0, 4.0]
w = linear_interpolation(ts, ws)
tspan = first(ts), last(ts)
prob = ODEProblem(oscillator!, u0, tspan, w)
sol = solve(prob, DP5(), dt=0.001)
Of course, it doesn't need to be a linear interpolation.
If you still need the solution saved at particular time points, have a look at saveat for solve. E.g. saving the solution using ts used in the interpolation:
sol = solve(prob, DP5(), dt=0.001, saveat=ts)
Edit: Follow up on comment:
Mathematically, you are always making some assumption about the w(t) over the entire domain tspan. There is no such as "driven by a time series".
For example, the standard Runge-Kutta method you have chosen here will require that the ODE function is at h/2. For the better DP5() it is evaluated at several more sub-steps. This is of course unavoidable, regardless of adaptivity is used or not.
Try adding println(t) into your ODE function and you will see this.
In case someone comes from matlab's ode45, not that it simply still uses adaptivity, and just treats explicit time steps the same as saveat does. And, of course, it will evaluate the function at various t outside of the explicit steps as well.
So even in your first example, you are interpolating your w. You are making a strange type of constant_interpolation (but with floor, which combined with floats will cause other issues, since floor(n*dt/dt) might evaluate to n or n-1.).
And even if you were to pick a method that only will try to evaluate at exactly the predetermined time steps, say e.g. ExplicitEuler(), you are still implicitly making the same assumption that w(t) is constant up until the next time step.
Only now, you are also getting a much worse solution from just the ODE integration.
If a constant-previous type interpolation really is how w(t) is defined over the entire domain (which is what you did with floor(t/dt)) here, then what we have is:
w = extrapolate(interpolate((ts,), ws, Gridded(Constant{Previous}())), Flat())
There is simply mathematically no way we get to ignore what happens across the time-step, and there is no reason to limit the time-stepping to the sample points of our "load" function. It's not natural is correct in any mathematical sense.
u'(t) has to be defined on the entire domain we integrate over.

How to properly reset the ContinuousState in a class derived from LeafSystem?

I want to write a continuous time system derived from the LeafSystem that can have its continuous state reset to other values if some conditions are met. However, the system does not work as what I expected. To find out the reason, I implement a simple multi-step integrator system as below:
class MultiStepIntegrator(LeafSystem):
def __init__(self):
LeafSystem.__init__(self)
self.state_index = self.DeclareContinuousState(1)
self.DeclareStateOutputPort("x", self.state_index)
self.flag_1 = True
self.flag_2 = True
def reset_state(self, context, value):
state = context.get_mutable_continuous_state_vector()
state.SetFromVector(value)
def DoCalcTimeDerivatives(self, context, derivatives):
t = context.get_time()
if t < 2.0:
V = [1]
elif t < 4.0:
if self.flag_1:
self.reset_state(context, [0])
print("Have done the first reset")
self.flag_1 = False
V = [1]
else:
if self.flag_2:
self.reset_state(context, [0])
print("Have done the second reset")
self.flag_2 = False
V = [-1]
derivatives.get_mutable_vector().SetFromVector(V)
What I expect from this system is that it will give me a piecewise and discontinuous trajectory. Given that I set the state initially to be 0, firstly the state will go from 0 to 2 for $t \in [0,2]$, then agian from 0 to 2 for $t \in [2,4]$ and then from 0 to -2 for $t \in [4,6]$.
Then I simulate this system, and plot the logging with
builder = DiagramBuilder()
plant, scene_graph = AddMultibodyPlantSceneGraph(builder, 1e-4)
plant.Finalize()
integrator = builder.AddSystem(MultiStepIntegrator())
state_logger = LogVectorOutput(integrator.get_output_port(), builder, 1e-2)
diagram = builder.Build()
simulator = Simulator(diagram)
log_state = state_logger.FindLog(context)
fig = plt.figure()
t = log_state.sample_times()
plt.plot(t, log_state.data()[0, :])
fig.set_size_inches(10, 6)
plt.tight_layout()
It seems that the resets never happen. However I do see the two logs indicating that the resets are done:
Have done the first reset
Have done the second reset
What happened here? Are there some checkings done behind the scene that the ContinuousState cannot jump (as the name indicates)? How can I reset the state value given that some conditions are met?
Thank you very much for your help!

In DoCalcTimeDerivatives, the context is a const (input-only) argument. It cannot be modified. The only thing DoCalcTimeDerivatives can do is output the derivative to enable the integrator to integrate the continuous state.
Not all integrators used fixed-size time steps. Some might need to evaluate the gradients multiple times before deciding what step size(s) to use. Therefore, it's not reasonable for a dx/dt calculation to have any side-effects. It must be a pure function, where its only consequence is to report a dx/dt.
To change a continuous state value other than through pure integration, the System needs to use an "unrestricted update" event. That event can mutate any and all elements of the State (including continuous state).
If the timing of the discontinuities is periodic (even if some events make no change to the state), you can use DeclarePeriodicUnrestrictedUpdateEvent to declare the update calculation.
If the discontinuities happen per a witness function, see bouncing_ball or rimless_wheel or compass_gait for an example.
If you need a generalized (bespoke) triggering schedule for the discontinuity events, you'll need to override DoCalcNextUpdateTime to manually inject the next event timing, something like the LcmSubscriberSystem. We don't have many good examples of this to my knowledge.

Fast.Ai EarlyStoppingCallback does not work

callbacks = [EarlyStoppingCallback(learn, monitor='error_rate', min_delta=1e-5, patience=5)]
learn.fit_one_cycle(30, callbacks=callbacks, max_lr=slice(1e-5,1e-3))
As you can see, I use patience = 5 and min_delta=1e-5 and monitor='error_rate'
My understanding is: patience tells how many epochs it waits if improvement is less than min_delta on the monitored value, in this case it's error_rate.
So if my understanding was correct, then it would not stop at Epoch 6.
So is this my understanding wrong or the debug in fast.ai lib ?

It keeps track of the best error rate and compares the min_delta to the difference between this epoch and that value:
class EarlyStoppingCallback(TrackerCallback):
...
if self.operator(current - self.min_delta, self.best):
self.best,self.wait = current,0
else:
self.wait += 1
if self.wait > self.patience:
print(f'Epoch {epoch}: early stopping')
return {"stop_training":True}
...
So self.wait only increases if the decrease in error was large enough. Once the 5th time occurs it stops.
np.greater(0.000638 - 1e-5, 0.000729)
False
There does seem to be an issue though, because clearly if the error rate jumped very high we would not want to assign this to self.best. And I believe the point of this callback is to stop training if the error rate starts to increase - which right now it is doing the opposite.
So in TrackerCallback there might need to be a change in:
mode_dict['auto'] = np.less if 'loss' in self.monitor else np.greater
to
mode_dict['auto'] = np.less if 'loss' in self.monitor or 'error' in self.monitor else np.greater

What is the correct way to set StopLoss and TakeProfit in OrderSend() in MetaTrader4 EA?

I'm trying to figure out if there is a correct way to set the Stop Loss (SL) and Take Profit (TP) levels, when sending an order in an Expert Advisor, in MQL4 (Metatrader4). The functional template is:
OrderSend( symbol, cmd, volume, price, slippage, stoploss, takeprofit, comment, magic, expiration, arrow_color);
So naturally I've tried to do the following:
double dSL = Point*MM_SL;
double dTP = Point*MM_TP;
if (buy) { cmd = OP_BUY; price = Ask; SL = ND(Bid - dSL); TP = ND(Ask + dTP); }
if (sell) { cmd = OP_SELL; price = Bid; SL = ND(Ask + dSL); TP = ND(Bid - dTP); }
ticket = OrderSend(SYM, cmd, LOTS, price, SLIP, SL, TP, comment, magic, 0, Blue);
However, there are as many variations as there are scripts and EA's. So far I have come across these.
In the MQL4 Reference in the MetaEditor, the documentation say to use:
OrderSend(Symbol(),OP_BUY,Lots,Ask,3,
NormalizeDouble(Bid - StopLoss*Point,Digits),
NormalizeDouble(Ask + TakeProfit*Point,Digits),
"My order #2",3,D'2005.10.10 12:30',Red);
While in the "same" documentation online, they use:
double stoploss = NormalizeDouble(Bid - minstoplevel*Point,Digits);
double takeprofit = NormalizeDouble(Bid + minstoplevel*Point,Digits);
int ticket=OrderSend(Symbol(),OP_BUY,1,price,3,stoploss,takeprofit,"My order",16384,0,clrGreen);
And so it goes on with various flavors, here, here and here...
Assuming we are interested in a OP_BUY and have the signs correct, we have the options for basing our SL and TP values on:
bid, bid
bid, ask
ask, ask
ask, bid
So what is the correct way to set the SL and TP for a BUY?
(What are the advantages or disadvantages of using the various variations?)
EDIT: 2018-06-12
Apart a few details, the answer is actually quite simple, although not obvious. Perhaps because MT4 only show Bid prices on the chart (by default) and not both Ask and Bid.
So because: Ask > Bid and Ask - Bid = Slippage, it doesn't matter which we choose as long as we know about the slippage. Then depending on what price you are following on the chart, you may wish to decide on using one over the other, adding or subtracting the Slippage accordingly.
So when you use the measure tool to get the Pip difference of currently shown prices, vs your "exact" SL/TP settings, you need to keep this in mind.
So to avoid having to put the Slippage in my code above, I used the following for OP_BUY: TP = ND(Bid + dTP); (and the opposite for OP_SELL.)

If you buy, you OP_BUY at Ask and close (SL, TP) at Bid.
If you sell, OP_SELL operation is made at Bid price, and closes at Ask.
Both SL and TP should stay at least within STOP_LEVEL * Point() distance from the current price to close ( Bid for buy, Ask for sell).
It is possible that STOP_LEVEL is zero - in such cases ( while MT4 accepts the order ) the Broker may reject it, based on its own algorithms ( Terms and Conditions may call it a "floating Stoplevel" rule or some similar Marketing-wise "re-dressed" term ).
It is adviced to send an OrderSend() request with zero values of SL and TP and modify it after you see that the order was sent successfully. Sometimes it is not required, sometimes that is even mandatory.
There is no difference between the two links you gave us: you may compute SL and TP and then pass them into the function or compute them based on OrderOpenPrice() +/- distance * Point().

So what is the correct way to set the SL and TP for a BUY ?
There is no such thing as "The Correct Way", there are rules to meet
Level 0: Syntax is to meet the call-signature ( the easiest one )
Level 1: all at Market XTO-s have to meet the right level of the current Price +/- slippage, make sure to repeat a RefreshRates()-test as close to the PriceDOMAIN-levels settings, otherwise they get rejected from the Broker side ( blocking one's trading engine at a non-deterministic add-on RTT-latency ) + GetLastError() == 129 | ERR_INVALID_PRICE
Level 2: yet another rules get set from Broker-side, in theire respective Service / Product definition in [ Trading Terms and Conditions ]. If one's OrderSend()-request fails to meet any one of these, again, the XTO will get rejected, having same adverse blocking effects, as noted in Level 1.
Some Brokers do not allow some XTO situations due to their T&C, so re-read such conditions with a due care. Any single of theirs rule, if violated, will lead to your XTO-instruction to get legally rejected, with all adverse effects, as noted above. Check all rules, as you will not like to see any of the following error-states + any of others, restricted by your Broker's T&C :
ERR_LONG_POSITIONS_ONLY_ALLOWED Buy orders only allowed
ERR_TRADE_TOO_MANY_ORDERS The amount of open and pending orders has reached the limit set by the broker
ERR_TRADE_HEDGE_PROHIBITED An attempt to open an order opposite to the existing one when hedging is disabled
ERR_TRADE_PROHIBITED_BY_FIFO An attempt to close an order contravening the FIFO rule
ERR_INVALID_STOPS Invalid stops
ERR_INVALID_TRADE_VOLUME Invalid trade volume
...
..
.
#ASSUME NOTHING ; Is the best & safest design-side (self)-directive

Project Euler #3 Ruby Solution - What is wrong with my code?

This is my code:
def is_prime(i)
j = 2
while j < i do
if i % j == 0
return false
end
j += 1
end
true
end
i = (600851475143 / 2)
while i >= 0 do
if (600851475143 % i == 0) && (is_prime(i) == true)
largest_prime = i
break
end
i -= 1
end
puts largest_prime
Why is it not returning anything? Is it too large of a calculation going through all the numbers? Is there a simple way of doing it without utilizing the Ruby prime library(defeats the purpose)?
All the solutions I found online were too advanced for me, does anyone have a solution that a beginner would be able to understand?

"premature optimization is (the root of all) evil". :)
Here you go right away for the (1) biggest, (2) prime, factor. How about finding all the factors, prime or not, and then taking the last (biggest) of them that is prime. When we solve that, we can start optimizing it.
A factor a of a number n is such that there exists some b (we assume a <= b to avoid duplication) that a * b = n. But that means that for a <= b it will also be a*a <= a*b == n.
So, for each b = n/2, n/2-1, ... the potential corresponding factor is known automatically as a = n / b, there's no need to test a for divisibility at all ... and perhaps you can figure out which of as don't have to be tested for primality as well.
Lastly, if p is the smallest prime factor of n, then the prime factors of n are p and all the prime factors of n / p. Right?
Now you can complete the task.
update: you can find more discussion and a pseudocode of sorts here. Also, search for "600851475143" here on Stack Overflow.

I'll address not so much the answer, but how YOU can pursue the answer.
The most elegant troubleshooting approach is to use a debugger to get insight as to what is actually happening: How do I debug Ruby scripts?
That said, I rarely use a debugger -- I just stick in puts here and there to see what's going on.
Start with adding puts "testing #{i}" as the first line inside the loop. While the screen I/O will be a million times slower than a silent calculation, it will at least give you confidence that it's doing what you think it's doing, and perhaps some insight into how long the whole problem will take. Or it may reveal an error, such as the counter not changing, incrementing in the wrong direction, overshooting the break conditional, etc. Basic sanity check stuff.
If that doesn't set off a lightbulb, go deeper and puts inside the if statement. No revelations yet? Next puts inside is_prime(), then inside is_prime()'s loop. You get the idea.
Also, there's no reason in the world to start with 600851475143 during development! 17, 51, 100 and 1024 will work just as well. (And don't forget edge cases like 0, 1, 2, -1 and such, just for fun.) These will all complete before your finger is off the enter key -- or demonstrate that your algorithm truly never returns and send you back to the drawing board.
Use these two approaches and I'm sure you'll find your answers in a minute or two. Good luck!

Do you know you can solve this with one line of code in Ruby?
Prime.prime_division(600851475143).flatten.max
=> 6857

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

How does one determine when the CartPole environment has been solved? - machine-learning

The time used in case of cartpole equals the reward of the episode. The longer you balance the pole the higher the score, stopping at some maximum time value. So the episode would be considered solved if the running average of the last episodes is near enough that maximum time.

Related

How do I setup the timestep when using DifferentialEquations.jl in Julia for an irregular time series?

How to properly reset the ContinuousState in a class derived from LeafSystem?

Fast.Ai EarlyStoppingCallback does not work

What is the correct way to set StopLoss and TakeProfit in OrderSend() in MetaTrader4 EA?

Project Euler #3 Ruby Solution - What is wrong with my code?

Categories

Resources