Fast.Ai EarlyStoppingCallback does not work - machine-learning

callbacks = [EarlyStoppingCallback(learn, monitor='error_rate', min_delta=1e-5, patience=5)]
learn.fit_one_cycle(30, callbacks=callbacks, max_lr=slice(1e-5,1e-3))
As you can see, I use patience = 5 and min_delta=1e-5 and monitor='error_rate'
My understanding is: patience tells how many epochs it waits if improvement is less than min_delta on the monitored value, in this case it's error_rate.
So if my understanding was correct, then it would not stop at Epoch 6.
So is this my understanding wrong or the debug in fast.ai lib ?

It keeps track of the best error rate and compares the min_delta to the difference between this epoch and that value:
class EarlyStoppingCallback(TrackerCallback):
...
if self.operator(current - self.min_delta, self.best):
self.best,self.wait = current,0
else:
self.wait += 1
if self.wait > self.patience:
print(f'Epoch {epoch}: early stopping')
return {"stop_training":True}
...
So self.wait only increases if the decrease in error was large enough. Once the 5th time occurs it stops.
np.greater(0.000638 - 1e-5, 0.000729)
False
There does seem to be an issue though, because clearly if the error rate jumped very high we would not want to assign this to self.best. And I believe the point of this callback is to stop training if the error rate starts to increase - which right now it is doing the opposite.
So in TrackerCallback there might need to be a change in:
mode_dict['auto'] = np.less if 'loss' in self.monitor else np.greater
to
mode_dict['auto'] = np.less if 'loss' in self.monitor or 'error' in self.monitor else np.greater

Related

Capture timeout exception with Mosek + Cvxpy

We are solving our large scale MI optimization problems with Cvxpy and Mosek.
Often times it happens that Mosek consumes higher runtime then our stipulated timeout of two hours.
Is there a way to systematically capture those timeout exception?
Minimum reproducible example:
import cvxpy as cp
import numpy as np
import mosek
m = 15
n = 10
np.random.seed(1)
s0 = np.random.randn(m)
lamb0 = np.maximum(-s0, 0)
s0 = np.maximum(s0, 0)
x0 = np.random.randn(n)
A = np.random.randn(m, n)
b = A # x0 + s0
c = -A.T # lamb0
# Define and solve the CVXPY problem.
x = cp.Variable(n)
prob = cp.Problem(cp.Minimize(c.T#x),
[A # x <= b])
# try:
prob.solve(cp.MOSEK, mosek_params={mosek.dparam.optimizer_max_time: 0.01}) # set verbose=True (to see actual error in solver logs)
# except Timeout exception
# print('Timeout occured')
print(prob.value)
def execute_other_important_stuff():
print("Hello world")
execute_other_important_stuff() # Not executed currently
When MOSEK terminates because of a timeout there will never be any exception - you set a timeout, so terminating at that point is a normal, not an abnormal, situation. I am not sure what "Error" you are referring to.
If you mean something like
Cannot unpack invalid solution: Solution(status=UNKNOWN
then it is not related to the timeout itself, but to the fact that there is no solution available (yet, so the only available "solution" has UNKNOWN status), and CVXPY handles this by throwing an exception. So the question is, has any solution at all to your problem been found after those 2 hours? If yes, it should be returned without problem. If not, you might see the above and I would guess the only way is to catch the ValueError CVXPY happens to throw.
If you were using the native Mosek interface then you could find out why it terminated from various response codes, but I CVXPY does not propagate them.
I am not sure if this is a Cvxpy issue.
However, a general comment is that Mosek cannot check the time limit continuously so it will most likely go over time.
For instance for a SDP it has to compute eigenvalues of a potentially large matrix and the time limit cannot be checked before it is completed.
I have no other insights (other than what ErlingMOSEK suggested) on why it happened, but you can use signals library to force the .solve method to stop.
This is an independent way to make sure you can stop the function after whenever time you wise.
See example in Timeout a function call

How does one determine when the CartPole environment has been solved?

I was going through this tutorial and saw the following piece of code:
# Calculate score to determine when the environment has been solved
scores.append(time)
mean_score = np.mean(scores[-100:])
if episode % 50 == 0:
print('Episode {}\tAverage length (last 100 episodes): {:.2f}'.format(
episode, mean_score))
if mean_score > env.spec.reward_threshold:
print("Solved after {} episodes! Running average is now {}. Last episode ran to {} time steps."
.format(episode, mean_score, time))
break
however, it didn't really made sense to me. How does one define when a "RL environment has been solved"? Not sure what that even means. I guess in classification it would make sense to define it to be when loss is zero. In regression maybe when the total l2 loss is less than some value? Perhaps it would have made sense to define it when the expected returns (discounted rewards) is greater than some value.
But here it seems they are counting the # of time steps? This doesn't make any sense to me.
Note the original tutorial had this:
def main(episodes):
running_reward = 10
for episode in range(episodes):
state = env.reset() # Reset environment and record the starting state
done = False
for time in range(1000):
action = select_action(state)
# Step through environment using chosen action
state, reward, done, _ = env.step(action.data[0])
# Save reward
policy.reward_episode.append(reward)
if done:
break
# Used to determine when the environment is solved.
running_reward = (running_reward * 0.99) + (time * 0.01)
update_policy()
if episode % 50 == 0:
print('Episode {}\tLast length: {:5d}\tAverage length: {:.2f}'.format(episode, time, running_reward))
if running_reward > env.spec.reward_threshold:
print("Solved! Running reward is now {} and the last episode runs to {} time steps!".format(running_reward, time))
break
not sure if this makes much more sense...
is this only a particular quirk of this environment/task? How does the task end in general?
The time used in case of cartpole equals the reward of the episode. The longer you balance the pole the higher the score, stopping at some maximum time value.
So the episode would be considered solved if the running average of the last episodes is near enough that maximum time.
is this only a particular quirk of this environment/task?
Yes. Episode termination depends totally on the respective environment.
CartPole challenge is considered as solved when the average reward is greater than or equal to 195.0 over 100 consecutive trials.
Performance of your solution is measured by how quickly your algorithm was able to solve the problem.
For more information on Cartpole env refer to this wiki.
For information on any GYM environment refer to this wiki.

Project Euler #3 Ruby Solution - What is wrong with my code?

This is my code:
def is_prime(i)
j = 2
while j < i do
if i % j == 0
return false
end
j += 1
end
true
end
i = (600851475143 / 2)
while i >= 0 do
if (600851475143 % i == 0) && (is_prime(i) == true)
largest_prime = i
break
end
i -= 1
end
puts largest_prime
Why is it not returning anything? Is it too large of a calculation going through all the numbers? Is there a simple way of doing it without utilizing the Ruby prime library(defeats the purpose)?
All the solutions I found online were too advanced for me, does anyone have a solution that a beginner would be able to understand?
"premature optimization is (the root of all) evil". :)
Here you go right away for the (1) biggest, (2) prime, factor. How about finding all the factors, prime or not, and then taking the last (biggest) of them that is prime. When we solve that, we can start optimizing it.
A factor a of a number n is such that there exists some b (we assume a <= b to avoid duplication) that a * b = n. But that means that for a <= b it will also be a*a <= a*b == n.
So, for each b = n/2, n/2-1, ... the potential corresponding factor is known automatically as a = n / b, there's no need to test a for divisibility at all ... and perhaps you can figure out which of as don't have to be tested for primality as well.
Lastly, if p is the smallest prime factor of n, then the prime factors of n are p and all the prime factors of n / p. Right?
Now you can complete the task.
update: you can find more discussion and a pseudocode of sorts here. Also, search for "600851475143" here on Stack Overflow.
I'll address not so much the answer, but how YOU can pursue the answer.
The most elegant troubleshooting approach is to use a debugger to get insight as to what is actually happening: How do I debug Ruby scripts?
That said, I rarely use a debugger -- I just stick in puts here and there to see what's going on.
Start with adding puts "testing #{i}" as the first line inside the loop. While the screen I/O will be a million times slower than a silent calculation, it will at least give you confidence that it's doing what you think it's doing, and perhaps some insight into how long the whole problem will take. Or it may reveal an error, such as the counter not changing, incrementing in the wrong direction, overshooting the break conditional, etc. Basic sanity check stuff.
If that doesn't set off a lightbulb, go deeper and puts inside the if statement. No revelations yet? Next puts inside is_prime(), then inside is_prime()'s loop. You get the idea.
Also, there's no reason in the world to start with 600851475143 during development! 17, 51, 100 and 1024 will work just as well. (And don't forget edge cases like 0, 1, 2, -1 and such, just for fun.) These will all complete before your finger is off the enter key -- or demonstrate that your algorithm truly never returns and send you back to the drawing board.
Use these two approaches and I'm sure you'll find your answers in a minute or two. Good luck!
Do you know you can solve this with one line of code in Ruby?
Prime.prime_division(600851475143).flatten.max
=> 6857

Gradual slowdown of h5write function in Julia HDF5 package

EDIT: Based on additional experimentation, I'm fairly confident the slow-down occurs in response to many calls to the file open and close routines (h5open and close). I'm a bit short on time right now, but will come back and add more code/detail in the next few days.
Using the HDF5 package for Julia, I've noticed that the h5write function starts to slow down if one performs many iterations over calls to h5write and h5read. Interestingly, it appears that for the behaviour to be really obvious, one should be reading and writing to a large number of locations in a small number of files. A demonstration of the behaviour I'm talking about can be obtained by starting a Julia session and running the following subroutine (note, you will need the HDF5 package):
#Set parameters
numFile = 10;
numLocation = 10000;
writeDir = "/home/colin/Temp/";
FloatOut = 5.5;
#Import functions
using HDF5
#Loop over read/writes.
c1 = 1;
timeMat = Array(Float64, numFile * 2, 2);
for i in 1:numFile
filePath = string(writeDir, "f", string(i), ".h5");
for j in 1:numLocation
location = string("G1/L", string(j));
if j == 1 || j == numLocation; tic(); end;
h5write(filePath, location, FloatOut);
if j == 1 || j == numLocation; timeMat[c1, 1] = toc(); end;
if j == 1 || j == numLocation; tic(); end;
FloatIn = h5read(filePath, location);
if j == 1 || j == numLocation; timeMat[c1, 2] = toc(); end;
if j == 1 || j == numLocation; c1 = c1+1; end;
end
rm(filePath);
end
This code writes the floating point number 5.5 (chosen for no particular reason) to 10,000 locations in each of 10 files using h5write. Immediately after performing the write operation each time, the number is then read back in using h5read. For each file, I store the time taken to perform the write and read operation to the first and last location for each file in timeMat (note: initially, I stored the timing for every call but this level of detail is unnecessary to demonstrate the anomaly for the purposes of this question). The times are printed below:
h5write h5read
0.0007 0.0004
0.0020 0.0004
0.0020 0.0004
0.0031 0.0004
0.0034 0.0004
0.0049 0.0004
0.0050 0.0004
0.0064 0.0005
0.0068 0.0004
0.0082 0.0004
0.0084 0.0005
0.0106 0.0005
0.0114 0.0005
0.0114 0.0005
0.0120 0.0005
0.0131 0.0005
0.0135 0.0005
0.0146 0.0005
0.0151 0.0005
0.0163 0.0005
The timings for h5read are fairly consistent across the subroutine. However, the timings for h5write gradually get slower. By the end, a write is taking an order of magnitude longer than at the start. I understand that for each file, as we increase the number of locations, the time for a write (and read) might get slightly slower. But in this case, the slower performance persists even after we begin a new file. Perhaps strangest of all, we can run the subroutine a second time and time taken for a write will pick up where we left off on the previous run. The only way to get the time taken for a write back to the fastest speed is to completely restart Julia.
Final disclaimer: I am brand new to both Julia and hdf5, so I may have done something stupid or be overlooking something obvious.
The slowdown is indeed curious; profiling shows that almost all the time is spent in the close function, which is basically just a ccall. This suggests it may be a problem with the HDF5 C-library itself.
I think you'll be rather happier with the performance if you don't open and close the file each time you write a variable; instead, access the file through a file object. Here's an example:
filePath = string(writeDir, "f", string(i), ".h5")
h5open(filePath, "w") do file
global c1
for j in 1:numLocation
...
write(file, location, FloatOut)
...
FloatIn = read(file, location)
...
end
end
This way you're leaving the file open throughout the test. On my machine this is something like 100x faster.
If you want to pursue this further, please submit an issue.

getTickCount time unit confusion

At the answer to the question on Stack and in the book at here on page 52 I found the normal getTickCount getTickFrequency combination to measure time of execution gives time in milliseconds . However the OpenCV website says its time in seconds. I am confused. Please help...
There is no room for confusion, all the references you have given point to the same thing.
getTickCount gives you the number of clock cycles after a certain event, eg, after machine is switched on.
A = getTickCount() // A = no. of clock cycles from beginning, say 100
process(image) // do whatever process you want
B = getTickCount() // B = no. of clock cycles from beginning, say 150
C = B - A // C = no. of clock cycles for processing, 150-100 = 50,
// it is obvious, right?
Now you want to know how many seconds are these clock cycles. For that, you want to know how many seconds a single clock takes, ie clock_time_period. If you find that, simply multiply by 50 to get total time taken.
For that, OpenCV gives second function, getTickFrequency(). It gives you frequency, ie how many clock cycles per second. You take its reciprocal to get time period of clock.
time_period = 1/frequency.
Now you have time_period of one clock cycle, multiply it with 50 to get total time taken in seconds.
Now read all those references you have given once again, you will get it.
dwStartTimer=GetTickCount();
dwEndTimer=GetTickCount();
while((dwEndTimer-dwStartTimer)<wDelay)//delay is 5000 milli seconds
{
Sleep(200);
dwEndTimer=GetTickCount();
if (PeekMessage (&uMsg, NULL, 0, 0, PM_REMOVE) > 0)
{
TranslateMessage (&uMsg);
DispatchMessage (&uMsg);
}
}

Resources