Unexpected results of multiprocessing environments regarding time ad rewards - machine-learning

I set up a costum gym environment, trained a stable baseline 3 (SB3) PPO agent on a GPU and it worked quite well.
Now I wanted to speed up the training with multiprocessing after the example from the SB3 library. To investigate the perfect number of processes to use, I trained multiple models with varying numbers of workers. Because the result was surprising and I wanted to clear out the effects of "bad luck" of an agent at a given cumber of workers, I packed everything within a loop, ran the code 10 times and averaged over it.
This is my code:
max_num_processes = 16
n_timesteps = 100_000
iterations = 10
processes = range(max_num_processes)
processing_time = np.zeros(max_num_processes)
rewards = np.zeros(max_num_processes)
for l in range(iterations):
for num_p in processes:
env = SubprocVecEnv([make_env(env_parameter, rank=i) for i in range(num_p+1)])
model = PPO(policy, env, verbose=verbose, tensorboard_log=log_dir)
# Multiprocessed RL Training
start_time = time.time()
model.learn(n_timesteps)
total_time_multi = time.time() - start_time
mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes=eval_steps)
processing_time[num_p] += total_time_multi
rewards[num_p] += mean_reward
processing_time = processing_time/iterations
rewards = rewards/iterations
I expect the graph with the runtime to drop from 1 until a sweet spot (say 4) is reached where the runtime is lowest. And then rise again. But the results seem to be random. These are the plots:
I ran the multiple times and the result is always the same, there is no sweet spot. But why? Can you just expect one when training on a CPU? But when I run the example from the SB3 library in Colab on a GPU there is a rapid decrease in duration. But why not with my code?

Related

Detectron2 - Same Code&Data // Different platforms // highly divergent results

I use different hardware to benchmark multiple possibilites. The Code runs in a jupyter Notebook.
When i evaluate the different losses i get highly divergent results.
I also checked the full .cfg with cfg.dump() - it is completely consistent.
Detectron2 Parameters:
cfg = get_cfg()
cfg.merge_from_file(model_zoo.get_config_file("COCO-Detection/retinanet_R_101_FPN_3x.yaml"))
cfg.DATASETS.TRAIN = ("dataset_train",)
cfg.DATASETS.TEST = ("dataset_test",)
cfg.DATALOADER.NUM_WORKERS = 2
cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url("COCO-Detection/retinanet_R_101_FPN_3x.yaml") # Let training initialize from model zoo
cfg.SOLVER.IMS_PER_BATCH = 2
cfg.SOLVER.BASE_LR = 0.00025 # 0.00125 pick a good LR
cfg.SOLVER.MAX_ITER = 1200 # 300 iterations seems good enough for this toy dataset; you will need to train longer for a practical dataset
cfg.SOLVER.STEPS = [] # do not decay learning rate
cfg.MODEL.ROI_HEADS.BATCH_SIZE_PER_IMAGE = 512 # faster, and good enough for this toy dataset (default: 512)
#cfg.MODEL.ROI_HEADS.NUM_CLASSES = 25 # only has one class (ballon). (see https://detectron2.readthedocs.io/tutorials/datasets.html#update-the-config-for-new-datasets)
cfg.MODEL.RETINANET.NUM_CLASSES = 3
# NOTE: this config means the number of classes, but a few popular unofficial tutorials incorrect uses num_classes+1 here.
cfg.OUTPUT_DIR = "/content/drive/MyDrive/Colab_Notebooks/testrun/output"
cfg.TEST.EVAL_PERIOD = 25
cfg.SEED=5
1. Environment: Azure
Microsoft Azure - Machine Learning
STANDARD_NC6
Torch: 1.9.0+cu111
Results:
Training Log: Log Azure
2. Environment: Colab
GoogleColab free
Torch: 1.9.0+cu111
Results:
Training Log: Log Colab
EDIT:
3. Environment: Ubuntu
Ubuntu 22.04
RTX 3080
Torch: 1.9.0+cu111
Results:
Training Log: https://pastebin.com/PwXMz4hY
New dataset
Issue is not reproducible with a larger dataset:

Pytorch reshape to modify batch size in forward()

I have a tensor of shape, say: [4,10] where 4 is the batch size and 10 is the length of my input samples buffer. Now, I know that it is really [4,5+5] i.e. the input samples buffer consists of two windows of length 5 which can be processed independently and, best, in parallel. What I am doing is, inside forward() of my model I first reshape the tensor to [8,5], run my layers on it, and then reshape it back to [4,-1] and return. What I am hoping to get from this is Pytorch would run my model on each of the windows (kind of sub-batches) in parallel, effectively yielding a parallel-for loop. It runs OK, Pytorch does not complain or anything but I am getting weird results. I'd like to know if Pytorch can work this way before I dive into debugging my model.
Well, it doesn't. The reason being the ordering pytorch uses for tensor reshaping. This can be seen by running the small repro code below.
It would be nice to have something like a 'rebatch' function in pytorch that would take care of proper memory layout as a foundation for parallel-for construcs (provided it can even be done memory-efficiently in generic case).
import torch
conv = torch.nn.Conv1d(1,3,1)
def conv_batch(t, conv, window):
batch = t.shape[0]
t = t.view(-1, t.shape[1], window)
t = conv(t)
t = t.view(batch, t.shape[1], -1)
return t
batch = 1
channels = 1
width = 4
window = 2
x = torch.arange(batch*channels*width)
x = x.view(batch,channels,width).float()
r1 = conv(x)
r2 = conv_batch(x, conv, window)
print(r1)
print(r2)
print(r1==r2)

Torch linear model forward pass 4 times slower on GPU then CPU

I am working on one of the AWS GPU instances using torch 7.
The following code benchmarks a simple forward pass of a linear model. The gpu execution seems to be about 4 times slower. What am I doing wrong?
require 'torch';
require 'nn';
cmd = torch.CmdLine()
cmd:option("-gpu", 0) -- gpu/cpu
cmd:option("-n_in", 100)
cmd:option("-n_out", 100)
cmd:option("-n_iter", 1000)
params = cmd:parse(arg)
A = torch.Tensor():randn(params.n_in);
model = nn.Sequential():add(nn.Linear(params.n_in, params.n_out))
if params.gpu>0 then
require 'cutorch';
require 'cudnn';
A = A:cuda()
model = model:cuda()
end
timer = torch.Timer()
for i=1,params.n_iter do
A2 = model:forward(A)
end
print("Average time:" .. timer:time().real/params.n_iter)
You need sufficient large network to fully utilize the GPU. For small network (< 500 x 500), overhead including GPU kernel launching, data transfer through PCI-E, etc. will take a great portion of the training time. In this case you may want to use CPU instead.

Cross validation is very slow in Grid search (libsvm)

I am using libsvm on 62 classes with 2000 samples each. The problem is i wanted to optimize my parameters using grid search. i set the range to be C=[0.0313,0.125,0.5,2,8] and gamma=[0.0313,0.125,0.5,2,8] with 5-folds. the crossvalition does not finish at the first two parameters of each. Is there a faster way to do the optimization? Can i reduce the number of folds to 3 for instance? The number of iterations written keeps playing in (1629,1630,1627) range I don't know if that is related
optimization finished,
#iter = 1629 nu = 0.997175 obj = -81.734944, rho = -0.113838 nSV = 3250, nBSV = 3247
This is simply expensive task to find a good model. Lets do some calculations:
62 classes x 5 folds x 4 values of C x 4 values of Gamma = 4960 SVMs
You can always reduce the number of folds, which will decrease the quality of the search, but will reduce the whole amount of trained SVMs of about 40%.
The most expensive part is the fact, that SVM is not well suited for multi label classification. It needs to train at least O(log n) models (in the error correcting code scenario), O(n) (in libsvm one-vs-all) to even O(n^2) (in one-vs-one scenario, which achieves the best results).
Maybe it would be more valuable to switch to some fast multilabel model? Like for example some ELM (Extreme Learning Machine)?

Time Series Ahead Prediction in Neural Network (N Point Ahead Prediction) Large Scale Iterative Training

(N=90) Point ahead Prediction using Neural Network:
I am trying to predict 3 minutes ahead i.e. 180 points ahead. Because I compressed my time series data as taking the mean of every 2 points as one, I have to predict (N=90) step-ahead prediction.
My time series data is given in seconds. The values are in between 30-90. They usually move from 30 to 90 and 90 to 30, as seen in the example below.
My data could be reach from: https://www.dropbox.com/s/uq4uix8067ti4i3/17HourTrace.mat
I am having trouble in implementing neural network to predict N points ahead. My only feature is previous time. I used elman recurrent neural network and also newff.
In my scenario I need to predict 90 points ahead. First how I separated my input and target data manually:
For Example:
data_in = [1,2,3,4,5,6,7,8,9,10]; //imagine 1:10 only defines the array index values.
N = 90; %predicted second ahead.
P(:, :) T(:) it could also be(2 theta time) P(:, :) T(:)
[1,2,3,4,5] [5+N] | [1,3,5,7,9] [9+N]
[2,3,4,5,6] [6+N] | [2,4,6,8,10] [10+N]
...
until it reaches to end of the data
I have 100 input points and 90 output points in Elman recurrent neural networks. What could be the most efficient hidden node size?
input_layer_size = 90;
NodeNum1 =90;
net = newelm(threshold,[NodeNum1 ,prediction_ahead],{'tansig', 'purelin'});
net.trainParam.lr = 0.1;
net.trainParam.goal = 1e-3;
//At the beginning of my training I filter it with kalman, normalization into range of [0,1] and after that I shuffled the data.
1) I won't able to train my complete data. First I tried to train complete M data which is around 900,000, which didn't gave me a solution.
2) Secondly I tried iteratively training. But in each iteration the new added data is merged with already trained data. After 20,000 trained data the accuracy start to decreases. First trained 1000 data perfectly fits in training. But after when I start iterativelt merge the new data and continue to training, the training accuracy drops very rapidly 90 to 20.
For example.
P = P_test(1:1000) T = T_test(1:1000) counter = 1;
while(1)
net = train(net,P,T, [], [] );%until it reaches to minimum error I train it.
[normTrainOutput] = sim(net,P, [], [] );
P = [ P P(counter*1000:counter*2000)]%iteratively new training portion of the data added.
counter = counter + 1; end
This approach is very slow and after a point it won't give any good resuts.
My third approach was iteratively training; It was similar to previous training but in each iteration, I do only train the 1000 portion of the data, without do any merging with previous trained data.For example when I train first 1000 data until it gets to minimum error which has >95% accuracy. After it has been trained, when I have done the same for the second 1000 portion of the data;it overwrites the weight and the predictor mainly behave as the latest train portion of the data.
> P = P_test(1:1000) T = T_test(1:1000) counter = 1;
while(1)
> net = train(net,P,T, [], [] ); % I did also use adapt()
> [normTrainOutput] = sim(net,P, [], [] );
>
> P = [ P(counter*1000:counter*2000)]%iteratively only 1000 portion of the data is added.
> counter = counter + 1;
end
Trained DATA: This figure is snapshot from my trained training set, blue line is the original time series and red line is the predicted values with trained neural network. The MSE is around 50.
Tested DATA: On the below picture, you can see my prediction for my testing data with the neural network, which is trained with 20,000 input points while keeping MSE error <50 for the training data set. It is able to catch few patterns but mostly I doesn't give the real good accuracy.
I wasn't able to successes any of this approaches. In each iteration I also observe that slight change on the alpha completely overwrites to already trained data and more focus onto the currently trained data portion.
I won't able to come up with a solution to this problem. In iterative training should I keep the learning rate small and number of epochs as small.
And I couldn't find an efficient way to predict 90 points ahead in time series. Any suggestions that what should I do to do in order to predict N points ahead, any tutorial or link for information.
What is the best way for iterative training? On my second approach when I reach 15 000 of trained data, training size starts suddenly to drop. Iteratively should I change the alpha on run time?
==========
Any suggestion or the things I am doing wrong would be very appreciated.
I also implemented recurrent neural network. But on training for large data I have faced with the same problems.Is it possible to do adaptive learning(online learning) in Recurrent Neural Networks for(newelm)? The weight won't update itself and I didn't see any improvement.
If yes, how it is possible, which functions should I use?
net = newelm(threshold,[6, 8, 90],{'tansig','tansig', 'purelin'});
net.trainFcn = 'trains';
batch_size = 10;
while(1)
net = train(net,Pt(:, k:k+batch_size ) , Tt(:, k:k+batch_size) );
end
Have a look at Echo State Networks (ESNs) or other forms of Reservoir Computing. They are perfect for time series prediction, very easy to use and converge fast. You don't need to worry about the structure of the network at all (every neuron in the mid-layer has random weights which do not change). You only learn the output weights.
If I understood the problem correctly, with Echo State Networks, I would just train the network to predict the next point AND 90 points ahead. This can be done by simply forcing the desired output in the output neurons and then performing ridge regression to learn the output weights.
When running the network after having trained it, at every step n, it would output the next point (n+1), which you would feed back to the network as input (to continue the iteration), and 90 points ahead (n+90), which you can do whatever you want with - i.e: you could also feed it back to the network so that it affects the next outputs.
Sorry if the answer is not very clear. It's hard to explain how reservoir computing works in a short answer, but if you just read the article in the link, you will find it very easy to understand the principles.
If you do decide to use ESNs, also read this paper to understand the most important property of ESNs and really know what you're doing.
EDIT: Depending on how "predictable" your system is, predicting 90 points ahead may still be very difficult. For example if you're trying to predict a chaotic system, noise would introduce very large errors if you're predicting far ahead.
use fuzzy logic using membership function to predict the future data. will be efficient method.

Resources