I have a tensor of shape, say: [4,10] where 4 is the batch size and 10 is the length of my input samples buffer. Now, I know that it is really [4,5+5] i.e. the input samples buffer consists of two windows of length 5 which can be processed independently and, best, in parallel. What I am doing is, inside forward() of my model I first reshape the tensor to [8,5], run my layers on it, and then reshape it back to [4,-1] and return. What I am hoping to get from this is Pytorch would run my model on each of the windows (kind of sub-batches) in parallel, effectively yielding a parallel-for loop. It runs OK, Pytorch does not complain or anything but I am getting weird results. I'd like to know if Pytorch can work this way before I dive into debugging my model.
Well, it doesn't. The reason being the ordering pytorch uses for tensor reshaping. This can be seen by running the small repro code below.
It would be nice to have something like a 'rebatch' function in pytorch that would take care of proper memory layout as a foundation for parallel-for construcs (provided it can even be done memory-efficiently in generic case).
import torch
conv = torch.nn.Conv1d(1,3,1)
def conv_batch(t, conv, window):
batch = t.shape[0]
t = t.view(-1, t.shape[1], window)
t = conv(t)
t = t.view(batch, t.shape[1], -1)
return t
batch = 1
channels = 1
width = 4
window = 2
x = torch.arange(batch*channels*width)
x = x.view(batch,channels,width).float()
r1 = conv(x)
r2 = conv_batch(x, conv, window)
print(r1)
print(r2)
print(r1==r2)
Related
I am loading a yolo model with opencv in python with cv2.dnn_DetectionModel(cfg,weights)
and then calling net.detect(img). I think I can get a speed-up per image using batches, but I don't see any support for batch size other than one.
Is it possible to set the batch size?
net.detect does not support batch size > 1.
However, it's possible to do inference with batch size > 1 on darknet models with some extra work. Here is some partial code:
net = cv2.dnn.readNetFromDarknet(cfg,weights)
net.setInputNames(["input"])
net.setInputShape("input",(batch_size,3,h,w))
blob = cv2.dnn.blobFromImages(image_list)
net.setInput(blob)
results = net.forward(net.getUnconnectedOutLayersNames())
Now loop over all the images, and for each layer output in results, extract the boxes and confidences for this image each class, and having collected this info for every layer, pass this through cv2.dnn.NMSBoxes. This part is non-trivial, but doable.
one idea is you combine images manually and pass it into net and after getting result you separate them:
h1, w1 = im1.shape[:2]
h2, w2 = im2.shape[:2]
#create empty matrix
vis = np.zeros((max(h1, h2), w1+w2,3), np.uint8)
#combine 2 images
vis[:h1, :w1,:3] = im1
vis[:h2, w1:w1+w2,:3] = im2
after inference, you can sperate them again:
result1=pred[:h1,:w1,:]
result2=pred[:h2, w1:w1+w2,:]
Since I’m a beginner in ML, this question or the design overall may sound silly, sorry about that. I’m open to any suggestions.
I have a simple network with three linear layers one of which is output layer.
self.fc1 = nn.Linear(in_features=2, out_features=12)
self.fc2 = nn.Linear(in_features=12, out_features=16)
self.out = nn.Linear(in_features=16, out_features=4)
My states are consisting of two values, coordinate x and y. That’s why input layer has two features.
In main.py I’m sampling and extracting memories in ReplayMemory class and pass them to get_current function:
experiences = memory.sample(batch_size)
states, actions, rewards, next_states = qvalues.extract_tensors(experiences)
current_q_values = qvalues.QValues.get_current(policy_net, states, actions)
Since a single state is consisting of two values, length of the states tensor is batchsize x 2 while length of the actions is batchsize. (Maybe that’s the problem?)
When I pass “states” to my network in get_current function to obtain predicted q-values for the state, I get this error:
size mismatch, m1: [1x16], m2: [2x12]
It looks like it is trying to grab the states tensor as if it is a single state tensor. I don’t want that. In the tutorials that I follow, they pass the states tensor which is a stack of multiple states, and there is no problem. What am I doing wrong? :)
This is how I store an experience:
memory.push(dqn.Experience(state, action, next_state, reward))
This is my extract tensors function:
def extract_tensors(experiences):
# Convert batch of Experiences to Experience of batches
batch = dqn.Experience(*zip(*experiences))
state_batch = torch.cat(tuple(d[0] for d in experiences))
action_batch = torch.cat(tuple(d[1] for d in experiences))
reward_batch = torch.cat(tuple(d[2] for d in experiences))
nextState_batch = torch.cat(tuple(d[3] for d in experiences))
print(action_batch)
return (state_batch,action_batch,reward_batch,nextState_batch)
Tutorial that I follow is this project's tutorial.
https://github.com/nevenp/dqn_flappy_bird/blob/master/dqn.py
Look between 148th and 169th lines. And especially 169th line where it passes the states batch to the network.
SOLVED. It turned out that I didn't know how to properly create 2d tensor.
2D Tensor must be like this:
states = torch.tensor([[1, 1], [2,2]], dtype=torch.float)
I’m dealing with CIFAR10 and I use torchvision.datasets to create it. I’m in need of GPU to accelerate the calculation but I can’t find a way to put the whole dataset into GPU at one time. My model need to use mini-batches and it is really time-consuming to deal with each batch separately.
I've tried to put each mini-batch into GPU separately but it seems really time-consuming.
TL;DR
You won't save time by moving the entire dataset at once.
I don't think you'd necessarily want to do that even if you have the GPU memory to handle the entire dataset (of course, CIFAR10 is tiny by today's standards).
I tried various batch sizes and timed the transfer to GPU as follows:
num_workers = 1 # Set this as needed
def time_gpu_cast(batch_size=1):
start_time = time()
for x, y in DataLoader(dataset, batch_size, num_workers=num_workers):
x.cuda(); y.cuda()
return time() - start_time
# Try various batch sizes
cast_times = [(2 ** bs, time_gpu_cast(2 ** bs)) for bs in range(15)]
# Try the entire dataset like you want to do
cast_times.append((len(dataset), time_gpu_cast(len(dataset))))
plot(*zip(*cast_times)) # Plot the time taken
For num_workers = 1, this is what I got:
And if we try parallel loading (num_workers = 8), it becomes even clearer:
I've got an answer and I'm gonna try it later. It seems promising.
You can write a dataset class where in the init function, you red the entire dataset and apply all the transformations you need, and convert them to tensor format. Then, send this tensor to GPU (assuming there is enough memory). Then, in the getitem function you can simply use the index to retrieve the elements of that tensor which is already on GPU.
I am trying to use Resnet50 for image classification problem. However it shows error and I could not fix it.
RuntimeError: inconsistent tensor size, expected tensor [120 x 2048] and src [1000 x 2048] to have the same number of elements, but got 245760 and 2048000 elements respectively at /Users/soumith/code/builder/wheel/pytorch-src/torch/lib/TH/generic/THTensorCopy.c:86
and error happens below.
self.resnet = models.resnet50(num_classes=num_breeds, pretrained='imagenet')
Model is below
class Resnet(nn.Module):
def __init__(self):
super(Resnet,self).__init__()
self.resnet = models.resnet50(num_classes=num_breeds, pretrained='imagenet')
#self.resnet = nn.Sequential(*list(resnet.children())[:-2])
#self.fc = nn.Linear(2048,num_breeds)
def forward(self,x):
x = self.resnet(x)
return x
When you create your models.resnet50 with num_classes=num_breeds, the last layer is a fully connected layer from 2048 to num_classes (which is 120 in your case).
Having pretrained='imagenet' asks pytorch to load all the corresponding weights into your network, but it has for its last layer 1000 classes, not 120. This is the source of the error since the 2048x120 tensor doesn't match the loaded weights 2048x1000.
You should either create your network with 1000 classes and load the weights, then "trim" to the classes you want to keep. Or you could create the network you wished for with 120 classes, but manually load the weights. In the last case, you only need to pay specific attention to the last layer.
I wrote a PyMC model for fitting 3 Normals to data using (similar to the one in this question).
import numpy as np
import pymc as mc
import matplotlib.pyplot as plt
n = 3
ndata = 500
# simulated data
v = np.random.randint( 0, n, ndata)
data = (v==0)*(10+ 1*np.random.randn(ndata)) \
+ (v==1)*(-10 + 2*np.random.randn(ndata)) \
+ (v==2)*3*np.random.randn(ndata)
# the model
dd = mc.Dirichlet('dd', theta=(1,)*n)
category = mc.Categorical('category', p=dd, size=ndata)
precs = mc.Gamma('precs', alpha=0.1, beta=0.1, size=n)
means = mc.Normal('means', 0, 0.001, size=n)
#mc.deterministic
def mean(category=category, means=means):
return means[category]
#mc.deterministic
def prec(category=category, precs=precs):
return precs[category]
obs = mc.Normal('obs', mean, prec, value=data, observed = True)
model = mc.Model({'dd': dd,
'category': category,
'precs': precs,
'means': means,
'obs': obs})
M = mc.MAP(model)
M.fit()
# mcmc sampling
mcmc = mc.MCMC(model)
mcmc.use_step_method(mc.AdaptiveMetropolis, model.means)
mcmc.use_step_method(mc.AdaptiveMetropolis, model.precs)
mcmc.sample(100000,burn=0,thin=10)
tmeans = mcmc.trace('means').gettrace()
tsd = mcmc.trace('precs').gettrace()**-.5
plt.plot(tmeans)
#plt.errorbar(range(len(tmeans)), tmeans, yerr=tsd)
plt.show()
The distributions from which I sample my data are clearly overlapping, yet there are 3 well distinct peaks (see image below). Fitting 3 Normals to this kind of data should be trivial and I would expect it to produce the means I sample from (-10, 0, 10) in 99% of the MCMC runs.
Example of an outcome I would expect. This happened in 2 out of 10 cases.
Example of an unexpected result that happened in 6 out of 10 cases. This is weird because on -5, there is no peak in the data so I can't really a serious local minimum that the sampling can get stuck in (going from (-5,-5) to (-6,-4) should improve the fit, and so on).
What could be the reason that (adaptive Metropolis) MCMC sampling gets stuck in the majority of cases? What would be possible ways to improve the sampling procedure that it doesn't?
So the runs do converge, but do not really explore the right range.
Update: Using different priors, I get the right convergence (appx. first picture) in 5/10 and the wrong one (appx. second picture) in the other 5/10. Basically, the lines changed are the ones below and removing the AdaptiveMetropolis step method:
precs = mc.Gamma('precs', alpha=2.5, beta=1, size=n)
means = mc.Normal('means', [-5, 0, 5], 0.0001, size=n)
Is there a particular reason you would like to use AdaptiveMetropolis? I imagine that vanilla MCMC wasn't working, and you got something like this:
Yea, that's no good. There are a few comments I can make. Below I used vanilla MCMC.
Your means prior variance, 0.001, is too big. This corresponds to a std deviation of about 31 ( = 1/sqrt(0.001) ), which is too small. You are really forcing your means to be close to 0. You want a much larger std. deviation to help explore the area. I decreased the value to 0.00001 and got this:
Perfect. Of course, apriori I knew the true means were 50,0,and -50. Usually we don't know this, so it's always a good idea to set that value to be quite small.
2. Do you really think all the normals line up at 0, like your mean prior suggests? (You set the mean of all of them to 0) The point of this exercise is to find them to be different, so your priors should reflect that. Something like:
means = mc.Normal('means', [-5,0,5], 0.00001, size=n)
more accurately reflects your true belief. This actually also helps convergence by suggesting to the MCMC where the means should be. Of course, you'd have to use your best estimate to come up with these numbers (I've naively chosen -5,0,5 here).
The problem is caused by a low acceptance rate for the category variable. See the answer I gave to a similar question.