I am modifying YOLOV5 into a rotating object detection network and replacing its head part with a decoupled head. I successfully replaced the head of the origin YOLOV5 with a decoupled head, but when I tried to replace it for my modified YOLOV5, it reported an error :RuntimeError: Given groups=1, weight of size [256, 128, 1, 1], expected input[1, 63, 32, 32] to have 128 channels,
but got 63 channels instead. The only part I changed in the head of YOLOV5 is that I turned = nc + 5 into = nc + 5 + 180, to adapt to rotating target detection scenarios.
This is the modification I made for the decoupled head
class DecoupledHead(nn.Module):
def __init__(self, ch=256, nc=80, width=1.0, anchors=()):
super().__init__() = nc # number of classes = len(anchors) # number of detection layers = len(anchors[0]) // 2 # number of anchors
self.merge = Conv(ch, 256 * width, 1, 1)
self.cls_convs1 = Conv(256 * width, 256 * width, 3, 1, 1)
self.cls_convs2 = Conv(256 * width, 256 * width, 3, 1, 1)
self.reg_convs1 = Conv(256 * width, 256 * width, 3, 1, 1)
self.reg_convs2 = Conv(256 * width, 256 * width, 3, 1, 1)
self.cls_preds = nn.Conv2d(256 * width, *, 1)
self.reg_preds = nn.Conv2d(256 * width, 4 *, 1)
self.obj_preds = nn.Conv2d(256 * width, 1 *, 1)
def forward(self, x):
x = self.merge(x)
x1 = self.cls_convs1(x)
x1 = self.cls_convs2(x1)
x1 = self.cls_preds(x1)
x2 = self.reg_convs1(x)
x2 = self.reg_convs2(x2)
x21 = self.reg_preds(x2)
x22 = self.obj_preds(x2)
out =[x21, x22, x1], 1)
return out
and I changed __init__ in class Detect with
def __init__(self, nc=80, anchors=(), ch=(), inplace=True): # detection layer
self.n_anchors = 1 = nc # number of classes = nc + 5 + 180 # number of outputs per anchor = len(anchors) # number of detection layers = len(anchors[0]) // 2 # number of anchors
self.grid = [torch.zeros(1)] * # init grid
self.anchor_grid = [torch.zeros(1)] * # init anchor grid
self.register_buffer('anchors', torch.tensor(anchors).float().view(, -1, 2)) # shape(nl,na,2)
# self.m = nn.ModuleList(nn.Conv2d(x, *, 1) for x in ch) # output conv
self.m = nn.ModuleList(DecoupledHead(x, nc, 1, anchors) for x in ch)
self.inplace = inplace # use in-place ops (e.g. slice assignment)
The origin __init__ was like
def __init__(self, nc=80, anchors=(), ch=(), inplace=True): # detection layer
self.n_anchors = 1 = nc # number of classes = nc + 5 # number of outputs per anchor = len(anchors) # number of detection layers = len(anchors[0]) // 2 # number of anchors
self.grid = [torch.zeros(1)] * # init grid
self.anchor_grid = [torch.zeros(1)] * # init anchor grid
self.register_buffer('anchors', torch.tensor(anchors).float().view(, -1, 2)) # shape(nl,na,2)
self.m = nn.ModuleList(nn.Conv2d(x, *, 1) for x in ch) # output conv
self.inplace = inplace # use in-place ops (e.g. slice assignment)
How to modify the class DecoupledHead to make it work?


How to calculate F1 Score for Multi-label Classification

I am trying to calculate F1 score (and accuracy) for my multi-label classification problem. Could you please provide feedback on my method, if I'm calculating it correctly. Note that I'm calculating IOU (intersection over union) when model predicts an object as 1, and mark it as TP only if IOU is greater than or equal to 0.5.
GT labels: 14 x 10 x 128
Output: 14 x 10 x 128
where 14 is the batch_size, 10 is the sequence_length, and 128 is the object vector (i.e., 1 if the object at an index belongs to the sequence and 0 otherwise).
def calculate_performance_metrics(total_padded_elements, gt_labels, predicted_labels):
# check if TP pred objects overlap with TP gt objects
TP_INDICES = (torch.logical_and(predicted_labels == 1, gt_labels == 1)).nonzero() # we only want the batch and object indices, i.e. the 0 and 2 indices
TP = calculate_tp_with_iou() # details of this don't matter for now
FP = torch.sum(torch.logical_and(predicted_labels, 1 - gt_labels)).item()
TN = torch.sum(torch.logical_and(1 - predicted_labels, 1 - gt_labels)).item()
FN = torch.sum(torch.logical_and(1 - predicted_labels, gt_labels)).item()
return float(TP), float(FP), float(TN - total_padded_elements), float(FN)
for epoch in range(10):
for inputs, gt_labels, masks in tr_dl:
outputs = model(inputs) # out shape: (14, 10, 128)
# mask shape: (14, 10). So need to expand it to the shape of output
masks = masks[:, :, None].expand_as(outputs)
pred_labels = (torch.sigmoid(outputs) >= 0.5).float().type(torch.int64) # consider all predictions above 0.5 as 1, rest 0
pred_labels = pred_labels * masks
gt_labels = (gt_labels * masks).type(torch.int64)
total_padded_elements = masks.numel() - masks.sum() # need this to get accurate true negatives
batch_tp, batch_fp, batch_tn, batch_fn = calculate_performance_metrics(gt_labels, pred_labels, total_padded_elements)
EPOCH_TP += batch_tp
EPOCH_FP += batch_fp
EPOCH_TN += batch_tn
EPOCH_FN += batch_fn

PyTorch Boolean - Stop Backpropagation?

I need to create a Neural Network where I use binary gates to zero-out certain tensors, which are the output of disabled circuits.
To improve runtime speed, I was looking forward to use torch.bool binary gates to stop backpropagation along disabled circuits in the network. However, I created a small experiment using the official PyTorch example for the CIFAR-10 dataset, and the runtime speed is exactly the same for any values for gate_A and gate_B: (this means that the idea is not working)
class Net(nn.Module):
def __init__(self):
self.pool = nn.MaxPool2d(2, 2)
self.conv1a = nn.Conv2d(3, 6, 5)
self.conv2a = nn.Conv2d(6, 16, 5)
self.conv1b = nn.Conv2d(3, 6, 5)
self.conv2b = nn.Conv2d(6, 16, 5)
self.fc1 = nn.Linear(32 * 5 * 5, 120)
self.fc2 = nn.Linear(120, 84)
self.fc3 = nn.Linear(84, 10)
def forward(self, x):
# Only one gate is supposed to be enabled at random
# However, for the experiment, I fixed the values to [1,0] and [1,1]
choice = randint(0,1)
gate_A = torch.tensor(choice ,dtype = torch.bool)
gate_B = torch.tensor(1-choice ,dtype = torch.bool)
a = self.pool(F.relu(self.conv1a(x)))
a = self.pool(F.relu(self.conv2a(a)))
b = self.pool(F.relu(self.conv1b(x)))
b = self.pool(F.relu(self.conv2b(b)))
a *= gate_A
b *= gate_B
x = [a,b], dim = 1 )
x = torch.flatten(x, 1) # flatten all dimensions except batch
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
return x
How can i define gate_A and gate_B in such a way that backpropagation effectively stops if they are zero?
PS. Changing concatenation dynamically at runtime would also change which weights are assigned to every module. (for example, the weights associated to a could be assigned to b in another pass, disrupting how the network operates).
You could use torch.no_grad (the code below can probably be made more concise):
def forward(self, x):
# Only one gate is supposed to be enabled at random
# However, for the experiment, I fixed the values to [1,0] and [1,1]
choice = randint(0,1)
gate_A = torch.tensor(choice ,dtype = torch.bool)
gate_B = torch.tensor(1-choice ,dtype = torch.bool)
if choice:
a = self.pool(F.relu(self.conv1a(x)))
a = self.pool(F.relu(self.conv2a(a)))
a *= gate_A
with torch.no_grad(): # disable gradient computation
b = self.pool(F.relu(self.conv1b(x)))
b = self.pool(F.relu(self.conv2b(b)))
b *= gate_B
with torch.no_grad(): # disable gradient computation
a = self.pool(F.relu(self.conv1a(x)))
a = self.pool(F.relu(self.conv2a(a)))
a *= gate_A
b = self.pool(F.relu(self.conv1b(x)))
b = self.pool(F.relu(self.conv2b(b)))
b *= gate_B
x = [a,b], dim = 1 )
x = torch.flatten(x, 1) # flatten all dimensions except batch
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
return x
On a second look, I think the following is a simpler solution to the specific problem:
def forward(self, x):
# Only one gate is supposed to be enabled at random
# However, for the experiment, I fixed the values to [1,0] and [1,1]
choice = randint(0,1)
if choice:
a = self.pool(F.relu(self.conv1a(x)))
a = self.pool(F.relu(self.conv2a(a)))
b = torch.zeros(shape_of_conv_output) # replace shape of conv output here
b = self.pool(F.relu(self.conv1b(x)))
b = self.pool(F.relu(self.conv2b(b)))
a = torch.zeros(shape_of_conv_output) # replace shape of conv output here
x = [a,b], dim = 1 )
x = torch.flatten(x, 1) # flatten all dimensions except batch
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
return x
Easy solution, simply define a tensor with zeros when a or b are disabled :)
class Net(nn.Module):
def __init__(self):
self.pool = nn.MaxPool2d(2, 2)
self.conv1a = nn.Conv2d(3, 6, 5)
self.conv2a = nn.Conv2d(6, 16, 5)
self.conv1b = nn.Conv2d(3, 6, 5)
self.conv2b = nn.Conv2d(6, 16, 5)
self.fc1 = nn.Linear(32 * 5 * 5, 120)
self.fc2 = nn.Linear(120, 84)
self.fc3 = nn.Linear(84, 10)
def forward(self, x):
if randint(0,1):
a = self.pool(F.relu(self.conv1a(x)))
a = self.pool(F.relu(self.conv2a(a)))
b = torch.zeros_like(a)
b = self.pool(F.relu(self.conv1b(x)))
b = self.pool(F.relu(self.conv2b(b)))
a = torch.zeros_like(b)
x = [a,b], dim = 1 )
x = torch.flatten(x, 1) # flatten all dimensions except batch
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
return x
PS. I thought about this while I was having a coffee.

Why is cv2.dnn work faster when I use CPU rather than GPU?

I am new to openCV - CUDA so I have been testing the most simple one which is loading a model on GPU rather than CPU to see how fast GPU is and I am horrified at the result I get.
--- GPU vs CPU ---
--- ---
--- 21.913758993148804 seconds ---3.0586464405059814 seconds ---
--- 22.379303455352783 seconds ---3.1384341716766357 seconds ---
--- 21.500431060791016 seconds ---2.9400241374969482 seconds ---
--- 21.292986392974854 seconds ---3.3738017082214355 seconds ---
--- 20.88358211517334 seconds ---3.388749599456787 seconds ---
I will give my code snippet in case I may be doing something wrong that cause GPU time to spike so high.
def loadYolo():
net = cv2.dnn.readNet("yolov4.weights", "yolov4.cfg")
classes = []
with open("coco.names", "r") as f:
classes = [line.strip() for line in f.readlines()]
layer_names = net.getLayerNames()
output_layers = [layer_names[i[0] - 1] for i in net.getUnconnectedOutLayers()]
return net,classes,layer_names,output_layers
def image(data_image):
sbuf = StringIO()
b = io.BytesIO(base64.b64decode(data_image))
if(str(data_image) == 'data:,'):
pimg =
frame = cv2.cvtColor(np.array(pimg), cv2.COLOR_RGB2BGR)
frame = resize(frame, width=700)
frame = cv2.flip(frame, 1)
height, width, channels = frame.shape
blob = cv2.dnn.blobFromImage(frame, 1 / 255.0, (416, 416),
swapRB=True, crop=False)
outs = net.forward(output_layers)
print("--- %s seconds ---" % (time.time() - start_time))
class_ids = []
confidences = []
boxes = []
for out in outs:
for detection in out:
scores = detection[5:]
class_id = np.argmax(scores)
confidence = scores[class_id]
if confidence > 0.5:
# Object detected
center_x = int(detection[0] * width)
center_y = int(detection[1] * height)
w = int(detection[2] * width)
h = int(detection[3] * height)
# Rectangle coordinates
x = int(center_x - w / 2)
y = int(center_y - h / 2)
boxes.append([x, y, w, h])
indexes = cv2.dnn.NMSBoxes(boxes, confidences, 0.5, 0.4)
colors = np.random.uniform(0, 255, size=(len(classes), 3))
for i in range(len(boxes)):
if i in indexes:
x, y, w, h = boxes[i]
label = str(classes[class_ids[i]])
color = colors[class_ids[i]]
cv2.rectangle(frame, (x, y), (x + w, y + h), color, 2)
cv2.putText(frame, label, (x, y + 30), font, 1, color, 2)
imgencode = cv2.imencode('.jpg', frame)[1]
stringData = base64.b64encode(imgencode).decode('utf-8')
b64_src = 'data:image/jpg;base64,'
stringData = b64_src + stringData
emit('response_back', stringData)
My Gpu is Nvidia 1050 Ti and my CPU is i5 gen 9 in case someone need the specification. Can someone please enlighten me as I am super confused right now? Thank you very much
EDIT 1: I tried to use cv2.dnn.DNN_TARGET_CUDA instead of cv2.dnn.DNN_TARGET_CUDA_FP16, but the time is still terrible compare to CPU. Below is the GPU result :
--- 10.91195559501648 seconds ---
--- 11.344025135040283 seconds ---
--- 11.754926204681396 seconds ---
--- 12.779674530029297 seconds ---
Below is CPU result :
--- 4.780993223190308 seconds ---
--- 4.910650253295898 seconds ---
--- 4.990436553955078 seconds ---
--- 5.246175050735474 seconds ---
it is still slower than CPU
EDIT 2: OpenCv is 4.5.0, CUDA 11.1 and CUDNN 8.0.1
You should definitely only load YOLO once. Recreating it for every image that comes through the socket is slow for both CPU and GPU, but GPU takes longer to initially load which is why you're seeing it run slower than CPU.
I don't understand what you mean by using an LRU cache for your YOLO model. Without seeing the rest of your code structure I can't make any real suggestions, but can you try at least temporarily putting the network into the global space just to see if it runs faster? (remove the function altogether and put its body in the global space)
something like this
net = cv2.dnn.readNet("yolov4.weights", "yolov4.cfg")
classes = []
with open("coco.names", "r") as f:
classes = [line.strip() for line in f.readlines()]
layer_names = net.getLayerNames()
output_layers = [layer_names[i[0] - 1] for i in net.getUnconnectedOutLayers()]
def image(data_image):
sbuf = StringIO()
b = io.BytesIO(base64.b64decode(data_image))
if(str(data_image) == 'data:,'):
pimg =
frame = cv2.cvtColor(np.array(pimg), cv2.COLOR_RGB2BGR)
frame = resize(frame, width=700)
frame = cv2.flip(frame, 1)
height, width, channels = frame.shape
blob = cv2.dnn.blobFromImage(frame, 1 / 255.0, (416, 416),
swapRB=True, crop=False)
outs = net.forward(output_layers)
print("--- %s seconds ---" % (time.time() - start_time))
class_ids = []
confidences = []
boxes = []
for out in outs:
for detection in out:
scores = detection[5:]
class_id = np.argmax(scores)
confidence = scores[class_id]
if confidence > 0.5:
# Object detected
center_x = int(detection[0] * width)
center_y = int(detection[1] * height)
w = int(detection[2] * width)
h = int(detection[3] * height)
# Rectangle coordinates
x = int(center_x - w / 2)
y = int(center_y - h / 2)
boxes.append([x, y, w, h])
indexes = cv2.dnn.NMSBoxes(boxes, confidences, 0.5, 0.4)
colors = np.random.uniform(0, 255, size=(len(classes), 3))
for i in range(len(boxes)):
if i in indexes:
x, y, w, h = boxes[i]
label = str(classes[class_ids[i]])
color = colors[class_ids[i]]
cv2.rectangle(frame, (x, y), (x + w, y + h), color, 2)
cv2.putText(frame, label, (x, y + 30), font, 1, color, 2)
imgencode = cv2.imencode('.jpg', frame)[1]
stringData = base64.b64encode(imgencode).decode('utf-8')
b64_src = 'data:image/jpg;base64,'
stringData = b64_src + stringData
emit('response_back', stringData)
From the previous two answer I manage to get the solution changing :
into :
have help to twice the GPU speed due to my GPU type is not compatible with FP16 this is thanks to Amir Karami and also despite Ian Chu answer did not solve my problem it give me basis to forcefully make all the images to only use one net instances this actually lower the processing time significantly from each needing 10 second into 0.03-0.04 seconds thus surpassing CPU speed by many times. The reason I did not accept both answer because neither really solve my problem but both become strong basis to my solution so I still upvote them. I just leave my answer here in case anyone encounter this problem like me.
DNN_TARGET_CUDA_FP16 refers to 16-bit floating-point. since your gpu is 1050 Ti, your gpu seems not works too well with can check it from here and your compute capability from here.
i think you should change this line :
into :

Pytorch, slicing tensor causes RuntimeError:: one of the variables needed for gradient computation has been modified by an inplace operation:

I wrote a RNN with LSTM cell with Pycharm. The peculiarity of this network is that the output of the RNN is fed into a integration opeartion, computed with Runge-kutta.
The integration takes some input and propagate that in time one step ahead. In order to do so I need to slice the feature tensor X along the batch dimension, and pass this to the Runge-kutta.
class MyLSTM(torch.nn.Module):
def __init__(self, ni, no, sampling_interval, nh=10, nlayers=1):
super(MyLSTM, self).__init__()
self.device = torch.device("cpu")
self.dtype = torch.float = ni = no
self.nh = nh
self.nlayers = nlayers
self.lstms = torch.nn.ModuleList(
[torch.nn.LSTMCell(, self.nh)] + [torch.nn.LSTMCell(self.nh, self.nh) for i in range(nlayers - 1)])
self.out = torch.nn.Linear(self.nh, = torch.nn.Dropout(p=0.2)
self.actfn = torch.nn.Sigmoid()
self.sampling_interval = sampling_interval
self.scaler_states = None
# Options
# description of the whole block
def forward(self, x, h0, train=False, integrate_ode=True):
x0 = x.clone().requires_grad_(True)
hs = x # initiate hidden state
if h0 is None:
h = torch.zeros(hs.shape[0], self.nh, device=self.device)
c = torch.zeros(hs.shape[0], self.nh, device=self.device)
(h, c) = h0
# LSTM cells
for i in range(self.nlayers):
h, c = self.lstms[i](hs, (h, c))
if train:
hs =
hs = h
# Output layer
# y = self.actfn(self.out(hs))
y = self.out(hs)
if integrate_ode:
p = y
y = self.integrate(x0, p)
return y, (h, c)
def integrate(self, x0, p):
# RK4 steps per interval
M = 4
DT = self.sampling_interval / M
X = x0
# X = self.scaler_features.inverse_transform(x0)
for b in range(X.shape[0]):
xx = X[b, :]
for j in range(M):
k1 = self.ode(xx, p[b, :])
k2 = self.ode(xx + DT / 2 * k1, p[b, :])
k3 = self.ode(xx + DT / 2 * k2, p[b, :])
k4 = self.ode(xx + DT * k3, p[b, :])
xx = xx + DT / 6 * (k1 + 2 * k2 + 2 * k3 + k4)
X_all[b, :] = xx
return X_all
def ode(self, x0, y):
# Here I a dynamic model
I get this error:
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor []], which is output 0 of SelectBackward, is at version 64; expected version 63 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
the problem is in the operations xx = X[b, :] and p[b,:]. I know that because I choose batch dimension of 1, then I can replace the previous two equations with xx=X and p, and this works. How can split the tensor without loosing the gradient?
I had the same question, and after a lot of searching, I added .detach() function after "h" and "c" in the RNN cell.

I am trying to do arithmetic on table values and keep getting an error. here is my code

I am trying to perform arithmetic on table values and keep getting an error. Here is my total code. I am basically trying to generate simplex noise. I have created a multidimensional array (table) and am trying to perform operations on the values but i keep getting an error that says I cannot perform arithmetic on a table value. I don't know If I have to convert it to something or what. Please Help.
totalNoiseMap = {}
function simplex_noise(width, height)
simplexnoise = {}
for i = 1,512 do
simplexnoise[i] = {}
for j = 1, 512 do
simplexnoise[i][j] = 0
frequency = 5.0 / width
for x = 1, width do
for y = 1, height do
simplexnoise[x][y] = noise(x * frequency,y * frequency)
simplexnoise[x][y] = (simplexnoise[x][y] + 1) / 2
return simplexnoise
function noise(x, y, frequency)
return simplex_noise(x / frequency, y / frequency)
function generateOctavedSimplexNoise(width,height,octaves,roughness,scale)
totalnoise = {}
for i = 1,512 do
totalnoise[i] = {}
for j = 1, 512 do
totalnoise[i][j] = 0
layerFrequency = scale
layerWeight = 1
weightSum = 0
for octave = 1, octaves do
for x = 1, width do
for y = 1, height do
totalnoise[x][y] = (totalnoise[x][y] + noise(x * layerFrequency,y * layerFrequency, 2) * layerWeight)
--Increase variables with each incrementing octave
layerFrequency = layerFrequency * 2
weightSum = weightSum + layerWeight
layerWeight = layerWeight * roughness
return totalnoise
totalNoiseMap = generateOctavedSimplexNoise(512, 512, 3, 0.4, 0.005)
totalnoise[x][y] + noise(x * layerFrequency,y * layerFrequency, 2) * layerWeight
Here you get table noise(x * layerFrequency,y * layerFrequency, 2), mult it by scalar layerWeight and than add it to scalar totalnoise[x][y].
I can think of how to multiply table by scalar - it should be something like
for i = 1,512 do
for j = 1,512 do
a[i][j] = t[i][j] * scalar
But I'm unable to get what you're trying to do by adding. Suppose it should be addition of two tables
for i = 1,512 do
for j = 1,512 do
a[i][j] = b[i][j] + c[i][j]
But it works only with same-sized tables
