I am currently working on instance segmentation. I follow these two tutorials:
However, these two tutorials work perfectly with one class like person + background. But in my case, I have two classes like a person and car + background. I didn't find any resources about making the Mask RCNN work with multiple objects.
Notice that:
I am using PyTorch ( torchvision ), torch==1.10.0+cu111 torchvision==0.11.0+cu111 torchaudio==0.10.0
I am using a Pascal VOC annotation
i used segmentation class (not the XML file) + the images
and this is my dataset class
class PennFudanDataset(
def __init__(self, root, transforms=None):
self.root = root
self.transforms = transforms
# load all image files, sorting them to
# ensure that they are aligned
self.imgs = list(sorted(os.listdir(os.path.join(root, "img"))))
self.masks = list(sorted(os.listdir(os.path.join(root, "imgMask"))))
def __getitem__(self, idx):
# load images ad masks
img_path = os.path.join(self.root, "img", self.imgs[idx])
mask_path = os.path.join(self.root, "imgMask", self.masks[idx])
img ="RGB")
# note that we haven't converted the mask to RGB,
# because each color corresponds to a different instance
# with 0 being background
mask =
mask = np.array(mask)
# instances are encoded as different colors
obj_ids = np.unique(mask)
# first id is the background, so remove it
obj_ids = obj_ids[1:]
# split the color-encoded mask into a set
# of binary masks
masks = mask == obj_ids[:, None, None]
# get bounding box coordinates for each mask
num_objs = len(obj_ids)
boxes = []
for i in range(num_objs):
pos = np.where(masks[i])
xmin = np.min(pos[1])
xmax = np.max(pos[1])
ymin = np.min(pos[0])
ymax = np.max(pos[0])
boxes.append([xmin, ymin, xmax, ymax])
boxes = torch.as_tensor(boxes, dtype=torch.float32)
# there is only one class
labels = torch.ones((num_objs,), dtype=torch.int64)
masks = torch.as_tensor(masks, dtype=torch.uint8)
image_id = torch.tensor([idx])
area = (boxes[:, 3] - boxes[:, 1]) * (boxes[:, 2] - boxes[:, 0])
# suppose all instances are not crowd
iscrowd = torch.zeros((num_objs,), dtype=torch.int64)
target = {}
target["boxes"] = boxes
target["labels"] = labels
target["masks"] = masks
target["image_id"] = image_id
target["area"] = area
target["iscrowd"] = iscrowd
if self.transforms is not None:
img, target = self.transforms(img, target)
return img, target
def __len__(self):
return len(self.imgs)
anyone can help me?


Pytorch: Add information to images in image prediction

I would like to add information to my current dataset. At the moment, I have six-frame sequences in folders. The DataLoader reads all 6 and uses the first 3 for predicting the last 1/2/3 (depending on how many I tell him to). This is the function for the DataLoader.
class TrainFeeder(Dataset):
def init(self, data_set):
super(TrainFeeder, self).init()
self.input_data = data_set
if torch.cuda.current_device() ==0:
print('There are total %d sequences in trainset' % len(self.input_data))
def getitem(self, index):
path = self.input_data[index]
imgs_path = sorted(glob.glob(path + '/*.png'))
imgs = []
for img_path in imgs_path:
img = cv2.imread(img_path)
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
img = cv2.resize(img, (256,448))
img = cv2.resize(img, (0, 0), fx=0.5, fy=0.5, interpolation=cv2.INTER_CUBIC) #has been 0.5 for official data, new is fx = 2.63 and fy = 2.84
img_tensor = ToTensor()(img).float()
imgs = torch.stack(imgs, dim=0)
return imgs
def len(self):
return len(self.input_data)
Now I'd like to add one value to these images. It is a boolean, I have stored in a list in a .json in the same folder, like the six-frame-sequences. But I don't know how to add the values of the list in the .json to the tensor. Which dimension should I use? Will the system work at all, if I change the shape of the input?
The function getitem can return anything, so you can return a tuple instead of just images :
def __getitem__(self, index):
path = ...
# load your 6 images
imgs = torch.stack( ... )
# load your boolean metadata
metadata = load_json_data( ... )
# return them both
return (imgs, metadata)
You will need to make metadata a tensor before returning it, otherwise I expect that pytorch will complain about not being able to collate (i.e stack) them to make batches
"Will the system work" is a question only you can answer, since you did not provide the code of your ML model. I would bet on : "no but it won't require significant changes to work". Most likely you currently have a loop like
for imgs in dataloader:
# do some training
output = model(imgs)
And you will have to make it like
for imgs, metadata in dataloader:
# do some training
output = model(imgs)

Any way to efficiently stack/ensemble pre-trained models for image classification?

I am trying to stack a few pre-trained models that I have through taking the last hidden layer of each model and then concatenating them together and then plugging them into a meta-learner model (e.g. XGBoost).
I am running into a big problem of having to process each image of my dataset multiple times since each base model requires a different processing method. This is causing my model to take a really long time to train and is infeasible. Is there any way to work past this?
For example:
model_1, processor_1 = pretrained_model(), pretrained_processor()
model_2, processor_2 = pretrained_model2(), pretrained_processor2()
for img in images:
input_1 = processor_1(img)
input_2 = processor_2(img)
out_1 = model_1(input_1)
out_2 = model_2(input_2),out2), dim=1) #concatenates hidden representations to feed into another model
Here'a recommendation if you want to process your images faster:
Note: I did not test this out
import torch
import torch.nn as nn
# Create a stack nn module
class StackedModel(nn.Module):
def __init__(self, model1, model2):
super(StackedModel, self).__init__()
self.model1 = model1
self.model2 = model2
def forward(self, imgs):
out_1 = model_1(input_1)
out_2 = model_2(input_2)
return, out2), dim=1)
# Init model
model = StackedModel(model1, model2)
# Try to stack and run in a larger batch assuming u have extra gpu space
stacked_preproc1 = []
stacked_preproc2 = []
max_batch_size = 16
total_output = []
for index, img in enumerate(images):
input_1 = processor_1(img)
input_2 = processor_2(img)
if index % max_batch_size == 0:
stacked_preproc1 = torch.stack(stacked_preproc1)
stakced_preproc2 = torch.stack(stakced_preproc2)
model(stacked_preproc1, stacked_preproc2)
# Reset array
stacked_preproc1 = []
stakced_preproc2 = []

Custom dataset Loader pytorch

i am doing covid-19 classification.i took dataset from kaggle. it has folder named dataset which contain 3 folders normal pnuemonia and covid-19 each contaning images for these classes i am stucked in writting getitem in pytorch custom dataloader ?
Dataset has 189 covid images but by this get item i get 920 images of covid kindly help
class_names = ['normal', 'viral', 'covid']
root_dir = 'COVID-19 Radiography Database'
source_dirs = ['NORMAL', 'Viral Pneumonia', 'COVID-19']
if os.path.isdir(os.path.join(root_dir, source_dirs[1])):
os.mkdir(os.path.join(root_dir, 'test'))
for i, d in enumerate(source_dirs):
os.rename(os.path.join(root_dir, d), os.path.join(root_dir, class_names[i]))
for c in class_names:
os.mkdir(os.path.join(root_dir, 'test', c))
for c in class_names:
images = [x for x in os.listdir(os.path.join(root_dir, c)) if x.lower().endswith('png')]
selected_images = random.sample(images, 30)
for image in selected_images:
source_path = os.path.join(root_dir, c, image)
target_path = os.path.join(root_dir, 'test', c, image)
shutil.move(source_path, target_path)
Above code is used to create test dataset which has 30 images of each class
class ChestXRayDataset(
def __init__(self, image_dirs, transform):
def get_images(class_name):
images = [x for x in os.listdir(image_dirs[class_name]) if
print(f'Found {len(images)} {class_name} examples')
return images
self.images = {}
self.class_names = ['normal', 'viral', 'covid']
for class_name in self.class_names:
self.images[class_name] = get_images(class_name)
self.image_dirs = image_dirs
self.transform = transform
def __len__(self):
return sum([len(self.images[class_name]) for class_name in self.class_names])
def __getitem__(self, index):
class_name = random.choice(self.class_names)
index = index % len(self.images[class_name])
image_name = self.images[class_name][index]
image_path = os.path.join(self.image_dirs[class_name], image_name)
image ='RGB')
return self.transform(image), self.class_names.index(class_name)
**Stucked in get item of this **
images in folder are arranged as follows
Dataset is as follows
**Code for confusion matrix is **
nb_classes = 3
confusion_matrix = torch.zeros(nb_classes, nb_classes)
with torch.no_grad():
for data in tqdm_notebook(dl_train,total=len(dl_train),unit='batch'):
img,lab = data
img,lab =,
_,output = torch.max(model(img),1)
for t, p in zip(lab.view(-1), output.view(-1)):
confusion_matrix[t.long(), p.long()] += 1
output for confusion matrix only one class is getting trained
confusio matrix image
Putting you images in a dictionary complicates the manipulation, rather use a list. Also you Dataset should not have any randomness, shuffling of the data should happen from the DataLoader not from the Dataset.
Use something like below:
class ChestXRayDataset(
def __init__(self, image_dirs, transform):
def get_images(class_name):
images = [x for x in os.listdir(image_dirs[class_name]) if
print(f'Found {len(images)} {class_name} examples')
return images
self.images = []
self.labels = []
self.class_names = ['normal', 'viral', 'covid']
for class_name in self.class_names:
images = get_images(class_name)
# This is a list containing all the images
# This is a list containing all the corresponding image labels
self.image_dirs = image_dirs
self.transform = transform
def __len__(self):
return len(self.images)
# Will return the image and its label at the position `index`
def __getitem__(self, index):
# image at index position of all the images
image_name = self.images[index]
# Its label
class_name = self.labels[index]
image_path = os.path.join(self.image_dirs[class_name], image_name)
image ='RGB')
return self.transform(image), self.class_names.index(class_name)
If you enumerate it say using
ds = ChestXRayDataset(image_dirs, transform)
for x, y in ds:
print (x.shape, y)
You should see all the images and the labels in the sequential order.
However in real case you would rather use a Torch DataLoader and pass it the ds object with shuffle parameter set to True. So the DataLoader will take care of shuffling the Dataset by calling the __getitem__ with shuffled index values.

Loading custom dataset of images using PyTorch

I'm using the coil-100 dataset which has images of 100 objects, 72 images per object taken from a fixed camera by turning the object 5 degrees per image. Following is the folder structure I'm using:
data/train/obj1/obj01_0.png, obj01_5.png ... obj01_355.png
data/train/obj85/obj85_0.png, obj85_5.png ... obj85_355.png
data/test/obj86/, obj86_5.png ... obj86_355.png
data/test/obj100/, obj100_5.png ... obj100_355.png
I have used the imageloader and dataloader classes. The train and test datasets loaded properly and I can print the class names.
train_path = 'data/train/'
test_path = 'data/test/'
data_transforms = {
transforms.Resize(224, 224),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
train_data = torchvision.datasets.ImageFolder(
transform= data_transforms
test_data = torchvision.datasets.ImageFolder(
root = test_path,
transform = data_transforms
train_loader =
test_loader =
classes = train_data.class_to_idx
print("detected classes: ", classes)
In my model I wish to pass every image through pretrained resnet and make a dataset from the output of resnet to feed into a biderectional LSTM.
For which I need to access the images by classname and index.
for ex. pre_resnet_train_data['obj01'][0] should be obj01_0.png and post_resnet_train_data['obj01'][0] should be the resnet output of obj01_0.png and so on.
I'm a beginner in Pytorch and for the past 2 days, I have read many tutorials and stackoverflow questions about creating a custom dataset class but couldn't figure out how to achieve what I want.
please help!
Assuming you only plan on running resent on the images once and save the output for later use, I suggest you write your own data set, derived from ImageFolder.
Save each resnet output at the same location as the image file with .pth extension.
class MyDataset(torchvision.datasets.ImageFolder):
def __init__(self, root, transform):
super(MyDataset, self).__init__(root, transform)
def __getitem__(self, index):
# override ImageFolder's method
index (int): Index
tuple: (sample, resnet, target) where target is class_index of the target class.
path, target = self.samples[index]
sample = self.loader(path)
if self.transform is not None:
sample = self.transform(sample)
if self.target_transform is not None:
target = self.target_transform(target)
# this is where you load your resnet data
resnet_path = os.path.join(os.path.splitext(path)[0], '.pth') # replace image extension with .pth
resnet = torch.load(resnet_path) # load the stored features
return sample, resnet, target

GAN changing input size causes error

Below code takes only 32*32 input, I want to feed in 128*128 images, how to go about it. The code is from the tutorial -
def generator(z):
zP = slim.fully_connected(z,4*4*256,normalizer_fn=slim.batch_norm,\
zCon = tf.reshape(zP,[-1,4,4,256])
gen1 = slim.convolution2d_transpose(\
activation_fn=tf.nn.relu,scope='g_conv1', weights_initializer=initializer)
gen2 = slim.convolution2d_transpose(\
activation_fn=tf.nn.relu,scope='g_conv2', weights_initializer=initializer)
gen3 = slim.convolution2d_transpose(\
activation_fn=tf.nn.relu,scope='g_conv3', weights_initializer=initializer)
g_out = slim.convolution2d_transpose(\
scope='g_out', weights_initializer=initializer)
return g_out
def discriminator(bottom, reuse=False):
dis1 = slim.convolution2d(bottom,16,[4,4],stride=[2,2],padding="SAME",\
dis2 = slim.convolution2d(dis1,32,[4,4],stride=[2,2],padding="SAME",\
reuse=reuse,scope='d_conv2', weights_initializer=initializer)
dis3 = slim.convolution2d(dis2,64,[4,4],stride=[2,2],padding="SAME",\
d_out = slim.fully_connected(slim.flatten(dis3),1,activation_fn=tf.nn.sigmoid,\
reuse=reuse,scope='d_out', weights_initializer=initializer)
return d_out
Below is the error which I get when I feed 128*128 images.
Trying to share variable d_out/weights, but specified shape (1024, 1) and found shape (16384, 1).
The generator is generating 32*32 images, and thus when we feed any other dimension in discriminator, it results in the given error.
The solution is to generate 128*128 images from the generator, by
1. Adding more no. of layers(2 in this case)
2. Changing the input to the generator
zP = slim.fully_connected(z,16*16*256,normalizer_fn=slim.batch_norm,\
zCon = tf.reshape(zP,[-1,16,16,256])
