Colab crashes when trying to create confusion matrix - machine-learning

I am trying to create a confusion matrix for my test set. My test set consists of 3585 images. Whenever I try to run the following code:
x_test,y_test = next(iter(dataloader)))
y_pred = resnet(x_test)
Google colab crashes using all the available RAM. Does anyone have a work around for this? Should I do this in batches?

Should I do this in batches?
Yes! Try to reduce batch size.
dataloader = ... # reduce batch size here on dataloader creation
...
y_pred = []
for batch in dataloader:
batch_y_pred = resnet(batch)
y_pred.append(batch_y_pred)
I use list with append, you can try another way.

Related

Can I change the batch size in a yolo detection model in OpenCV?

I am loading a yolo model with opencv in python with cv2.dnn_DetectionModel(cfg,weights)
and then calling net.detect(img). I think I can get a speed-up per image using batches, but I don't see any support for batch size other than one.
Is it possible to set the batch size?
net.detect does not support batch size > 1.
However, it's possible to do inference with batch size > 1 on darknet models with some extra work. Here is some partial code:
net = cv2.dnn.readNetFromDarknet(cfg,weights)
net.setInputNames(["input"])
net.setInputShape("input",(batch_size,3,h,w))
blob = cv2.dnn.blobFromImages(image_list)
net.setInput(blob)
results = net.forward(net.getUnconnectedOutLayersNames())
Now loop over all the images, and for each layer output in results, extract the boxes and confidences for this image each class, and having collected this info for every layer, pass this through cv2.dnn.NMSBoxes. This part is non-trivial, but doable.
one idea is you combine images manually and pass it into net and after getting result you separate them:
h1, w1 = im1.shape[:2]
h2, w2 = im2.shape[:2]
#create empty matrix
vis = np.zeros((max(h1, h2), w1+w2,3), np.uint8)
#combine 2 images
vis[:h1, :w1,:3] = im1
vis[:h2, w1:w1+w2,:3] = im2
after inference, you can sperate them again:
result1=pred[:h1,:w1,:]
result2=pred[:h2, w1:w1+w2,:]

How to load Omniglot on Pytorch

I'm trying to do some experiments on the Omniglot dataset, and I saw that Pytorch implemented it. I've run the command
from torchvision.datasets import Omniglot
but I have no idea on how to actually load the dataset. Is there a way to open it equivalent to how we open MNIST? Something like the following:
train_dataset = dsets.MNIST(root='./data',
train=True,
transform=transforms.ToTensor(),
download=True)
The final goal is to be able to open training and test set separately and run experiments on it.
You can do exact same transformations as Omniglot contains images and labels just like MNIST, for example:
import torchvision
dataset = torchvision.datasets.Omniglot(
root="./data", download=True, transform=torchvision.transforms.ToTensor()
)
image, label = dataset[0]
print(type(image)) # torch.Tensor
print(type(label)) # int
Instead of train and test , Omniglot dataset uses background and evaluation terminology instead.
background_set = datasets.Omniglot(root='./data', background=True, download=True,
transform=transforms.ToTensor())

How to put datasets created by torchvision.datasets in GPU in one operation?

I’m dealing with CIFAR10 and I use torchvision.datasets to create it. I’m in need of GPU to accelerate the calculation but I can’t find a way to put the whole dataset into GPU at one time. My model need to use mini-batches and it is really time-consuming to deal with each batch separately.
I've tried to put each mini-batch into GPU separately but it seems really time-consuming.
TL;DR
You won't save time by moving the entire dataset at once.
I don't think you'd necessarily want to do that even if you have the GPU memory to handle the entire dataset (of course, CIFAR10 is tiny by today's standards).
I tried various batch sizes and timed the transfer to GPU as follows:
num_workers = 1 # Set this as needed
def time_gpu_cast(batch_size=1):
start_time = time()
for x, y in DataLoader(dataset, batch_size, num_workers=num_workers):
x.cuda(); y.cuda()
return time() - start_time
# Try various batch sizes
cast_times = [(2 ** bs, time_gpu_cast(2 ** bs)) for bs in range(15)]
# Try the entire dataset like you want to do
cast_times.append((len(dataset), time_gpu_cast(len(dataset))))
plot(*zip(*cast_times)) # Plot the time taken
For num_workers = 1, this is what I got:
And if we try parallel loading (num_workers = 8), it becomes even clearer:
I've got an answer and I'm gonna try it later. It seems promising.
You can write a dataset class where in the init function, you red the entire dataset and apply all the transformations you need, and convert them to tensor format. Then, send this tensor to GPU (assuming there is enough memory). Then, in the getitem function you can simply use the index to retrieve the elements of that tensor which is already on GPU.

Very large networks in tensorflow

I am trying to train a neural network in tensorflow, but my array of weights is sufficiently large that I am running into the 2GB GraphDef limit. What is my best recourse in this situation?
Note: I am not really using the full functionality of tensorflow (e.g. my network has no optimizers). Rather, I am just using tensorflow as a way to perform some basic array ops on the GPU.
You are probably accidentally initialising a tf.Variable with a large constant. See https://github.com/tensorflow/tensorflow/issues/2382
Workaround from the github issue:
init_val = np.array(...) # Construct a large numpy array.
init_placeholder = tf.placeholder(tf.float32, shape=init_val.shape)
v = tf.Variable(init_placeholder)
# ...
sess.run(v.initializer, feed_dict={init_placeholder: init_val})

How to split documents into training set and test set?

I am trying to build a classification model. I have 1000 text documents in local folder. I want to divide them into training set and test set with a split ratio of 70:30(70 -> Training and 30 -> Test) What is the better approach to do so? I am using python.
I wanted a approach programatically to split the training set and test set. First to read the files in local directory. Second, to build a list of those files and shuffle them. Thirdly to split them into a training set and test set.
I tried a few ways by using built in python keywords and functions only to fail. Lastly I got the idea of approaching it. Also Cross-validation is a good option to be considered for the building general classification models.
Not sure exactly what you're after, so I'll try to be comprehensive. There will be a few steps:
Get a list of the files
Randomize the files
Split files into training and testing sets
Do the thing
1. Get a list of the files
Let's assume that your files all have the extension .data and they're all in the folder /ml/data/. What we want to do is get a list of all of these files. This is done simply with the os module. I'm assuming you have no subdirectories; this would change if there were.
import os
def get_file_list_from_dir(datadir):
all_files = os.listdir(os.path.abspath(datadir))
data_files = list(filter(lambda file: file.endswith('.data'), all_files))
return data_files
So if we were to call get_file_list_from_dir('/ml/data'), we would get back a list of all the .data files in that directory (equivalent in the shell to the glob /ml/data/*.data).
2. Randomize the files
We don't want the sampling to be predictable, as that is considered a poor way to train an ML classifier.
from random import shuffle
def randomize_files(file_list):
shuffle(file_list)
Note that random.shuffle performs an in-place shuffling, so it modifies the existing list. (Of course this function is rather silly since you could just call shuffle instead of randomize_files; you can write this into another function to make it make more sense.)
3. Split files into training and testing sets
I'll assume a 70:30 ratio instead of any specific number of documents. So:
from math import floor
def get_training_and_testing_sets(file_list):
split = 0.7
split_index = floor(len(file_list) * split)
training = file_list[:split_index]
testing = file_list[split_index:]
return training, testing
4. Do the thing
This is the step where you open each file and do your training and testing. I'll leave this to you!
Cross-Validation
Out of curiosity, have you considered using cross-validation? This is a method of splitting your data so that you use every document for training and testing. You can customize how many documents are used for training in each "fold". I could go more into depth on this if you like, but I won't if you don't want to do it.
Edit: Alright, since you requested I will explain this a little bit more.
So we have a 1000-document set of data. The idea of cross-validation is that you can use all of it for both training and testing — just not at once. We split the dataset into what we call "folds". The number of folds determines the size of the training and testing sets at any given point in time.
Let's say we want a 10-fold cross-validation system. This means that the training and testing algorithms will run ten times. The first time will train on documents 1-100 and test on 101-1000. The second fold will train on 101-200 and test on 1-100 and 201-1000.
If we did, say, a 40-fold CV system, the first fold would train on document 1-25 and test on 26-1000, the second fold would train on 26-40 and test on 1-25 and 51-1000, and on.
To implement such a system, we would still need to do steps (1) and (2) from above, but step (3) would be different. Instead of splitting into just two sets (one for training, one for testing), we could turn the function into a generator — a function which we can iterate through like a list.
def cross_validate(data_files, folds):
if len(data_files) % folds != 0:
raise ValueError(
"invalid number of folds ({}) for the number of "
"documents ({})".format(folds, len(data_files))
)
fold_size = len(data_files) // folds
for split_index in range(0, len(data_files), fold_size):
training = data_files[split_index:split_index + fold_size]
testing = data_files[:split_index] + data_files[split_index + fold_size:]
yield training, testing
That yield keyword at the end is what makes this a generator. To use it, you would use it like so:
def ml_function(datadir, num_folds):
data_files = get_file_list_from_dir(datadir)
randomize_files(data_files)
for train_set, test_set in cross_validate(data_files, num_folds):
do_ml_training(train_set)
do_ml_testing(test_set)
Again, it's up to you to implement the actual functionality of your ML system.
As a disclaimer, I'm no expert by any means, haha. But let me know if you have any questions about anything I've written here!
that's quite simple if you use numpy, first load the documents and make them a numpy array, and then:
import numpy as np
docs = np.array([
'one', 'two', 'three', 'four', 'five',
'six', 'seven', 'eight', 'nine', 'ten',
])
idx = np.hstack((np.ones(7), np.zeros(3))) # generate indices
np.random.shuffle(idx) # shuffle to make training data and test data random
train = docs[idx == 1]
test = docs[idx == 0]
print(train)
print(test)
the result:
['one' 'two' 'three' 'six' 'eight' 'nine' 'ten']
['four' 'five' 'seven']
Just make a list of the filenames using os.listdir(). Use collections.shuffle() to shuffle the list, and then training_files = filenames[:700] and testing_files = filenames[700:]
You can use train_test_split method provided by sklearn. See documentation here:
http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

Resources