How to resize(reshape) the images in CNN? Mathematical intuition behind resizing - image-processing

I have been working on Images for few months for my internship, and recently I have been wondering that is there a mathematical way of resizing the images.
This becomes a fairly difficult task to resize the images because many a times freshers like me have little experience about the pre-processing in Images.
Given that my problem statement was Gender classification using the human eye. However I found it difficult because
The images were 3 channel
The images were in rectangular shape (17:11)
I did try to resize the images by following few blogs which said to start small and then go up, while it could have worked I still did not understand how small. I resized them to 800,800 randomly and go Resource Exhaustive error(I was using GPU).
So I ask the community if there is any such mathematical formula or a generalized way of doing the resizing task.
Thank you in advance.

This partially answers your question. But, normally many people use transfer learning and a pre-designed architecture for computer vision tasks. Since almost all architecture is designed for square input shape, you can get a better results by making the shape of your input image squared. Another solution would be only padding your 17X11 to make it square by 0 values. (you need to test to see which one works best in your case, but the common practice is re-shaping to square.)
It is fine to have 3 channel images, almost all images are designed for 3 channel input ( even for BW images it is suggested to repeat the channel to have 3 channel input for the model)
About resizing
About resizing the image, in theory, you need to resize the image to the model you are going to use. For example, LeNet-5 accepts images of Mnist with size 28x28. In theory, larger images result in better model performance, but in your case, the images are super low resolution you can start with 28x28 or 224x224 architectures and later use bigger ones and see if it helps in your case.
About the error it's pretty normal your model size was going to be bigger than your GPU memory so, you see Out of memory error. you can use a smaller model ( and smaller input image size) with your device, or you need to use a device with bigger GPU memory.
Finally, you should consider the size of architecture you are going to reuse to determine the correct resize of the dataset you need. If you are designing your model then best starting point can be something around 28x28 ( basically using Lenet) and later developing based on needs/performance.
the resizing can be as easy as calling a Transform with Pytorch transforms like ( i mean you don't need to manually recreate a copy of the dataset just for resizing)
T.Compose([
T.RandomResize(224)
])

Related

Image processing technique for image segmentation

I'm trying to create a model that segment various part of an aerial image.
I'm using a dataset found in kaggle: https://www.kaggle.com/datasets/bulentsiyah/semantic-drone-dataset
My question regards about the right way of treat images for semantic segmentation.
In this case is it better to simply resize the images (e.g. 6000x4000 to 256x256 pixel) or is it better to resize them less but then create patches from it (e.g. 6000x4000 to 1024x1024 pixel and then patches in 256x256 pixel).
I think that resizing too much an image may cause the loss of information but at the same time patching could not guarantee a full view of the image.
I also found a notebook that got 96% accuracy just by resizing so i'm not sure how to proceed:
https://www.kaggle.com/code/yesa911/aerial-semantic-segmentation-96-acc/notebook
I think there is not one correct answer to this. Dependant on the amount and size of the areas you want to segmentate, it seems unlikely to get a proper/accurate segemantion with images of your size. However, if there are only easy detectable and big areas in the image I would definetly go for the approach without patches, since the patch-approach is way more complex as it has more variables to consider (size of patches, overlapping patches, edge treatment). It would save you a lot of implementation time for preprocessing and stichting afterwards.
TLDR: I would start without patching and - if the result is sufficient - stop there. Else, try the patching approach afterwards.

how can i improve on my convnet performance with small dataset

I have a very small image data set (about 8 images). I am aware that my model can result in overfitting with a small dataset and I wanted some ideas on ways to deal with situations where the dataset is as small as stated above.
The best way to deal with this kind of issue is to use Image Augmentation. There are several libraries present which provides this like opencv2, keras, scikit-image. The basic idea behind image augmentation is to artificially create more images from one image by introducing certain changes in the data like rotating the image, blurring it at certain sides, zooming in/out on images, changing the coloring, flipping the image and a lot more. You can create 10x, 20x, 40x, etc images from one image.
This method will help you generate more images but remember that 8 images is a very small data and these new augmented images will in one way or another will have, to some extent, similar features to that of the original.

training custom object in YOLOv3, how does it work?

I got an project needs to detect person in anime-like style vedios
I just tested YOLOv3 608x608 with COCO in GTX 1050TI
however speed is only at about ~1.5FPS , but I need at least 10 FPS on 1050TI for my project
1.I want to know that does the number of the classes will effect detection speed? (I assume COCO is about finding 80 kinds object in picture? if I just need find one kind of object, will it go 80x faster?)
2.when I input image for training ,original image are 1920*1080, should I resize them to 608x608 before labeling and training?
3.is there any labeling tool should I use? in README.md at https://github.com/AlexeyAB/darknet <x> <y> <width> <height> seems need to be calculate and input by hand which seems too hard, maybe there is a tool I just need to crop where the object is in image?
4.if the object is not a square in image, how does YOLO know which part are object? How to avoid it train background as object?
do I have to remove all background and fill it as black, only keep the object in image?
5.is the output always a box? can I train and get output as mask? if I detect as mask, will it slower then box because it seems to be more information?
6.to get a good result, how many training image and test image should I make?
I know it's just some noob question in CV, however I really want to know this without spending weeks in training and find out answer myself , an answer will be appreciated!
3.
https://en.wikipedia.org/wiki/List_of_manual_image_annotation_tools
You should be able to get output of corners coordinates by using some image annotation tool.
4.
With enough images with different background for training, supposedly the model should be able to ignore background. A black background is still a background. I guess that's a kind of data augmentation, so it might help reduce overfitting.
5.
If it does not support mask out-of-the-box, maybe you want to do background-subtraction as an extra step to process the output.
1) In my opinion, GTX 1050Ti is not enough to test YOLO v3. Because, the model size (i.e. the number of layers) of the YOLO v3 becomes extremely large compared with the previous versions. The number of classes will be not matter in this case. If you want fast test computing speed, you should upgrade your GPU like 1070Ti.
2) Whatever the size of input images, it will be resized into the pre-defined size, which is depicted as cfg file, by force, so you don't need to resize the input image.
1) I think it may affect the speed a bit in because as you use less classes you get less convolutional filters before each YOLO layer (you set it up in the .cfg file), but it's not likely gonna be an 80x speed up
2) Maybe? I mean, YOLO's gonna resize them when training and then testing, so maybe if you really want to you could, but high res images usually work better, in my experience.
3)I like the OpenLabelling (you can just Google it and it's on GitHub)
4) You may wanna give YOLO negative images that have nothing in them to prevent them picking up on a background, where there's nothing there
5)YOLO doesn't do masks
6)About 1k per class is what probably will work, you can get by with 500 but the rule of thumb is that the more, the better)
If you're interested, I've put out the whole series on YOLO on YouTube, so you may wanna check it out: https://youtu.be/TP67icLSt1Y

Extracting foreground objects from an image to run through convolution neural net

I am new to computer vision and image recognition. For my first CV project I am developing a tool that detects apples (the fruit) in images.
What I have so far:
I developed a convolution neural net in Python using tensorflow that determines whether something is an apple or not. The drawback is that my CNN only works on images where the apple is the only object in the image. My training data set looks something like:
What I want to achieve: I would like to be able to detect an apple in an image and put a border around them. The images however would be full of other objects like in this image of a picnic:
Possible approaches:
Sliding Window: I would break my photo down into smaller image. I would start with a large window size in the top left corner and move to right by a step size. When I get to right border of the image I would move down a certain amount of pixels and repeat. This is effectively a sliding window and every one of these smaller images would be run through my CNN.
The window size would get smaller and smaller until an apple is found. The downside of this is that I would be running hundreds of smaller images through my CNN which would take a long time to detect an apple. Additionally if there isn't an apple present in the image, a lot of time would be wasted for nothing.
Extracting foreground objects: Another approach could be to extract all the foreground elements from an image (using OpenCV maybe?) and running those objects through my CNN.
Compared to the sliding window approach, I would be running a handful of images through my CNN vs. hundreds of images.
These are two approaches I could think of, but I was wondering if there are better ones out there in terms of speed. The sliding window approach would eventually work, but it will take a really long time to get the border window of the apple.
I would really appreciate if someone could give me some guidance (maybe I'm on a completely wrong track?), a link to some reading material or some examples of code for extracting foreground elements. Thanks!
A better way to do this is to use the Single Shot Multibox detector (SSD), or "You Only Look Once" (YOLO). Until this approach was designed, it was common to detect objects the way you suggest in the question.
There is a python implementation of SSD is here. OpenCV is used in the YOLO implementation. You can train the networks anew for Apples, in case the current versions do not detect them, or your project requires you to build a system from scratch.

find mosquitos' head in the image

I have images of mosquitos similar to these ones and I would like to automatically circle around the head of each mosquito in the images. They are obviously in different orientations and there are random number of them in different images. some error is fine. Any ideas of algorithms to do this?
This problem resembles a face detection problem, so you could try a naïve approach first and refine it if necessary.
First you would need to recreate your training set. For this you would like to extract small images with examples of what is a mosquito head or what is not.
Then you can use those images to train a classification algorithm, be careful to have a balanced training set, since if your data is skewed to one class it would hit the performance of the algorithm. Since images are 2D and algorithms usually just take 1D arrays as input, you will need to arrange your images to that format as well (for instance: http://en.wikipedia.org/wiki/Row-major_order).
I normally use support vector machines, but other algorithms such as logistic regression could make the trick too. If you decide to use support vector machines I strongly recommend you to check libsvm (http://www.csie.ntu.edu.tw/~cjlin/libsvm/), since it's a very mature library with bindings to several programming languages. Also they have a very easy to follow guide targeted to beginners (http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf).
If you have enough data, you should be able to avoid tolerance to orientation. If you don't have enough data, then you could create more training rows with some samples rotated, so you would have a more representative training set.
As for the prediction what you could do is given an image, cut it using a grid where each cell has the same dimension that the ones you used on your training set. Then you pass each of this image to the classifier and mark those squares where the classifier gave you a positive output. If you really need circles then take the center of the given square and the radius would be the half of the square side size (sorry for stating the obvious).
So after you do this you might have problems with sizes (some mosquitos might appear closer to the camera than others) , since we are not trained the algorithm to be tolerant to scale. Moreover, even with all mosquitos in the same scale, we still might miss some of them just because they didn't fit in our grid perfectly. To address this, we will need to repeat this procedure (grid cut and predict) rescaling the given image to different sizes. How many sizes? well here you would have to determine that through experimentation.
This approach is sensitive to the size of the "window" that you are using, that is also something I would recommend you to experiment with.
There are some research may be useful:
A Multistep Approach for Shape Similarity Search in Image Databases
Representation and Detection of Shapes in Images
From the pictures you provided this seems to be an extremely hard image recognition problem, and I doubt you will get anywhere near acceptable recognition rates.
I would recommend a simpler approach:
First, if you have any control over the images, separate the mosquitoes before taking the picture, and use a white unmarked underground, perhaps even something illuminated from below. This will make separating the mosquitoes much easier.
Then threshold the image. For example here i did a quick try taking the red channel, then substracting the blue channel*5, then applying a threshold of 80:
Use morphological dilation and erosion to get rid of the small leg structures.
Identify blobs of the right size to be moquitoes by Connected Component Labeling. If a blob is large enough to be two mosquitoes, cut it out, and apply some more dilation/erosion to it.
Once you have a single blob like this
you can find the direction of the body using Principal Component Analysis. The head should be the part of the body where the cross-section is the thickest.

Resources