Object detection training on large images - machine-learning

Trying to detect number plates by using faster RCNN on the images 4096 by 8192 pixels.
Instead of resizing for training I have cropped some parts of the image and labeled the number plate and trained, This way it's working but It cannot detect on actual images but only on small images.
Please guide me what is the best way to achieve such a job. How should I feed the training, and how the configuration has to be on faster_rcnn_inception_v2_pets.config. Or if you think faster RCNN is not good for this kind of job please suggest a better way, I need over %80 accuracy at least.
I have searched a lot on google but couldn't find anyone working on above 8k images.
I am attaching a sample image below as well.
https://ibb.co/NKWWd7q
I have tried to do annotated training on 4096 by 8192 pixels images over google cloud servers, it consumed over 250gb RAM on single batch size.
Kind Regards.

Here my answer as collections of my comments:
It seems that your number plates might be too small compared to the rest of the image.
Try to first extract the car with e.g. YOLO, extract it and then run your number-plate detection network again. Keep in mind that you maybe have to adjust the bounding box (the size of the car extracted) to fit your networks input size.
An example to detect cars with YOLO can be found here.

Related

How to resize(reshape) the images in CNN? Mathematical intuition behind resizing

I have been working on Images for few months for my internship, and recently I have been wondering that is there a mathematical way of resizing the images.
This becomes a fairly difficult task to resize the images because many a times freshers like me have little experience about the pre-processing in Images.
Given that my problem statement was Gender classification using the human eye. However I found it difficult because
The images were 3 channel
The images were in rectangular shape (17:11)
I did try to resize the images by following few blogs which said to start small and then go up, while it could have worked I still did not understand how small. I resized them to 800,800 randomly and go Resource Exhaustive error(I was using GPU).
So I ask the community if there is any such mathematical formula or a generalized way of doing the resizing task.
Thank you in advance.
This partially answers your question. But, normally many people use transfer learning and a pre-designed architecture for computer vision tasks. Since almost all architecture is designed for square input shape, you can get a better results by making the shape of your input image squared. Another solution would be only padding your 17X11 to make it square by 0 values. (you need to test to see which one works best in your case, but the common practice is re-shaping to square.)
It is fine to have 3 channel images, almost all images are designed for 3 channel input ( even for BW images it is suggested to repeat the channel to have 3 channel input for the model)
About resizing
About resizing the image, in theory, you need to resize the image to the model you are going to use. For example, LeNet-5 accepts images of Mnist with size 28x28. In theory, larger images result in better model performance, but in your case, the images are super low resolution you can start with 28x28 or 224x224 architectures and later use bigger ones and see if it helps in your case.
About the error it's pretty normal your model size was going to be bigger than your GPU memory so, you see Out of memory error. you can use a smaller model ( and smaller input image size) with your device, or you need to use a device with bigger GPU memory.
Finally, you should consider the size of architecture you are going to reuse to determine the correct resize of the dataset you need. If you are designing your model then best starting point can be something around 28x28 ( basically using Lenet) and later developing based on needs/performance.
the resizing can be as easy as calling a Transform with Pytorch transforms like ( i mean you don't need to manually recreate a copy of the dataset just for resizing)
T.Compose([
T.RandomResize(224)
])

training custom object in YOLOv3, how does it work?

I got an project needs to detect person in anime-like style vedios
I just tested YOLOv3 608x608 with COCO in GTX 1050TI
however speed is only at about ~1.5FPS , but I need at least 10 FPS on 1050TI for my project
1.I want to know that does the number of the classes will effect detection speed? (I assume COCO is about finding 80 kinds object in picture? if I just need find one kind of object, will it go 80x faster?)
2.when I input image for training ,original image are 1920*1080, should I resize them to 608x608 before labeling and training?
3.is there any labeling tool should I use? in README.md at https://github.com/AlexeyAB/darknet <x> <y> <width> <height> seems need to be calculate and input by hand which seems too hard, maybe there is a tool I just need to crop where the object is in image?
4.if the object is not a square in image, how does YOLO know which part are object? How to avoid it train background as object?
do I have to remove all background and fill it as black, only keep the object in image?
5.is the output always a box? can I train and get output as mask? if I detect as mask, will it slower then box because it seems to be more information?
6.to get a good result, how many training image and test image should I make?
I know it's just some noob question in CV, however I really want to know this without spending weeks in training and find out answer myself , an answer will be appreciated!
3.
https://en.wikipedia.org/wiki/List_of_manual_image_annotation_tools
You should be able to get output of corners coordinates by using some image annotation tool.
4.
With enough images with different background for training, supposedly the model should be able to ignore background. A black background is still a background. I guess that's a kind of data augmentation, so it might help reduce overfitting.
5.
If it does not support mask out-of-the-box, maybe you want to do background-subtraction as an extra step to process the output.
1) In my opinion, GTX 1050Ti is not enough to test YOLO v3. Because, the model size (i.e. the number of layers) of the YOLO v3 becomes extremely large compared with the previous versions. The number of classes will be not matter in this case. If you want fast test computing speed, you should upgrade your GPU like 1070Ti.
2) Whatever the size of input images, it will be resized into the pre-defined size, which is depicted as cfg file, by force, so you don't need to resize the input image.
1) I think it may affect the speed a bit in because as you use less classes you get less convolutional filters before each YOLO layer (you set it up in the .cfg file), but it's not likely gonna be an 80x speed up
2) Maybe? I mean, YOLO's gonna resize them when training and then testing, so maybe if you really want to you could, but high res images usually work better, in my experience.
3)I like the OpenLabelling (you can just Google it and it's on GitHub)
4) You may wanna give YOLO negative images that have nothing in them to prevent them picking up on a background, where there's nothing there
5)YOLO doesn't do masks
6)About 1k per class is what probably will work, you can get by with 500 but the rule of thumb is that the more, the better)
If you're interested, I've put out the whole series on YOLO on YouTube, so you may wanna check it out: https://youtu.be/TP67icLSt1Y

TensorFlow for image recognition, size of images

How can size of an image effect training the model for this task?
My current training set holds images that are 2880 X 1800, but I am worried this may be too large to train. In total my sample size will be about 200-500 images.
Would this just mean that I need more resources (GPU,RAM, Distribution) when training my model?
If this is too large, how should I go about resizing? -- I want to mimic real-world photo resolutions as best as possible for better accuracy.
Edit:
I would also be using TFRecord format for the image files
Your memory and processing requirements will be proportional to the pixel size of your image. Whether this is too large for you to process efficiently will depend on your hardware constraints and the time you have available.
With regards to resizing the images there is no one answer, you have to consider how to best preserve information that'll be required for your algorithm to learn from your data while removing information that won't be useful. Reducing the size of your input images won't necessarily be a negative for accuracy. Consider two cases:
Handwritten digits
Here the images could be reduced considerably in size and maintain all the structural information necessary to be correctly identified. Have a look at the MNIST data set, these images are distributed at 28 x 28 resolution and identifiable to 99.7%+ accuracy.
Identifying Tree Species
Imagine a set of images of trees where individual leaves could help identify species. Here you might find that reducing the image size reduces small scale detail on leaf shape in a way that's detrimental to the model, but you might find that you get a similar result with a tight crop (which preserves individual leaves) rather than an image resize. If this is the case you may find that creating multiple crops from the same image gives you an augmented data set for training that considerably improves results (which is something to consider, if possible, given your training set is very small)
Deep learning models are achieving results around human level in many image classification tasks: if you struggle to identify your own images then it's less likely you'll train an algorithm to. This is often a useful starting point when considering the level of scaling that might be appropriate.
If you are using GPUs to train, this will def affect your training time. Tensorflow does most of the GPU allocation so you don't have to worry about that. But with big photos you will be experiencing long training time although your dataset is small. You should consider data-augmentation.
You could complement your resizing with the data-augmentation. Resize in equal dimensions and then perform reflection and translation (as in geometric movement)
If your images are too big, your GPU might run out of memory before it can start training because it has to store the convolution outputs on its memory. If that happens, you can do some of the following things to reduce memory consumption:
resize the image
reduce batch size
reduce model complexity
To resize your image, there are many scripts just one Google search away, but I will add that in your case 1440 by 900 is probably a sweet spot.
Higher resolution images will result in a higher training time and an increased memory consumption (mainly GPU memory).
Depending on your concrete task, you might want to reduce the image size in order to therefore fit a reasonable batch size of let's say 32 or 64 on the GPU - for stable learning.
Your accuracy is probably affected more by the size of your training set. So instead of going for image size, you might want to go for 500-1000 sample images. Recent publications like SSD - Single Shot MultiBox Detector achieve high accuracy values like an mAP of 72% on the PascalVOC dataset - with "only" using 300x300 image resolution.
Resizing and augmentation: SSD for instance just scales every input image down to 300x300, independent of the aspect ratio - does not seem to hurt. You could also augment your data by mirroring, translating, ... etc (but I assume there are built-in methods in Tensorflow for that).

Preprocessing before CNN: Resizing vs Cropping

I'm using a simple neural network (similar to AlexNet) to classify images into categories. As a preprocessing stage, input images are resized to 256x256 before being fed into the network.
Lately, I have run into the following problem: Many of the images I deal with are of very high resolution (say, 2000x2000). In this case, doing a "hard resize" results in a severe loss of information. For example, a small 100x100 face, easily recognisable in the original image, would be unrecognisable in the resized version. In such cases, I may prefer taking several crops of the 2000x2000 image and run the classification on each crop.
I'm looking for a method to automatically determine which type of pre-processing is most adequate. Ideally, it would be able to recognize, for example, that a high resolution image of a single face should be resized, whereas a high resolution image of a crowd should be cropped several times. The basic requirements, on my part:
As computationally efficient as possible. Hence, something like a "sliding window" would be probably be ruled out (it is computationally cheaper to just crop all the images).
Ability to balance between recall and precision
What I considered thus far:
"Low-level" (image processing) approach: Implement an algorithm that uses local image information (like gradients) to distinguish between high resolution and low resolution images.
"High-level" (semantic) approach: Run the images through a pre-trained network for segmentation of some sort, and use its oputput to determine the appropriate pre-procssing.
I want to try the first option first, but not exactly sure how to go about it. Is there anything I can do in the Fourier domain? Something in OpenCv I can try? Does anyone have any suggestions/thoughts? Other ideas would be very welcome too. Thanks!

semantic segmentation for large images

I am working on a limited number of large size images, each of which can have 3072*3072 pixels. To train a semantic segmentation model using FCN or U-net, I construct a large sample of training sets, each training image is 128*128.
In the prediction stage, what I do is to cut a large image into small pieces, the same as trainning set of 128*128, and feed these small pieces into the trained model, get the predicted mask. Afterwards, I just stitch these small patches together to get the mask for the whole image. Is this the right mechanism to perform the semantic segmentation against the large images?
Your solution is often used for this kind of problem. However, I would argue that it depends on the data if it truly makes sense. Let me give you two examples you can still find on kaggle.
If you wanted to mask certain parts of satellite images, you would probably get away with this approach without a drop in accuracy. These images are highly repetitive and there's likely no correlation between the segmented area and where in the original image it was taken from.
If you wanted to segment a car from its background, it wouldn't be desirable to break it into patches. Over several layers the network will learn the global distribution of a car in the frame. It's very likely that the mask is positive in the middle and negative in the corners of the image.
Since you didn't give any specifics what you're trying to solve, I can only give a general recommendation: Try to keep the input images as large as your hardware allows. In many situation I would rather downsample the original images than breaking it down into patches.
Concerning the recommendation of curio1729, I can only advise against training on small patches and testing on the original images. While it's technically possible thanks to fully convolutional networks, you're changing the data to an extend, that might very likely hurt performance. CNNs are known for their extraction of local features, but there's a large amount of global information that is learned over the abstraction of multiple layers.
Input image data:
I would not advice feeding the big image (3072x3072) directly into the caffe.
Batch of small images will fit better into the memory and parallel programming will too come into play.
Data Augmentation will also be feasible.
Output for big Image:
As for the output of big Image, you better recast the input size of FCN to 3072x3072 during test phase. Because, layers of FCN can accept inputs of any size.
Then you will get 3072x3072 segmented image as output.

Resources