YOLOv5 image augmentation - machine-learning

I am using roboflow to augment images currently, but some augmentations seem to overlap between roboflow and YOLOv5.
I saw that YOLOv5 augments images, but by how much does it change the total sample size of the images that are inputted into the model? Would I be better off implementing the augmentations on my own for better control?

Related

Autoencoder vs Pre-trained network for feature extraction

I wanted to know if anyone has any sort of guidance on what is better for image classification with a small amount of samples per class (arround 20) yet a lot of classes (about 400) for relatively big RGB images (arround 600x600).
I know that Autoencoders can be used for feature extraction, such that I can just let an autoencoder run on the images unsupervised, and thus reduce the dimensionality of the images to train on those dimensionally-reduced images.
Similarly, I also know that you can just use a pre-trained network, strip the final layer and change it into a linear layer to your own dataset's number of classes, and then just train that final layer or a few layers before it to fit your dataset.
I haven't been able to find any resources online that determine which of these two techniques for feature extraction is better and under which conditions; anyone has any advice?

How to verify if the image contains noise in background before ‘OCR’ing

I have several types of images that I need to extract text from.
I can manually classify the images into 3 categories based on the noise on the background:
Images with no noise.
Images with some light noise in the background.
Heavy noise in the background.
For the category 1 images, I could apply OCR’ing fine without problems. → basic case.
For the category 2 images and some of the category 3 images, I could manage to extract the texts by applying the following methods:
Grayscale, Gaussian blur, Otsu’s threshold
Morph open to remove noise and invert the image
→ then perform text extraction.
For the OCR’ing task, one removing noise method is obviously not working for all images. So, Is there any method for classifying the level background noise of the images?
Please all suggestions are welcome.
Thanks in advance.
Following up on your comment from other question here are some things you could try. Some combinations of ideas below should help.
Image Embedding and Vector Clustering
Manual
Use a pretrained network such as resnet on imagenet (may not work good) or a simple pretrained network trained on MNIST/EMNIST.
Extract and concat some layers flattened weight vectors toward end of network. Apply dimensionality reduction and apply nearest neighbor/approximate nearest neighbor algorithms to find closest matches. Set number of clusters 3 as you have 3 types of images.
For nearest neighbor start with KNN. There are also many libraries in github that may help such as faiss, annoy etc.
More can be found from,
https://github.com/topics/nearest-neighbor-search
https://github.com/topics/approximate-nearest-neighbor-search
If result of above is not good enough try finetuning only last few layers MNIST/EMNIST trained network.
Using Existing Libraries
For grouping/finding similar images look into,
https://github.com/jina-ai/jina
You should be able to find more similarity clustering using tags neural-search, image-search on github.
https://github.com/topics/neural-search
https://github.com/topics/image-search
OCR
Try easyocr as it worked better for me than tesserect last time used ocr.
Run it first on whole document to see if requirements met.
Use not so tight cropping instead some/large padding around text if possible with no other text nearby. Another way is try padding in all direction in tight cropped text to see if it improves ocr result.
For tesserect see if tools mentioned in improving quality doc helps.
Classification
If you already have data sorted into 3 different directory and want to classify future images only then I suggest a neural network. Modify mnist or cifar example of pytorch or tensorflow to train and classify test images.
Based on sample images it looks like computer font instead of handwritten text. If that is the case Template matching at multiple scales may help. You have to see if the noise affects the matching result. Image from, https://www.pyimagesearch.com/2021/03/22/opencv-template-matching-cv2-matchtemplate/
Noise Removal
Here also you can go with a neural network. Train a denoising autoencoder with Category 1 images, corrupted type 1 images by adding noise that mimicks Category 2 and Category 3 images. This way the neural network will classify the 3 image categories without needing manually create dataset and in post processing you can use another neural network or image processing method to remove noise based on category type. Image from, https://keras.io/examples/vision/autoencoder/
Try existing libraries or pretrained networks on github to remove noise in the whole document/cropped region. Look into rembg if it works on text documents.
Your samples are not very convincing. All images binarize easily (threshold 25).

How to apply MNIST trained NN on a big image

I have a Neural network (NN) in deeplearning4j trained with MNIST to recognise digits on an image. As MNIST set contains 28x28 pixel images, I am able to predict class of an 28x28 image using this NN.
I am trying to find out how to apply this NN on a picture of a handwritten page? How to convert text from that image to actual text (OCR)? Basically, What kind of preprocessing is needed and how to find out part of image where the text is? How to derive smaller pieces of that image to apply NN individually?
You may want to explore HTR using Tensorflow (Handwritten text recognition). There are some interesting implementations already available and used widely as baseline models for the same. One such can be found here.
The architecture above details out how to design such as system. You can modify this further to suit your requirements of course.
If you are working with a combination of data, or trying to understand preprocessing steps for such images, here is a link that could guide you.
The primary preprocessing step is to detect and crop words so that those are manageable by the underlying TensorFlow HTR or tesseract architecture.
You may want to look at cropyble, which packages the cropping and the word extraction in one go. You can use this specifically for just cropping the image to extract word sequences for other downstream tasks

Poor performance on digit recognition with CNN trained on MNIST dataset

I trained a CNN (on tensorflow) for digit recognition using MNIST dataset.
Accuracy on test set was close to 98%.
I wanted to predict the digits using data which I created myself and the results were bad.
What I did to the images written by me?
I segmented out each digit and converted to grayscale and resized the image into 28x28 and fed to the model.
How come that I get such low accuracy on my data set where as such high accuracy on test set?
Are there other modifications that i'm supposed to make to the images?
EDIT:
Here is the link to the images and some examples:
Excluding bugs and obvious errors, my guess would be that your problem is that you are capturing your hand written digits in a way that is too different from your training set.
When capturing your data you should try to mimic as much as possible the process used to create the MNIST dataset:
From the oficial MNIST dataset website:
The original black and white (bilevel) images from NIST were size
normalized to fit in a 20x20 pixel box while preserving their aspect
ratio. The resulting images contain grey levels as a result of the
anti-aliasing technique used by the normalization algorithm. the
images were centered in a 28x28 image by computing the center of mass
of the pixels, and translating the image so as to position this point
at the center of the 28x28 field.
If your data has a different processing in the training and test phases then your model is not able to generalize from the train data to the test data.
So I have two advices for you:
Try to capture and process your digit images so that they look as similar as possible to the MNIST dataset;
Add some of your examples to your training data to allow your model to train on images similar to the ones you are classifying;
For those still have a hard time with the poor quality of CNN based models for MNIST:
https://github.com/christiansoe/mnist_draw_test
Normalization was the key.

image augmentation algorithms for preparing deep learning training set

To prepare large amounts of data sets for training deep learning-based image classification models, we usually have to rely on image augmentation methods. I would like to know what are the usual image augmentation algorithms, are there any considerations when choosing them?
The litterature on data augmentation is very very large and very dependent on your kind of applications.
The first things that come to my mind are the galaxy competition's rotations and Jasper Snoeke's data augmentation.
But really all papers have their own tricks to get good scores on special datasets for exemples stretching the image to a specific size before cropping it or whatever and this in a very specific order.
More practically to train models on the likes of CIFAR or IMAGENET use random crops and random contrast, luminosity perturbations additionally to the obvious flips and noise addition.
Look at the CIFAR-10 tutorial on TF website it is a good start. Plus TF now has random_crop_and_resize() which is quite useful.
EDIT: The papers I am referencing here and there.
It depends on the problem you have to address, but most of the time you can do:
Rotate the images
Flip the image (X or Y symmetry)
Add noise
All the previous at the same time.

Resources