Labeling runways for localization and detection using deep learning - machine-learning

Shown above is a sample image of runway that needs to be localized(a bounding box around runway)
i know how image classification is done in tensorflow, My question is how do I label this image for training?
I want model to output 4 numbers to draw bounding box.
In CS231n they say that we use a classifier and a localization head.
but how does my model knows where are the runnway in 400x400 images?
In short How do I LABEL this image for training? So that after training my model detects and localizes(draw bounding box around this runway) runways from input images.
Please feel free to give me links to lectures, videos, github tutorials from where I can learn about this.
**********Not CS231n********** I already took that lecture and couldnt understand how to solve using their approach.
Thanks

If you want to predict bounding boxes, then the labels are also bounding boxes. This is what most object detection systems use for training. You can just have bounding box labels, or if you want to detect multiple object classes, then also class labels for each bounding box would be required.

Collect data from google or any resources that contains only runway photos (From some closer view). I would suggest you to use a pre-trained image classification network (like VGG, Alexnet etc.) and fine tune this network with downloaded runway data.
After building a good image classifier on runway data set you can use any popular algorithm to generate region of proposal from the image.
Now take all regions of proposal and pass them to classification network one by one and check weather this network is classifying given region of proposal as positive or negative. If it classifying as positively then most probably your object(Runway) is present in that region. Otherwise it's not.
If there are a lot of region of proposal in which object is present according to classifier then you can use non maximal suppression algorithms to reduce number of positive proposals.

Related

How to verify if the image contains noise in background before ‘OCR’ing

I have several types of images that I need to extract text from.
I can manually classify the images into 3 categories based on the noise on the background:
Images with no noise.
Images with some light noise in the background.
Heavy noise in the background.
For the category 1 images, I could apply OCR’ing fine without problems. → basic case.
For the category 2 images and some of the category 3 images, I could manage to extract the texts by applying the following methods:
Grayscale, Gaussian blur, Otsu’s threshold
Morph open to remove noise and invert the image
→ then perform text extraction.
For the OCR’ing task, one removing noise method is obviously not working for all images. So, Is there any method for classifying the level background noise of the images?
Please all suggestions are welcome.
Thanks in advance.
Following up on your comment from other question here are some things you could try. Some combinations of ideas below should help.
Image Embedding and Vector Clustering
Manual
Use a pretrained network such as resnet on imagenet (may not work good) or a simple pretrained network trained on MNIST/EMNIST.
Extract and concat some layers flattened weight vectors toward end of network. Apply dimensionality reduction and apply nearest neighbor/approximate nearest neighbor algorithms to find closest matches. Set number of clusters 3 as you have 3 types of images.
For nearest neighbor start with KNN. There are also many libraries in github that may help such as faiss, annoy etc.
More can be found from,
https://github.com/topics/nearest-neighbor-search
https://github.com/topics/approximate-nearest-neighbor-search
If result of above is not good enough try finetuning only last few layers MNIST/EMNIST trained network.
Using Existing Libraries
For grouping/finding similar images look into,
https://github.com/jina-ai/jina
You should be able to find more similarity clustering using tags neural-search, image-search on github.
https://github.com/topics/neural-search
https://github.com/topics/image-search
OCR
Try easyocr as it worked better for me than tesserect last time used ocr.
Run it first on whole document to see if requirements met.
Use not so tight cropping instead some/large padding around text if possible with no other text nearby. Another way is try padding in all direction in tight cropped text to see if it improves ocr result.
For tesserect see if tools mentioned in improving quality doc helps.
Classification
If you already have data sorted into 3 different directory and want to classify future images only then I suggest a neural network. Modify mnist or cifar example of pytorch or tensorflow to train and classify test images.
Based on sample images it looks like computer font instead of handwritten text. If that is the case Template matching at multiple scales may help. You have to see if the noise affects the matching result. Image from, https://www.pyimagesearch.com/2021/03/22/opencv-template-matching-cv2-matchtemplate/
Noise Removal
Here also you can go with a neural network. Train a denoising autoencoder with Category 1 images, corrupted type 1 images by adding noise that mimicks Category 2 and Category 3 images. This way the neural network will classify the 3 image categories without needing manually create dataset and in post processing you can use another neural network or image processing method to remove noise based on category type. Image from, https://keras.io/examples/vision/autoencoder/
Try existing libraries or pretrained networks on github to remove noise in the whole document/cropped region. Look into rembg if it works on text documents.
Your samples are not very convincing. All images binarize easily (threshold 25).

CNN approaches to detection which are not bounding box based

Most of the approaches to detection problems are based more or less on some form of bounding box proposal, which turns to two class (positive/negative) classification. I found lots of materials on these topics.
But I was wondering, aren't there some approaches taking the whole image as an input, then sending it through several convolutional and pooling layers, output of which would be two numbers (x, y position of the object)? Of course this would mean that there's just one object in the image. So far I haven't find anything about this, should I consider it not usable?

Image recognition and Uniqueness detection

I am new to AI/ML and am trying to use the same for solving the following problem.
I have a set of (custom) images which while having common characteristics also will have a unique pattern/signature and color value. What set of algorithms should I use to have the pass in following manner:
1. Recognize the common characteristic (like presence of a triangle at any position in a 10x10mm image). If present, proceed, else exit.
2. Identify the unique pattern/signature to identify each image individually. The pattern/signature could be shape (visible to human eye or hidden like using an overlay shape using background image with no boundaries).
3. Store color tone/hue/saturation to determine any loss/difference (maybe because the capture source is different from the original one).
While this is in way similar to face recognition algo, for me saturation/shadow will matter while being direction independent.
I figure that using CNN may be the way to go for step#2 and SVN for step#1, any input on training, specifics will be appreciated. What about step#3, use BGR2HSV? The objective is to use ML/AI and not get into machine-vision.
Recognize the common characteristic (like presence of a triangle at any position in a 10x10mm image). If present, proceed, else exit.
In a sense, what you want is a classifier that can detect patterns in an image. However, we can train classifiers to detect certain types of patterns in images.
For example, I can train a classifier to recognise squares and circles, but if I show it an image with a triangle in it, I cannot expect it to tell me it is a triangle, because it has never seen it before. The downside is, your classifier will end up misclassifying it as one of the shapes it knows to exist: either square or circle. The upside is, you can prevent this.
Identify the unique pattern/signature to identify each image individually.
What you want to do is train a classifier on a large amount of labelled data. If you want the classifier to detect squares, circles, or triangles in an image, you must train it with a large amount of labelled images of squares, circles and triangles.
Store color tone/hue/saturation to determine any loss/difference (maybe because the capture source is different from the original one).
Now, you are leaving the territory of simple image labelling and entering the world of computer vision. This is not as simple as a vanilla image classifier, but it is possible and there are a lot of online tools to help you do this. For example, you may take a look at OpenCV. They have an implementation in python and C++.
I figure that using CNN may be the way to go for step#2 and SVN for
step#1
You can combine step 1 and step 2 with a Convolutional Neural Network (CNN). You do not need to use a two step prediction process. However, beware, if you pass the CNN an image of a car, it will still label it as a shape. You can, again circumvent this by training it on a million positive samples of shapes, and a million negative samples of random other images with the class "Other". This way, anything that is not a shape will get classified into "Other". This is one possibility.
What about step#3, use BGR2HSV? The objective is to use ML/AI and not
get into machine-vision.
With the inclusion of this step, there is no option but to get into computer vision. I am not exactly sure how to go about this, but I can guarantee OpenCV will provide you a way to do this. In fact, with OpenCV, you will no longer need to implement your own CNN, because OpenCV has its own image labelling libraries.

CNN Object Localization Preprocessing?

I'm trying to use a pretrained VGG16 as an object localizer in Tensorflow on ImageNet data. In their paper, the group mentions that they basically just strip off the softmax layer and either toss on a 4D/4000D fc layer for bounding box regression. I'm not trying to do anything fancy here (sliding windows, RCNN), just get some mediocre results.
I'm sort of new to this and I'm just confused about the preprocessing done here for localization. In the paper, they say that they scale the image to 256 as its shortest side, then take the central 224x224 crop and train on this. I've looked all over and can't find a simple explanation on how to handle localization data.
Questions: How do people usually handle the bounding boxes here?...
Do you use something like the tf.sample_distorted_bounding_box command, and then rescale the image based on that?
Do you just rescale/crop the image itself, and then interpolate the bounding box with the transformed scales? Wouldn't this result in negative box coordinates in some cases?
How are multiple objects per image handled?
Do you just choose a single bounding box from the beginning ,crop to that, then train on this crop?
Or, do you feed it the whole (centrally cropped) image, and then try to predict 1 or more boxes somehow?
Does any of this generalize to the Detection or segmentation (like MS-CoCo) challenges, or is it completely different?
Anything helps...
Thanks
Localization is usually performed as an intersection of sliding windows where the network identifies the presence of the object you want.
Generalizing that to multiple objects works the same.
Segmentation is more complex. You can train your model on a pixel mask with your object filled, and you try to output a pixel mask of the same size

Extracting Image Attributes

I am doing a project in computer vision and I need some help.
The objective of my project is to extract the attributes of any object - for example if I have a Nike running shoe, I should be able to figure out that it is a shoe in the first place, then figure out that it is a Nike shoe and not an Adidas shoe (possibly because of the Nike tick) and then figure out that it is a running shoe and not football studs.
I have started off by treating this as an image classification problem and I am using the following steps:
I have taken training samples (around 60 each) of say shoes, heels, watches and extracted their features using Dense SIFT.
Creating a vocabulary using k-means clustering (arbitrarily chosen the vocabulary size to be 600).
Creating a Bag-Of-Words representation for the images.
Training an SVM classifier to obtain a bag-of-words (feature vector) for every class (shoe,heel,watch).
For testing, I extracted the feature vector for the test image and found its bag-of-words representation from the already created vocabulary.
I compared the bag-of-words of the test image with that of each class and returned the class which matched closest.
I would like to know how I should proceed from here? Will feature extraction using D-SIFT help me identify the attributes as it only represents the gradient around certain points?
And sometimes, my classification goes wrong, for example if I have trained the classifier with the images of a left shoe, and a watch, a right shoe is classified as a watch. I understand that I have to include right shoes in my training set to solve this problem, but is there any other approach that I should follow?
Also is there any way to understand the shape? For example if I have trained the classifier for watches, and there are watches with both circular and rectangular dials in the training set, can I identify the shape of any new test image? Or do I simply have train it separately for watches with circular and rectangular dials?
Thanks

Resources