Extracting foreground objects from an image to run through convolution neural net - opencv

I am new to computer vision and image recognition. For my first CV project I am developing a tool that detects apples (the fruit) in images.
What I have so far:
I developed a convolution neural net in Python using tensorflow that determines whether something is an apple or not. The drawback is that my CNN only works on images where the apple is the only object in the image. My training data set looks something like:
What I want to achieve: I would like to be able to detect an apple in an image and put a border around them. The images however would be full of other objects like in this image of a picnic:
Possible approaches:
Sliding Window: I would break my photo down into smaller image. I would start with a large window size in the top left corner and move to right by a step size. When I get to right border of the image I would move down a certain amount of pixels and repeat. This is effectively a sliding window and every one of these smaller images would be run through my CNN.
The window size would get smaller and smaller until an apple is found. The downside of this is that I would be running hundreds of smaller images through my CNN which would take a long time to detect an apple. Additionally if there isn't an apple present in the image, a lot of time would be wasted for nothing.
Extracting foreground objects: Another approach could be to extract all the foreground elements from an image (using OpenCV maybe?) and running those objects through my CNN.
Compared to the sliding window approach, I would be running a handful of images through my CNN vs. hundreds of images.
These are two approaches I could think of, but I was wondering if there are better ones out there in terms of speed. The sliding window approach would eventually work, but it will take a really long time to get the border window of the apple.
I would really appreciate if someone could give me some guidance (maybe I'm on a completely wrong track?), a link to some reading material or some examples of code for extracting foreground elements. Thanks!

A better way to do this is to use the Single Shot Multibox detector (SSD), or "You Only Look Once" (YOLO). Until this approach was designed, it was common to detect objects the way you suggest in the question.
There is a python implementation of SSD is here. OpenCV is used in the YOLO implementation. You can train the networks anew for Apples, in case the current versions do not detect them, or your project requires you to build a system from scratch.

Related

How to resize(reshape) the images in CNN? Mathematical intuition behind resizing

I have been working on Images for few months for my internship, and recently I have been wondering that is there a mathematical way of resizing the images.
This becomes a fairly difficult task to resize the images because many a times freshers like me have little experience about the pre-processing in Images.
Given that my problem statement was Gender classification using the human eye. However I found it difficult because
The images were 3 channel
The images were in rectangular shape (17:11)
I did try to resize the images by following few blogs which said to start small and then go up, while it could have worked I still did not understand how small. I resized them to 800,800 randomly and go Resource Exhaustive error(I was using GPU).
So I ask the community if there is any such mathematical formula or a generalized way of doing the resizing task.
Thank you in advance.
This partially answers your question. But, normally many people use transfer learning and a pre-designed architecture for computer vision tasks. Since almost all architecture is designed for square input shape, you can get a better results by making the shape of your input image squared. Another solution would be only padding your 17X11 to make it square by 0 values. (you need to test to see which one works best in your case, but the common practice is re-shaping to square.)
It is fine to have 3 channel images, almost all images are designed for 3 channel input ( even for BW images it is suggested to repeat the channel to have 3 channel input for the model)
About resizing
About resizing the image, in theory, you need to resize the image to the model you are going to use. For example, LeNet-5 accepts images of Mnist with size 28x28. In theory, larger images result in better model performance, but in your case, the images are super low resolution you can start with 28x28 or 224x224 architectures and later use bigger ones and see if it helps in your case.
About the error it's pretty normal your model size was going to be bigger than your GPU memory so, you see Out of memory error. you can use a smaller model ( and smaller input image size) with your device, or you need to use a device with bigger GPU memory.
Finally, you should consider the size of architecture you are going to reuse to determine the correct resize of the dataset you need. If you are designing your model then best starting point can be something around 28x28 ( basically using Lenet) and later developing based on needs/performance.
the resizing can be as easy as calling a Transform with Pytorch transforms like ( i mean you don't need to manually recreate a copy of the dataset just for resizing)
T.Compose([
T.RandomResize(224)
])

Quantifying differences in an image sequence to measure activity

I'm looking for a program that will enable me to quantity the difference between images in an image sequence over time.
We are hoping to use timelapse images to measure the activity of tadpoles by comparing how the images change over time. Tracking the movement of individuals isn’t necessary. The tadpoles are dark and the background of the aquarium is light, however the background isn’t uniform and some of the decor items like dark rocks and foliage make it so that all the tadpoles aren’t visible at all times.
Basically need a program that will allow me to quantity the differences/motion detected in an image sequence (i.e 209 images) and produce data that can be exported...
Any and all suggestions appreciated!!
Your question is rather vague and you don't supply any images or real indication of what you expect as results, so my answer will not be as thorough as it might otherwise be.
You don't mention any tools you are familiar with, but my recommendation would be Python and OpenCV. Alternatives are probably scikit-image, Python Wand.
In general, when trying to detect movement across a series of images, you would:
try and work out what the background is
look for movement by sutracting, or differencing, frames from the background
clean up the difference image
identify objects - maybe by shape or size or colour
maybe track objects
produce statistics
As regards working out the background, I did an example here by finding the median pixel across all images at each location in the images. There is also an OpenCV tutorial here.
As regards cleaning up images, you can probably remove noise in the background subtraction with a small median filter, say 3x3 or 5x5 depending on the resolution of your images.
As regards detecting tadpoles, you will probably want to use OpenCV findContours() and filter by size, or colour, or circularity. There are some fairly decent tutorials on PyImageSearch. There is also an ImageMagick "Connected Component" analysis to find a tennis player that I did here.

training custom object in YOLOv3, how does it work?

I got an project needs to detect person in anime-like style vedios
I just tested YOLOv3 608x608 with COCO in GTX 1050TI
however speed is only at about ~1.5FPS , but I need at least 10 FPS on 1050TI for my project
1.I want to know that does the number of the classes will effect detection speed? (I assume COCO is about finding 80 kinds object in picture? if I just need find one kind of object, will it go 80x faster?)
2.when I input image for training ,original image are 1920*1080, should I resize them to 608x608 before labeling and training?
3.is there any labeling tool should I use? in README.md at https://github.com/AlexeyAB/darknet <x> <y> <width> <height> seems need to be calculate and input by hand which seems too hard, maybe there is a tool I just need to crop where the object is in image?
4.if the object is not a square in image, how does YOLO know which part are object? How to avoid it train background as object?
do I have to remove all background and fill it as black, only keep the object in image?
5.is the output always a box? can I train and get output as mask? if I detect as mask, will it slower then box because it seems to be more information?
6.to get a good result, how many training image and test image should I make?
I know it's just some noob question in CV, however I really want to know this without spending weeks in training and find out answer myself , an answer will be appreciated!
3.
https://en.wikipedia.org/wiki/List_of_manual_image_annotation_tools
You should be able to get output of corners coordinates by using some image annotation tool.
4.
With enough images with different background for training, supposedly the model should be able to ignore background. A black background is still a background. I guess that's a kind of data augmentation, so it might help reduce overfitting.
5.
If it does not support mask out-of-the-box, maybe you want to do background-subtraction as an extra step to process the output.
1) In my opinion, GTX 1050Ti is not enough to test YOLO v3. Because, the model size (i.e. the number of layers) of the YOLO v3 becomes extremely large compared with the previous versions. The number of classes will be not matter in this case. If you want fast test computing speed, you should upgrade your GPU like 1070Ti.
2) Whatever the size of input images, it will be resized into the pre-defined size, which is depicted as cfg file, by force, so you don't need to resize the input image.
1) I think it may affect the speed a bit in because as you use less classes you get less convolutional filters before each YOLO layer (you set it up in the .cfg file), but it's not likely gonna be an 80x speed up
2) Maybe? I mean, YOLO's gonna resize them when training and then testing, so maybe if you really want to you could, but high res images usually work better, in my experience.
3)I like the OpenLabelling (you can just Google it and it's on GitHub)
4) You may wanna give YOLO negative images that have nothing in them to prevent them picking up on a background, where there's nothing there
5)YOLO doesn't do masks
6)About 1k per class is what probably will work, you can get by with 500 but the rule of thumb is that the more, the better)
If you're interested, I've put out the whole series on YOLO on YouTube, so you may wanna check it out: https://youtu.be/TP67icLSt1Y

How to match images with unknown rotation differences

I have a collection of about 3000 images that were taken from camera suspended from a weather balloon in flight. The camera is pointing a different direction in each image but is generally aimed down, so all the images share a significant area (40-50%) with the previous image but at a slightly different scale and rotated an arbitrary (and not consistent) amount. The image metadata includes a timestamp, so I do know with certainty the correct order of images and the elapsed time between each.
I want to process these images into a single video. If I simply string them together it will be great for making people seasick, but won't really capture the amazingness of the set :)
The specific part I need help with is finding the rotation of the image from the previous image. Is there a library somewhere that can identify regions of overlap between two images when the images themselves are rotated relative to each other? If I can find 2-3 common points (or more), I can do the remaining calculations to determine the amount of rotation and the offset so I can put them together correctly. Alternately, if there is a library that calculates both of those things for me, that would be even better.
I can do this in any language, with a slight preference for either Java or Python. The data is in Hadoop, so Java is the most natural language, but I can use scripting languages as well if necessary.
Since I'm new to image processing, I don't even know where to start. Any help is greatly appreciated!
For a problem like this you could look into SIFT. This algorithm detects local features in images. OpenCV has an implementation of it, you can read about it here.
You could also try SURF, which is a similar type of algorithm. OpenCV also has this implemented, you can read about that here.

Sparse Image matching in iOS

I am building an iOS app that, as a key feature, incorporates image matching. The problem is the images I need to recognize are small orienteering 10x10 plaques with simple large text on them. They can be quite reflective and will be outside(so the light conditions will be variable). Sample image
There will be up to 15 of these types of image in the pool and really all I need to detect is the text, in order to log where the user has been.
The problem I am facing is that with the image matching software I have tried, aurasma and slightly more successfully arlabs, they can't distinguish between them as they are primarily built to work with detailed images.
I need to accurately detect which plaque is being scanned and have considered using gps to refine the selection but the only reliable way I have found is to get the user to manually enter the text. One of the key attractions we have based the product around is being able to detect these images that are already in place and not have to set up any additional material.
Can anyone suggest a piece of software that would work(as is iOS friendly) or a method of detection that would be effective and interactive/pleasing for the user.
Sample environment:
http://www.orienteeringcoach.com/wp-content/uploads/2012/08/startfinishscp.jpeg
The environment can change substantially, basically anywhere a plaque could be positioned they are; fences, walls, and posts in either wooded or open areas, but overwhelmingly outdoors.
I'm not an iOs programmer, but I will try to answer from an algorithmic point of view. Essentially, you have a detection problem ("Where is the plaque?") and a classification problem ("Which one is it?"). Asking the user to keep the plaque in a pre-defined region is certainly a good idea. This solves the detection problem, which is often harder to solve with limited resources than the classification problem.
For classification, I see two alternatives:
The classic "Computer Vision" route would be feature extraction and classification. Local Binary Patterns and HOG are feature extractors known to be fast enough for mobile (the former more than the latter), and they are not too complicated to implement. Classifiers, however, are non-trivial, and you would probably have to search for an appropriate iOs library.
Alternatively, you could try to binarize the image, i.e. classify pixels as "plate" / white or "text" / black. Then you can use an error-tolerant similarity measure for comparing your binarized image with a binarized reference image of the plaque. The chamfer distance measure is a good candidate. It essentially boils down to comparing the distance transforms of your two binarized images. This is more tolerant to misalignment than comparing binary images directly. The distance transforms of the reference images can be pre-computed and stored on the device.
Personally, I would try the second approach. A (non-mobile) prototype of the second approach is relatively easy to code and evaluate with a good image processing library (OpenCV, Matlab + Image Processing Toolbox, Python, etc).
I managed to find a solution that is working quite well. Im not fully optimized yet but I think its just tweaking filters, as ill explain later on.
Initially I tried to set up opencv but it was very time consuming and a steep learning curve but it did give me an idea. The key to my problem is really detecting the characters within the image and ignoring the background, which was basically just noise. OCR was designed exactly for this purpose.
I found the free library tesseract (https://github.com/ldiqual/tesseract-ios-lib) easy to use and with plenty of customizability. At first the results were very random but applying sharpening and monochromatic filter and a color invert worked well to clean up the text. Next a marked out a target area on the ui and used that to cut out the rectangle of image to process. The speed of processing is slow on large images and this cut it dramatically. The OCR filter allowed me to restrict allowable characters and as the plaques follow a standard configuration this narrowed down the accuracy.
So far its been successful with the grey background plaques but I havent found the correct filter for the red and white editions. My goal will be to add color detection and remove the need to feed in the data type.

Resources