I'm trying to set up an object classification system with OpenCV. When I detect a new object in a scene, I want to know if the new object belongs to a known object class (is it a box, a bottel, something unknown, etc.).
My steps so far:
Cutting down the Image to the roi where a new object could appear
Calculating keypoints for every Image (cv::SurfFeatureDetector)
Calculating descriptors for each keypoint (cv::SurfDescriptorExtractor)
Generating a vocabulary using Bag of Words (cv::BOWKMeansTrainer)
Calculating Response histograms (cv::BOWImgDescriptorExtractor)
Use the Response histograms to train a cv::SVM for every object class
Using the same set of images again to test the classification
I know that there is still something wrong with my code since the classification don't work yet.
But I don't really know, where I should use the full image (cutted down to the roi) or when I should extract the new object from the image and use just the object itself.
It's my first step into object recognition/classification and I saw people using both, full Images and extracted objects, but I just don't know when to use what.
I hope womeone can clarify this for me.
You should not use the same images for both testing and training.
In training, ideally you need to extract a ROI which includes just one dominant object, since the algorithm will assume that the codewords extracted from positive samples are the ones that should be presented in a test image to label it as positive. However, if you have a really big dataset like ImageNet, the algorithm should make a generalization.
In testing, you don't need to extract a ROI, because SIFT/SURF are scale invariant features. However, it's good to have a one dominant object in the test set, as well.
I think you should train 1 classifier for your each object class. This is called one-vs-all classifier.
One little note, if you don't want to worry about this issues and have big dataset. Just go with Convolutional Neural Networks. They have a really good generalization capability and are inherently multi-label thanks to their fully connected last layer.
Related
I have many evolution curves (on time), of a system as images.
These evolution curves are plotted when the system behave in a normal way ('ok').
I want to train a model, which learn the shapes of the curves (or parts of the shapes) when it behave in a normal way, so it will be able to classify new curves to normal (or abnormal).
Any ideas of the model to use, or how to proceed ?
Thank you
You can perform PCA, and then classify. Also look for functional data analysis
Here is a nice getting started guide with PCA
You can start with labeling (annotating) the images. The label can be as Normal/ Not Normal as 0/1 or as many classes you want to divide the data into.
Since it's a chart so the orientation is important, a wrong orientation can destroy the meaning of the image.
So make an algorithm which always orient the chart in the same way while reading.
Now that the labeling is done you need to train these images for correct classification.
Augment the data if needed
Find a image classification model
Use the trained weights
feed you images and annotations in the desired format
Train the model
Check for the output error or classification errors.
Create an evaluation matrix like confusion matrix in case of classification.
If the model is right and training is properly done you will get good accuracy.
Otherwise repeat the steps.
This is just an overview, with this you can start towards your goal.
I like to know whether I can use data set of signs that is made using Kinect to retrain inception's final layer like mentioned in the Tensor Flow tutorial website that uses ordinary RGB images.I am new to this field. Opinions are much appreciated.
The short answer is "No. You cannot just fine tune only the last layer. But you can fine tune the whole pre-trained network.". The first layers of the pre-trained network is looking for RGB features. Your depth frames will hardly provide enough entropy to match that. Your options are:
If the recognised/tracked objects (hands) are not masked and you have actual depth data for the background, you can train from scratch on depth images with few contrast stretching and data whitening ((x-mu)/sigma). This will take very long time for the ivy league networks like Inception and ResNet. Also, keep in mind that most python based deep learning frameworks rely on PIL image loaders which by default assumes images are of 8bits channels mapped in the range [0, 1]. These image loaders cast all 16bits pixels ones.
If the recognised/tracked object (hands) are masked which means your background is set to the same value or barely have gradient in it, the network will overfit on the silhouette of the object because this is where the strongest edges are. The solution for this is to colorise the depth image using normal maps, HSA, HSV, JET colour coding to convert it into 3x8bits channeled image. This makes the training converge much faster and in my late experiments we found that you can fine tune the ivy league networks on the colorised depth.
Since you are new to this field.I would like to suggest you to read what is transfer learning all the three types mentioned.I would like to tell you to apply any of the mentioned forms of transfer learning basing on your data set.If your data set is very similar to the type of model you are using then you can pass through last layers.If you data is not similar you have to tune the existing model and use it.
As the layers of the neural networks increases the data specific feature extraction increases so you have to take care of the specific layers if your dataset is not very similar to the pre-built model dataset. The starting layers will contain more generic features.
I am new to AI/ML and am trying to use the same for solving the following problem.
I have a set of (custom) images which while having common characteristics also will have a unique pattern/signature and color value. What set of algorithms should I use to have the pass in following manner:
1. Recognize the common characteristic (like presence of a triangle at any position in a 10x10mm image). If present, proceed, else exit.
2. Identify the unique pattern/signature to identify each image individually. The pattern/signature could be shape (visible to human eye or hidden like using an overlay shape using background image with no boundaries).
3. Store color tone/hue/saturation to determine any loss/difference (maybe because the capture source is different from the original one).
While this is in way similar to face recognition algo, for me saturation/shadow will matter while being direction independent.
I figure that using CNN may be the way to go for step#2 and SVN for step#1, any input on training, specifics will be appreciated. What about step#3, use BGR2HSV? The objective is to use ML/AI and not get into machine-vision.
Recognize the common characteristic (like presence of a triangle at any position in a 10x10mm image). If present, proceed, else exit.
In a sense, what you want is a classifier that can detect patterns in an image. However, we can train classifiers to detect certain types of patterns in images.
For example, I can train a classifier to recognise squares and circles, but if I show it an image with a triangle in it, I cannot expect it to tell me it is a triangle, because it has never seen it before. The downside is, your classifier will end up misclassifying it as one of the shapes it knows to exist: either square or circle. The upside is, you can prevent this.
Identify the unique pattern/signature to identify each image individually.
What you want to do is train a classifier on a large amount of labelled data. If you want the classifier to detect squares, circles, or triangles in an image, you must train it with a large amount of labelled images of squares, circles and triangles.
Store color tone/hue/saturation to determine any loss/difference (maybe because the capture source is different from the original one).
Now, you are leaving the territory of simple image labelling and entering the world of computer vision. This is not as simple as a vanilla image classifier, but it is possible and there are a lot of online tools to help you do this. For example, you may take a look at OpenCV. They have an implementation in python and C++.
I figure that using CNN may be the way to go for step#2 and SVN for
step#1
You can combine step 1 and step 2 with a Convolutional Neural Network (CNN). You do not need to use a two step prediction process. However, beware, if you pass the CNN an image of a car, it will still label it as a shape. You can, again circumvent this by training it on a million positive samples of shapes, and a million negative samples of random other images with the class "Other". This way, anything that is not a shape will get classified into "Other". This is one possibility.
What about step#3, use BGR2HSV? The objective is to use ML/AI and not
get into machine-vision.
With the inclusion of this step, there is no option but to get into computer vision. I am not exactly sure how to go about this, but I can guarantee OpenCV will provide you a way to do this. In fact, with OpenCV, you will no longer need to implement your own CNN, because OpenCV has its own image labelling libraries.
I'm trying to use a pretrained VGG16 as an object localizer in Tensorflow on ImageNet data. In their paper, the group mentions that they basically just strip off the softmax layer and either toss on a 4D/4000D fc layer for bounding box regression. I'm not trying to do anything fancy here (sliding windows, RCNN), just get some mediocre results.
I'm sort of new to this and I'm just confused about the preprocessing done here for localization. In the paper, they say that they scale the image to 256 as its shortest side, then take the central 224x224 crop and train on this. I've looked all over and can't find a simple explanation on how to handle localization data.
Questions: How do people usually handle the bounding boxes here?...
Do you use something like the tf.sample_distorted_bounding_box command, and then rescale the image based on that?
Do you just rescale/crop the image itself, and then interpolate the bounding box with the transformed scales? Wouldn't this result in negative box coordinates in some cases?
How are multiple objects per image handled?
Do you just choose a single bounding box from the beginning ,crop to that, then train on this crop?
Or, do you feed it the whole (centrally cropped) image, and then try to predict 1 or more boxes somehow?
Does any of this generalize to the Detection or segmentation (like MS-CoCo) challenges, or is it completely different?
Anything helps...
Thanks
Localization is usually performed as an intersection of sliding windows where the network identifies the presence of the object you want.
Generalizing that to multiple objects works the same.
Segmentation is more complex. You can train your model on a pixel mask with your object filled, and you try to output a pixel mask of the same size
I want to train a 3-class classifier with tissue images, but only have around 50 labelled images in total. I can't take patches from the images and train on them, so I am looking for another way to deal with this problem.
Can anyone suggest an approach to this? Thank you in advance.
The question is very broad but here are some recommendations:
It could make sense to generate variations of your input images. Things like modifying contrast, brightness or color, rotating the image, adding noise. But which of these operations, if any, make sense really depends on the type of classification problem.
Generally, the less data you have, the fewer parameters (weights etc.) your model should have. Otherwise it will result in overlearning, meaning that your classifier will classify the training data but nothing else.
You should check for overlearning. A simple method would be to split your training data into a training set and a control set. Once you have found that the classification is correct for the control set as well, you could do additional training including the control set.