Project: Content Based Image Retrieval - Semi-supervised (manual tagging is done on images while training)
Description
I have 1000000 images in the database. The training is manual (supervised) - title and tags are provided for each image.
Example:
coke.jpg
Title : Coke
Tags : Coke, Can
Using the images and tags, I have to train the system. After training, when I give a new image (already in database/ completely new) the system should output the possible tags the image may belong to and display few images belonging to each tag. The system may also say no match found.
Questions:
1) What is mean by image fingerprint? What is the image fingerprint size expected ? (important because there will be millions of images to be inserted in database)
2) What is the field format of that fingerprint in the database ? (important because a fast search is needed … script should search in a 1M images database in less than 1 second)
3) What is the descriptors (algorithms) we use to analyze them ?
Thanks in advance
Well, this topic is very large, but here is a brief overview of a possible solution
Image fingerprints are collections of SIFT descriptors
These are quantized both to reduce size, and to allow indexing
Build an inverted index of your database to allow looking up an image by quantized descriptors (you can use any full text search engine \ DB for this)
Given an image, lookup images which share a large amount of common descriptors
For those potential candidates, you should validate that the spatial arrangement of descriptors is similar enough
Some articles to get you started:
Philbin, James, et al. "Object retrieval with large vocabularies and
fast spatial matching." Computer Vision and Pattern Recognition, 2007.
CVPR'07. IEEE Conference on. IEEE, 2007.
Philbin, James, et al. "Lost in quantization: Improving particular
object retrieval in large scale image databases." Computer Vision and
Pattern Recognition, 2008. CVPR 2008. IEEE Conference on. IEEE,
2008.
Mikulík, Andrej, et al. "Learning a fine vocabulary." Computer
Vision–ECCV 2010 (2010): 1-14.
I would suggest to train SVM model on list of image features extracted from training images
Image fingerprint: a meaningful representation of the image. You can't use the single pixels of course. The most rational way to do it is to minimise the correlation between basis. In simple words, if you take a 64x64 image probably the two pixels at the top left corner will be the same or similar. It's useless to use as input each single 64^2 pixels, you need something better. Try to have a look at what Principal Component Analysis does.
It's entirely up to you. Extremising it, you could use a bit, that tells you whether the image is dark or not. Better, you do PCA on the image and experiment with different numbers of features (it's not always the case that more features is better)
Whatever you want, there are a lot of algorithms you can use. I'd recommend Support Vector Machines. Easy to use and well supported. If you have a lot of different tags you probably have to tray one SVM for each tag. That may not be ideal and you may want to try something else.
Related
I was planning on doing some classification/segmentation on whole slide images. Since the images are huge, I was wondering about the methods that can be applied to process them. So far I've come across techniques that split the image into multiple parts, process those parts and combine the results. However, I would like to know more about other better approaches and if it's the good one. Any reference to existing literature would be of great help.
pyvips has a feature for generating patches from slide images efficiently.
This benchmark shows how it works. It can generate about 25,000 64x64 patches a second in the 8 basic orientations from an SVS file:
https://github.com/libvips/pyvips/issues/100#issuecomment-493960943
It's handy for training. I don't know how that compares to the other patch generation systems people use.
To read these images, the standard library is "open-slide" [https://openslide.org/api/python/]. By "open-slide" you can read, e.g., patches or thumbnails.
For basic image processing operations like filtering, "libvips" and its python binding "pyvips" is quick and convenient to use [https://libvips.github.io/pyvips/vimage.html].
If you need to pass data (like random patches) to a machine learning model, I would personally suggest "PyDmed". When training, e.g., a classifier or a generative model, the loading speed of "PyDmed" is suitable for feeding batches of data to GPU(s).
Here is the link to PyDmed public repo:
https://github.com/amirakbarnejad/PyDmed
Here is the link to PyDmed quick start:
https://amirakbarnejad.github.io/Tutorial/tutorial_section1.html
As akbarnejad mentioned, my preference is to use openslide.
I usually end up writing bespoke dataloaders to feed into pytorch models that use openslide to first do some simple segmentation using various thresholds of a low resolution (thumbnail) image of the slide to get patch coordinates and then pull out the relevant patches of tissue for feeding into the training model.
There are a few good examples of this and tools that try to make it simpler for both pytorch and Keras -
Pytorch
wsi-preprocessing
Keras
Deeplearning-digital-pathology
Both
deep-openslide
I have a large image (5400x3600) that has multiple CCTVs that I need to detect.
The detection takes lot of time (4-7 minutes) with rotation. But it still fails to resolve certain CCTVs.
What is the best method to match a template like this?
I am using skImage - openCV is not an option for me, but I am open to suggestions on that too.
For example: in the images below, the template is correct matched with the second image - but the first image is not matched - I guess due to the noise created by the text "BLDG..."
Template:
Source image:
Match result:
The fastest method is probably a cascade of boosted classifiers trained with several variations of your logo and possibly a few rotations and some negative examples too (non-logos). You have to roughly scale your overall image so the test and training examples are approximately matched by scale. Unlike SIFT or SURF that spend a lot of time in searching for interest points and creating descriptors for both learning and searching, binary classifiers shift most of the burden to a training stage while your testing or search will be much faster.
In short, the cascade would run in such a way that a very first test would discard a large portion of the image. If the first test passes the others will follow and refine. They will be super fast consisting of just a few intensity comparison in average around each point. Only a few locations will pass the whole cascade and can be verified with additional tests such as your rotation-correlation routine.
Thus, the classifiers are effective not only because they quickly detect your object but because they can also quickly discard non-object areas. To read more about boosted classifiers see a following openCV section.
This problem in general is addressed by Logo Detection. See this for similar discussion.
There are many robust methods for template matching. See this or google for a very detailed discussion.
But from your example i can guess that following approach would work.
Create a feature for your search image. It essentially has a rectangle enclosing "CCTV" word. So the width, height, angle, and individual character features for matching the textual information could be a suitable choice. (Or you may also use the image having "CCTV". In that case the method will not be scale invariant.)
Now when searching first detect rectangles. Then use the angle to prune your search space and also use image transformation to align the rectangles in parallel to axis. (This should take care of the need for the rotation). Then according to the feature choosen in step 1, match the text content. If you use individual character features, then probably your template matching step is essentially a classification step. Otherwise if you use image for matching, you may use cv::matchTemplate.
Hope it helps.
Symbol spotting is more complicated than logo spotting because interest points work hardly on document images such as architectural plans. Many conferences deals with pattern recognition, each year there are many new algorithms for symbol spotting so giving you the best method is not possible. You could check IAPR conferences : ICPR, ICDAR, DAS, GREC (Workshop on Graphics Recognition), etc. This researchers focus on this topic : M Rusiñol, J Lladós, S Tabbone, J-Y Ramel, M Liwicki, etc. They work on several techniques for improving symbol spotting such as : vectorial signatures, graph based signature and so on (check google scholar for more papers).
An easy way to start a new approach is to work with simples shapes such as lines, rectangles, triangles instead of matching everything at one time.
Your example can be recognized by shape matching (contour matching), much faster than 4 minutes.
For good match , you require nice preprocess and denoise.
examples can be found http://www.halcon.com/applications/application.pl?name=shapematch
I have a number of .jpeg from the websites of musicians. These images are comprised of posters for upcoming shows and band photos (photos of the band in real life).
Here is an example poster:
I am not well-versed in any modern techniques or algorithms (if they exist?), but this is what I thought I might look for:
Text in the image is usually a dead giveaway of a poster.
Maybe realistic photos (ie. non-posters) follow a different color distribution?
Posters are probably less likely to have faces in them - but that's a pretty weak assertion.
Is there any classification algorithm that can detect if an image is a poster?
Your question is very broad. Poster or photo is not well defined object. What is a poster? In real life, posters are often photos, or combination of photos, or a bit corrected photos.
If we narrow down to refered in the first part of your question - band photos vs upcoming shows posters, then the answer is - probably yes (even though I never seen anyone doing it). As you are looking for a binary classifier I would suggest taking some machine learning model (Naive Bayes should be sufficient, but if you want to use more complex features then try out SVM, ELM, or some Random Forests/Decision Tree) and apply it to the data encoded in vectors containing:
Binary features:
"is there any word on the image?" - you will need external text detection algorithm
"is there a number on the image" - events should have dates
"is there a date on the image"
"is there any face on the image"
Using Naive Bayes would build conditional propabilities P(poster|there is a word),P(poster|there is a number) etc. which will not only give you a classifier, but also some insights of how important are your featuers (probability close to 0.5 are a suggestion, that a particular feature is useless).
I would not use histograms etc. due to the wide range of possible photos, photo session styles etc. unless you are willing to create really big training set.
If this is not enough, you could change these to more complex features and use more powerfull classifier then Naive Bayes.
Complex features:
How many words are there on the image?
How many numbers are there on the image?
How many dates are there on the image?
How many faces are there on the image?
Image histogram
And one last option, if everything fails, you could try to train some modern model, like Deep Belief Network on the raw images. It would require serious computetional power, but results would be very valuable also for the scientific community.
I am working with SVM-light. I would like to use SVM-light to train a classifier for object detection. I figured out the syntax to start a training:
svm_learn example2/train_induction.dat example2/model
My problem: how can I build the "train_induction.dat" from a
set of positive and negative pictures?
There are two parts to this question:
What feature representation should I use for object detection in images with SVMs?
How do I create an SVM-light data file with (whatever feature representation)?
For an intro to the first question, see Wikipedia's outline. Bag of words models based on SIFT or sometimes SURF or HOG features are fairly standard.
For the second, it depends a lot on what language / libraries you want to use. The features can be extracted from the images using something like OpenCV, vlfeat, or many others. You can then convert those features to the SVM-light format as described on the SVM-light homepage (no anchors on that page; search for "The input file").
If you update with what language and library you want to use, we can give more specific advice.
I am developing an application to read the letters and numbers from an image using opencv in c++. I first changed the given colour image and colour template to binary image, then called the method cvMatchTemplate(). This method just highlighted the areas where the template matches.. But not clear.. I just dont want to see the area.. I need to parse the characters(letters & numbers) from the image. I am new to openCV. Does anybody know any other method to get the result??
Image is taken from camera. the sample image is shown above. I need to get all the texts from the LED display(130 and Delft Tanthaf).
Friends I tried with the sample application of face detection, It detects the faces. the HaarCascade file is provided with the openCV. I just loaded that file and called the method cvHaarDetectObjects(); To detect the letters I created the xml file by using the application letter_recog.cpp provided by openCV. But when I loading this file, it shows some error(OpenCV error: UnSpecified error > in unknown function, file ........\ocv\opencv\src\cxcore\cxpersistence.cpp,line 4720). I searched in web for this error and got the information about lib files used. I did so, but the error still remains. Is the error with my xml file or calling the method to load this xml file((CvHaarClassifierCascade*)cvLoad("builded xml file name",0,0,0);)?? please HELP...
Thanks in advance
As of OpenCV 3.0 (in active dev), you can use the built-in "scene text" object detection module ~
Reference: http://docs.opencv.org/3.0-beta/modules/text/doc/erfilter.html
Example: https://github.com/Itseez/opencv_contrib/blob/master/modules/text/samples/textdetection.cpp
The text detection is built on these two papers:
[Neumann12] Neumann L., Matas J.: Real-Time Scene Text Localization
and Recognition, CVPR 2012. The paper is available online at
http://cmp.felk.cvut.cz/~neumalu1/neumann-cvpr2012.pdf
[Gomez13] Gomez L. and Karatzas D.: Multi-script Text Extraction from
Natural Scenes, ICDAR 2013. The paper is available online at
http://refbase.cvc.uab.es/files/GoK2013.pdf
Once you've found where the text in the scene is, you can run any sort of standard OCR against those slices (Tesseract OCR is common). And there's now an end-to-end sample in opencv using OpenCV's new interface to Tesseract:
https://github.com/Itseez/opencv_contrib/blob/master/modules/text/samples/end_to_end_recognition.cpp
Template matching tend not to be robust for this sort of application because of lighting inconsistencies, orientation changes, scale changes etc. The typical way of solving this problem is to bring in machine learning. What you are trying to do by training your own boosting classifier is one possible approach. However, I don't think you are doing the training correctly. You mentioned that you gave it 1 logo as a positive training image and 5 other images not containing the logo as negative examples? Generally you need training samples to be in the order of hundreds or thousands or more. You cannot possibly train with 6 training samples and expect it to work.
If you are unfamiliar with machine learning, here is roughly what you should do:
1) You need to collect many positive training samples (from hundred onwards but generally the more the merrier) of the object you are trying to detect. If you are trying to detect individual characters in the image, then get cropped images of individual characters. You can start with the MNIST database for this. Better yet, to train the classifier for your particular problem, get many cropped images of the characters on the bus from photos. If you are trying to detect the entire rectangular LED board panel, then use images of them as your positive training samples.
2) You will need to collect many negative training samples. Their number should be in the same order as the number of positive training samples you have. These could be images of the other objects that appear in the images you will run your detector on. For example, you could crop images of the front of the bus, road surfaces, trees along the road etc. and use them as negative examples. This is to help the classifier rule out these objects in the image you run your detector on. Hence, negative examples are not just any image containing objects you don't want to detect. They should be objects that could be mistaken for the object you are trying to detect in the images you run your detector on (at least for your case).
See the following link on how to train the cascade of classifier and produce the XML model file: http://note.sonots.com/SciSoftware/haartraining.html
Even though you mentioned you only want to detect the individual characters instead of the entire LED panel on the bus, I would recommend first detecting the LED panel so as to localize the region containing the characters of interest. After that, either perform template matching within this smaller region or run a classifier trained to recognize individual characters on patches of pixels in this region obtained using sliding window approach, and possibly at multiple scale. (Note: The haarcascade boosting classifier you mentioned above will detect characters but it won't tell you which character it detected unless you only train it to detect that particular character...) Detecting characters in this region in a sliding window manner will give you the order the characters appear so you can string them into words etc.
Hope this helps.
EDIT:
I happened to chance upon this old post of mine after separately discovering the scene text module in OpenCV 3 mentioned by #KaolinFire.
For those who are curious, this is the result of running that detector on the sample image given by the OP. Notice that the detector is able to localize the text region, even though it returns more than one bounding box.
Note that this method is not foolproof (at least this implementation in OpenCV with the default parameters). It tends to generate false-positives, especially when the input image contains many "distractors".
Here are more examples obtained using this OpenCV 3 text detector on the Google Street View dataset:
Notice that it has a tendency to find "text" between parallel lines (e.g., windows, walls etc). Since the OP's input image is likely going to contain outdoor scenes, this will be a problem especially if he/she does not restrict the region of interest to a smaller region around the LED signs.
It seems that if you are able to localize a "rough" region containing just the text (e.g., just the LED sign in the OP's sample image), then running this algorithm can help you get a tighter bounding box. But you will have to deal with the false-positives though (perhaps discarding small regions or picking among the overlapping bounding boxes using a heuristic based on knowledge about the way letters appear on the LED signs).
Here are more resources (discussion + code + datasets) on text detection.
Code
Extracting text OpenCV
http://libccv.org/doc/doc-swt/
Stroke Width Transform (SWT) implementation (Python)
https://github.com/subokita/Robust-Text-Detection
Datasets
You will find the google streetview and MSRA datasets here. Although the images in these datasets are not exactly the same as the ones for the LED signs on buses, they may be helpful either for picking the "best" performing algorithm from among several competing algorithms, or to train a machine learning algorithm from scratch.
http://www.iapr-tc11.org/mediawiki/index.php/Datasets_List
See my answer to How to read time from recorded surveillance camera video? You can/should use cvMatchTemplate() to do that.
If you are working with a fixed set of bus destinations, template matching will do.
However, if you want the system to be more flexible, I would imagine you would need some form of contour/shape analysis for each individual letter.
You can also look at EAST: Efficient Scene Text Detector - https://www.learnopencv.com/deep-learning-based-text-detection-using-opencv-c-python/
Under this link, you have examples with C++ and Python. I used this code to detect numbers of buses (after detecting that given object is a bus).