I have the following problem:
I am given a set of images and I need to devide them to photos and pictures(graphics) with means of OpenCV library.
I've already tried
to analyze RGB histogram (in average picture has empty bins of histogram),
to analyze HSV histogram (in average picture has not much colors),
to search for contours (in average the number of contours on picture is less than on photo).
So I have 7% error (tested on 2000 images). I'm confused a little, because I have no a lot of experience in numerous computer vision means.
For example,this photo below. Its histograms (RGB and HSV) are very poor and number of contours is rather small. In addition there is a lot of background color, so I need to find an object to calculate only it histogram (I use findContours() for this). But in any case my algorithm detects this image as picture.
And one more example:
The problem with pictures is noise. I have images of small size (200*150) and in some cases noise is so perceptible, that my algorithm detects this image as photo. I've tried to blur images, but in this case the number of colors increases because of mixing pixels and also it decreases the number of contours (some dim boundaries become indistinguishable).
Example of pictures:
I've also tried color segmentation and MSER, but my best result is still 7%.
Could you advice me what methods can I also try?
I've used your dataset to create really simple models. To do this, I've used Rattle library in R.
Input data
rgbh1 - number of bins in RGB histogram, which value > #param#, in my case #param# = 30 (340 is maximum value)
rgbh2 - number of bins in RGB histogram, which value > 0 (not empty)
hsvh1 - number of bins in HSV histogram, which value > #param#, in my case #param# = 30 (340 is maximum value)
hsvh2 - number of bins in HSV histogram, which value > 0 (not empty)
countours - number of contours on image
PicFlag - flag indicating picture/photo (picture = 1, photo = 0)
Data exploration
To better understand your data, here is a plot of distribution of individual variables by picture/photo group (there is percentage on y axis):
It clearly shows that there are variables with preditive power. Most of them can be used in our model. Next I've created simple scatter plot matrix to see whether some combination of variables can be useful:
You can see the for example combination of number of countours and rgbh1 looks promising.
On the following chart you can notice that there is also quite strong correlation among variables. (Generally, we like to have a lot of variables with low correlation, while you have only a limited number of correlated variables). Pie chart shows how big is the correlation - full circle means 1, empty circle means 0, my opinion is that if correlation exceeds .4 it might not be good idea to have both variables in the model)
Model
Then I created simple models (keeping Rattle's default) using decision tree, random forest, logistic regression and neural network. As input I've used your data with 60/20/20 split (training, validiation, testing dataset). This is my result (please refer to google if you don't understand error matrix):
Error matrix for the Decision Tree model on pics.csv [validate] (counts):
Predicted
Actual 0 1
0 167 22
1 6 204
Error matrix for the Decision Tree model on pics.csv [validate] (%):
Predicted
Actual 0 1
0 42 6
1 2 51
Overall error: 0.07017544
Rattle timestamp: 2013-01-02 11:35:40
======================================================================
Error matrix for the Random Forest model on pics.csv [validate] (counts):
Predicted
Actual 0 1
0 170 19
1 8 202
Error matrix for the Random Forest model on pics.csv [validate] (%):
Predicted
Actual 0 1
0 43 5
1 2 51
Overall error: 0.06766917
Rattle timestamp: 2013-01-02 11:35:40
======================================================================
Error matrix for the Linear model on pics.csv [validate] (counts):
Predicted
Actual 0 1
0 171 18
1 13 197
Error matrix for the Linear model on pics.csv [validate] (%):
Predicted
Actual 0 1
0 43 5
1 3 49
Overall error: 0.07769424
Rattle timestamp: 2013-01-02 11:35:40
======================================================================
Error matrix for the Neural Net model on pics.csv [validate] (counts):
Predicted
Actual 0 1
0 169 20
1 15 195
Error matrix for the Neural Net model on pics.csv [validate] (%):
Predicted
Actual 0 1
0 42 5
1 4 49
Overall error: 0.0877193
Rattle timestamp: 2013-01-02 11:35:40
======================================================================
Results
As you can see the overall error rate oscilates between 6.5% and 8%. I do not think that this result can be significantly improved by tunning parameters of used methods. There are two ways how to decrease overall error rate:
add more uncorrelated variables (we do usually have 100+ input variables in the modeling dataset and +/- 5-10 are in the final model)
add more data (we can then tune the model without being scared by overfitting)
Used sofware:
R http://www.r-project.org/
Rattle http://rattle.togaware.com/
Code used to create corrgram and scatterplot (other outputs were generated using Rattle GUI):
# install.packages("lattice",dependencies=TRUE)
# install.packages("car")
library(lattice)
library(car)
setwd("C:/")
indata <- read.csv2("pics.csv")
str(indata)
# Corrgram
corrgram(indata, order=TRUE, lower.panel=panel.shade,
upper.panel=panel.pie, text.panel=panel.txt,
main="Picture/Photo correlation matrix")
# Scatterplot Matrices
attach(indata)
scatterplotMatrix(~rgbh1+rgbh2+hsvh1+hsvh2+countours|PicFlag,main="Picture/Photo scatterplot matrix",
diagonal=c("histogram"),legend.plot=TRUE,pch=c(1,1))
Well a generic suggestion will be to increase the number of features ( or get better features) and to build a classifier using this features, trained with an appropriate machine learning algorithm. OpenCV already has couple of good machine learning algorithms, which you can make use of.
I have never worked on this problem, but a quick google search led me to this paper by Cutzu et. al. Distinguishing paintings from photographs
One feature that should be useful is the gradient histogram. Natural images have a particular distribution of gradient strengths.
Related
I will try to explain what the problem is.
I have 5 materials, each composed of 3 different minerals of a set of 10 different minerals. For each material I have measured the inensity vs wavelength. And each Intensity vs wavelength vector can be mapped into a binary vector of ones and zeros corresponding to the minerals the material is composed of.
So material 1 has an intensity of [0.51 0.53 0.57 0.68...... ] measured at different wavelengths [470 480 490 500 510 ......] and a binary vector
[1 0 0 0 1 0 0 1 0 0]
and so on for each material.
For each material I have 5000 examples, so 25000 examples for all. Each example will have a 'similar' intensity vs wavelength behaviour but will give the 'same' binary vector.
I want to design a NN classifier so that if I give it as an input the intensity vs wavelength, it gives me the corresponding binary vector.
The intensity vs wavelength has a length of 450 so I will have 450 units in the input layer
the binary vector has a length of 10, so 10 output neurons
the hidden layer/s will have as a beginning 200 neurons.
Can I simly design a NN classifier this way, and would it solve the problem, or I need something else?
You can do that, however, be aware to use the right cost and output layer activation functions. In your case, you should use sigmoid units for your outer layer and binary-cross-entropy as a cost function.
Another way to go about this would be to use one-hot encoding so that you can use normal multi-class classification (will probably not make sense since your output is probably sparse).
I'm using the ScikitLearn flavour of the DecisionTree.jl package to create a random forest model for a binary classification problem of one of the RDatasets data sets (see bottom of the DecisionTree.jl home page for what I mean by ScikitLearn flavour). I'm also using the MLBase package for model evaluation.
I have built a random forest model of my data and would like to create a ROC Curve for this model. Reading the documentation available, I do understand what a ROC curve is in theory. I just can't figure out how to create one for a specific model.
From the Wikipedia page the last part of the first sentence that I have marked in bold italics below is the one that is causing my confusion: "In statistics, a receiver operating characteristic (ROC), or ROC curve, is a graphical plot that illustrates the performance of a binary classifier system as its discrimination threshold is varied." There is more on the threshold value throughout the article but this still confuses me for binary classification problems. What is the threshold value and how do I vary it?
Also, in the MLBase documentation on ROC Curves it says "Compute an ROC instance or an ROC curve (a vector of ROC instances), based on given scores and a threshold thres." But doesn't mention this threshold anywhere else really.
Example code for my project is given below. Basically, I want to create a ROC curve for the random forest but I'm not sure how to or if it's even appropriate.
using DecisionTree
using RDatasets
using MLBase
quakes_data = dataset("datasets", "quakes");
# Add in a binary column as feature column for classification
quakes_data[:MagGT5] = convert(Array{Int32,1}, quakes_data[:Mag] .> 5.0)
# Getting features and labels where label = 1 is mag > 1 and label = 2 is mag <= 5
features = convert(Array, quakes_data[:, [1:3;5]]);
labels = convert(Array, quakes_data[:, 6]);
labels[labels.==0] = 2
# Create a random forest model with the tuning parameters I want
r_f_model = RandomForestClassifier(nsubfeatures = 3, ntrees = 50, partialsampling=0.7, maxdepth = 4)
# Train the model in-place on the dataset (there isn't a fit function without the in-place functionality)
DecisionTree.fit!(r_f_model, features, labels)
# Apply the trained model to the test features data set (here I haven't partitioned into training and test)
r_f_prediction = convert(Array{Int64,1}, DecisionTree.predict(r_f_model, features))
# Applying the model to the training set and looking at model stats
TrainingROC = roc(labels, r_f_prediction) #getting the stats around the model applied to the train set
# p::T # positive in ground-truth
# n::T # negative in ground-truth
# tp::T # correct positive prediction
# tn::T # correct negative prediction
# fp::T # (incorrect) positive prediction when ground-truth is negative
# fn::T # (incorrect) negative prediction when ground-truth is positive
I also read this question and didn't find it helpful really.
The task in binary classification is to give a 0/1 (or true/false, red/blue) label to a new, unlabeled, data-point. Most classification algorithms are designed to output a continuous real value. This value is optimized to be higher for points with known or predicted label 1, and lower for points with known or predicted label 0. To use this value to generate a 0/1 prediction, an additional threshold is used. Points with a value higher than threshold are predicted to be labeled 1 (and for lower than threshold a 0 label is predicted ).
Why is this setup useful? Because, sometimes mispredicting a 0 instead of a 1 is more costly, and then you can set the threshold low, making the algorithm output predict 1s more often.
In an extreme case when predicting 0 instead of a 1 costs nothing for the application, you can set the threshold at infinity, making it always output 0 (which is obviously the best solution, since it incurs no cost).
The threshold trick cannot eliminate errors from the classifier - no classifier in real-world problems is perfect or free from noise. What it can do is change the ratio between the 0-when-really-1 errors and 1-when-really-0 errors for the final classification.
As you increase the threshold, more points are classified with a 0 label. Consider a chart with the fraction of points classified with 0 on the x-axis, and the fraction of points with a 0-when-really-1 error on the y-axis. For each value of the threshold, plot a point for the resulting classifier on this chart. Plotting a point for all thresholds you get a curve. This is (some variant of) the ROC curve, which summarizes the abilities of the classifier. An often used metric for quality of classification is the AUC or area-under-curve of this chart, but in fact, the whole curve can be of interest in applications.
A summary like this appears in many texts on machine learning, which are a google query away.
Hope this clarifies the role of the threshold and its relation to ROC curves.
I've been playing with some SVM implementations and I am wondering - what is the best way to normalize feature values to fit into one range? (from 0 to 1)
Let's suppose I have 3 features with values in ranges of:
3 - 5.
0.02 - 0.05
10-15.
How do I convert all of those values into range of [0,1]?
What If, during training, the highest value of feature number 1 that I will encounter is 5 and after I begin to use my model on much bigger datasets, I will stumble upon values as high as 7? Then in the converted range, it would exceed 1...
How do I normalize values during training to account for the possibility of "values in the wild" exceeding the highest(or lowest) values the model "seen" during training? How will the model react to that and how I make it work properly when that happens?
Besides scaling to unit length method provided by Tim, standardization is most often used in machine learning field. Please note that when your test data comes, it makes more sense to use the mean value and standard deviation from your training samples to do this scaling. If you have a very large amount of training data, it is safe to assume they obey the normal distribution, so the possibility that new test data is out-of-range won't be that high. Refer to this post for more details.
You normalise a vector by converting it to a unit vector. This trains the SVM on the relative values of the features, not the magnitudes. The normalisation algorithm will work on vectors with any values.
To convert to a unit vector, divide each value by the length of the vector. For example, a vector of [4 0.02 12] has a length of 12.6491. The normalised vector is then [4/12.6491 0.02/12.6491 12/12.6491] = [0.316 0.0016 0.949].
If "in the wild" we encounter a vector of [400 2 1200] it will normalise to the same unit vector as above. The magnitudes of the features is "cancelled out" by the normalisation and we are left with relative values between 0 and 1.
This may be a weird request so some explanation first. I recently had a sudden hd crash and lost a data file I was using to generate model files with libSVM. I do have the SVM model and scaling file that I generated from this data file and I was wondering if there is a way to generate a data file from the Support Vectors in the model file, something like model_sv_to_instances(model, &instances) since thhe process for obtaining instances is very costly. (I know it won't be the same as the original but still it's better than nothing) I'm using a probabilistic SVM with RBF kernel.
If you open a given model file in any text editor you would find something like this:
svm_type c_svc
kernel_type sigmoid
gamma 0.5
coef0 0
nr_class 2
total_sv 4
rho 0
label 0 1
nr_sv 2 2
SV
1 1:0 2:0
1 1:1 2:1
-1 1:1 2:0
-1 1:0 2:1
Where the interesting thing for you is after the line with SV.
1 1:0 2:0
1 1:1 2:1
-1 1:1 2:0
-1 1:0 2:1
Those are data points that were selected as support vectors, so you just have to parse the file. The format is as follows :
[label] [index1]:[value1] [index2]:[value2] ... [indexn][valuen]
For instance, from my example you can conclude that my training set was:
x y desired val
0 0 -1
0 1 1
1 0 1
1 1 -1
A few considerations and warnings. The ratio between number of SVs and data points depends on the parameters that you used. In some cases the ratio is big and you would have very few SVs in comparison with your data.
Another thing to keep in mind is that this reduction is likely to change the problem and if you train again just with SVs as data points you would probably get a complete different model with a complete different set of parameters.
Good luck!
In the case of RBF you are lucky. According to the libsvm FAQ you can extract the support vectors from the model file:
In the model file, after parameters and other informations such as labels , each line represents a support vector.
But remember, these are only the support vectors, which are only a fraction of your original input data.
To the best of my knowledge, SVM models in general, and libSVM models in particular, consist of only the support vectors. These vectors represent the borderline between the classes; most probably, they don't represent the vast majority of your data points. So, unfortunately, I don't think there's a way to regenerate your data from the model.
Having said that, I can think of an esoteric case where there might be some value to the model: there are companies specializing in recovering data in such cases (e.g. from crashed HDs). However, the recovered data sometimes has gaps; in certain cases, the model might be reverse-engineered to fill-in some missing spots. However, this is very theoretic.
EDIT: as the other answers state, the proportion of data points represented by the support vectors might vary, depending on the specific problem and parameters. However, as stated above, in most common cases, you'll be able to reconstruct only a small fraction of your original data set.
I'm trying to implement P. Viola and M. Jones detection framework in C++ (at the beginning, simply sequence classifier - not cascaded version). I think I have designed all required class and modules (e.g Integral images, Haar features), despite one - the most important: the AdaBoost core algorithm.
I have read the P. Viola and M. Jones original paper and many other publications. Unfortunately I still don't understand how I should find the best threshold for the one weak classifier? I have found only small references to "weighted median" and "gaussian distribution" algorithms and many pieces of mathematics formulas...
I have tried to use OpenCV Train Cascade module sources as a template, but it is so comprehensive that doing a reverse engineering of code is very time-consuming. I also coded my own simple code to understand the idea of Adaptive Boosting.
The question is: could you explain me the best way to calculate the best threshold for the one weak classifier?
Below I'm presenting the AdaBoost pseudo code, rewritten from sample found in Google, but I'm not convinced if it's correctly approach. Calculating of one weak classifier is very slow (few hours) and I have doubts about method of calculating the best threshold especially.
(1) AdaBoost::FindNewWeakClassifier
(2) AdaBoost::CalculateFeatures
(3) AdaBoost::FindBestThreshold
(4) AdaBoost::FindFeatureError
(5) AdaBoost::NormalizeWeights
(6) AdaBoost::FindLowestError
(7) AdaBoost::ClassifyExamples
(8) AdaBoost::UpdateWeights
DESCRIPTION (1)
-Generates all possible arrangement of features in detection window and put to the vector
DO IN LOOP
-Runs main calculating function (2)
END
DESCRIPTION(2)
-Normalizes weights (5)
DO FOR EACH HAAR FEATURE
-Puts sequentially next feature from list on all integral images
-Finds the best threshold for each feature (3)
-Finds the error for each the best feature in current iteration (4)
-Saves errors for each the best feature in current iteration in array
-Saves threshold for each the best feature in current iteration in array
-Saves the threshold sign for each the best feature in current iteration in array
END LOOP
-Finds for classifier index with the lowest error selected by above loop (6)
-Gets the value of error from the best feature
-Calculates the value of the best feature in the all integral images (7)
-Updates weights (8)
-Adds new, weak classifier to vector
DESCRIPTION (3)
-Calculates an error for each feature threshold on positives integral images - seperate for "+" and "-" sign (4)
-Returns threshold and sign of the feature with the lowest error
DESCRIPTION(4)
- Returns feature error for all samples, by calculating inequality f(x) * sign < sign * threshold
DESCRIPTION (5)
-Ensures that samples weights are probability distribution
DESCRIPTION (6)
-Finds the classifier with the lowest error
DESCRIPTION (7)
-Calculates a value of the best features at all integral images
-Counts false positives number and false negatives number
DESCRIPTION (8)
-Corrects weights, depending on classification results
Thank you for any help
In the original viola-Jones paper here, section 3.1 Learning Discussion (para 4, to be precise) you will find out the procedure to find optimal threshold.
I'll sum up the method quickly below.
Optimal threshold for each feature is sample-weight dependent and therefore calculated in very iteration of adaboost. The best weak classifier's threshold is saved as mentioned in the pseudo code.
In every round, for each weak classifier, you have to arrange the N training samples according to the feature value. Putting a threshold will separate this sequence in 2 parts. Both parts will have either positive or negative samples in majority along with a few samples of other type.
T+ : total sum of positive sample weights
T- : total sum of negative sample weights
S+ : sum of positive sample weights below the threshold
S- : sum of negative sample weights below the threshold
Error for this particular threshold is -
e = MIN((S+) + (T-) - (S-), (S-) + (T+) - (S+))
Why the minimum? here's an example:
If the samples and threshold is like this -
+ + + + + - - | + + - - - - -
In the first round, if all weights are equal(=w), taking the minimum will give you the error of 4*w, instead of 10*w.
You calculate this error for all N possible ways of separating the samples.
The minimum error will give you the range of threshold values. The actual threshold is probably the average of the adjacent feature values (I'm not sure though, do some research on this).
This was the second step in your DO FOR EACH HAAR FEATURE loop.
The cascades given along with OpenCV were created by Rainer Lienhart and I don't know what method he used.
You could closely follow the OpenCV source codes to get any further improvements on this procedure.