Responses of machine learning functions? - opencv

When going through any of the machine learning functions explained here. They all follow the format of cvStatModel.
For example the train function of NormalBayes is achieved by:
CvNormalBayesClassifier::train(const Mat& trainData, const Mat& responses, const Mat& varIdx=Mat(), const Mat& sampleIdx=Mat(), bool update=false )
The documentation tells you to check out cvStatModel for details on parameters.
What I dont understand is what is responses supposed to take? I know that trainData is the data we used for training the system using bag of words, but what to place in responses?
In an example on bag of words the responses element was handled as follows:
float label=atof(entryPath.filename().c_str());
labels.push_back(label);
NormalBayesClassifier classifier;
classifier.train(trainingData, labels);
So here the filenames of the images were converted to doubles and used as the responses element.
I don't understand this and am confused by it. Can some one please explain what the responses element is supposed to take? and why is atof used in the above example?

Those models are supervised machine learning techniques, it means that training the model requires not only the training data (i.e. the vectors of measurements), but also the labels (or continuous values) associated with each sample. For example, if you are trying to detect images containing cats, you have a training set of, say, 500 images not containing cats and 500 containing cats. You compute your descriptors for all 1000 images, and you assign a number to each category (by convention, -1 for "non-cats", 1 for "cats). Then, responses will be a matrix of 1000x1 integers, the first 500 values being -1, while the remaining beeing 1.
In you example, atof is used to convert a directory name to a unique number, representing the category, because training examples are probably sorted by folders (folder cats, dogs, bicycles, etc).

Related

which machine learning technique should be used for message classification

I have a dataset having customer message and final category one of example is following-
key message final category
1 i want customer care no i want to talk with ur team other
2 hi I 9986443603cjhh had qkuiv1uhqllljqvocally q illgi vq noclass
3 hai points not coming checking
like. The dataset is huge file with at least 20 final category type. Please suggest appropriate method to classify the data with a message which will be its final category. I am thinking of making feature_vector with message word and feed it into Bayesian would it be great? Or I have to use other technique.
Thanks a lot.
You can consider word-embedding.
You can download from here the embbedings (in this link- Glove, you can alternatively use word2vec).
The idea is that similar words will have similar vectors.
After you convert each word in your message to a vector you can average all the vectors (or, average using TF-IDF for better results) to get the vector-representation of your message.
Of course, words like qkuiv1uhqllljqvocally will not appear in the vocabulary.
To check your results, you can cluster(using 20-means clustering, if you have 20 classes) all your vectors to see that similar messages cluster to the same group.

How to decide numClasses parameter to be passed to Random Forest algorithm in SPark MLlib with pySpark

I am working on Classification using Random Forest algorithm in Spark have a sample dataset that looks like this:
Level1,Male,New York,New York,352.888890
Level1,Male,San Fransisco,California,495.8001345
Level2,Male,New York,New York,-495.8001345
Level1,Male,Columbus,Ohio,165.22352099
Level3,Male,New York,New York,495.8
Level4,Male,Columbus,Ohio,652.8
Level5,Female,Stamford,Connecticut,495.8
Level1,Female,San Fransisco,California,495.8001345
Level3,Male,Stamford,Connecticut,-552.8234
Level6,Female,Columbus,Ohio,7000
Here the last value in each row will serve as a label and rest serve as features. But I want to treat label as a category and not a number. So 165.22352099 will denote a category and so will -552.8234. For this I have encoded my features as well as label into categorical data. Now what I am having difficulty in is deciding what should I pass for numClasses parameter in Random Forest algorithm in Spark MlLib? I mean should it be equal to number of unique values in my label? My label has like 10000 unique values so if I put 10000 as value of numClasses then wouldn't it decrease the performance dramatically?
Here is the typical signature of building a model for Random Forest in MlLib:
model = RandomForest.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={},
numTrees=3, featureSubsetStrategy="auto",
impurity='gini', maxDepth=4, maxBins=32)
The confusion comes from the fact that you are doing something that you should not do. You problem is clearly a regression/ranking, not a classification. Why would you think about it as a classification? Try to answer these two questions:
Do you have at least 100 samples per each value (100,000 * 100 = 1,000,000)?
Is there completely no structure in the classes, so for example - are objects with value "200" not more similar to those with value "100" or "300" than to those with value "-1000" or "+2300"?
If at least one answer is no, then you should not treat this as a classification problem.
If for some weird reason you answered twice yes, then the answer is: "yes, you should encode each distinct value as a different class" thus leading to 10000 unique classes, which leads to:
extremely imbalanced classification (RF, without balancing meta-learner will nearly always fail in such scenario)
extreme number of classes (there are no models able to solve it, for sure RF will not solve it)
extremely small dimension of the problem- looking at as small is your number of features I would be surprised if you could predict from that binary classifiaction. As you can see how irregular are these values, you have 3 points which only diverge in first value and you get completely different results:
Level1,Male,New York,New York,352.888890
Level2,Male,New York,New York,-495.8001345
Level3,Male,New York,New York,495.8
So to sum up, with nearly 100% certainty this is not a classification problem, you should either:
regress on last value (keyword: reggresion)
build a ranking (keyword: learn to rank)
bucket your values to at most 10 different values and then - classify (keywords: imbalanced classification, sparse binary representation)

OpenCV Principal Component Analysis terminology - what actually is a 'sample'?

I'm working with Principal Component Analysis (PCA) in openCV. The constructor inputs for the case I'm interested in are:
PCA(InputArray data, InputArray mean, int flags, double retainedVariance);
Regarding the InputArray 'data' the documents state the appropriate flags should be:
CV_PCA_DATA_AS_ROW indicates that the input samples are stored as
matrix rows.
CV_PCA_DATA_AS_COL indicates that the input samples are
stored as matrix columns.
My question pertains to the use of the term 'samples' in that I'm not sure what a sample is in this context.
For example let's say I have 4 sets of data and for the sake of illustration let's label them A-D. Now each set A through D has 8 elements. They are then set up in the Mat variable I'll use as InputArray as follows:
The question is, which is it:
My sets are samples?
My data elements are samples?
Another way of asking:
Do I have 4 samples (CV_PCA_DATA_AS_COL)
Or do I have 4 sets of 8 samples (CV_PCA_DATA_AS_ROW)
?
As a guess, I'd choose CV_PCA_DATA_AS_COL (i.e. I have 4 samples) - but that's just where my head is at... Until I learn the correct terminology it seems the word 'sample' could apply to either reasoning.
Ugh...
So the answer was found by reversing the logic behind the documentation for the PCA::project step...
Mat PCA::project(InputArray vec)
vec – input vector(s); must have the same dimensionality and the same
layout as the input data used at PCA phase, that is, if
CV_PCA_DATA_AS_ROW are specified, then vec.cols==data.cols (vector
dimensionality)
i.e. 'sample' is equivalent to 'set', and the elements are the 'dimension'.
(and my guess was correct :)

Encog query classification

I'm trying to process this dataset using Encog. In order to do so, I combined the outputs into one (can't seem to figure out how to use multiple expected outputs, even tho I unsuccessfully tried to manually train a NN with 4 output neurons) with the values: "disease1", "disease2", "none" and "both".
Starting from there, used the analyst wizard in the CSV, and the automatic process trained a NN with the expected outputs. A peak from the file:
"field:1","field:2","field:3","field:4","field:5","field:6","field:7","Output:field:7"
40.5,yes,yes,yes,yes,no,both,both
41.2,no,yes,yes,no,yes,second,second
Now my problem is: how do I query it? I tried with classification, but as far as I've understood, the result only gives me the values {0,1,2}, so there are two classes which I can't differentiate (both are 0).
This same problem applies to the Iris example presented in the wiki. Also, how does Encog extrapolate from the output neuron values to the 0/1/2 results?
Edit: the solution I have found was to use a separate network for disease 1 and disease 2, but I really would like to know if it was possible to combine those into one.
You are correct, that you will need to combine the output column to a single value. Encog analyst will only classify to a single output column. That output column can have many different values. So normalizing the two output columns to none,first,second,both will work. If you use the underlying neural networks directly, you could actually train for two outputs each doing an independent classification. But for this discussion I will assume we are dealing with the analyst.
Are you querying the network using the workbench, or in code? By default Encog analyst encodes to the neural network using equilateral encoding. This results in a number of output neurons equal to n-1, where n is the number of classes. If you choose one-of-n encoding in the analyst wizard, then the regular classify method on the BasicNetwork will work, as it is only designed for one-of-n.
If you would like to query (in code) using equilateral, then you can use a method similar to the following. I am adding this to the next version of Encog.
/**
* Used to classify a neural network that has been encoded using equilateral encoding.
* This is the default for the Encog analyst. Equilateral encoding uses an output count
* equal to the number of classes minus one.
* #param input The input to the neural network.
* #param high The high value of the activation range, usually 1.
* #param low The low end of the normalization range, usually -1 or 0.
* #return The class that the input belongs to.
*/
public int classifyEquilateral(final MLData input,double high, double low) {
MLData result = this.compute(input);
Equilateral eq = new Equilateral(getOutputCount()+1,high,low);
return eq.decode(result.getData());
}

opencv kmeans doesn't classify data in some classes

I'm trying to implement Scalable Recognition with a Vocabulary Tree
and I'm using opencv kmeans function to cluster feature vectors so I put all my vectors in one Mat object and pass it to the function like this:
TermCriteria criteria;
criteria.epsilon = 0.1;
int attempts = 1;
int flags = KMEANS_RANDOM_CENTERS;
int K = 10;
Mat Centers;
Mat Labels;
kmeans(descriptors, K, Labels, criteria, attempts, flags, Centers);
So in the function fills "Centers" and "Labels" Mat objects like this:
Centers has K rows, 64 columns (I'm using SURF features) and one channel
Labels has as many rows as "descriptors", one column and one channel and it's values are in the range of [0 K-1]
These are the things I have checked. After I do this to all the vectors I copy vectors with the same label to a new Mat and pass it to the function again.
My problem is that sometimes one of the values in the range [0 k-1] is missing in "Label" so none of the feature vectors is classified in that cluster. I've checked it for different K's and It usually happens at least once at some level (never in the first call though). Even for K = 3.
I assume at those times the data I pass to the function is not right. So my question is that when could this happen? What things should I check on the data that I pass to the function to make sure they are valid?
Also if you have a link of any good implementations of the paper I would really appreciate it if you post it here.
It turned out that some times some clusters have less than K number of members in them so in the next level the function returns an error. Though I still haven't figured out why sometimes a cluster is empty.

Resources