use ContinuousValueEncoder in mahout - mahout

I am trying to classify Mnist data set using mahout library. a it has 784 features each continuous variable. I am encoding the vector using ContinuousValueEncoder. So I have a doubt if I have to create 784 ContinuousVectorEncoder Instance.
encoder_feature_1.addToVector((byte[]) null,value1,v);
encoder_feature_2.addToVector((byte[]) null,value2,v);
encoder_feature_3.addToVector((byte[]) null,value3,v);
...........
...........
...........
encoder_feature_784.addToVector((byte[]) null,value784,v);
//where v is RandomAccessSparseVector with cardinality 784
Is it the correct way? or do I need to make single instance of the encoder.
I have not understood the concept properly its seems.

Related

Convert dgeMatrix for downstream tasks

I am trying to cluster sentence embeddings based on Glove model from text2vec. I generated the embeddings using the glove model like so (I create the iterator, vocab etc in the standard way).
# create document term matrix
dtm = create_dtm(it, vectorizer)
# assign the word embeddings
common_terms = intersect(colnames(dtm), rownames(word_vectors) )
# normalise
dtm_averaged <- text2vec::normalize(dtm[, common_terms], "l1")
# compute average sentence embeddings
sentence_vectors = dtm_averaged %*% word_vectors[common_terms, ]
The resulting object is of dgeMatrix class, which is equivalent to matrix class as I understand. dgeMatrix class isn't used for many downstream tasks so I would like to convert the matrix. The object, however, is 6GB large, and I have problems converting the matrix to a data frame or even text file for further processing.
Ideally , I'd use this matrix in Spark for further analysis such as k-means clustering. My question what would be the best strategy to use the matrix for downstream tasks.
a) Convert to matrix class or data frame
b) write the matrix to file?
c) something completely different
I run the models on Google Cloud and have a machine with 32gb ram and 28 cpu.
Thanks for your help.

How to apply CNN for multi-channel pixel data based weights to each channel?

I have an image with 8 channels.I have a conventional algorithm where weights are added to each of these channels to get an output as '0' or '1'.This works fine with several samples and complex scenarios. I would like implement the same in Machine Learning using CNN method.
I am new to ML and started looking out the tutorials which seem to be exclusively dealing with image processing problems- Hand writing recognition,Feature extraction etc.
http://cv-tricks.com/tensorflow-tutorial/training-convolutional-neural-network-for-image-classification/
https://leonardoaraujosantos.gitbooks.io/artificial-inteligence/content/neural_networks.html
I have setup the Keras with Theano as background.Basic Keras samples are working without problem.
What steps do I require to follow in order achieve the same result using CNN ? I do not comprehend the use of filters,kernels,stride in my use case.How do we provide Training data to Keras if the pixel channel values and output are in the below form?
Pixel#1 f(C1,C2...C8)=1
Pixel#2 f(C1,C2...C8)=1
Pixel#3 f(C1,C2...C8)=0 .
.
Pixel#N f(C1,C2...C8)=1
I think you should treat this the same way you use CNN to do semantic segmentation. For an example look at
https://people.eecs.berkeley.edu/~jonlong/long_shelhamer_fcn.pdf
You can use the same architecture has they are using but for the first layer instead of using filters for 3 channels use filters for 8 channels.
For the loss function you can use the same loos function or something that is more specific for binary loss.
There are several implementation for keras but with tensorflow
backend
https://github.com/JihongJu/keras-fcn
https://github.com/aurora95/Keras-FCN
Since the input is in the form of channel values,that too in sequence.I would suggest you to use Convolution1D. Here,you are taking each pixel's channel values as the input and you need to predict for each pixel.Try this
eg :
Conv1D(filters, kernel_size, strides=1, padding='valid')
Conv1D()
MaxPooling1D(pool_size)
......
(Add many layers as you want)
......
Dense(1)
use binary_crossentropy as the loss function.

How to get jlibsvm prediction probability in multi-class classification

I am new to SVM. I am using jlibsvm for a multi-class classification problem. Basically, I am doing a sentence classification problem. There are 3 Classes. What I understood is I am doing One-against-all classification. I have a comparatively small train set. A total of 75 sentences, In which 25 sentences belongs to each class.
I am making 3 SVMs (so 3 different models), where, while training, in SVM_A, sentences belong to CLASS A will have a true label, i.e., 1 and other sentences will have a -1 label. Correspondingly done for SVM_B, and SVM_C.
While testing, to get the true label of a sentence, I am giving the sentence to 3 models and I am taking the prediction probability returned by these 3 models. Which one returns the highest will be the class the sentence belong to.
This is how I am doing. But I am getting the same prediction probability for every sentence in the test set for all models.
A predicted:0.012820514
B predicted:0.012820514
C predicted:0.012820514
These values repeat for all sentences in the training set.
The following is how I set parameters for training:
C_SVC svm = new C_SVC();
MutableBinaryClassificationProblemImpl problem;
ImmutableSvmParameterGrid.Builder builder = ImmutableSvmParameterGrid.builder();
// create training parameters ------------
HashSet<Float> cSet;
HashSet<LinearKernel> kernelSet;
cSet = new HashSet<Float>();
cSet.add(1.0f);
kernelSet = new HashSet<LinearKernel>();
kernelSet.add(new LinearKernel());
// configure finetuning parameters
builder.eps = 0.001f; // epsilon
builder.Cset = cSet; // C values used
builder.kernelSet = kernelSet; //Kernel used
builder.probability=true; // To get the prediction probability
ImmutableSvmParameter params = builder.build();
What am I doing wrong?
Is there any other better way to do multi-class classification other than this?
You are getting the same output, because you generate the same model three times.
The reason for this is, that jlibsvm is able to perform multiclass classification out of the box based on the provided data (LIBSVM itself supports this too). If it detects, that more than two class labes are provided in the given data, it automatically performs multiclass classification. So there is no need for a manually 1vsN approach. Just supply the data with class-labels for each category.
However, jlibsvm is still in beta and relies on a rather old version of LIBSVM (2.88). A lot has changed. For a more intiuitive Java binding (in comparison to the default LIBSVM version), you can take a look at zlibsvm, which is available via Maven Central and based on the latest LIBSVM version.

libsvm not giving support vectors / no support vectors

I am using jlibsvm to do SVM for regression .My data set is very small (42 samples) . When I use the dataset to create the model using epsilon SVR with sigmoid kernel then no support vectors are generated.
This is what I get in my model file :
svm_type epsilon_svr
kernel_type sigmoid
gamma 0.02380952425301075
coef0 0.0
label
rho -66.42803
total_sv 0
probA -1.0
SV
When I use some other data set on the libsvm website I get a model file with support vectors fine.
Can someone please suggest why no support vectors are being generated for my data set ?
My data set file is formatted right so no issues there...
This could mean that the best found classification, given your data and the hyperparameters, is to assign the same label to all samples.
Are your samples unbalanced? What's the number of positive and negative samples? You might want to try to add a weighting to positive/negative samples to account for that
It could also be the samples are hard to separate given their structure and the kernel type. Have you tried a different structure?
With only 42 data samples, maybe you could add them to your question and get better answers.

Import trained SVM from scikit-learn to OpenCV

I'm porting an algorithm that uses a Support Vector Machine from Python (using scikit-learn) to C++ (using the machine learning library of OpenCV).
I have access to the trained SVM in Python, and I can import SVM model parameters from an XML file into OpenCV. Since the SVM implementation of both scikit-learn and OpenCV is based on LibSVM, I think it should be possible to use the parameters of the trained scikit SVM in OpenCV.
The example below shows an XML file which can be used to initialize an SVM in OpenCV:
<?xml version="1.0"?>
<opencv_storage>
<my_svm type_id="opencv-ml-svm">
<svm_type>C_SVC</svm_type>
<kernel><type>RBF</type>
<gamma>0.058823529411764705</gamma></kernel>
<C>100</C>
<term_criteria><epsilon>0.0</epsilon>
<iterations>1000</iterations></term_criteria>
<var_all>17</var_all>
<var_count>17</var_count>
<class_count>2</class_count>
<class_labels type_id="opencv-matrix">
<rows>1</rows>
<cols>2</cols>
<dt>i</dt>
<data>
0 1</data></class_labels>
<sv_total>20</sv_total>
<support_vectors>
<_>
2.562423055146794554e-02 1.195797425735170838e-01
8.541410183822648050e-02 9.395551202204914520e-02
1.622867934926303379e-01 3.074907666176152077e-01
4.099876888234874062e-01 4.697775601102455179e-01
3.074907666176152077e-01 3.416564073529061440e-01
5.124846110293592716e-01 5.039432008455355660e-01
5.466502517646497639e-01 1.494746782168964394e+00
4.168208169705446942e+00 7.214937388193202183e-01
7.400275229357797802e-01</_>
<!-- omit 19 vectors to keep it short -->
</support_vectors>
<decision_functions>
<_>
<sv_count>20</sv_count>
<rho>-5.137523249549433402e+00</rho>
<alpha>
2.668992955678978518e+01 7.079767098112181145e+01
3.554240018130368384e+01 4.787014908624512088e+01
1.308470223155845069e+01 5.499185410034550614e+01
4.160483074010306126e+01 2.885504210853826379e+01
7.816431542954153144e+01 6.882061506693679576e+01
1.069534676985309574e+01 -1.000000000000000000e+02
-5.088050252552544350e+01 -1.101740897543916375e+01
-7.519686789702373630e+01 -3.893481464245511603e+01
-9.497774056452135483e+01 -4.688632332663718927e+00
-1.972745089701982835e+01 -8.169343841768861125e+01</alpha>
<index>
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
</index></_></decision_functions></my_svm>
</opencv_storage>
I would now like to fill this XML file with values from the trained scikit-learn SVM. But I'm not sure how the parameters of scikit-learn and OpenCV correspond. Here is what I have so far (clf is the classifier object in Python):
<kernel><gamma> corresponds to clf.gamma
<C> corresponds to clf.C
<term_criteria><epsilon> corresponds to clf.tol
<support_vectors> corresponds to clf.support_vectors_
Is this correct so far? Now here are the items I'm not really sure:
What about <term_criteria><iterations>?
Does <decision_functions><_><rho> correspond to clf.intercept_?
Does <decision_functions><_><alpha> correspond to clf.dual_coef_? Here I'm not sure because the scikit-learn documentation says "dual_coef_ which holds the product yiαi". It looks like OpenCV expects only αi, and not yiαi.
You don't need epsilon and iterations anymore, those are used in the training optimization problem. You can set them to your favorite number or ignore them.
Porting the support vectors may require some fiddling, as indexing may be different between scikit-learn and opencv. The XML in your example has no sparse format for example.
As for the other parameters:
rho should correspond to intercept_, but you may need to change sign.
scikit's dual_coef_ corresponds to sv_coef in standard libsvm models (which is alpha_i*y_i).
If opencv complains about the values you provide for alpha when porting, use absolute values of scikit-learn's dual_coef_ (e.g. all positive). These are the true alpha values of an SVM model.

Resources