Support Vector Machine in Apache Spark - machine-learning

I would like to have some insights in running Support Vector Machine (SVM) in Apache Spark.
When I use the run-example script given in the Spark home directory and using the argument org.apache.spark.mllib.classification.SVMWithSGD it displays the following Usage: SVM <master> <input_dir> <step_size> <regularization_parameter> <niters> message. I understand the <master>, the <input_dir> and the <niters> arguments.
Can you please help me figure out the rest of the arguments or at least direct me to some tutorial site of some sort?

<step_size> is the starting point for the learning rate. For convergence the step-size should be decreasing. In SGD this is implemented by taking the input value for step_size and dividing by the square root of the iteration.
<reg_param>is the scalar adjusting the strength of the constraints. Small values imply a soft margin and large values imply a hard margin. Infinity is the hardest margin.

They are explained in the docs in the last section.

Related

How to implement randomised log space search of learning rate in PyTorch?

I am looking to fine tune a GNN and my supervisor suggested exploring different learning rates. I came across this tutorial video where he mentions that a randomised log space search of hyper parameters is typically done in practice. For sake of the introductory tutorial this was not covered.
Any help or pointers on how to achieve this in PyTorch is greatly appreciated. Thank you!
Setting the scale in logarithm terms let you take into account more desirable values of the learning rate, usually values lower than 0.1
Imagine you want to take learning rate values between 0.1 (1e-1) and 0.001 (1e-4). Then you can set this lower and upper bound on a logarithm scale by applying a logarithm base 10 on it, log10(0.1) = -1 and log10(0.001) = -4. Andrew Ng provides a clearer explanation in this video.
In Python you can use np.random.uniform() for this
searchable_learning_rates = [10**(-4 * np.random.uniform(0.5, 1)) for _ in range(10)]
searchable_learning_rates
>>>
[0.004890650359810075,
0.007894672127828331,
0.008698831627963768,
0.00022779163472045743,
0.0012046829055603172,
0.00071395500159473,
0.005690032483124896,
0.000343368839731761,
0.0002819402550629178,
0.0006399571804618883]
as you can see you're able to try learning rate values from 0.0002819402550629178 up to 0.008698831627963768 which is close to the upper bound. The longer the array the more values you will try.
Following the example code in the video you provided you can implement the randomized log search for the learning rate by replacing learning_rates for searchable learning_rates
for batch_size in batch_sizes:
for learning_rate in searchable_learning_rates:
...
...

How do I decide or count number of hidden/tunable parameters in my design?

For my deep learning assignment I need to design a image classification network. There this constraint in the assignment I can have 500,000 number of hidden/tunable parameters at most in this design.
How can I count or observe the number of these hidden parameters especially if I am using this tensor flow tutorial as initial code/design.
Thanks in advance
How can I count or observe the number of these hidden parameters especially if I am using this tensor flow tutorial as initial code/design.
Instead of me doing the work for you I'll show you how to count free parameters
Glancing quickly it looks like the code at cifar10 uses layers of max pooling, convolution, bias, fully connected weights. Let's review how many free parameters each of these layers adds to your architecture.
max pooling : FREE! That's right, there are no "free parameters" from max pooling.
conv : Convolutions are defined using parameters like [1,3,3,1] where the numbers correspond to your tensor like so [batch_size, CONV_SIZE, CONV_SIZE, FEATURE_DEPTH]. Multiply all the dimension sizes together to find the total size of your free parameters. In the case of [1,3,3,1], the total is 1x3x3x1 = 9.
bias : A Bias is similar to convolutions in that it is defined by a shape like [10] or [1,342,342,3]. Same thing, just multiply all dimension sizes together to get the total free parameters. Sometimes a bias is just a single number, which means a size of 1.
fully connected : A fully connected layer usually has a 2d shape like [1024,32]. This means that it is a 2d matrix, and you calculate the total free parameters just like the convolution. In this example [1024,32] has 1024x32 = 32,768 free parameters.
Finally you add up all the free parameters from all the layers and that is your total number of free parameters.
500 000 parmeters? You use an R, G and B value of each pixel? If yes there is some problems
1. too much data (long calculating time)
2. in image clasification companys always use some other image analysis technique(preprocesing) befor throwing data into NN. if you have to identical images. Second is moved by one piksel. For the network they can be very diffrend.
Imagine other neural network. Use two parameters maybe weight and height. If you swap this parametrs what will happend.
Yes during learning of your image network can decrease this effect but when I made experiments with 5x5 binary images that was very hard to network. I start using 4 layers but this help only a little.
The image used to lerning can be good clasified, after destoring also but mooving for one pixel and you have a problem.
If no make eksperiments or use genetic algoritm to find it.
After laerning you should use some algoritm to find dates with network recognize as "no important"(big differnce beetwen weight of this input and the rest, If this input weight are too close to 0 network "think" it is no important)

bayesianoptimization in machine learning

Thanks for reading this. I am currently studying bayesoptimization problem and follow the tutorial. Please see the attachment.bayesian optimization tutorial
In page 11, about the acquisition function. Before I raise my question I need state my understanding about bayesian optimization to see if there is anything wrong.
First we need take some training points and assume them as multivariable gaussian ditribution. Then we need use acquisiont function to find the next point we want to sample. So for example we use x1....x(t) as training point then we need use acquisition function to find x(t+1) and sample it. Then we'll assume x1....x(t),x(t+1) as multivariable gaussian ditribution and then use acquisition function to find x(t+2) to sample so on and so forth.
In page 11, seems we need find the x that max the probability of improvement. f(x+) is from the sample training point(x1...xt) and easy to get. But how to get u(x) and that variance here? I don't know what is the x in the eqaution. It should be x(t+1) but the paper doesn't say that. And if it is indeed x(t+1), then how could I get its u(x(t+1))? You may say use equation at the bottom page 8, but we can use that equation on condition that we have found the the x(t+1) and put it into multivariable gaussian distribution. Now we don't know what is the next point x(t+1) so I have no way to calculate, in my opinion.
I know this is a tough question. Thanks for answering!!
In fact I have got the answer.
Indeed it is x(t+1). The direct way is we compute every u and varaince of the rest x outside of the training data and put it into acquisition function to find which one is the maximum.
This is time consuming. So we use nonlinear optimization like DIRECT to get the x that max the acquisition function instead of trying one by one

How to optimize affinity propagation results?

I was wondering if anybody knows anything deeper than what I do about the Affinity Propagation clustering algorithm in the python scikit-learn package?
All I know right now is that I input an "affinity matrix" (affmat), which is calculated by applying a heat kernel transformation to the "distance matrix" (distmat). The distmat is the output of a previous analysis step I do. After I obtain the affmat, I pass that into the algorithm by doing something analogous to:
af = AffinityPropagation(affinity = 'precomputed').fit(affmat)
On one hand, I appreciate the simplicity of the inputs. On the other hand, I'm still not sure if I've got the input done correctly?
I have tried evaluating the results by looking at the Silhouette scores, and frequently (but not a majority of the time) I am getting scores close to zero or in the negative ranges, which indicate to me that I'm not getting clusters that are 'naturally grouped'. This, to me, is a problem.
Additionally, when I observe some of the plots, I'm getting weird patterns of clustering, such as the image below, where I clearly see 3 clusters of points, but AP is giving me 10:
I also see other graphs like this, where the clusters overlap a lot.:
How are the cluster centers' euclidean coordinates identified? I notice when I put in percent values (99.2, 99.5 etc.), rather than fraction values (0.992, 0.995 etc.), the x- and y-axes change in scale from [0, 1] to [0, 100] (but of course the axes are scaled appropriately such that I'm getting, on occasion, [88, 100] or [92, 100] and so on).
Sorry for the rambling here, but I'm having quite a hard time understanding the underpinnings of the python implementation of the algorithm. I've gone back to the original paper and read it, but I don't know how the manipulate it in the scikit-learn package to get better clustering.

Which of the parameters in LibSVM is the slack variable?

I am a bit confused about the namings in the SVM. I am using this library LibSVM. There are so many parameters that can be set. Does anyone know which of these is the slack variable?
thx
The "slack variable" is C in c-svm and nu in nu-SVM. These both serve the same function in their respective formulations - controlling the tradeoff between a wide margin and classifier error. In the case of C, one generally test it in orders of magnitude, say 10^-4, 10^-3, 10^-2,... to 1, 5 or so. nu is a number between 0 and 1, generally from .1 to .8, which controls the ratio of support vectors to data points. When nu is .1, the margin is small, the number of support vectors will be a small percentage of the number of data points. When nu is .8, the margin is very large and most of the points will fall in the margin.
The other things to consider are your choice of kernel (linear, RBF, sigmoid, polynomial) and the parameters for the chosen kernel. Generally one has to do a lot of experimenting to find the best combination of parameters. However, be careful of over-fitting to your dataset.
Burges wrote a great tutorial: A Tutorial on Support Vector Machines for Pattern
Recognition
But if you mostly just want to know how to USE it and less about how it works, read "A Practical Guide to Support Vector Classication" by Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin (authors of libsvm)
First decide which type of SVM are u intending to use: C-SVC, nu-SVC , epsilon-SVR or nu-SVR. In my opinion u need to vary C and gamma most of the time... the rest are usually fixed..

Resources