what does the gamma parameter in SVM.SVC() actually do - machine-learning

I am trying to tune my SVM model which I built with different values of gamma and C.
Also, I have used scaled input data for my model training.
After trying with multiple combinations I did find a combination of gamma and C which gave the best accuracy,though w/o having any idea of what gamma is doing; PFB:
svc = svm.SVC(gamma=0.025, C=25)
I read the docs for getting a sense of what gamma actually does(which says, "Kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’") and now I'm even more confused.
Any help will be highly appreciated.

Related

Questions around XGBoost

I am trying to understand the XGBoost algorithm and have a few questions around it.
I have read various blogs but all seem to tell a different story. Below is a snippet from the code that I am using (only for reference).
param <- list( objective = 'reg:linear',
eta = 0.01,
max_depth = 7,
subsample = 0.7,
colsample_bytree = 0.7,
min_child_weight = 5
)
Below are the 4 questions that I have:
1) It seems that XGBoost uses Gradient decent to minimise the cost function by changing the coefficients. I understand that it can be done for a gblinear model which uses linear regression.
However, for a gbtree model, how can XGboost apply gradient decent as there are no coefficients in the tree based model for the model to change. Or are there?
2) Similarly, gbtree model uses parameters lambda for L2 regularisation and alpha for L1 regularisation. I understand that regularisation applies some constraints on coefficients, but again a gbtree model has no coefficients. So how can it apply constraints on it?
3) What is the job of an objective function. For e.g. reg:linear. From what I understand, assigning an objective function only tells the model which evaluation metric to use. But then, there is a separate eval_metric parameter for it. So why do we need objective function?
4) What is min_child_weight in simple terms? I thought it is just the minimum no. of observations in the leaf node. But I think it has something to do with hessian metrics etc, which I don't understand well.
Hence, I would really appreciate if anyone can through some more light on these in simple and easy to understand terms?

Why the hypothesis has to introduce two parameters, namely θ0 and θ1

I was learning Machine Learning from this course on Coursera taught by Andrew Ng. The instructor defines the hypothesis as a linear function of the "input" (x, in my case) like the following:
hθ(x) = θ0 + θ1(x)
In supervised learning, we have some training data and based on that we try to "deduce" a function which closely maps the inputs to the corresponding outputs. To deduce the function, we introduce the hypothesis as a linear function of input (x). My question is, why the function involving two θs is chosen? Why it can't be as simple as y(i) = a * x(i) where a is a co-efficient? Later we can go about finding a "good" value of a for a given example (i) using an algorithm? This question might look very stupid. I apologize but I'm not very good at machine learning I am just a beginner. Please help me understand this.
Thanks!
The a corresponds to θ1. Your proposed linear model is leaving out the intercept, which is θ0.
Consider an output function y equal to the constant 5, or perhaps equal to a constant plus some tiny fraction of x which never exceeds .01. Driving the error function to zero is going to be difficult if your model doesn't have a θ0 that can soak up the D.C. component.

Scikit_learn's PolynomialFeatures with logistic regression resulting in lower scores

I have a dataset X whose shape is (1741, 61). Using logistic regression with cross_validation I was getting around 62-65% for each split (cv =5).
I thought that if I made the data quadratic, the accuracy is supposed to increase. However, I'm getting the opposite effect (I'm getting each split of cross_validation to be in the 40's, percentage-wise) So,I'm presuming I'm doing something wrong when trying to make the data quadratic?
Here is the code I'm using,
from sklearn import preprocessing
X_scaled = preprocessing.scale(X)
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(3)
poly_x =poly.fit_transform(X_scaled)
classifier = LogisticRegression(penalty ='l2', max_iter = 200)
from sklearn.cross_validation import cross_val_score
cross_val_score(classifier, poly_x, y, cv=5)
array([ 0.46418338, 0.4269341 , 0.49425287, 0.58908046, 0.60518732])
Which makes me suspect, I'm doing something wrong.
I tried transforming the raw data into quadratic, then using preprocessing.scale, to scale the data, but it was resulting in an error.
UserWarning: Numerical issues were encountered when centering the data and might not be solved. Dataset may contain too large values. You may need to prescale your features.
warnings.warn("Numerical issues were encountered "
So I didn't bother going this route.
The other thing that's bothering is the speed of the quadratic computations. cross_val_score is taking around a couple of hours to output the score when using polynomial features. Is there any way to speed this up? I have an intel i5-6500 CPU with 16 gigs of ram, Windows 7 OS.
Thank you.
Have you tried using the MinMaxScaler instead of the Scaler? Scaler will output values that are both above and below 0, so you will run into a situation where values with a scaled value of -0.1 and those with a value of 0.1 will have the same squared value, despite not really being similar at all. Intuitively this would seem to be something that would lower the score of a polynomial fit. That being said I haven't tested this, it's just my intuition. Furthermore, be careful with Polynomial fits. I suggest reading this answer to "Why use regularization in polynomial regression instead of lowering the degree?". It's a great explanation and will likely introduce you to some new techniques. As an aside #MatthewDrury is an excellent teacher and I recommend reading all of his answers and blog posts.
There is a statement that "the accuracy is supposed to increase" with polynomial features. That is true if the polynomial features brings the model closer to the original data generating process. Polynomial features, especially making every feature interact and polynomial, may move the model further from the data generating process; hence worse results may be appropriate.
By using a 3 degree polynomial in scikit, the X matrix went from (1741, 61) to (1741, 41664), which is significantly more columns than rows.
41k+ columns will take longer to solve. You should be looking at feature selection methods. As Grr says, investigate lowering the polynomial. Try L1, grouped lasso, RFE, Bayesian methods. Try SMEs (subject matter experts who may be able to identify specific features that may be polynomial). Plot the data to see which features may interact or be best in a polynomial.
I have not looked at it for a while but I recall discussions on hierarchically well-formulated models (can you remove x1 but keep the x1 * x2 interaction). That is probably worth investigating if your model behaves best with an ill-formulated hierarchical model.

TensorFlow and the MNIST data set

First of all: I'm completely new to Machine Learning and TensorFlow - I'm just playing around with this technology for a few weeks - and I really like it.
But I have (maybe a simple) question about the MNIST data set in combination with TensorFlow: I'm currently working through the "MNIST for ML Beginners" tutorial (https://www.tensorflow.org/versions/r0.11/tutorials/mnist/beginners/index.html#mnist-for-ml-beginners). I fully understand how the whole thing works, and what I accomplish with the source code.
My question is now the following:
Is it possible to see the individual weights parameters for each pixel? As far as I understand I can't access the individual weight parameters directly for each pixel, because the tf.matmul() operation returns me the sum over all weight parameters for a given class.
I want to access the individual weight parameters, because I want to see how these values are changing through the training process of the Neural Network.
Thanks for your help,
-Klaus
You can get the actual weights by just doing something like:
w = sess.run(W, feed_dict={x: batch_xs, y_: batch_ys})
print w.shape
If you want the per pixel results, just do a element-wise multiply of batch_xs * w (reshaped appropriately.)

Parameter selection in Adaboost

After using OpenCV for boosting I'm trying to implement my own version of the Adaboost algorithm (check here, here and the original paper for some references).
By reading all the material I've came up with some questions regarding the implementation of the algorithm.
1) It is not clear to me how the weights a_t of each weak learner are assigned.
In all the sources I've pointed out the choice is a_t = k * ln( (1-e_t) / e_t ), k being a positive constant and e_t the error rate of the particular weak learner.
At page 7 of this source it says that that particular value minimizes a certain convex differentiable function, but I really don't understand the passage.
Can anyone please explain it to me?
2) I have some doubts on the procedure of weight update of the training samples.
Clearly it should be done in such a way to guarantee that they remain a probability distribution. All the references adopt this choice:
D_{t+1}(i) = D_{t}(i) * e^(-a_ty_ih_t(x_i)) / Z_t (where Z_t is a
normalization factor chosen so that D_{t+1} is a distribution).
But why is the particular choice of weight update multiplicative with the exponential of error rate made by the particular weak learner?
Are there any other updates possible? And if yes is there a proof that this update guarantees some kind of optimality of the learning process?
I hope this is the right place to post this question, if not please redirect me!
Thanks in advance for any help you can provide.
1) Your first question:
a_t = k * ln( (1-e_t) / e_t )
Since the error on training data is bounded by product of Z_t)alpha), and Z_t(alpha) is convex w.r.t. alpha, and thus there is only one "global" optimal alpha which minimize the upperbound of the error. This is the intuition of how you find the magic "alpha"
2) Your 2nd question:
But why is the particular choice of weight update multiplicative with the exponential of error rate made by the particular weak learner?
To cut it short: the intuitive way of finding the above alpha is indeed improve the accuracy. This is not surprising: you are actually trusting more (by giving larger alpha weight) of the learners who work better than the others, and trust less (by giving smaller alpha) to those who work worse. For those learners brining no new knowledge than the previous learners, you assign weight alpha equal 0.
It is possible to prove (see) that the final boosted hypothesis yielding training error bounded by
exp(-2 \sigma_t (1/2 - epsilon_t)^2 )
3) Your 3rd question:
Are there any other updates possible? And if yes is there a proof that this update guarantees some kind of optimality of the learning process?
This is hard to say. But just remember here the update is improving the accuracy on the "training data" (at the risk of over-fitting), but it is hard to say about its generality.

Resources