Deriving Model learnable parameters - machine-learning

I am having 1000 input attributes which I am trying to categorize into 100 categories.
By training using multi-class logistic regression, how many model parameters needs to be learned?
Will it be (1000*100 + 100) or 1000+100 ?

Logistic regression is a binary classification model, meaning that it can only recognise one class from the other. In order to apply it to multi-class classification one needs to modify it, and there is no "one way" of doing so, there are some common approaches though:
The "most standard" way would be "1 vs ALL classification", which means you effectively build 100 logistic regression models, each recognising one class vs all the rest, in this case you have 100*(1000 + 1) parameters.
Another option is "1 vs 1" approach, where you build a logistic regression for each pair of classes, thus leading to 100*(100-1)/2 * (1000 + 1) parameters.
Finally, in principle you could train a model with just 1000 + 100 parameters, where each class only has its own bias, but projection is fixed, however this makes no sense unless your categories are orderable.

Related

Logistic Regression to support multiple classes directly

My Understanding is Softmax Regression is generalization of Logistic Regression to support multiple classes .
Softmax Regression model first computes a score for each class then estimates the probability of each class by applying the softmax function to the scores.
Each class has its own dedicated parameter vector
My question : Why can't we use Logistic Regression to classify to multiple classes in a much simpler way like if probability is 0 to 0.3 then Class A ; 0.3 to 0.6 then Class B : 0.6 to 0.9 then Class C etc.
Why separate coefficient vector is always needed ?
I'm new to ML . Not sure if this question is due to lack of any fundamental concept understanding .
First up, in terms of terminology, I'd say a more established terminology is multinomial logistic regression.
Softmax function is a natural choice for computing probabilities because it corresponds to MLE. Cross-entropy loss has a probabilistic interpretation as well - that's the "distance" between two distributions (output and target).
What you suggest is to discriminate classes in an artificial way - output a binary distribution and somehow compare it to a multi-class distribution. In theory, it is possible and may work, but surely has drawbacks. For example, it is harder to train.
Suppose the output is 0.2 (i.e. class A) and the ground truth is class B. You'd like to tell the network to shift towards a higher value. Next time, the output is 0.7 - the network actually learned and moved in the right direction, but you punish it again. In fact, there are unstable points (0.3 and 0.6 in your example) that the network need time to learn as critical ones. Two values - 0.2999999 and 0.3000001 are almost indistinguishable for the network, but they determine if the result is correct or not.
In general, output as a probability distribution is always better than direct discrimination, because it gives more information.

Can anyone give me some pointers for using SVM for user recognition using keystroke timing?

I am trying to perform user identification using keystroke dynamics. The data consists of the timing of individual keystrokes. I am using an SVM for binary classification. How can I train this for multiple users?
i have times of dynamic keyword, very times of users, example “hello” h->16seg, e->10, l->30, o->20, therefore, i not have class(1pos, -1neg)
SVMs are a binary classifier. However, SVMs do give you a confidence score (a function of distance from the separating hyperplane). So, you can use this information in one of two popular ways to convert a binary classifier into a multiclass classifier. These two ways are One-vs-All and One-vs-One.
See this article on how to use SVMs in a multiclass setting.
For example, in the One vs. All setting, for each class you separate the training data into samples that belong to that class and samples that belong to any other class. Then you fit an SVM on that data. At the end of the day you have k classifiers if you have k classes. Then you run your test data through all k classifiers and return the class with the highest probability (confidence score).

Logistic Regression only recognizing predominant classes

I am participating in the Kaggle San Francisco Crime competition and i am currently trying o number of different classifiers to test benchmark performances. I am using a LogisticRegressionClassifier from sklearn, without any parameter tuning and I noticed from sklearn.metrict.classification_report that it is only predicting the predominant classses,i.e. the classes which have the highest number of occurrences in my training set.
Intuition tells me that this has to parameter tuning, but I am not sure which parameters I have to tweek in order to make the classifier more aware of less predominant classes ( LogisticRegressionClassifier has quite a few ). At the moment it is predicting only 3 classes from 38 or smth like that so it definitely needs improvement.
Any ideas?
If your model is classifying only predominant classes then you are facing problem of imbalance classes. Here are some good reads to tackle this in machine learning.
Logistic Regression is a binary classifier and uses one-vs-all or one-vs-one technique for multiclass classification, which is not good if you have higher number of output classes (33 in your case). Try using other classifier. For a start , use softmax classifier which is an extension of logistic classifier having support for multi-class classification. In scikit learn, set multi_class variable as multinomial to use softmax regression.
Other way to improve your model could be using GridSearch for parameter tuning.
On a side note, I would recommend you to use other models as well.

Model in Naive Bayes

When we train a training set using decision tree classifier, we will get a tree model. And this model can be converted to rules and can be incorporated into a java code.
Now if I train the training set using Naive Bayes, in what form is the model? And how can I incorporated the model into my java code?
If there is no model resulted from the training, then what is the difference between Naive Bayes and lazy learner (ex. kNN)?
Thanks in advance.
Naive Bayes constructs estimations of conditional probabilities P(f_1,...,f_n|C_j), where f_i are features and C_j are classes, which, using bayes rule and estimation of priors (P(C_j)) and evidence (P(f_i)) can be translated into x=P(C_j|f_1,...,f_n), which can be roughly read as "Given features f_i I think, that their describe object of class C_j and my certainty is x". In fact, NB assumes that festures are independent, and so it actualy uses simple propabilities in form of x=P(f_i|C_j), so "given f_i I think that it is C_j with probability x".
So the form of the model is set of probabilities:
Conditional probabilities P(f_i|C_j) for each feature f_i and each class C_j
priors P(C_j) for each class
KNN on the other hand is something completely different. It actually is not a "learned model" in a strict sense, as you don't tune any parameters. It is rather a classification algorithm, which given training set and number k simply answers question "For given point x, what is the major class of k nearest points in the training set?".
The main difference is in the input data - Naive Bayes works on objects that are "observations", so you simply need some features which are present in classified object or absent. It does not matter if it is a color, object on the photo, word in the sentence or an abstract concept in the highly complex topological object. While KNN is a distance-based classifier which requires you to classify object which you can measure distance between. So in order to classify abstract objects you have to first come up with some metric, distance measure, which describes their similarity and the result will be highly dependent on those definitions. Naive Bayes on the other hand is a simple probabilistic model, which does not use the concept of distance at all. It treats all objects in the same way - they are there or they aren't, end of story (of course it can be generalised to the continuous variables with given density function, but it is not the point here).
The Naive Bayes will construct/estimate the probability distribution from which your training samples have been generated.
Now, given this probability distribution for all your output classes, you take a test sample, and depending on which class has the highest probability of generating this sample, you assign the test sample to that class.
In short, you take the test sample and run it through all the probability distributions (one for each class) and calculate the probability of generating this test sample for that particular distribution.

Late fusion step of classification using libLinear

I am doing a classification work that use libLinear as kernel these days.
And have trained two type of feature sets into two models to do prediction for a query input.
Wish to utilize Late Fusion to combine two result from models, I change the code of liblinear that I can get the decision score for different classes. So we got two sets of score to determine which class the query should be in.
Is there any standard way to do this "Late Fusion" or just intuitively add two scores of each classes and choose the class with highest score as candidate?
The standard way to combine multiple classifiers would be a weighted sum of the scores of the individual classifiers. Of course, you then have the problem of specifying the weight coefficients. There are different possibilities:
set weights uniformly
set weights proportional to performance of classifier
train a new classifier which takes the scores as input

Resources