Classification LDA vs. TFIDF - machine-learning

I was running Multi-label classification on text data I noticed TFIDF outperformed LDA by a large margin. TFIDF accuracy was aorund 50% and LDA was around 29%.
Is this expected or should LDA do better than this?

LDA is normally used for unsupervised learning, not for classification. It provides a generative model, not a discriminative model (What is the difference between a Generative and Discriminative Algorithm?), which makes it less optimal for classification. LDA can also be sensitive to data preprocessing and model parameters.

Related

Word2vec is a Generalization or memorization algorithm?

I need to know that word2vec is a Generalization algorithm like all ML algorithms or its Memorization algorithm like KNN ?
Because we have 2 types of algorithms model based and memory based , word2vec is coming in which category when it's using for most_similar_items
Let me define generalization as the ability of a model which has completed training to be effective in prediction across a whole range of inputs, including include inputs that is not part of training. From that perspective, Word2Vec cannot predict words that are not part of the training dataset because it simply would not have trained on the context of it to create an embedding. To qualify as a generalization method, it needs to be able to predict on an input which was not part of the training dataset.
Word2Vec model maintains a dictionary of words to the corresponding embedding/vector. In summary, cannot predict on unknown words. This was one of the important differences between fastText model and Word2Vec.

Should I use multinomial logistic regression or linear discriminant analysis?

In a classification problem, I cannot use a simple logit model if my data label (aka., dependent variable) has more than two categories. That leaves me with multinomial regression and Linear Discriminant Analysis (LDA) and the likes. Why is it that multinomial logit is not as popular as LDA in machine learning? What is the particular advantage that LDA offers?

Weak Learners of Gradient Boosting Tree for Classification/ Multiclass Classification

I am a beginner in machine learning field and I want to learn how to do multiclass classification with Gradient Boosting Tree (GBT). I have read some of the articles about GBT but for regression problem and I couldn't find the right explanation about GBT for multiclass classfication. I also check GBT in scikit-learn library for machine learning. The implementation of GBT is GradientBoostingClassifier which used regression tree as the weak learners for multiclass classification.
GB builds an additive model in a forward stage-wise fashion; it allows for the optimization of arbitrary differentiable loss functions. In each stage n_classes_ regression trees are fit on the negative gradient of the binomial or multinomial deviance loss function. Binary classification is a special case where only a single regression tree is induced.
Source : http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html#sklearn.ensemble.GradientBoostingClassifier
The things is, why do we use regression tree as our learners for GBT instead of classification tree ? It would be very helpful, if someone can provide me the explanation about why regression tree is being used rather than classification tree and how regression tree can do the classification. Thank you
You are interpreting 'regression' too literally here (as numeric prediction), which is not the case; remember, classification is handled with logistic regression. See, for example, the entry for loss in the documentation page you have linked:
loss : {‘deviance’, ‘exponential’}, optional (default=’deviance’)
loss function to be optimized. ‘deviance’ refers to deviance (= logistic regression) for classification with probabilistic outputs. For loss ‘exponential’ gradient boosting recovers the AdaBoost algorithm.
So, a 'classification tree' is just a regression tree with loss='deviance'...

Text classification algorithms which are not Naive?

Naive Bayes Algorithm assumes independence among features. What are some text classification algorithms which are not Naive i.e. do not assume independence among it's features.
The answer will be very straight forward, since nearly every classifier (besides Naive Bayes) is not naive. Features independence is very rare assumption, and is not taken by (among huge list of others):
logistic regression (in NLP community known as maximum entropy model)
linear discriminant analysis (fischer linear discriminant)
kNN
support vector machines
decision trees / random forests
neural nets
...
You are asking about text classification, but there is nothing really special about text, and you can use any existing classifier for such data.

How to use platt scaling with cross-validation using LIBSVM?

Could somebody give me the example to show how platt scaling is used along with k-fold cross-validation in multiclass SVM classification in libsvm?
I have divided the whole dataset in two parts: Training and testing. For cross-validation i am partitioning the training data such that 1 partition is for testing and rest is for training multiclass SVM classifier.
Platt scaling has nothing to do with your partitioning or multiclass setting. Platt scaling is internal technique of each individual binary SVM, which uses only a training data. This is actually just fitting a logistic reggresion on top of your learned SVM projections.

Resources