Using MinMaxScaler with Truncated SVD output - machine-learning

I am trying to use reduce the dimensionality of my features before using Multinomial NB classifier. Now the thing is, Multinomial NB does not take negative values in X_train. One of the suggestions I found online is to use MinMaxScaler to scale the SVD output to range (0,1) but I am not sure how feasible that is. (Dealing with negative values in sklearn MultinomialNB).
How can I use the TruncatedSVD output as an input to Multinomial NB classifier? Thanks!
Edit: Sample code below.
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df[col])
#After applying Truncated SVD
transformer = TruncatedSVD()
X_red = pd.DataFrame(transformer.fit_transform(X))
The dataset X_red has negative values which I can not use in a Multimonial NB later.

First of all, what is your data? Multinomial NB classifier and TruncatedSVD are both related to Nature Language Processing (NLP). If your data from NLP, then where the negative value from? And what kind of problem do you want to address?
You should also do the data transformation/processing, here MinMaxScale first before applying dimension reduction.

Related

What is the difference between LinearRegression and SGDRegressor?

I understand that both LinearRegression class and SGDRegressor class from scikit-learn performs linear regression. However, only SGDRegressor uses Gradient Descent as the optimization algorithm.
Then what is the optimization algorithm used by LinearRegression, and what are the other significant differences between these two classes?
LinearRegression always uses the least-squares as a loss function.
For SGDRegressor you can specify a loss function and it uses Stochastic Gradient Descent (SGD) to fit. For SGD you run the training set one data point at a time and update the parameters according to the error gradient.
In simple words - you can train SGDRegressor on the training dataset, that does not fit into RAM. Also, you can update the SGDRegressor model with a new batch of data without retraining on the whole dataset.
To understand the algorithm used by LinearRegression, we must have in mind that there is (in favorable cases) an analytical solution (with a formula) to find the coefficients which minimize the least squares:
theta = (X'X)^(-1)X'Y (1)
where X' is the the transpose matrix of X.
In the case of non-invertibility, the inverse can be replaced by the Moore-Penrose pseudo-inverse calculated using "singular value decomposition" (SVD). And even in the case of invertibility, the SVD method is faster and more stable than applying the formula (1).
PS - No LaTeX (MathJaX) in Stackoverflow ???
--
Pierre (from France)

Magnitude of Sample Weights in Keras

Keras model.fit supports per-sample weights. What is the range of acceptable values for these weights? Must they sum to 1 across all training samples? Or does keras accept any weight values and then perform some sort of normalization? The keras source includes, e.g. training_utils.standardize_weights but that does not appear to be doing statistical standardization.
After looking at the source here, I've found that you should be able to pass any acceptable numerical values (within overflow bounds) for both sample weights and class weights. They do not need to sum to 1 across all training samples and each weight may be greater than one. The only sort of normalization that appears to be happening is taking the max of 2D class weight inputs.
If both class weights and samples weights are provided, it provides the product of the two.
I think the unspoken component here is that the activation function should be dealing with normalization.

Classifier for data with varying dimensionality

I need to train a classifier with data whose dimensionality can vary. For example (and this is made-up date for illustration):
class-1,0,1,2,3
class-2,0,3,2,4,5,7
class-3,1,8,8,8,2,8,0,0,0
:
:
and so on...
I am trying to train a Linear SVM using scikit-learn which requires the dimensionality to be fixed. A simple zero-padding of the smaller dims to match the dim of the largest, is giving me disappointing results.
Should I be using a different classifier for such data? How should I approach this?
Feature hashing is the algorithm you need to use to convert your variable-length input into constant-length input. Then, you could use your transformed vectors with any appropiate learning algorithm.
Wikipedia: Feature Hashing
Try padding with feature mean/median, that's another way to deal with missing data.
Are those measurements made in the same points/features ?

Loss function for OneVsRestClassifier

I have a OneVsRestClassifier (scikit-learn) which has been trained.
clf = OneVsRestClassifier(LogisticRegression(C=1.2, penalty='l1')).fit(X_train, y_train)
I want to find out the loss for my test data. I used log_loss function but it does not seem to work because I have multiple classes as outputs for each test case. What do I do?
The classification problem that you are referring to is known as a Multi-Label Classification problem. You have made a good decision of using the OneVsRestClassifier for this purpose. By default the score method uses the subset accuracy which is a very harsh metric as it requires you to guess the entire subset of labels correctly.
Some other loss functions, provided by scikit-learn, that you can use are as follows:
Hamming Loss - This measures the hamming distance between your prediction of labels and the true label. This is an intuitive formula to understand the hamming distance.
Jaccard Similarity Coefficient Score - This measures the Jaccard similarity between your predicted labels and the true labels.
Precision, Recall and F-Measures - In the case of multi-label classification, the notion of Precision, Recall and F-Measures can be applied to each class independently. The following guide explains how to combine them across all labels in multi-label classification.
If you need to also rank the labels as it is done in multi-label ranking problems, then there are other more advanced techniques available in scikit-learn which are very well documented with examples here. If you are dealing with this kind of a problem, then let me know in the comments, I will explain each of these metrics in more details.
Hope this helps!

CvSVM.predict() gives 'NaN' output and low accuracy

I am using CvSVM to classify only two types of facial expression. I used LBP(Local Binary Pattern) based histogram to extract features from the images, and trained using cvSVM::train(data_mat,labels_mat,Mat(),Mat(),params), where,
data_mat is of size 200x3452, containing normalized(0-1) feature histogram of 200 samples in row major form, with 3452 features each(depends on number of neighbourhood points)
labels_mat is corresponding label matrix containing only two value 0 and 1.
The parameters are:
CvSVMParams params;
params.svm_type =CvSVM::C_SVC;
params.kernel_type =CvSVM::LINEAR;
params.C =0.01;
params.term_crit=cvTermCriteria(CV_TERMCRIT_ITER,(int)1e7,1e-7);
The problem is that:-
while testing I get very bad result (around 10%-30% accuracy), even after applying with different kernel and train_auto() function.
CvSVM::predict(test_data_mat,true) gives 'NaN' output
I will greatly appreciate any help with this, it's got me stumped.
I suppose, that your classes linearly hard/non-separable in feature space you use.
May be it will be better to apply PCA to your dataset before classifier training step
and estimate effective dimensionality of this problem.
Also I think it will be userful test your dataset with other classifiers.
You can adapt for this purpose standard opencv example points_classifier.cpp.
It includes a lot of different classifiers with similar interface you can play with.
The SVM generalization power is low.In the first reduce your data dimension by principal component analysis then change your SVM kerenl type to RBF.

Resources