Weighted Naive Bayes Classifier in Apache Mahout - machine-learning

I am using Naive Bayes classifier for my sentiment analysis on customer support. But unfortunately I don't have huge annotated data sets in the customer support domain. But I have a little amount of annotated data in the same domain(around 100 positive and 100 negative). I have the amazon product review data set as well.
Is there anyway can I implement a weighted naive bayes classifier using mahout, so that I can give more weight to the small set of customer support data and small weight to the amazon product review data. A training on the above weighted data set would drastically improve accuracy I guess. Kindly help me with the same.

One really simple approach is oversampling. Ie just repeat the customer support examples in your training data multiple times.
Though it's not the same problem you might get some further ideas by looking into the approaches used for class imbalance; in particular oversampling (as mentioned) and undersampling.

Related

The impact of number of negative samples used in a highly imbalanced dataset (XGBoost)

I am trying to model a classifier using XGBoost on a highly imbalanced data-set, with a limited number of positive samples and practically infinite number of negative samples.
Is it possible that having too many negative samples (making the data-set even more imbalanced) will weaken the model's predictive power? Is there a reason to limit the number of negative samples aside from running time?
I am aware of the scale_pos_weight parameter which should address the issue but my intuition says even this method has its limits.
To answer your question directly: adding more negative examples will likely decrease the decision power of the trained classifier. For the negative class choose the most representative examples and discard the rest.
Learning from imbalanced dataset can influence the predictive power and even an ability of a classifier to converge at all. Generally recommended strategy is to maintain similar sizes of training examples per each of the classes. Imbalance of classes effect on learning depends on the shape of the decision space and the width of boundaries between classes. The wider they are, and the simpler the decision space the more successful training even for imbalanced datasets.
TL;DR
For a quick overview of the methods of imbalanced learning I recommend these two articles:
SMOTE and AdaSyn by example
How to Handle Imbalanced Data: An Overview
Dealing with Imbalanced Classes in Machine Learning
Learning from Imbalanced Data by Prof. Haibo He (more scientific)
There is a Python package called imbalanced-learn which has an extensive documentation of algorithms that I recommend for in-depth review.

Machine Learning - Feature Ranking by Algorithms

I have a dataset that contains around 30 features and I want to find out which features contribute the most to the outcome. I have 5 algorithms:
Neural Networks
Logistics
Naive
Random Forest
Adaboost
I read a lot about Information Gain technique and it seems it is independent of the machine learning algorithm used. It is like a preprocess technique.
My question follows, is it best practice to perform feature importance for each algorithm dependently or just use Information Gain. If yes what are the technique used for each ?
First of all, it's worth stressing that you have to perform the feature selection based on the training data only, even if it is a separate algorithm. During testing, you then select the same features from the test dataset.
Some approaches that spring to mind:
Mutual information based feature selection (eg here), independent of the classifier.
Backward or forward selection (see stackexchange question), applicable to any classifier but potentially costly since you need to train/test many models.
Regularisation techniques that are part of the classifier optimisation, eg Lasso or elastic net. The latter can be better in datasets with high collinearity.
Principal components analysis or any other dimensionality reduction technique that groups your features (example).
Some models compute latent variables which you can use for interpretation instead of the original features (e.g. Partial Least Squares or Canonical Correlation Analysis).
Specific classifiers can aid interpretability by providing extra information about the features/predictors, off the top of my head:
Logistic regression: you can obtain a p-value for every feature. In your interpretation you can focus on those that are 'significant' (eg p-value <0.05). (same for two-classes Linear Discriminant Analysis)
Random Forest: can return a variable importance index that ranks the variables from most to least important.
I have a dataset that contains around 30 features and I want to find out which features contribute the most to the outcome.
This will depend on the algorithm. If you have 5 algorithms, you will likely get 5 slightly different answers, unless you perform the feature selection prior to classification (eg using mutual information). One reason is that Random Forests and neural networks would pick up nonlinear relationships while logistic regression wouldn't. Furthermore, Naive Bayes is blind to interactions.
So unless your research is explicitly about these 5 models, I would rather select one model and proceed with it.
Since your purpose is to get some intuition on what's going on, here is what you can do:
Let's start with Random Forest for simplicity, but you can do this with other algorithms too. First, you need to build a good model. Good in the sense that you need to be satisfied with its performance and it should be Robust, meaning that you should use a validation and/or a test set. These points are very important because we will analyse how the model takes its decisions, so if the model is bad you will get bad intuitions.
After having built the model, you can analyse it at two level : For the whole dataset (understanding your process), or for a given prediction. For this task I suggest you to look at the SHAP library which computes features contributions (i.e how much does a feature influences the prediction of my classifier) that can be used for both puproses.
For detailled instructions about this process and more tools, you can look fast.ai excellent courses on the machine learning serie, where lessons 2/3/4/5 are about this subject.
Hope it helps!

Feature engineering for fraud detection

I'm doing some research into fraud detection for academic purposes.
I' d like to know specifically about techniques for feature selection\engeneering from a transactional dataset.
In more details, given a dataset of transactions (credit card for example), what kind of features are selected to be used on the model and how are they engineered?
All the papers I've come across focus on the model itself (SVM, NN, ...) not really touching on this subject.
Also, if anyone knows of public datasets that are not anonymized - that would also help.
Thanks
Having a good understanding of feature selection/ranking can be a great asset for a data scientist or machine learning practitioner. A good grasp of these methods leads to better performing models, better understanding of the underlying structure and characteristics of the data and leads to better intuition about the algorithms that underlie many machine learning models.
There are in general two reasons why feature selection is used:
1. Reducing the number of features, to reduce overfitting and improve the generalization of models.
2. To gain a better understanding of the features and their relationship to the response variables.
Possible methods:
Univariate feature selection:
Pearson Correlation
Mutual information and maximal information coefficient (MIC)
Distance correlation
Model based ranking
Tree based methods:
Random forest feature importance (Mean decrease impurity, Mean decrease accuracy)
Others:
stability selection
RFE

Collecting Machine learning training data

I am very new to machine learning, and need a couple of things clarified. I am trying to predict the probability of someone liking an activity based on their Facebook likes. I am using the Naive Bayes classifier, but am unsure on a couple of things. 1. What would my labels/inputs be? 2. What info do I need to collect for training data? My guess is create a survey and have questions on wether the person would enjoy an activity (Scale from 1-10)
In supervised classification, all classifiers need to be trained with known labeled data, this data is known as training data. Your data should have a vector of features followed by a special one called class. In your problem, if the person has enjoyed the activity or not.
Once you train the classifier, you should test it's behavior with another dataset in order not to be biased. This dataset must have the class as the train data. If you train and test with the same datasets your classifiers prediction may be really nice but unfair.
I suggest you to take a look to evaluation techniques like K Fold Cross Validation.
Another thing you should know is that the common Naïve Bayes classifier is used to predict binary data, so your class should be 0 or 1 meaning that the person you make a survey enjoyed or not the activity. Also it's implemented in packages like Weka (Java) or SkLearn (Python).
If you are really interested in Bayesian Classifiers I need to say that in fact, Naïve Bayes for binary classification is not the best one because Minsky in 1961 discovered that the decision boundaries are hyperplanes. Also the Brier Score is really bad and it is say that this classifier is not well calibrated. But, it make good predictions after all.
Hope it helps.
This may be fairly difficult with Naive Bayes. You'll need to collect (or calculate) samples of whether or not a person likes activity X, and also details on their Facebook likes (organized in some consistent way).
Basically, for Naive Bayes, your training data should be the same data type as your testing data.
The survey approach may work, if you have access to each person's Facebook like history.

How to improve classification accuracy for machine learning

I have used the extreme learning machine for classification purpose and found that my classification accuracy is only at 70+% which leads me to use the ensemble method by creating more classification model and testing data will be classified based on the majority of the models' classification. However, this method only increase classification accuracy by a small margin. Can I asked what are the other methods which can be used to improve classification accuracy of the 2 dimension linearly inseparable dataset ?
Your question is very broad ... There's no way to help you properly without knowing the real problem you are treating. But, some methods to enhance a classification accuracy, talking generally, are:
1 - Cross Validation : Separe your train dataset in groups, always separe a group for prediction and change the groups in each execution. Then you will know what data is better to train a more accurate model.
2 - Cross Dataset : The same as cross validation, but using different datasets.
3 - Tuning your model : Its basically change the parameters you're using to train your classification model (IDK which classification algorithm you're using so its hard to help more).
4 - Improve, or use (if you're not using) the normalization process : Discover which techniques (change the geometry, colors etc) will provide a more concise data to you to use on the training.
5 - Understand more the problem you're treating... Try to implement other methods to solve the same problem. Always there's at least more than one way to solve the same problem. You maybe not using the best approach.
Enhancing a model performance can be challenging at times. I’m sure, a lot of you would agree with me if you’ve found yourself stuck in a similar situation. You try all the strategies and algorithms that you’ve learnt. Yet, you fail at improving the accuracy of your model. You feel helpless and stuck. And, this is where 90% of the data scientists give up. Let’s dig deeper now. Now we’ll check out the proven way to improve the accuracy of a model:
Add more data
Treat missing and Outlier values
Feature Engineering
Feature Selection
Multiple algorithms
Algorithm Tuning
Ensemble methods
Cross Validation
if you feel the information is lacking then this link should you learn, hopefully can help : https://www.analyticsvidhya.com/blog/2015/12/improve-machine-learning-results/
sorry if the information I give is less satisfactory

Resources