How to use different dataset for scikit and NLTK? - machine-learning

I am trying to implement inbuilt naive bayes classifier of Scikit and NLTK for raw data I have. The data I have is set tab-separated-rows each having some label, paragraph and some other attributes.
I am interested in classifying the paragraphs.
I need to convert this data into format suitable for inbuilt classifiers of Scikit/ NLTK.
I want to implement Gaussian,Bernoulli and Multinomial Naive Bayes for all paragraphs.
Question 1:
For scikit, the example given imports iris data. I checked the iris data, it has precalculated values from the data set. How can I convert my data into such format and directly call the gaussian function? Is there any standard way of doing so?
Question 2:
For NLTK,
What should be input for NaiveBayesClassifier.classify function? is it dict with boolean values? how can it be made multinomial or gaussian?

# question 2:
nltk.NaiveBayesClassifier.classify expects a so called 'featureset'. A featureset is a dictionary with feature names as keys and feature values as values, e.g. {'word1':True, 'word2':True, 'word3':False}. Nltks' naive bayes classifier cannot be used as multinomial approach. However, you can install scikit learn and use the nltk.classify.scikitlearn wrapper module to deploy scikit's multinomial classifier.

Related

Encode my multiclass classification problem for ordinal NN

I want to encode my multiclass classification output variable in a specific way to take ordinality into account. I want to use this in a NN with sigmoid objective.
I have a couple of questions about this:
How could I encode my classes in this way?
This would not change the problem from multiclass to multilabel classification right?
P.S. here is a link to the paper I based this on. And here is a figure representing the change from a normal NN to their addaptation:
1. How could I encode my classes in this way?
Depends on the framework, a pytorch example can be found here, which also includes a code snippet for converting from predictions and back to labels
This would not change the problem from multiclass to multilabel classification right?
No, you would have multiple binary outputs, but they are subsequently converted to a single label, thus it is still multiclass classification.

How to code Naïve Bayes using Information Gain (IG)

I read from a paper that Naive Bayes using IG is the best model for text classification where the dataset is small and has few positives. However, I'm not too sure how to code this specific model using Python. Would this be user TF or Scikit learn and then adjusting a parameter?

Predicting over data that has categorical, numerical and text

I am trying to build a classifier for my dataset. Each observation in the data has categorical and numerical values, as well as a more general description in free-text. I understand how to build a boosting algorithm to handle the categorical and numerical values, and I have already trained a neural network that predicted over the text quite succesfully. What I'm wrapping my head around is how to integrate both approaches?
Embed your free text using a Language Model (e.g. averaging fasttext wordembeddings, or using google-universal-sentence-encoder) into an N-dim vector of floats. One hot encode the categorical stuff. Concatenate [embedding, one_hot_encoding, numericals] and badabing badaboom, you've got yourself 1 vector representing your datapoint.
Tensorflow hub's KerasLayer + https://tfhub.dev/google/universal-sentence-encoder/4 is def a good starting point. I you need to train something yourself, you could look into tf.keras.layers.Embedding.

How do I combine text and numerical features in training set for machine learning?

I am trying to predict the number of likes on a post in a social network basing on both on numerical features and text features. Now I have dataframe with required features, but I don't know what to do with posts text data. Should I vectorize it/do smth else in order to get a suitable train matrix? I am going to use LinearSVC from sklearn for analysis.
There are a lot of different ways you can transform your text features into numerical ones.
One of the most common ways is the Bag of Words approach. Where you transform your text into an array with the occurrences of each word.
If you are using scikit-learn I recommend you reading their Text Feature extraction User Guide.
Also look at the NLTK toolkit for more complex ways to process your text data.

In Mahout, is there any method for data classification with Naive Bayes?

I am still a newbie in using Mahout, and currently studying on the Naive Bayes for data classification.
As far as I know Mahout has 2 related programs, one is trainnb which is for training Bayes model, and testnb which is for evaluating the model. Under current implementation of Mahout, is there a way to apply the model on new data classification by just a simple command? Or do I need to code an implementation from scratch (e.g. use the model as a base to calculate the likelihood for each of the possibilities, compute and return the class with highest value) using java?

Resources