Train a text model to predict true or false - machine-learning

So I am new to the whole machine learning topic but I think I have an interesting problem to solve. I basically just want to know if a sentence conforms to TRUE or FALSE
Here a some sample sentences:
Yes this is me -> TRUE
This was me -> TRUE
Yes true -> TRUE
This was not me -> FALSE
....
Now I would need some hints how I can successfully train a model in e.g. Keras, Caffe or other tools and what kind of principal I should follow.
Thanks for any hints
Update
So from what I understood I need to do natural language classification. I would need to create 2 classes and get the probability for each of the classes back.
Could something like https://github.com/Russell91/nlpcaffe be useful?

If my understanding is correct, you want to categorize various kinds of responses into true/false, which are perhaps responses to questions as part of a conversation.
For this scenario, you should be creating/having a dataset with a large number of examples for both true and false classes and train a binary text classifier. You can read up on SVM and Naive Bayes which are very good for text classification and easily implementable using Scikit-Learn.

Related

Evaluate CNN model for multiclass image classification

i want to ask what metric can be used to evalutate my CNN model for multi class, i have 3 classes for now and i’m just using accuracy and confussion matrix also plot the loss of model, is there any metric can be used to evaluate my model performance?
Evaluating the performance of a model is one of the most crucial phase of any Machine Learning project cycle and must be done effectively. Since, you have mentioned that you are using accuracy and confusion metrics for the evaluation. I would like to add some points for developing a better evaluation strategy:
Consider you are developing a classifier that classifies an EMAIL into SPAM or NON - SPAM (HAM), now one of the possible evaluation criteria can be the FALSE POSITIVE RATE because it can be really annoying if a non-spam email ends in spam category (which means you will read a valuable email)
So, I recommend you to consider metrics based on the problem you are targeting. There are many metrics such as F1 score, recall, precision that you can choose based on the problem you are havning.
You can visit: https://medium.com/apprentice-journal/evaluating-multi-class-classifiers-12b2946e755b for better understanding.

How do I create a feature vector if I don’t have all the data?

So say for each of my ‘things’ to classify I have:
{house, flat, bungalow, electricityHeated, gasHeated, ... }
Which would be made into a feature vector:
{1,0,0,1,0,...} which would mean a house that is heated by electricity.
For my training data I would have all this data- but for the actual thing I want to classify I might only have what kind of house it is, and a couple other things- not all the data ie.
{1,0,0,?,?,...}
So how would I represent this?
I would want to find the probability that a new item would be gasHeated.
I would be using a SVM linear classifier- I don’t have any core to show because this is purely theoretical at the moment. Any help would be appreciated :)
When I read this question, it seems that you may have confused with feature and label.
You said that you want to predict whether a new item is "gasHeated", then "gasHeated" should be a label rather than a feature.
btw, one of the most-common ways to deal with missing value is to set it as "zero" (or some unused value, say -1). But normally, you should have missing value in both training data and testing data to make this trick be effective. If this only happened in your testing data but not in your training data, it means that your training data and testing data are not from the same distribution, which basically violated the basic assumption of machine learning.
Let's say you have a trained model and a testing sample {?,0,0,0}. Then you can create two new testing samples, {1,0,0,0}, {0,0,0,0}, and you will have two predictions.
I personally don't think SVM is a good approach if you have missing values in your testing dataset. Just like I have mentioned above, although you can get two new predictions, but what if each one has different predictions? It is difficult to assign a probability to results of SVM in my opinion unless you use logistic regression or Naive Bayes. I would prefer Random Forest in this situation.

Machine Learning Text Classification technique

I am new to Machine Learning.I am working on a project where the machine learning concept need to be applied.
Problem Statement:
I have large number(say 3000)key words.These need to be classified into seven fixed categories.Each category is having training data(sample keywords).I need to come with a algorithm, when a new keyword is passed to that,it should predict to which category this key word belongs to.
I am not aware of which text classification technique need to applied for this.do we have any tools that can be used.
Please help.
Thanks in advance.
This comes under linear classification. You can use naive-bayes classifier for this. Most of the ml frameworks will have an implementation for naive-bayes. ex: mahout
Yes, I would also suggest to use Naive Bayes, which is more or less the baseline classification algorithm here. On the other hand, there are obviously many other algorithms. Random forests and Support Vector Machines come to mind. See http://machinelearningmastery.com/use-random-forest-testing-179-classifiers-121-datasets/ If you use a standard toolkit, such as Weka, Rapidminer, etc. these algorithms should be available. There is also OpenNLP for Java, which comes with a maximum entropy classifier.
You could use the Word2Vec Word Cosine distance between descriptions of each your category and keywords in the dataset and then simple match each keyword to a category with the closest distance
Alternatively, you could create a training dataset from already matched to category, keywords and use any ML classifier, for example, based on artificial neural networks by using vectors of keywords Cosine distances to each category as an input to your model. But it could require a big quantity of data for training to reach good accuracy. For example, the MNIST dataset contains 70000 of the samples and it allowed me reach 99,62% model's cross validation accuracy with a simple CNN, for another dataset with only 2000 samples I was able reached only about 90% accuracy
There are many classification algorithms. Your example looks to be a text classification problems - some good classifiers to try out would be SVM and naive bayes. For SVM, liblinear and libshorttext classifiers are good options (and have been used in many industrial applcitions):
liblinear: https://www.csie.ntu.edu.tw/~cjlin/liblinear/
libshorttext:https://www.csie.ntu.edu.tw/~cjlin/libshorttext/
They are also included with ML tools such as scikit-learna and WEKA.
With classifiers, it is still some operation to build and validate a pratically useful classifier. One of the challenges is to mix
discrete (boolean and enumerable)
and continuous ('numbers')
predictive variables seamlessly. Some algorithmic preprocessing is generally necessary.
Neural networks do offer the possibility of using both types of variables. However, they require skilled data scientists to yield good results. A straight-forward option is to use an online classifier web service like Insight Classifiers to build and validate a classifier in one go. N-fold cross validation is being used there.
You can represent the presence or absence of each word in a separate column. The outcome variable is desired category.

percentage in classify of learning algorithem

I'm using weka, I have a training set, and the classify of the examples in the training set is boolean.
After I have the training set, I want to predict the percentage of new input to be true or false. I want to get a number between 0-1, and not only o or 1.
How can I do that, I have seen that in the prediection there are only the possibels classifes.
Thanks in advance.
You can only make the same kind of prediction with the learned classifier -- it learns to make the predictions you train it to make. The kind of prediction you want sounds more like regression. That is, you're don't want a strict classification, but a continuous value designating the membership probability.
The easiest way to achieve what you want is to replace the Booleans in your training set with 0/1 values and learn a regression model. This will give you numbers, although not necessarily only between 0 and 1.
To get real probabilities, you would need to use a classifier that calculates probabilities (such as Naive Bayes) and write some custom code (using the Weka library) to retrieve them. See the javadoc of the method that gives you access to the class probabilities.

Weka machine learning:how to interprete Naive Bayes classifier?

I am using the explorer feature for classification. My .arff data file has 10 features of numeric and binary values; (only the ID of instances is nominal).I have abt 16 instances. The class to predict is Yes/No.i have used Naive bayes but i cantnot interpret the results,,does anyone know how to interpret results from naive Bayes classification?
Naive Bayes doesn't select any important features. As you mentioned, the result of the training of a Naive Bayes classifier is the mean and variance for every feature. The classification of new samples into 'Yes' or 'No' is based on whether the values of features of the sample match best to the mean and variance of the trained features for either 'Yes' or 'No'.
You could use others algorithms to find the most informative attributes. In that case you might want to use a decision tree classifier, e.g. J48 in WEKA (which is the open-source implementation of C4.5 decision tree algorithm). The first node in the resulting decision tree tells you which feature has the most predictive power.
Even better (as stated by Rushdi Shams in the other post); Weka's Explorer offers purpose build options to find the most useful attributes in a dataset. These options can be found under the Select attributes tab.
As Sicco said NB cannot offer you the best features. Decision tree is a good choice because the branching can sometimes tell you the feature that is important- BUT NOT ALWAYS. In order to handle simple to complex featureset, you can use WEKA's SELECT ATTRIBUTE tab. There, you can find search methods and attribute evaluator. Depending on your task, you can choose the one that best suits you. They will provide you a ranking of the features (either from training data or from a k-fold cross validation). Personally, I believe that decision trees perform poor if your dataset is overfitting. In that case, a ranking of features is the standard way to select best features. Most of the times I use infogain and ranker algorithm. When you see your attributes are ranked from 1 to k, it is really nice to figure out the required features and unnecessary ones.

Resources