Interpretation of Classifier Result in Weka

Interpretation of Classifier Result in Weka - machine-learning

I am running classification algorithm in Weka. But I am unsure about some of results that Weka generate for reporting purposes.
In classification problem (either Yes=have disease or No = do not have disease), Weka produce result for each classifier. But also provide weighted result at bottom for both classifier.
Image
My question is, from reporting prospective what score should I be reporting on? (Basically I want to compare my results with other people results)
As per weka result (attached) for F-Measure; will it be 91 percent or 89 percent? Same applies for all other measurements (recall and precision).
Also, I want to know in research papers what score is reported for any given classifier? Weighted or for classifier that we are trying to predict, for example in my case, only report on result for'Yes' score?
Many thanks,

The use case defines what you report. In general, research papers report the entire confusion matrix and statistics tables. This allows readers to extract the data needed for the way they will use the research.
If a patient receives a "disease-free" result from this classifier, there's chance of about 18% that the person actually does have the disease. Is this acceptable? That's not a question SO (Stack Overflow) can answer: that's the use case.
If you insist on describing the test with a single, scalar statistic, you need to clarify the use case, and report that single metric accurately. In general, the summary f-measure (weighted) is what you report.

Related

Machine learning: Which algorithm is used to identify relevant features in a training set?

I've got a problem where I've potentially got a huge number of features. Essentially a mountain of data points (for discussion let's say it's in the millions of features). I don't know what data points are useful and what are irrelevant to a given outcome (I guess 1% are relevant and 99% are irrelevant).
I do have the data points and the final outcome (a binary result). I'm interested in reducing the feature set so that I can identify the most useful set of data points to collect to train future classification algorithms.
My current data set is huge, and I can't generate as many training examples with the mountain of data as I could if I were to identify the relevant features, cut down how many data points I collect, and increase the number of training examples. I expect that I would get better classifiers with more training examples given fewer feature data points (while maintaining the relevant ones).
What machine learning algorithms should I focus on to, first,
identify the features that are relevant to the outcome?
From some reading I've done it seems like SVM provides weighting per feature that I can use to identify the most highly scored features. Can anyone confirm this? Expand on the explanation? Or should I be thinking along another line?

Feature weights in a linear model (logistic regression, naive Bayes, etc) can be thought of as measures of importance, provided your features are all on the same scale.
Your model can be combined with a regularizer for learning that penalises certain kinds of feature vectors (essentially folding feature selection into the classification problem). L1 regularized logistic regression sounds like it would be perfect for what you want.

Maybe you can use PCA or Maximum entropy algorithm in order to reduce the data set...

You can go for Chi-Square tests or Entropy depending on your data type. Supervized discretization highly reduces the size of your data in a smart way (take a look into Recursive Minimal Entropy Partitioning algorithm proposed by Fayyad & Irani).

If you work in R, the SIS package has a function that will do this for you.
If you want to do things the hard way, what you want to do is feature screening, a massive preliminary dimension reduction before you do feature selection and model selection from a sane-sized set of features. Figuring out what is the sane-size can be tricky, and I don't have a magic answer for that, but you can prioritize what order you'd want to include the features by
1) for each feature, split the data in two groups by the binary response
2) find the Komogorov-Smirnov statistic comparing the two sets
The features with the highest KS statistic are most useful in modeling.
There's a paper "out there" titled "A selctive overview of feature screening for ultrahigh-dimensional data" by Liu, Zhong, and Li, I'm sure a free copy is floating around the web somewhere.

4 years later I'm now halfway through a PhD in this field and I want to add that the definition of a feature is not always simple. In the case that your features are a single column in your dataset, the answers here apply quite well.
However, take the case of an image being processed by a convolutional neural network, for example, a feature is not one pixel of the input, rather it's much more conceptual than that. Here's a nice discussion for the case of images:
https://medium.com/#ageitgey/machine-learning-is-fun-part-3-deep-learning-and-convolutional-neural-networks-f40359318721

How can i proof my results after mine some dataset?

I wonder if there´s anyway to proof the correctness of my results after apply some data mining algorithms to a set of data. When i say data mining algorithms im talking about the basic algorithms

If you have many examples, a simple way is to split available data in three partitions:
training data (around 50%-60% of available examples, randomly chosen);
validation data (20%-25%);
test data (20%-25%).
Training data are used to adjust parameters of the data mining algorithms.
With validation data you can compare models/algorithms/parameters and choose a winner.
Test data can give you a forecast of winner's performance in the "real world" because they are independent (during the training/validation phase you don't make any choice based on test data).
Anyway there are many schemes and probably the best place to delve deeper into the matter is http://stats.stackexchange.com

There can be several ways to proof correctness of your results. Firstly, you have to choose performance criteria
Accuracy of algorithm
Standard Deviation of results
Computation time
Based on either of these criteria, you have to adopt different-different mechanism to prove correctness of your algorithm.
1. Accuracy of algorithm
for this you have to understand, what are those point which can be questioned when you say that my algorithm's accuracy is XY.WZ%.
First question, is your algorithm giving better result because of over-fitting?
To avoid over-fitting by your algorithm, you can divide your data into three parts
training data
validation data
testing data
by doing so, if you are get good testing results, you can be sure that your algorithm did not over-fit. if there is a big difference between training and testing accuracy that is a sign of over-fitting.
What if you find out that your algorithm over-fit?
You can use several regularization techniques that keeps value of weights coefficient lower and helps in preventing over-fitting. You can know more about this in lectures of machine learning by Andre N.G at coursra.
Second question, is your data-set fairly chosen?
Suppose you have 100 dataset and you divided it in 50-30-20 set (training-validation-testing). Now question comes which 50 for training and which 30 dataset for validation and so on. So for different-2 selection of these data-set, you will get different-2 accuracy values. So, you should take 5-10 different-2 sets and then provide and average of results. This technique is known as cross-validation technique.
An another way to prove correctness of your algorithm is to provide confusion matrix in case of muticlass classification and sensitivity and specificity in case of binary classification. you can look at their wiki pages.
2. Standard deviation of results
If your algorithm is based on random population generation or based on heuristics then you are most likely to get different solution at each run of algorithm . In this case, you should provide an standard deviation of multiple runs on same data-set and same parameter setting by your algorithm.
3. computation time of algorithm
This might not be important in every case but if you are doing an comparison of your algorithm with other algorithm then you should provide comparison of computation time, however this has nothing to do with correctness of your algorithm but it does gives an idea of comprehensiveness of your algorithm.

What good are proven results?
At most you will be able to prove that your implementation matches some theoretical mathematical model, or that an approximative algorithm approximates this mathematical model.
But in practise, real data will not satisfy your mathematical assumptions anyway.
Often, the best proof is: does it work?
That is, on real, unseen data. Not on the data that you used to choose your parameters, because then you are prone to overfitting.

Possible mistakes which may result in higher classification accuracy?

i am doing text classification with 20NewsGroup data set and i used 20NewsGroup_ByDate dataset. I extract the stemmed documents provided here
http://web.ist.utl.pt/~acardoso/datasets/
i applied tf-idf conversion, Information Gain feature selection and Naive Bayes for classification in weka. My result are higher than the results mentioned on the page mentioned above(82%). I have thought alot and search the possible mistakes i may made but could'nt find out any as
i am using their processed documents.
I only need to apply tf-idf,IG and classifier. Kindly provide me insights what could be possible mistakes which can result in higher accuracy than expected ?

What are the metrics to evaluate a machine learning algorithm

I would like to know what are the various techniques and metrics used to evaluate how accurate/good an algorithm is and how to use a given metric to derive a conclusion about a ML model.
one way to do this is to use precision and recall, as defined here in wikipedia.
Another way is to use the accuracy metric as explained here. So, what I would like to know is whether there are other metrics for evaluating an ML model?

I've compiled, a while ago, a list of metrics used to evaluate classification and regression algorithms, under the form of a cheatsheet. Some metrics for classification: precision, recall, sensitivity, specificity, F-measure, Matthews correlation, etc. They are all based on the confusion matrix. Others exist for regression (continuous output variable).
The technique is mostly to run an algorithm on some data to get a model, and then apply that model on new, previously unseen data, and evaluate the metric on that data set, and repeat.
Some techniques (actually resampling techniques from statistics):
Jacknife
Crossvalidation
K-fold validation
bootstrap.

Talking about ML in general is a quite vast field, but I'll try to answer any way. The Wikipedia definition of ML is the following
Machine learning, a branch of artificial intelligence, concerns the construction and study of systems that can learn from data.
In this context learning can be defined parameterization of an algorithm. The parameters of the algorithm are derived using input data with a known output. When the algorithm has "learned" the association between input and output, it can be tested with further input data for which the output is well known.
Let's suppose your problem is to obtain words from speech. Here the input is some kind of audio file containing one word (not necessarily, but I supposed this case to keep it quite simple). You'd record X words N times and then use (for example) N/2 of the repetitions to parameterize your algorithm, disregarding - at the moment - how your algorithm would look like.
Now on the one hand - depending on the algorithm - if you feed your algorithm with one of the remaining repetitions, it may give you some certainty estimate which may be used to characterize the recognition of just one of the repetitions. On the other hand you may use all of the remaining repetitions to test the learned algorithm. For each of the repetitions you pass it to the algorithm and compare the expected output with the actual output. After all you'll have an accuracy value for the learned algorithm calculated as the quotient of correct and total classifications.
Anyway, the actual accuracy will depend on the quality of your learning and test data.
A good start to read on would be Pattern Recognition and Machine Learning by Christopher M Bishop

There are various metrics for evaluating the performance of ML model and there is no rule that there are 20 or 30 metrics only. You can create your own metrics depending on your problem. There are various cases wherein when you are solving real - world problem where you would need to create your own custom metrics.
Coming to the existing ones, it is already listed in the first answer, I would just highlight each metrics merits and demerits to better have an understanding.
Accuracy is the simplest of the metric and it is commonly used. It is the number of points to class 1/ total number of points in your dataset. This is for 2 class problem where some points belong to class 1 and some to belong to class 2. It is not preferred when the dataset is imbalanced because it is biased to balanced one and it is not that much interpretable.
Log loss is a metric that helps to achieve probability scores that gives you better understanding why a specific point is belonging to class 1. The best part of this metric is that it is inbuild in logistic regression which is famous ML technique.
Confusion metric is best used for 2-class classification problem which gives four numbers and the diagonal numbers helps to get an idea of how good is your model.Through this metric there are others such as precision, recall and f1-score which are interpretable.

Ways to improve the accuracy of a Naive Bayes Classifier?

I am using a Naive Bayes Classifier to categorize several thousand documents into 30 different categories. I have implemented a Naive Bayes Classifier, and with some feature selection (mostly filtering useless words), I've gotten about a 30% test accuracy, with 45% training accuracy. This is significantly better than random, but I want it to be better.
I've tried implementing AdaBoost with NB, but it does not appear to give appreciably better results (the literature seems split on this, some papers say AdaBoost with NB doesn't give better results, others do). Do you know of any other extensions to NB that may possibly give better accuracy?

In my experience, properly trained Naive Bayes classifiers are usually astonishingly accurate (and very fast to train--noticeably faster than any classifier-builder i have everused).
so when you want to improve classifier prediction, you can look in several places:
tune your classifier (adjusting the classifier's tunable paramaters);
apply some sort of classifier combination technique (eg,
ensembling, boosting, bagging); or you can
look at the data fed to the classifier--either add more data,
improve your basic parsing, or refine the features you select from
the data.
w/r/t naive Bayesian classifiers, parameter tuning is limited; i recommend to focus on your data--ie, the quality of your pre-processing and the feature selection.
I. Data Parsing (pre-processing)
i assume your raw data is something like a string of raw text for each data point, which by a series of processing steps you transform each string into a structured vector (1D array) for each data point such that each offset corresponds to one feature (usually a word) and the value in that offset corresponds to frequency.
stemming: either manually or by using a stemming library? the popular open-source ones are Porter, Lancaster, and Snowball. So for
instance, if you have the terms programmer, program, progamming,
programmed in a given data point, a stemmer will reduce them to a
single stem (probably program) so your term vector for that data
point will have a value of 4 for the feature program, which is
probably what you want.
synonym finding: same idea as stemming--fold related words into a single word; so a synonym finder can identify developer, programmer,
coder, and software engineer and roll them into a single term
neutral words: words with similar frequencies across classes make poor features
II. Feature Selection
consider a prototypical use case for NBCs: filtering spam; you can quickly see how it fails and just as quickly you can see how to improve it. For instance, above-average spam filters have nuanced features like: frequency of words in all caps, frequency of words in title, and the occurrence of exclamation point in the title. In addition, the best features are often not single words but e.g., pairs of words, or larger word groups.
III. Specific Classifier Optimizations
Instead of 30 classes use a 'one-against-many' scheme--in other words, you begin with a two-class classifier (Class A and 'all else') then the results in the 'all else' class are returned to the algorithm for classification into Class B and 'all else', etc.
The Fisher Method (probably the most common way to optimize a Naive Bayes classifier.) To me,
i think of Fisher as normalizing (more correctly, standardizing) the input probabilities An NBC uses the feature probabilities to construct a 'whole-document' probability. The Fisher Method calculates the probability of a category for each feature of the document then combines these feature probabilities and compares that combined probability with the probability of a random set of features.

I would suggest using a SGDClassifier as in this and tune it in terms of regularization strength.
Also try to tune the formula in TFIDF you're using by tuning the parameters of TFIFVectorizer.
I usually see that for text classification problems SVM or Logistic Regressioin when trained one-versus-all outperforms NB. As you can see in this nice article by Stanford people for longer documents SVM outperforms NB. The code for the paper which uses a combination of SVM and NB (NBSVM) is here.
Second, tune your TFIDF formula (e.g. sublinear tf, smooth_idf).
Normalize your samples with l2 or l1 normalization (default in Tfidfvectorization) because it compensates for different document lengths.
Multilayer Perceptron, usually gets better results than NB or SVM because of the non-linearity introduced which is inherent to many text classification problems. I have implemented a highly parallel one using Theano/Lasagne which is easy to use and downloadable here.
Try to tune your l1/l2/elasticnet regularization. It makes a huge difference in SGDClassifier/SVM/Logistic Regression.
Try to use n-grams which is configurable in tfidfvectorizer.
If your documents have structure (e.g. have titles) consider using different features for different parts. For example add title_word1 to your document if word1 happens in the title of the document.
Consider using the length of the document as a feature (e.g. number of words or characters).
Consider using meta information about the document (e.g. time of creation, author name, url of the document, etc.).
Recently Facebook published their FastText classification code which performs very well across many tasks, be sure to try it.

Using Laplacian Correction along with AdaBoost.
In AdaBoost, first a weight is assigned to each data tuple in the training dataset. The intial weights are set using the init_weights method, which initializes each weight to be 1/d, where d is the size of the training data set.
Then, a generate_classifiers method is called, which runs k times, creating k instances of the Naïve Bayes classifier. These classifiers are then weighted, and the test data is run on each classifier. The sum of the weighted "votes" of the classifiers constitutes the final classification.

Improves Naive Bayes classifier for general cases
Take the logarithm of your probabilities as input features
We change the probability space to log probability space since we calculate the probability by multiplying probabilities and the result will be very small. when we change to log probability features, we can tackle the under-runs problem.
Remove correlated features.
Naive Byes works based on the assumption of independence when we have a correlation between features which means one feature depends on others then our assumption will fail.
More about correlation can be found here
Work with enough data not the huge data
naive Bayes require less data than logistic regression since it only needs data to understand the probabilistic relationship of each attribute in isolation with the output variable, not the interactions.
Check zero frequency error
If the test data set has zero frequency issue, apply smoothing techniques “Laplace Correction” to predict the class of test data set.
More than this is well described in the following posts
Please refer below posts.
machinelearningmastery site post
Analyticvidhya site post

keeping the n size small also make NB to give high accuracy result. and at the core, as the n size increase its accuracy degrade,

Select features which have less correlation between them. And try using different combination of features at a time.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart